# Final Project

Name: Kartikeya Sharma <br/>Class: CSCI 349 - Intro to Data Mining  
Semester: Spring 2021  
Instructor: Brian King  

### Step 0: Enivornment Setup

#### Step 0.0: Installations

- **phonenumbers:** Python wrapper for Google's phone numbers API: ```conda install phonenumbers```<br/>
- **us:** package for conveniently working with state abbreviations: ```pip install us``` because conda-forge installation does not work (and is not listed as a primary way of installing this package, so we're stuck with pip)<br/>
- **uszipcode:** package that has a vast amount of information on zipcodes, including the major city, post office city, common city, county, state, **area code list** (helpful), **latitude** (helpful), **longitude** (helpful), timezone, demographics (population, population_density, population by age, population by gender, population by race, etc.)... talk about data mining ```pip install uszipcode``` (pip had a solid installation of this)

#### Step 0.1: Imports

In [76]:
%time
# imports used in the course
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# from  matplotlib.ticker import PercentFormatter
# from sklearn.model_selection import train_test_split, KFold
# from sklearn.utils import shuffle
# from sklearn.metrics import classification_report, confusion_matrix, f1_score
# from sklearn.model_selection import cross_validate, cross_val_predict
# from sklearn.model_selection import GridSearchCV

# from sklearn.preprocessing import StandardScaler
# from sklearn.tree import DecisionTreeClassifier

# import tensorflow as tf
# from tensorflow import keras
# from tensorflow.keras import Input, Model
# from tensorflow.keras.layers import Dense, Activation
# from tensorflow.keras.optimizers import Adam, SGD
# from scikeras.wrappers import KerasClassifier

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs


In [77]:
# custom imports (not used in the course)
import phonenumbers
import us
from uszipcode import SearchEngine # importing what is needed per the documentation

```uszipcode``` documentation<sup>5</sup>

In [78]:
%time 
try:
    df_raw = pd.read_csv(\
    "../data/Consumer_Complaints_Data_-_Unwanted_Calls_raw.csv"\
    )
except:
    # import from online if not available on machine
    df_raw = pd.read_csv('https://query.data.world/s/24xzbr2jaeuohhwmrzhj7jcyrdamlw')

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


In [87]:
df = df_raw.sample(frac=0.01).copy(deep=True)

```df``` will be considered the sample of the data set, where ```df_raw``` will be considered as the whole data set; this is for simplicity and convenience by design, so I don't have to write ```df_samp``` or ```df_sample``` every time.
<br/><br/>Data provided from data.world.<sup>1</sup>

### Step 1: Pre-Processing

#### Step 1.0 Inspect Data at a Macro Level

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4263 entries, 130601 to 70067
Data columns (total 13 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Ticket ID                                4263 non-null   int64 
 1   Ticket Created                           4263 non-null   object
 2   Date of Issue                            4219 non-null   object
 3   Time of Issue                            3250 non-null   object
 4   Form                                     4263 non-null   object
 5   Method                                   4239 non-null   object
 6   Issue                                    4263 non-null   object
 7   Caller ID Number                         3418 non-null   object
 8   Type of Call or Messge                   4217 non-null   object
 9   Advertiser Business Number               1796 non-null   object
 10  State                                    4260 non-null

In [82]:
df.head()

Unnamed: 0,Ticket ID,Ticket Created,Date of Issue,Time of Issue,Form,Method,Issue,Caller ID Number,Type of Call or Messge,Advertiser Business Number,State,Zip,Location (Center point of the Zip Code)
130601,304303,05/23/2015 01:38:57 AM +0000,05/22/2015,5:30 P.M.,Phone,Wired,Telemarketing (including do not call and spoof...,480-500-5307,Live Voice,,AZ,85042,"AZ 85042\n(33.378287, -112.032881)"
60775,824552,02/22/2016 03:38:40 PM +0000,02/22/2016,12:46 A.M.,Phone,Wireless (cell phone/other mobile device),Robocalls,,Text Message,,GA,30252,"GA 30252\n(33.468174, -84.073668)"
180535,493206,08/27/2015 04:32:07 AM +0000,08/26/2015,1:23 PM,Phone,Wireless (cell phone/other mobile device),Telemarketing (including do not call and spoof...,,Live Voice,214-687-4420,TN,37354,"TN 37354\n(35.503791, -84.351119)"
205489,589107,10/14/2015 02:20:49 PM +0000,10/13/2015,5:45 pm,Phone,Internet (VOIP),Robocalls,626-333-6138,Prerecorded Voice,,NC,28226,"NC 28226\n(35.107444, -80.81869)"
307133,1213645,09/22/2016 06:39:21 PM +0000,09/22/2016,10:09 am,Phone,Wireless (cell phone/other mobile device),Robocalls,843-486-8171,Prerecorded Voice,843-486-8171,MN,55437,"MN 55437\n(44.825997, -93.344715)"


#### Step 1.1: Discard, Format, Downsize Each Column

Column #0 "Ticket ID": Unless there are some duplicate ticket values that should be looked into further, I intend on discarding them.<sup>2</sup>

In [83]:
df["Ticket ID"].value_counts().sort_values(ascending=False).head(1)

235384    1
Name: Ticket ID, dtype: int64

Clearly no duplicate values. Discarding Column #0 "Ticket ID."

In [84]:
df.drop(columns="Ticket ID", inplace=True)

<br/>Column #1 "Ticket Created": This column represents when the ticket was created in GMT time. This will be imported into GMT time and then converted to EST (FCC headquarters and our time zone). Also, we don't need seconds information.

In [85]:
df["Ticket Created"].isna().sum()

0

None are NaN at the moment.

In [9]:
def _convert_to_EST_or_nan(ts_str: str):
    try:
        ts = pd.to_datetime(ts_str)
        ts = ts.tz_convert("US/Eastern")
        ts = ts.floor('Min')
        return ts
    except:
        return pd.NaT
    
df["Ticket Created"] = df["Ticket Created"].apply(_convert_to_EST_or_nan)

In [None]:
df = df[df["Ticket Created"].notnull()]

In [10]:
df["Ticket Created"].head()

294624   2016-05-02 18:15:00-04:00
173828   2014-12-16 18:09:00-05:00
192864   2014-12-23 11:39:00-05:00
18430    2016-07-01 18:19:00-04:00
189689   2015-09-15 10:01:00-04:00
Name: Ticket Created, dtype: datetime64[ns, US/Eastern]

How many are NaN now?

In [86]:
df["Ticket Created"].isna().sum()

0

Still 0 NaN values! Looks like no observations *failed* during the conversion.

<br/>Column #2 "Date of Issue": This column represents the date when the user actually experienced the issue. It will be stored as a date without a time zone (since not all observations gave a particular time so that the dates can be informatively localized accordingly).

In [75]:
df["Date of Issue"].isna().sum()

12

No NaN values at the moment.

In [73]:
def _convert_datetime(dt):
    try:
        return pd.to_datetime(dt)
    except: 
        return np.NaN

In [70]:
df["Date of Issue"] = df["Date of Issue"].apply(_convert_datetime)

<br/>Column #3 "Time of Issue": This column represents the time when the user actually experienced the issue. This information appears to be too specific for our analysis.<sup>2</sup>

In [11]:
df.drop(columns="Time of Issue", inplace=True)

<br/>Column #4 "Form": This represents the method through which the consumers were contacted by the reported caller. These should all be "Phone"; we don't need it.<sup>2</sup>

In [12]:
df["Form"].value_counts()

Phone    4263
Name: Form, dtype: int64

In [13]:
df.drop(columns="Form", inplace=True)

<br/>Column #5 "Method": How the reported caller contacted the consumer.<sup>2</sup>

In [14]:
df["Method"] = pd.Categorical(df["Method"], ordered=False)

In [15]:
df["Method"].head()

294624                                        Wired
173828    Wireless (cell phone/other mobile device)
192864    Wireless (cell phone/other mobile device)
18430     Wireless (cell phone/other mobile device)
189689                                        Wired
Name: Method, dtype: category
Categories (3, object): ['Internet (VOIP)', 'Wired', 'Wireless (cell phone/other mobile device)']

There are three possibilities (nominal) categories: Internet (VOIP), which is something like Google Voice, wired, which is a landline, and wireless, which is a cell phone.<sup>2</sup>

It is not necessary to clarify that wireless means cell phone or other mobile device. This category will be renamed to wireless.

In [39]:
df["Method"].cat.rename_categories(
    new_categories = {"Wireless (cell phone/other mobile device)": "Wireless"},
    inplace=True
)

<br/>Column #6 "Issue": 

In [16]:
df["Issue"] = pd.Categorical(df["Issue"], ordered=False)

In [17]:
df["Issue"].head()

294624                                            Robocalls
173828    Telemarketing (including do not call and spoof...
192864    Telemarketing (including do not call and spoof...
18430                                             Robocalls
189689    Telemarketing (including do not call and spoof...
Name: Issue, dtype: category
Categories (3, object): ['Robocalls', 'Telemarketing (including do not call and spoo..., 'Unwanted Calls']

Looking close at the FCC descriptions of the different attributes, robocalls and telemarketing are currently being marked as unwanted calls instead of those two categories; hence, the values in this column are not consistent throughout across the reports/over time. This column will be removed, and the data set will be considered as containing data that represents reports of 'unwanted/spam' calls.<sup>2</sup>

In [18]:
df.drop(columns="Issue", inplace=True)

<br/>Column #7 "Caller ID Number": number (of reported call from corresponding unwanted caller) that appeared on the consumer's caller ID<sup>2</sup>

We will use the Python wrapper of google's phonenumbers library.<sup>3</sup>

Per "Please enter the phone number in the following format 555-555-5555" instructions on the FCC complaint form, all numbers are assumed to be USA, which makes sense since the FCC is a USA regulatory body which has jurisdiction in the USA.<sup>4</sup>

In [19]:
def _parse_ph_number(ph_num: str):
    try:
        ph_num = str(ph_num)
        ph_num_parsed = phonenumbers.parse(ph_num, "US")
    except phonenumbers.NumberParseException: 
        print(ph_num)
        return np.NaN
    if not phonenumbers.is_possible_number(ph_num_parsed):
        print(2)
        return np.NaN
    if not phonenumbers.is_valid_number(ph_num_parsed):
        print(3)
        return np.NaN
    return ph_num_parsed

In [20]:
%%capture 
# %%capture to not print the slew of lines that comes out with phonenumbers methods
# for whatever reason (there are some patches that need to be resolved with
# wrapping the Google Java library in Python)
df["Caller ID Number"] = \
df["Caller ID Number"].apply(_parse_ph_number);

In [21]:
df = df[df["Caller ID Number"].notnull()] 

In [22]:
df["Caller ID Number"].head()

192864    Country Code: 1 National Number: 8133473401
404231    Country Code: 1 National Number: 2106257494
253800    Country Code: 1 National Number: 3153633174
250483    Country Code: 1 National Number: 4256817161
379501    Country Code: 1 National Number: 8009564126
Name: Caller ID Number, dtype: object

<br/>Column #8 "Type of Call or Messge": The type of the call (or message) received. Live voice? Prerecorded message? Text message?

In [23]:
df["Type of Call or Messge"] = pd.Categorical(df["Type of Call or Messge"], ordered=False)

In [24]:
df["Type of Call or Messge"].head()

192864    Prerecorded Voice
404231      Abandoned Calls
253800      Abandoned Calls
250483      Abandoned Calls
379501    Prerecorded Voice
Name: Type of Call or Messge, dtype: category
Categories (5, object): ['Abandoned Calls', 'Autodialed Live Voice Call', 'Live Voice', 'Prerecorded Voice', 'Text Message']

<br/>Column #9 "Advertiser Business Number": The number of the advertiser that the caller claims to be associated with.

In [25]:
%%capture 
# %%capture to not print the slew of lines that comes out with phonenumbers methods
# for whatever reason (there are some patches that need to be resolved with
# wrapping the Google Java library in Python)
df["Advertiser Business Number"] = \
df["Advertiser Business Number"].apply(_parse_ph_number);

Not every caller provides a business number which they're associated with. We do not need to remove all of the observations that had NaN values for this attribute.

In [26]:
df["Advertiser Business Number"].head()

192864    NaN
404231    NaN
253800    NaN
250483    NaN
379501    NaN
Name: Advertiser Business Number, dtype: object

<br/>Column #10 "State": State in which the reporting consumer resides in<sup>2</sup>

In [27]:
def _validate_state(state_ab: str):
    try:
        if us.states.lookup(state_ab) in us.states.STATES:
            return state_ab
        else:
            return np.NaN
    except: # if np.NaN, for example
        return np.NaN

```us``` documentation<sup>6</sup>

In [28]:
df["State"] = df["State"].apply(_validate_state)

We don't have to eliminate observations without valid state abbreviations. Perhaps we can extrapolate the state of the customer from their area code, i.e. the phone number that received the call.

In [29]:
df["State"] = pd.Categorical(df["State"], ordered=False)

State abbreviations are values within a nominal variable.

In [30]:
df["State"].head()

192864    FL
404231    TX
253800    NY
250483    WA
379501    MT
Name: State, dtype: category
Categories (50, object): ['AK', 'AL', 'AR', 'AZ', ..., 'WA', 'WI', 'WV', 'WY']

We have each of the 50 states covered in our data sample. Yay!

<br/>Column #11 "Zip": The zip code of where the consumer resides<sup>2</sup>

Because we don't need all of the vast amount that this package provides *at the moment*, given that the latitude, longitude, state information is already given in our data set, we will start by using the simple version of the backend database provided by the ```uszipcode``` package.

In [31]:
search = SearchEngine(simple_zipcode=True, db_file_dir="../data/zip_data_raw")

In [32]:
def _validate_and_get_info_zipcode(zip: str):
    try:
        zip_obj = search.by_zipcode(zip)
        if zip_obj.zipcode is None:
            return np.NaN
        return zip_obj
    except:
        return np.NaN

In [33]:
df["Zip Info"] = df["Zip"].apply(_validate_and_get_info_zipcode)

We will now remove all entries with invalid ZIP Codes. We will be using the ZIP Code objects, which are now stored in the "Zip Info" column (newly created column), to also store otherr information associated with the Zip Code, such as latitude and longitude information; this way, the information is consolidated within the SimpleZipcode object. Plus, we can avoid painfully parsing "Location..." column (next column), keep the data consistent within the uszipcode package (information all from one database), and verify the validity of the latitude/longitude data all at the same time through the SimpleZipcode object.

In [34]:
df = df[df["Zip Info"]!=np.NaN]

A zip code is technically a nominal variable (even though there may be *many* of them). We can perhaps sort the data sample by a zip code category to understand which zip codes had the most reports, for instance. Making the Zip Info column into a Categorical may not make sense if the pd.Categorical method does not pick up on two different objects potentially being the same if their contents are the same. For simplicity, the Zip Info column will not be converted into a pd.Categorical.

In [35]:
df["Zip"] = pd.Categorical(df["Zip"], ordered=False)

In [36]:
df["Zip"].head()

192864    33609
404231    78232
253800    13421
250483    98072
379501    59486
Name: Zip, dtype: category
Categories (2421, object): ['00000', '00725', '00792', '00977', ..., '99502', '99516', '99709', '99803']

```uszipcode``` documentation<sup>5</sup>

<br/>Column #12 "Location (Center point of the Zip Code)": center of the zip code, not by consumer's address or anything specifically (data set maintains consumer privacy, which is the ethical thing to do and is solid computer science research ethics)<sup>2</sup>

Per the explanation above, I am removing this column.<br/>
*"We will be using the ZIP Code objects, which are now stored in the "Zip Info" column (newly created column), to also store otherr information associated with the Zip Code, such as latitude and longitude information; this way, the information is consolidated within the SimpleZipcode object. Plus, we can avoid painfully parsing "Location..." column (next column), keep the data consistent within the uszipcode package (information all from one database), and verify the validity of the latitude/longitude data all at the same time through the SimpleZipcode object."*

In [37]:
df.drop(columns="Location (Center point of the Zip Code)", inplace=True)

<br/>All in all (after the column conversions):

In [40]:
df.head()

Unnamed: 0,Ticket Created,Date of Issue,Method,Caller ID Number,Type of Call or Messge,Advertiser Business Number,State,Zip,Zip Info
192864,2014-12-23 11:39:00-05:00,12/23/2014,Wireless,Country Code: 1 National Number: 8133473401,Prerecorded Voice,,FL,33609,"SimpleZipcode(zipcode='33609', zipcode_type='S..."
404231,2017-03-29 17:40:00-04:00,03/29/2017,Wired,Country Code: 1 National Number: 2106257494,Abandoned Calls,,TX,78232,"SimpleZipcode(zipcode='78232', zipcode_type='S..."
253800,2016-01-29 14:57:00-05:00,01/28/2016,Wired,Country Code: 1 National Number: 3153633174,Abandoned Calls,,NY,13421,"SimpleZipcode(zipcode='13421', zipcode_type='S..."
250483,2016-01-22 14:40:00-05:00,01/22/2016,Wireless,Country Code: 1 National Number: 4256817161,Abandoned Calls,,WA,98072,"SimpleZipcode(zipcode='98072', zipcode_type='S..."
379501,2017-02-10 17:47:00-05:00,02/10/2017,Wireless,Country Code: 1 National Number: 8009564126,Prerecorded Voice,,MT,59486,"SimpleZipcode(zipcode='59486', zipcode_type='S..."


### Step 2: EDA

In [41]:
#### Step 2.0: Plotting General Distributional Data

In [45]:
df.groupby(df["Date of Issue"].dt.month).count().plot(kind="bar")

AttributeError: Can only use .dt accessor with datetimelike values

Retrieved from Stack Overflow<sup>7</sup>

### References
1. Data from https://data.world/kgarrett/unwanted-calls
2. Verified description of the variable using https://opendata.fcc.gov/Consumer/Consumer-Complaints-Data-Unwanted-Calls/vakf-fz8e
3. Python wrapper of Google's ```phonenumbers``` library (documentation): https://pypi.org/project/phonenumbers/
4. FCC unwanted call complaint form: https://consumercomplaints.fcc.gov/hc/en-us/requests/new?ticket_form_id=39744
5. ```uszipcode``` documentation: https://pypi.org/project/uszipcode/
6. ```us``` documentation: https://pypi.org/project/us/
7. Stack Overflow: https://stackoverflow.com/questions/27365467/can-pandas-plot-a-histogram-of-dates