# Final Project

Name: Kartikeya Sharma <br/>Class: CSCI 349 - Intro to Data Mining  
Semester: Spring 2021  
Instructor: Brian King  

### Installations

- **phonenumbers:** Python wrapper for Google's phone numbers API: ```conda install phonenumbers```<br/>
- **us:** package for conveniently working with state abbreviations: ```pip install us``` because conda-forge installation does not work (and is not listed as a primary way of installing this package, so we're stuck with pip)<br/>
- **uszipcode:** package that has a vast amount of information on zipcodes, including the major city, post office city, common city, county, state, **area code list** (helpful), **latitude** (helpful), **longitude** (helpful), timezone, demographics (population, population_density, population by age, population by gender, population by race, etc.)... talk about data mining ```pip install uszipcode``` (pip had a solid installation of this)

### Imports

In [None]:
%time
# imports used in the course
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from  matplotlib.ticker import PercentFormatter
from sklearn.model_selection import train_test_split, KFold
from sklearn.utils import shuffle
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam, SGD
from scikeras.wrappers import KerasClassifier

In [None]:
# custom imports (not used in the course)
import phonenumbers
import us
from uszipcode import SearchEngine # importing what is needed per the documentation

```uszipcode``` documentation<sup>5</sup>

In [None]:
%time 
try:
    df_raw = pd.read_csv(\
    "../data/Consumer_Complaints_Data_-_Unwanted_Calls_raw.csv"\
    )
else:
    df_raw = pd.read_csv()

In [None]:
df = df_raw.sample(frac=0.01).copy(deep=True)

```df``` will be considered the sample of the data set, where ```df_raw``` will be considered as the whole data set; this is for simplicity and convenience by design, so I don't have to write ```df_samp``` or ```df_sample``` every time.
<br/><br/>Data provided from data.world.<sup>1</sup>

### Step 1: Pre-Processing

#### Step 1.0 Inspect Data at a Macro Level

In [None]:
df.info()

In [None]:
df.head()

#### Step 1.1: Discard, Format, Downsize Each Column

Column #0 "Ticket ID": Unless there are some duplicate ticket values that should be looked into further, I intend on discarding them.<sup>2</sup>

In [None]:
df["Ticket ID"].value_counts().sort_values(ascending=False).head(1)

Clearly no duplicate values. Discarding Column #0 "Ticket ID."

In [None]:
df.drop(columns="Ticket ID", inplace=True)

<br/>Column #1 "Ticket Created": This column represents when the ticket was created in GMT time. This will be imported into GMT time and then converted to EST (FCC headquarters and our time zone). Also, we don't need seconds information.

In [None]:
def _convert_to_EST_or_nan(ts_str: str):
    try:
        ts = pd.to_datetime(ts_str)
        ts = ts.tz_convert("US/Eastern")
        ts = ts.floor('Min')
        return ts
    except:
        return pd.NaT
    
df["Ticket Created"] = df["Ticket Created"].apply(_convert_to_EST_or_nan)
df = df[df["Ticket Created"].notnull()]

In [None]:
df["Ticket Created"].head()

<br/>Column #2 "Date of Issue": This column represents the date when the user actually experienced the issue. The consumer zip code will be mapped to a time zone, to which a particular date will be mapped to. *I will return to this once I have processed the consumer time zone information.*<sup>2</sup>

<br/>Column #3 "Time of Issue": This column represents the time when the user actually experienced the issue. This information appears to be too specific for our analysis.<sup>2</sup>

In [None]:
df.drop(columns="Time of Issue", inplace=True)

<br/>Column #4 "Form": This represents the method through which the consumers were contacted by the reported caller. These should all be "Phone"; we don't need it.<sup>2</sup>

In [None]:
df["Form"].value_counts()

In [None]:
df.drop(columns="Form", inplace=True)

<br/>Column #5 "Method": How the reported caller contacted the consumer.<sup>2</sup>

In [None]:
df["Method"] = pd.Categorical(df["Method"], ordered=False)

In [None]:
df["Method"].head()

There are three possibilities (nominal) categories: Internet (VOIP), which is something like Google Voice, wired, which is a landline, and wireless, which is a cell phone.<sup>2</sup>

<br/>Column #6 "Issue": 

In [None]:
df["Issue"] = pd.Categorical(df["Issue"], ordered=False)

In [None]:
df["Issue"].head()

Looking close at the FCC descriptions of the different attributes, robocalls and telemarketing are currently being marked as unwanted calls instead of those two categories; hence, the values in this column are not consistent throughout across the reports/over time. This column will be removed, and the data set will be considered as containing data that represents reports of 'unwanted/spam' calls.<sup>2</sup>

<br/>Column #7 "Caller ID Number": number (of reported call from corresponding unwanted caller) that appeared on the consumer's caller ID<sup>2</sup>

We will use the Python wrapper of google's phonenumbers library.<sup>3</sup>

Per "Please enter the phone number in the following format 555-555-5555" instructions on the FCC complaint form, all numbers are assumed to be USA, which makes sense since the FCC is a USA regulatory body which has jurisdiction in the USA.<sup>4</sup>

In [None]:
def _parse_ph_number(ph_num: str):
    try:
        ph_num = str(ph_num)
        ph_num_parsed = phonenumbers.parse(ph_num, "US")
    except phonenumbers.NumberParseException: 
        print(ph_num)
        return np.NaN
    if not phonenumbers.is_possible_number(ph_num_parsed):
        print(2)
        return np.NaN
    if not phonenumbers.is_valid_number(ph_num_parsed):
        print(3)
        return np.NaN
    return ph_num_parsed

In [None]:
%%capture 
# %%capture to not print the slew of lines that comes out with phonenumbers methods
# for whatever reason (there are some patches that need to be resolved with
# wrapping the Google Java library in Python)
df["Caller ID Number"] = \
df["Caller ID Number"].apply(_parse_ph_number);

In [None]:
df = df[df["Caller ID Number"].notnull()] 

In [None]:
df["Caller ID Number"].head()

<br/>Column #8 "Type of Call or Messge": The type of the call (or message) received. Live voice? Prerecorded message? Text message?

In [None]:
df["Type of Call or Messge"] = pd.Categorical(df["Type of Call or Messge"], ordered=False)

In [None]:
df["Type of Call or Messge"].head()

<br/>Column #9 "Advertiser Business Number": The number of the advertiser that the caller claims to be associated with.

In [None]:
%%capture 
# %%capture to not print the slew of lines that comes out with phonenumbers methods
# for whatever reason (there are some patches that need to be resolved with
# wrapping the Google Java library in Python)
df["Advertiser Business Number"] = \
df["Advertiser Business Number"].apply(_parse_ph_number);

Not every caller provides a business number which they're associated with. We do not need to remove all of the observations that had NaN values for this attribute.

In [None]:
df["Advertiser Business Number"].head()

<br/>Column #10 "State": State in which the reporting consumer resides in<sup>2</sup>

In [None]:
def _validate_state(state_ab: str):
    try:
        if us.states.lookup(state_ab) in us.states.STATES:
            return state_ab
        else:
            return np.NaN
    except: # if np.NaN, for example
        return np.NaN

```us``` documentation<sup>6</sup>

In [None]:
df["State"] = df["State"].apply(_validate_state)

We don't have to eliminate observations without valid state abbreviations. Perhaps we can extrapolate the state of the customer from their area code, i.e. the phone number that received the call.

In [None]:
df["State"] = pd.Categorical(df["State"], ordered=False)

State abbreviations are values within a nominal variable.

In [None]:
df["State"].head()

We have each of the 50 states covered in our data sample. Yay!

<br/>Column #11 "Zip": The zip code of where the consumer resides<sup>2</sup>

Because we don't need all of the vast amount that this package provides *at the moment*, given that the latitude, longitude, state information is already given in our data set, we will start by using the simple version of the backend database provided by the ```uszipcode``` package.

In [None]:
search = SearchEngine(simple_zipcode=True, db_file_dir="../data/zip_data_raw")

In [None]:
df = df_raw.sample(frac=0.01).copy(deep=True)

In [None]:
def _validate_and_get_info_zipcode(zip: str):
    try:
        zip_obj = search.by_zipcode(zip)
        if zip_obj.zipcode is None:
            return np.NaN
        return zip_obj
    except:
        return np.NaN

In [None]:
df["Zip Info"] = df["Zip"].apply(_validate_and_get_info_zipcode)

We will now remove all entries with invalid ZIP Codes. We will be using the ZIP Code objects, which are now stored in the "Zip Info" column (newly created column), to also store otherr information associated with the Zip Code, such as latitude and longitude information; this way, the information is consolidated within the SimpleZipcode object. Plus, we can avoid painfully parsing "Location..." column (next column), keep the data consistent within the uszipcode package (information all from one database), and verify the validity of the latitude/longitude data all at the same time through the SimpleZipcode object.

In [None]:
df = df[df["Zip Info"]!=np.NaN]

A zip code is technically a nominal variable (even though there may be *many* of them). We can perhaps sort the data sample by a zip code category to understand which zip codes had the most reports, for instance. Making the Zip Info column into a Categorical may not make sense if the pd.Categorical method does not pick up on two different objects potentially being the same if their contents are the same. For simplicity, the Zip Info column will not be converted into a pd.Categorical.

In [None]:
df["Zip"] = pd.Categorical(df["Zip"], ordered=False)

In [None]:
df["Zip"].head()

```uszipcode``` documentation<sup>5</sup>

<br/>Column #12 "Location (Center point of the Zip Code)": center of the zip code, not by consumer's address or anything specifically (data set maintains consumer privacy, which is the ethical thing to do and is solid computer science research ethics)<sup>2</sup>

Per the explanation above, I am removing this column.<br/>
*"We will be using the ZIP Code objects, which are now stored in the "Zip Info" column (newly created column), to also store otherr information associated with the Zip Code, such as latitude and longitude information; this way, the information is consolidated within the SimpleZipcode object. Plus, we can avoid painfully parsing "Location..." column (next column), keep the data consistent within the uszipcode package (information all from one database), and verify the validity of the latitude/longitude data all at the same time through the SimpleZipcode object."*

In [None]:
df.drop(columns="Location (Center point of the Zip Code)", inplace=True)

<br/>All in all (after initial preprocessing of columns):

In [None]:
df.head()

### References
1. Data from https://data.world/kgarrett/unwanted-calls
2. Verified description of the variable using https://opendata.fcc.gov/Consumer/Consumer-Complaints-Data-Unwanted-Calls/vakf-fz8e
3. Python wrapper of Google's ```phonenumbers``` library (documentation): https://pypi.org/project/phonenumbers/
4. FCC unwanted call complaint form: https://consumercomplaints.fcc.gov/hc/en-us/requests/new?ticket_form_id=39744
5. ```uszipcode``` documentation: https://pypi.org/project/uszipcode/
6. ```us``` documentation: https://pypi.org/project/us/