# Final Project

Name: Kartikeya Sharma <br/>Class: CSCI 349 - Intro to Data Mining  
Semester: Spring 2021  
Instructor: Brian King  

### Installations

```conda install phonenumbers``` # Python wrapper for Google's phone numbers API

### Imports

In [24]:
%time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from  matplotlib.ticker import PercentFormatter
from sklearn.model_selection import train_test_split, KFold
from sklearn.utils import shuffle
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam, SGD
from scikeras.wrappers import KerasClassifier

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs


In [2]:
import phonenumbers
import sys, os # for turning print on and off

In [26]:
%time df_raw = pd.read_csv(\
"../data/Consumer_Complaints_Data_-_Unwanted_Calls_raw.csv"\
)

CPU times: user 1.18 s, sys: 95.9 ms, total: 1.28 s
Wall time: 1.28 s


In [27]:
df = df_raw.sample(frac=0.01).copy(deep=True)

Data from https://data.world/kgarrett/unwanted-calls<sup>1</sup>

### Step 1: Pre-Processing

#### Step 1.0 Inspect Data at a Macro Level

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4263 entries, 73412 to 301999
Data columns (total 13 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Ticket ID                                4263 non-null   int64 
 1   Ticket Created                           4263 non-null   object
 2   Date of Issue                            4240 non-null   object
 3   Time of Issue                            3209 non-null   object
 4   Form                                     4263 non-null   object
 5   Method                                   4252 non-null   object
 6   Issue                                    4263 non-null   object
 7   Caller ID Number                         3400 non-null   object
 8   Type of Call or Messge                   4237 non-null   object
 9   Advertiser Business Number               1781 non-null   object
 10  State                                    4262 non-null

In [29]:
df.head()

Unnamed: 0,Ticket ID,Ticket Created,Date of Issue,Time of Issue,Form,Method,Issue,Caller ID Number,Type of Call or Messge,Advertiser Business Number,State,Zip,Location (Center point of the Zip Code)
73412,971417,05/11/2016 01:57:31 PM +0000,05/11/2016,8:45 am,Phone,Wireless (cell phone/other mobile device),Telemarketing (including do not call and spoof...,305-587-2982,Live Voice,,VA,22191,"VA 22191\n(38.627945, -77.272331)"
424582,1750185,07/01/2017 07:25:48 PM +0000,06/30/2017,11:46 A.M.,Phone,Wired,Unwanted Calls,219-696-4928,Live Voice,219-696-4928,IN,46356,"IN 46356\n(41.268847, -87.420272)"
329486,568291,10/04/2015 02:00:46 AM +0000,10/03/2015,8:31 a.m.,Phone,Wired,Telemarketing (including do not call and spoof...,832-615-8408,Live Voice,,CA,95624,"CA 95624\n(38.427382, -121.347323)"
358777,1395977,01/13/2017 08:10:55 PM +0000,01/11/2017,,Phone,Wireless (cell phone/other mobile device),Unwanted Calls,740-990-8156,Live Voice,,WA,98204,"WA 98204\n(47.899424, -122.255648)"
89070,153613,02/27/2015 06:22:43 PM +0000,02/27/2015,11:00 A.M.,Phone,Wireless (cell phone/other mobile device),Telemarketing (including do not call and spoof...,,Live Voice,,NY,10044,"NY 10044\n(40.761127, -73.950758)"


#### Step 1.1: Discard, Format, Downsize Each Column

Column #0 "Ticket ID": Unless there are some duplicate ticket values that should be looked into further, I intend on discarding them.<sup>2</sup>

In [30]:
df["Ticket ID"].value_counts().sort_values(ascending=False).head(1)

1533953    1
Name: Ticket ID, dtype: int64

Clearly no duplicate values. Discarding Column #0 "Ticket ID."

In [31]:
df.drop(columns="Ticket ID", inplace=True)

<br/>Column #1 "Ticket Created": This column represents when the ticket was created in GMT time. This will be imported into GMT time and then converted to EST (FCC headquarters and our time zone). Also, we don't need seconds information.

In [32]:
def _convert_to_EST_or_nan(ts_str: str):
    try:
        ts = pd.to_datetime(ts_str)
        ts = ts.tz_convert("US/Eastern")
        ts = ts.floor('Min')
        return ts
    except:
        return pd.NaT
    
df["Ticket Created"] = df["Ticket Created"].apply(_convert_to_EST_or_nan)
df = df[df["Ticket Created"].notnull()]
df.index = df["Ticket Created"]
df.drop(columns="Ticket Created", inplace=True)

<br/>Column #2 "Date of Issue": This column represents the date when the user actually experienced the issue. The consumer zip code will be mapped to a time zone, to which a particular date will be mapped to. *I will return to this once I have processed the consumer time zone information.*<sup>2</sup>

<br/>Column #3 "Time of Issue": This column represents the time when the user actually experienced the issue. This information appears to be too specific for our analysis.<sup>2</sup>

In [33]:
df.drop(columns="Time of Issue", inplace=True)

<br/>Column #4 "Form": This represents the method through which the consumers were contacted by the reported caller. These should all be "Phone"; we don't need it.<sup>2</sup>

In [34]:
df["Form"].value_counts()

Phone    4263
Name: Form, dtype: int64

In [35]:
df.drop(columns="Form", inplace=True)

<br/>Column #5 "Method": How the reported caller contacted the consumer.<sup>2</sup>

In [36]:
df["Method"] = pd.Categorical(df["Method"], ordered=False)

In [37]:
df["Method"].cat.categories

Index(['Internet (VOIP)', 'Wired',
       'Wireless (cell phone/other mobile device)'],
      dtype='object')

There are three possibilities (nominal) categories: Internet (VOIP), which is something like Google Voice, wired, which is a landline, and wireless, which is a cell phone.<sup>2</sup>

<br/>Column #6 "Issue": 

In [38]:
df["Issue"] = pd.Categorical(df["Issue"], ordered=False)

In [39]:
df["Issue"].cat.categories

Index(['Robocalls', 'Telemarketing (including do not call and spoofing)',
       'Unwanted Calls'],
      dtype='object')

Looking close at the FCC descriptions of the different attributes, robocalls and telemarketing are currently being marked as unwanted calls instead of those two categories; hence, the values in this column are not consistent throughout across the reports/over time. This column will be removed, and the data set will be considered as containing data that represents reports of 'unwanted/spam' calls.<sup>2</sup>

<br/>Column #7 "Caller ID Number": number (of reported call from corresponding unwanted caller) that appeared on the consumer's caller ID

We will use the Python wrapper of google's phonenumbers library.<sup>3</sup>

Per "Please enter the phone number in the following format 555-555-5555" instructions on the FCC complaint form, all numbers are assumed to be USA, which makes sense since the FCC is a USA regulatory body which has jurisdiction in the USA.<sup>4</sup>

In [40]:
df = df_raw.sample(frac=0.01)

In [41]:
df.head()

Unnamed: 0,Ticket ID,Ticket Created,Date of Issue,Time of Issue,Form,Method,Issue,Caller ID Number,Type of Call or Messge,Advertiser Business Number,State,Zip,Location (Center point of the Zip Code)
315571,1263348,10/13/2016 10:24:04 PM +0000,10/13/2016,3:15 PM,Phone,Wired,Robocalls,888-856-2755,Prerecorded Voice,888-856-2755,CA,94901,"CA 94901\n(37.973771, -122.51209)"
66887,598294,10/18/2015 08:03:34 PM +0000,10/04/2015,,Phone,Wired,Telemarketing (including do not call and spoof...,844-776-6275,Abandoned Calls,,CA,0,"CA 00000\n(18.113421, -66.163661)"
400894,1722081,06/22/2017 06:57:25 PM +0000,06/22/2017,2:45 p.m.,Phone,Internet (VOIP),Unwanted Calls,978-233-3802,Abandoned Calls,,MA,1420,"MA 01420\n(42.579854, -71.809066)"
134668,320170,06/03/2015 04:52:33 AM +0000,06/02/2015,8:30 P.M.,Phone,Wired,Robocalls,213-921-4695,Prerecorded Voice,213-921-4695,CA,92128,"CA 92128\n(32.99986, -117.073175)"
388063,1505801,03/15/2017 04:11:01 PM +0000,03/15/2017,,Phone,Wired,Unwanted Calls,509-320-4229,Prerecorded Voice,509-320-4229,VA,23462,"VA 23462\n(36.836388, -76.150289)"


In [42]:
def _parse_ph_number(ph_num: str):
    try:
        ph_num = str(ph_num)
        ph_num_parsed = phonenumbers.parse(ph_num, "US")
    except phonenumbers.NumberParseException: 
        print(ph_num)
        return np.NaN
    if not phonenumbers.is_possible_number(ph_num_parsed):
        print(2)
        return np.NaN
    if not phonenumbers.is_valid_number(ph_num_parsed):
        print(3)
        return np.NaN
    return ph_num_parsed

In [None]:
# Disable
def blockPrint():
    sys.stdout = open(os.devnull, 'w')

# Restore
def enablePrint():
    sys.stdout = sys.__stdout__

blockPrint() and enablePrint() utilities used DIRECTLY from the source.<sup>5</sup>

In [None]:
blockPrint()

In [None]:
df["Caller ID Number"] = \
df["Caller ID Number"].apply(_parse_ph_number);

In [44]:
df["Caller ID Number"].head()

315571    Country Code: 1 National Number: 8888562755
66887     Country Code: 1 National Number: 8447766275
400894    Country Code: 1 National Number: 9782333802
134668    Country Code: 1 National Number: 2139214695
388063    Country Code: 1 National Number: 5093204229
Name: Caller ID Number, dtype: object

<br/>Column #8 "Type of Call or Messge": 

<br/>Column #9 "Advertiser Business Number": 

<br/>Column #10 "State": 

<br/>Column #11 "Zip": 

<br/>Column #12 "Location (Center point of the Zip Code)": 

### References
1. Data from https://data.world/kgarrett/unwanted-calls
2. Verified description of the variable using https://opendata.fcc.gov/Consumer/Consumer-Complaints-Data-Unwanted-Calls/vakf-fz8e
3. Python wrapper of Google's phonenumbers library: https://pypi.org/project/phonenumbers/
4. FCC unwanted call complaint form: https://consumercomplaints.fcc.gov/hc/en-us/requests/new?ticket_form_id=39744
5. Stack Overflow: https://stackoverflow.com/questions/8391411/how-to-block-calls-to-print