# Binary Classification: Voicemails
### Similar to the Kaggle's "Titanic" problem, we will be building a model that predicts one of two different possible outcomes. We will be using meta-data (variables that describe details about the calls, but without transcripts) and already trained data in order to predict whether or not a call is a voicemail or not.


### As always, we will start by importing the necessary libraries

In [8]:
import pandas as pd 
import numpy as np
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import *
from datetime import datetime
from dateutil.relativedelta import *
from time import time
import boto3
from sklearn.feature_selection import SelectKBest, f_classif
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from sklearn.feature_selection import chi2
from random import randrange
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix


### We will now access our desired data by pulling out the 'output.csv' file from the data-science-tutorials folder in S3 on AWS. 

In [9]:
bucket = 'data-science-tutorials'
key = 'output.csv'

s3 = boto3.resource('s3')

s3.Bucket(bucket).download_file(key,key)

df = pd.read_csv('./output.csv')


### First let's visualize our dataframe:

In [10]:
print(df)

                CALLID                 date day_of_week  \
0     16050390083911f7  2016-05-03T12:04:43     Tuesday   
1     1605092f460485e7  2016-05-09T09:07:42      Monday   
2     160509fe3f1cc9c6  2016-05-09T15:56:09      Monday   
3     1605107036c49a76  2016-05-10T10:08:37     Tuesday   
4     1605047cea3ff9b4  2016-05-04T15:15:31   Wednesday   
5     160601f2ebcad14b  2016-06-01T13:59:04   Wednesday   
6     16060315848e0390  2016-06-03T10:45:30      Friday   
7     160601f300f8b0df  2016-06-01T09:50:38   Wednesday   
8     160609f173e2810b  2016-06-09T08:15:36    Thursday   
9     160609bf3d7779c4  2016-06-09T16:41:09    Thursday   
10    16071476e1540a8d  2016-07-14T15:53:41    Thursday   
11    160705b0e6e67685  2016-07-05T11:49:52     Tuesday   
12    1607163a321d9bf4  2016-07-16T11:35:00    Saturday   
13    1607076b6261523c  2016-07-07T16:01:20    Thursday   
14    16071359ec4acd90  2016-07-13T14:19:57   Wednesday   
15    160801981d2e1916  2016-08-01T10:20:39      Monday 

### Before we can build machine learning models, we must preprocess the data, create new variables, or any other manipulation that may improve our model. Since the *date* feature is written as a date attatched to a time, we will separate the variables from each other and create a new *time* feature. 

In [11]:
#adjust the 'time' column by cutting the first 11 characters off of time
df['time'] = df['date'].map(lambda x: str(x)[11:])


### We can create a new feature out of *date* called *holiday*, which gives a boolean value of whether or not it is a national holiday for a specific day. In order to create this however, the *date* feature needs to be in 'datetime' format so the calendar function will work.

In [12]:
#convert date to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

#check for holidays 
dr = pd.date_range(start='2016-05-03', end='2016-08-29')
cal = calendar()
holidays = cal.holidays(start=dr.min(), end=dr.max())
df['holiday'] = df['date'].dt.date.astype('datetime64').isin(holidays)


### Here we will transform *date* so that is has only numbers.

In [13]:
#convert date to int
df['date'] = df['date'].dt.strftime('%Y%m%d')

### We will also create a new *hour* variable out of *time* by just taking the first two characters and converting them into an integer. This is useful because it helps to run a large dataset.

In [14]:
#create new feature 'hour'
df['hour'] = df['time'].map(lambda x: str(x)[:2])
df['hour'] = df['hour'].astype(int)

### In order to process *day_of_week* in a machine learning model, each day of the week must be mapped to an integer.

In [15]:
weekday = {'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3, 'Friday': 4, 'Saturday': 5, 'Sunday': 6}
df['day_of_week'] = df['day_of_week'].map(weekday)

### We can create another variable out of *day_of_week* that outputs a boolean for whether or not it is a weekend. This could be useful to our model because there may be a correlation between whether or not a call goes to voicemail and whether or not it is the weekend. 

In [16]:
#create feature 'weekend'
weekend = [5,6]
df['weekend'] = df['day_of_week'].isin(weekend)


### Now that we are done creating new variables and adjusting old ones, we can visualize our dataframe. 

In [17]:
print(df)

                CALLID      date  day_of_week  call_duration_seconds  \
0     16050390083911f7  20160503            1                    103   
1     1605092f460485e7  20160509            0                    123   
2     160509fe3f1cc9c6  20160509            0                    335   
3     1605107036c49a76  20160510            1                    291   
4     1605047cea3ff9b4  20160504            2                    361   
5     160601f2ebcad14b  20160601            2                     73   
6     16060315848e0390  20160603            4                    134   
7     160601f300f8b0df  20160601            2                    181   
8     160609f173e2810b  20160609            3                    156   
9     160609bf3d7779c4  20160609            3                    121   
10    16071476e1540a8d  20160714            3                    148   
11    160705b0e6e67685  20160705            1                     46   
12    1607163a321d9bf4  20160716            5                   

### It's time to start machine learning!
### We are going to start off by using almost all of our predictors as X and the feature that we are predicting, *Voice Mail*, as y. 

In [18]:
predictors = ['date', 'call_duration_seconds', 'hour', 'day_of_week','holiday','weekend','switches']
X = df[predictors]
y = df['Voice Mail']


### Next we will perform feature selection using SelectKBest. Each result corresponds to a feature in the predictors array and represents how inflencial each feature is at predicting whether or not a call is a voicemail. In this examples, we use 'f_classif', the default model.

In [19]:


# Perform feature selection
selector = SelectKBest(f_classif, k=7)
fit = selector.fit(X, y)
# summarize scores
numpy.set_printoptions(precision=3)
results = pd.DataFrame({
    'feature names': ['date','call_duration_seconds', 'hour', 'day_of_week', 
              'holiday', 'weekday', 'switches'],
    'Score': [fit.scores_[0], fit.scores_[1], fit.scores_[2], fit.scores_[3], fit.scores_[4], fit.scores_[5],
              fit.scores_[6]]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
print(result_df, "\n")

                    feature names
Score                            
320.722246               switches
274.982391  call_duration_seconds
245.292680                weekday
134.955448                   hour
69.824174             day_of_week
0.559400                  holiday
0.003179                     date 



### Here we are doing the same thing as above, except that we are now using a chi squared model.  By analyzing both results, we can conclude that *call_duration_seconds*, *weekday*, and *switches* have larger and more positive influences on the model.

In [20]:
# Perform feature selection
selector = SelectKBest(chi2, k=7)
fit = selector.fit(X, y)
# summarize scores
numpy.set_printoptions(precision=3)
results = pd.DataFrame({
    'feature names': ['date','call_duration_seconds', 'hour', 'day_of_week', 
              'holiday', 'weekday', 'switches'],
    'Score': [fit.scores_[0], fit.scores_[1], fit.scores_[2], fit.scores_[3], fit.scores_[4], fit.scores_[5],
              fit.scores_[6]]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
print(result_df, "\n")

                      feature names
Score                              
30273.063469  call_duration_seconds
21197.204312               switches
225.533137                  weekday
84.364875                      hour
83.066879               day_of_week
0.558666                    holiday
0.001209                       date 



### Since the *date* feature seems to have almost no effect on our model, we will get rid of it and redefine our variables.


In [21]:
predictors = ['call_duration_seconds', 'hour', 'day_of_week','holiday','weekend','switches']
X = df[predictors]
y = df['Voice Mail']

### While some features are better than others, we are going to keep all of them. Here, we will use [K-Fold Cross Validation](https://machinelearningmastery.com/k-fold-cross-validation/) so that we can get more accurate scores for each model. 

In [22]:
#K-Fold Cross Validation

gaussian = GaussianNB()
gaussian_scores = cross_val_score(gaussian, X, y, cv=10, scoring = "accuracy")

print("GaussianNB scores:", gaussian_scores)
print("Mean:", gaussian_scores.mean())
print("Standard Deviation:", gaussian_scores.std(), "\n")

sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd_scores = cross_val_score(sgd, X, y, cv=10, scoring = "accuracy")

print("SGD scores:", sgd_scores)
print("Mean:", sgd_scores.mean())
print("Standard Deviation:", sgd_scores.std(), "\n")


knn = KNeighborsClassifier(n_neighbors = 3)
knn_scores = cross_val_score(knn, X, y, cv=10, scoring = "accuracy")

print("KNN scores:", knn_scores)
print("Mean:", knn_scores.mean())
print("Standard Deviation:", knn_scores.std(), "\n")



decision_tree = DecisionTreeClassifier()
dt_scores = cross_val_score(decision_tree, X, y, cv=10, scoring = "accuracy")

print("Decision tree scores:", dt_scores)
print("Mean:", dt_scores.mean())
print("Standard Deviation:", dt_scores.std(), "\n")


logreg = LogisticRegression()
logreg_scores = cross_val_score(logreg, X, y, cv=10, scoring = "accuracy")

print("Logistic Regression scores:", logreg_scores)
print("Mean:", logreg_scores.mean())
print("Standard Deviation:", logreg_scores.std(), "\n")

rf = RandomForestClassifier(n_estimators=100)
rf_scores = cross_val_score(rf, X, y, cv=10, scoring = "accuracy")

print("Random Forest Scores:", rf_scores)
print("Mean:", rf_scores.mean())
print("Standard Deviation:", rf_scores.std())
print("\n")

GaussianNB scores: [0.802 0.786 0.784 0.772 0.786 0.794 0.782 0.758 0.621 0.782]
Mean: 0.7667142996571987
Standard Deviation: 0.049805342176859155 

SGD scores: [0.844 0.24  0.776 0.85  0.832 0.794 0.826 0.545 0.864 0.866]
Mean: 0.7436381433525734
Standard Deviation: 0.19039603346955356 

KNN scores: [0.834 0.854 0.856 0.84  0.85  0.86  0.838 0.852 0.85  0.864]
Mean: 0.8497753015012058
Standard Deviation: 0.009137030870610633 

Decision tree scores: [0.828 0.832 0.818 0.82  0.824 0.836 0.82  0.79  0.828 0.822]
Mean: 0.8217556398225592
Standard Deviation: 0.012056866493128177 

Logistic Regression scores: [0.852 0.834 0.834 0.836 0.844 0.846 0.836 0.816 0.85  0.828]
Mean: 0.8375612718450874
Standard Deviation: 0.010412213656773677 

Random Forest Scores: [0.862 0.87  0.868 0.868 0.876 0.896 0.858 0.866 0.882 0.888]
Mean: 0.8733805471221885
Standard Deviation: 0.011322659683253786




### We can also create a dataframe with the mean scores for each model so we can easily visualize our results. The Random Forest Classifier seems to be the best model, with an accuracy score of around 87%. 

In [23]:
results = pd.DataFrame({
    'Model 1': ['Naive Bayes','Stochastic Gradient Decent', 'KNN', 'Decision Tree', 
              'Logistic Regression', 'Random Forest'],
    'Score': [gaussian_scores.mean(), sgd_scores.mean(), knn_scores.mean(), dt_scores.mean(), logreg_scores.mean(), 
             rf_scores.mean()]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
print(result_df, "\n")


                             Model 1
Score                               
0.873381               Random Forest
0.849775                         KNN
0.837561         Logistic Regression
0.821756               Decision Tree
0.766714                 Naive Bayes
0.743638  Stochastic Gradient Decent 



### A [confusion matrix](https://machinelearningmastery.com/confusion-matrix-machine-learning/) is useful because it gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. 

In [24]:
#confusion matrix on random forest model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(rf, X, y, cv=3)
print("confusion matrix:")
print(confusion_matrix(y, predictions))

confusion matrix:
[[3600  212]
 [ 394  793]]


### Around 3600 results were correctly classified as not being a voicemail and around 210 results were not correctly classified as not being a voicemail. One the other hand, around 400 results were incorrectly classified as being a voicemail and aroud 790 results were correctly classified as being a voicemail.