# Telecom Churn Analysis

## Overview

Most telecom companies suffer from voluntary churn(loss of customers to a competition). Churn rate has strong impact on the life time value of the customer because it affects the length of service and the future revenue of the company. For example if a company has 25% churn rate then the average customer lifetime is 4 years; similarly a company with a churn rate of 50%, has an average customer lifetime of 2 years. It is estimated that 75 percent of the 17 to 20 million subscribers signing up with a new wireless carrier every year are coming from another wireless provider, which means they are churners. Telecom companies spend hundreds of dollars to acquire a new customer and when that customer leaves, the company not only loses the future revenue from that customer but also the resources spend to acquire that customer. Churn erodes profitability.

Telecom companies have used two approaches to address churn - (a) Untargeted approach and (b) Targeted approach. The untargeted approach relies on superior product and mass advertising to increase brand loyalty and thus retain customers. The targeted approach relies on identifying customers who are likely to churn, and  provide suitable intervention to encourage them to stay.<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1)

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Data from CrowdAnalytix, https://www.crowdanalytix.com/contests/why-customer-churn


## Business Understanding

SyriaTel is a smaller end wireless provider, they don't have such the budget to invest in a large scale advertising campaign. However, they also can't afford to keep losing customers. Therefore they have turned to us to create a model for predicting which customers will churn. When a customers has been determined as likely to churn they will contact the customer with promotional deals in hopes of keeping the customer.

In this scenario we will break the data into 2 categories: 1) the columns used to predict churn, 2) the target column of whether the customers churned or not. If churn is "True" that means the customer churned, and "False" if the customer didn't churn. 

In general it is important to consider beforehand whether a false positive or false negative is worse. In each specific case one should tune their model based on what is more important. In our case a false positive means that we predicted the customer would churn when they didn't. The outcome of this is that we will send them some promotional deals, and perhaps give them a discount. This will in turn cause a slight loss of profit, as these clients were already happy paying full price. On the other hand, a false negative is when we predict that a customer won't churn when really they will. This means that we will lose a customer without ever sending them promotional deals to try to get them to stay. The loss from this mistake is far greater then the loss from a false positive. As such when we are creating models we will attempt to minimize false negatives(maximize recall) as much as possible.


Photo by <a href="https://unsplash.com/@giggiulena?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Mario Caruso</a> on <a href="https://unsplash.com/photos/0C9VmZUqcT8?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  

This public dataset is provided by the CrowdAnalytix community as part of their churn prediction competition. The real name of the telecom company is anonymized. It contains 20 predictor variables mostly about customer usage patterns. There are 3333 records in this dataset, out of which 483 customers are churners and the remaining 2850 are non-churners. Thus, the ratio of churners in this dataset is 14%.

company and year of data have been anonymized but it was used in competition in 2012, so seemingly from around then. data source : https://www.crowdanalytix.com/contests/why-customer-churn

In [103]:
import pandas as pd
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import precision_score, recall_score,\
classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score, RandomizedSearchCV

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, \
ExtraTreesClassifier, StackingClassifier, AdaBoostClassifier, GradientBoostingClassifier

import xgboost

In [2]:
!ls data

archive.zip                        bigml_59c28831336c6604c800002a.csv


In [3]:
df = pd.read_csv('data/bigml_59c28831336c6604c800002a.csv')
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [5]:
df['area code'] = df['area code'].astype(object)

df.dtypes

state                      object
account length              int64
area code                  object
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                        bool
dtype: object

In [6]:
df.churn.value_counts()

False    2850
True      483
Name: churn, dtype: int64

In [7]:
y = df.churn
X = df.drop(['churn', 'phone number'], axis = 1)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [9]:
X_train_cat = X_train.select_dtypes('object')

ohe = OneHotEncoder(drop='first', sparse=False)

dums = ohe.fit_transform(X_train_cat)
dums_df = pd.DataFrame(dums, columns=ohe.get_feature_names(), index=X_train_cat.index)

In [10]:
dums_df.head()

Unnamed: 0,x0_AL,x0_AR,x0_AZ,x0_CA,x0_CO,x0_CT,x0_DC,x0_DE,x0_FL,x0_GA,...,x0_VA,x0_VT,x0_WA,x0_WI,x0_WV,x0_WY,x1_415,x1_510,x2_yes,x3_yes
817,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
679,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
56,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [11]:
X_train_nums = X_train.select_dtypes(['int64', 'float64'])

ss = StandardScaler()

ss.fit(X_train_nums)

nums_df = pd.DataFrame(ss.transform(X_train_nums), columns=X_train_nums.columns, index=X_train_nums.index)

In [12]:
nums_df.head()

Unnamed: 0,account length,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
817,3.601382,-0.584936,-1.547653,-0.429657,-1.54717,-0.729987,-1.840891,-0.731087,1.255804,0.925634,1.256197,-1.300791,0.634849,-1.304132,0.318978
1373,0.184951,-0.584936,-1.244014,0.224176,-1.244071,-0.138082,0.499864,-0.139179,0.16509,-0.353704,0.164841,-2.194793,-0.18437,-2.191525,1.813519
679,-0.650176,-0.584936,0.787609,-1.133785,0.787772,2.491952,0.549667,2.493068,0.147339,0.209205,0.147309,-0.549828,1.863677,-0.549186,-0.428293
56,1.020079,-0.584936,-0.969818,-0.127888,-0.9702,-0.408385,-1.890695,-0.408439,-1.178086,1.437368,-1.176344,-0.800149,-1.003589,-0.800835,-0.428293
1993,-0.371801,-0.584936,0.675354,-0.228477,0.675192,1.29433,-1.143645,1.295326,0.26568,0.516246,0.265649,-2.051753,-0.59398,-2.045833,-1.175564


In [13]:
X_train_clean = pd.concat([nums_df, dums_df], axis=1)

X_train_clean.head()

Unnamed: 0,account length,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,...,x0_VA,x0_VT,x0_WA,x0_WI,x0_WV,x0_WY,x1_415,x1_510,x2_yes,x3_yes
817,3.601382,-0.584936,-1.547653,-0.429657,-1.54717,-0.729987,-1.840891,-0.731087,1.255804,0.925634,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1373,0.184951,-0.584936,-1.244014,0.224176,-1.244071,-0.138082,0.499864,-0.139179,0.16509,-0.353704,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
679,-0.650176,-0.584936,0.787609,-1.133785,0.787772,2.491952,0.549667,2.493068,0.147339,0.209205,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
56,1.020079,-0.584936,-0.969818,-0.127888,-0.9702,-0.408385,-1.890695,-0.408439,-1.178086,1.437368,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1993,-0.371801,-0.584936,0.675354,-0.228477,0.675192,1.29433,-1.143645,1.295326,0.26568,0.516246,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [14]:
X_test_cat = X_test.select_dtypes('object')

test_dums = ohe.transform(X_test_cat)

test_dums_df = pd.DataFrame(test_dums, columns=ohe.get_feature_names(), index=X_test_cat.index)

In [15]:
X_test_nums = X_test.select_dtypes(['int64','float64'])

test_nums = ss.transform(X_test_nums)

test_nums_df = pd.DataFrame(test_nums, columns=X_test_nums.columns, index=X_test_nums.index)

In [16]:
X_test_clean = pd.concat([test_nums_df, test_dums_df], axis=1)

X_test_clean.head()

Unnamed: 0,account length,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,...,x0_VA,x0_VT,x0_WA,x0_WI,x0_WV,x0_WY,x1_415,x1_510,x2_yes,x3_yes
438,0.311486,-0.584936,-0.452712,-0.379362,-0.452767,2.56298,0.300651,2.562705,-0.21952,1.181501,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2674,-0.852632,-0.584936,-1.297381,0.827714,-1.297113,0.329524,1.19711,0.329704,-0.239243,2.102624,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1345,-0.068118,-0.584936,-3.30508,-5.056782,-3.305141,-0.810881,1.49593,-0.810008,-0.659356,-0.609571,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1957,1.17192,-0.584936,0.610946,-1.08349,0.611325,0.067112,-0.446399,0.067408,-0.874343,0.669766,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2148,-0.118732,-0.584936,-0.655138,0.073292,-0.655194,0.473554,-1.342858,0.473619,0.535893,-0.456051,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### Logistic Regression

In [17]:
lr = LogisticRegression(random_state=42)

lr.fit(X_train_clean, y_train)

LogisticRegression(random_state=42)

In [18]:
scores = cross_val_score(estimator=lr, X = X_train_clean, y= y_train, cv=10)
scores

array([0.86142322, 0.86891386, 0.87265918, 0.86516854, 0.86142322,
       0.86142322, 0.84210526, 0.84962406, 0.84586466, 0.87593985])

In [19]:
np.median(scores)

0.8614232209737828

In [20]:
lr.score(X_test_clean, y_test)

0.8530734632683659

### Logistic Regression Pipeline

In [21]:
df_cat = X_train.select_dtypes(include=['object'])

# select numerical columns
df_num = X_train.select_dtypes(include=['int64', 'float64'])

In [22]:
num_pipe = Pipeline(steps=[('ss', StandardScaler())])


cat_pipe = Pipeline(steps=[('ohe', OneHotEncoder(drop='first', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('numerical', num_pipe, df_num.columns),
    ('categorical', cat_pipe, df_cat.columns)
])

In [23]:
model_pipe_lr = Pipeline(steps=[
    ('transformer', transformer),
    ('lr', LogisticRegression(random_state=42))
])

model_pipe_lr.fit(X_train, y_train)

Pipeline(steps=[('transformer',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('ss',
                                                                   StandardScaler())]),
                                                  Index(['account length', 'number vmail messages', 'total day minutes',
       'total day calls', 'total day charge', 'total eve minutes',
       'total eve calls', 'total eve charge', 'total night minutes',
       'total night calls', 'total night charge', 'total intl minutes',
       'total intl calls', 'total intl charge', 'customer service calls'],
      dtype='object')),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first',
                                                                                 s

In [24]:
model_pipe_lr.score(X_train, y_train)

0.872093023255814

In [25]:
model_pipe_lr.score(X_test, y_test)

0.8530734632683659

### KNN

In [26]:
knn = KNeighborsClassifier()

knn.fit(X_train_clean, y_train)

KNeighborsClassifier()

In [27]:
scores = cross_val_score(estimator=knn, X = X_train_clean, y= y_train, cv=10)
scores

array([0.87640449, 0.86516854, 0.88764045, 0.88389513, 0.88389513,
       0.89138577, 0.86842105, 0.87218045, 0.87969925, 0.90225564])

In [28]:
np.median(scores)

0.8817971896032215

In [29]:
knn.score(X_test_clean, y_test)

0.8935532233883059

In [30]:
grid = {
    'n_neighbors': [3 ,5 , 11],
    'metric': ['manhattan', 'minkowski'],
    'weights': ['uniform', 'distance']
}

In [31]:
knn

KNeighborsClassifier()

In [32]:
gs = GridSearchCV(estimator=knn, param_grid= grid, cv=5)

In [33]:
gs.fit(X_train_clean, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'metric': ['manhattan', 'minkowski'],
                         'n_neighbors': [3, 5, 11],
                         'weights': ['uniform', 'distance']})

In [34]:
gs.best_params_

{'metric': 'minkowski', 'n_neighbors': 5, 'weights': 'uniform'}

This happens to be the same as our default/ original model

In [35]:
gs.best_estimator_.score(X_test_clean, y_test)

0.8935532233883059

### Pipeline for knn

In [36]:
model_pipe_knn = Pipeline(steps=[
    ('transformer', transformer),
    ('knn', KNeighborsClassifier())
])

model_pipe_knn.fit(X_train, y_train)

Pipeline(steps=[('transformer',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('ss',
                                                                   StandardScaler())]),
                                                  Index(['account length', 'number vmail messages', 'total day minutes',
       'total day calls', 'total day charge', 'total eve minutes',
       'total eve calls', 'total eve charge', 'total night minutes',
       'total night calls', 'total night charge', 'total intl minutes',
       'total intl calls', 'total intl charge', 'customer service calls'],
      dtype='object')),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first',
                                                                                 s

In [37]:
model_pipe_knn.score(X_train, y_train)

0.9036009002250562

In [38]:
model_pipe_knn.score(X_test, y_test)

0.8935532233883059

In [40]:
param_grid = {
    'knn__n_neighbors': [3, 5, 11],
    'knn__metric': ['manhattan', 'minkowski'],
    'knn__weights': ['uniform', 'distance']
}

gs = GridSearchCV(model_pipe_knn, param_grid=param_grid, cv=5)
gs.fit(X_train, y_train)
gs.best_params_

{'knn__metric': 'manhattan', 'knn__n_neighbors': 3, 'knn__weights': 'uniform'}

In [41]:
gs.best_estimator_.score(X_test, y_test)

0.8875562218890555

### Decision Trees

In [42]:
clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train_clean, y_train)

DecisionTreeClassifier(random_state=42)

In [43]:
clf.score(X_train_clean, y_train)

1.0

In [44]:
clf.score(X_test_clean, y_test)

0.9280359820089955

In [45]:
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 7, 9],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

In [46]:
clf

DecisionTreeClassifier(random_state=42)

In [47]:
gs = GridSearchCV(clf, param_grid=param_grid, cv=5)

In [48]:
gs.fit(X_train_clean, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 3, 5, 7, 9],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10]})

In [49]:
gs.best_params_

{'criterion': 'entropy',
 'max_depth': 7,
 'min_samples_leaf': 2,
 'min_samples_split': 10}

In [50]:
gs.best_estimator_.score(X_test_clean, y_test)

0.9385307346326837

### Pipeline for Decision Tree

In [51]:
model_pipe_tree = Pipeline(steps=[
    ('transformer', transformer),
    ('dt', DecisionTreeClassifier(random_state=42))
]) 

In [52]:
model_pipe_tree.fit(X_train, y_train)

Pipeline(steps=[('transformer',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('ss',
                                                                   StandardScaler())]),
                                                  Index(['account length', 'number vmail messages', 'total day minutes',
       'total day calls', 'total day charge', 'total eve minutes',
       'total eve calls', 'total eve charge', 'total night minutes',
       'total night calls', 'total night charge', 'total intl minutes',
       'total intl calls', 'total intl charge', 'customer service calls'],
      dtype='object')),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first',
                                                                                 s

In [53]:
model_pipe_tree.score(X_train, y_train)

1.0

In [54]:
model_pipe_tree.score(X_test, y_test)

0.9280359820089955

In [55]:
# Define the hyperparameter grid to search over
param_grid = {
    'dt__criterion': ['gini', 'entropy'],
    'dt__max_depth': [None, 3, 5, 7, 9],
    'dt__min_samples_split': [2, 5, 10],
    'dt__min_samples_leaf': [1, 2, 4],
}

In [56]:
# Create a GridSearchCV object
grid_search = GridSearchCV(
    estimator=model_pipe_tree,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and corresponding score
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Best parameters:  {'dt__criterion': 'entropy', 'dt__max_depth': 7, 'dt__min_samples_leaf': 2, 'dt__min_samples_split': 10}
Best score:  0.9403594943468881


### Voting Classifier

In [57]:
w_avg = VotingClassifier(estimators=[
    ('lr', LogisticRegression()), 
    ('knn', KNeighborsClassifier(metric= 'minkowski', 
                                 n_neighbors= 5, 
                                 weights= 'uniform')),
    ('dt', DecisionTreeClassifier(criterion= 'entropy',
                                  max_depth= 7,
                                  min_samples_leaf= 2,
                                  min_samples_split= 10))
])

w_avg.fit(X_train_clean, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('knn', KNeighborsClassifier()),
                             ('dt',
                              DecisionTreeClassifier(criterion='entropy',
                                                     max_depth=7,
                                                     min_samples_leaf=2,
                                                     min_samples_split=10))])

In [58]:
scores = cross_val_score(estimator=w_avg, X=X_train_clean, y=y_train, cv=5)
scores

array([0.89513109, 0.89868668, 0.90243902, 0.87617261, 0.9043152 ])

In [59]:
np.median(scores)

0.8986866791744841

In [60]:
w_avg.score(X_train_clean, y_train)

0.9208552138034508

In [61]:
w_avg.score(X_test_clean, y_test)

0.9070464767616192

Now let's try again with re-weighting the same thing

In [62]:
w_avg = VotingClassifier(estimators=[
    ('lr', LogisticRegression()), 
    ('knn', KNeighborsClassifier(metric= 'minkowski', 
                                 n_neighbors= 5, 
                                 weights= 'uniform')),
    ('dt', DecisionTreeClassifier(criterion= 'entropy',
                                  max_depth= 7,
                                  min_samples_leaf= 2,
                                  min_samples_split= 10))],
    weights=[.2, .3, .5]
)

w_avg.fit(X_train_clean, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('knn', KNeighborsClassifier()),
                             ('dt',
                              DecisionTreeClassifier(criterion='entropy',
                                                     max_depth=7,
                                                     min_samples_leaf=2,
                                                     min_samples_split=10))],
                 weights=[0.2, 0.3, 0.5])

In [63]:
w_avg.score(X_train_clean, y_train)

0.9204801200300075

In [64]:
w_avg.score(X_test_clean, y_test)

0.9055472263868066

So far our best model has been DecisionTreeClassifier with the following parameters:
- Criterion: 'entropy'
- max_depth: 7
- min_samples_leaf: 2
- min_samples_split: 10

With an accuracy of .94 on the testing set

Before we move on let's check out the classification report of this model.

In [65]:
dt = DecisionTreeClassifier(criterion='entropy', 
                            max_depth=7, 
                            min_samples_leaf=2, 
                            min_samples_split=10)
dt.fit(X_train_clean, y_train)
y_preds = dt.predict(X_test_clean)
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

       False       0.95      0.98      0.96       566
        True       0.86      0.71      0.78       101

    accuracy                           0.94       667
   macro avg       0.90      0.85      0.87       667
weighted avg       0.94      0.94      0.94       667



### Bagging

In [66]:
bag = BaggingClassifier(n_estimators=100, verbose=1, random_state=42)

In [67]:
bag.fit(X_train_clean, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.4s finished


BaggingClassifier(n_estimators=100, random_state=42, verbose=1)

In [68]:
scores = cross_val_score(estimator=bag, X=X_train_clean, y=y_train, cv=5)
scores

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

array([0.94569288, 0.94559099, 0.93621013, 0.94183865, 0.9587242 ])

In [69]:
np.median(scores)

0.9455909943714822

In [70]:
bag.score(X_test_clean, y_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


0.9535232383808095

In [71]:
#potential best model(bag)
y_pred = bag.predict(X_test_clean)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.96      0.99      0.97       566
        True       0.93      0.75      0.83       101

    accuracy                           0.95       667
   macro avg       0.94      0.87      0.90       667
weighted avg       0.95      0.95      0.95       667



[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


So now we have a model with 95% accuracy, our best model yet! However, our recall is a meager 75%. Let's see if we can improve that by tweaking the threshold

In [149]:
# potential best model(bag)
y_prob = bag.predict_proba(X_test_clean)[:,1]
y_pred = (y_prob >= 0.32).astype(int)
bag_report = classification_report(y_test, y_pred)
print(bag_report)

              precision    recall  f1-score   support

       False       0.97      0.96      0.97       566
        True       0.80      0.83      0.82       101

    accuracy                           0.94       667
   macro avg       0.88      0.90      0.89       667
weighted avg       0.94      0.94      0.94       667



[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


In [73]:
print(confusion_matrix(y_test, y_pred))

[[545  21]
 [ 17  84]]


The best we can do without changing the model too much is a recall of .83 while only decreasing the accuracy by 1%. However, the precision took a big hit and went down over 10%. As mentioned earlier this is not such a big deal. Although, it may still not be worth the trade off.

### Random Forest

In [75]:
rfr = RandomForestClassifier(max_features='sqrt', max_samples=.5, random_state=42)

In [76]:
rfr.fit(X_train_clean, y_train)

RandomForestClassifier(max_features='sqrt', max_samples=0.5, random_state=42)

In [78]:
scores = cross_val_score(rfr, X_train_clean, y_train, cv=5)
scores

array([0.91011236, 0.93245779, 0.91932458, 0.90994371, 0.92120075])

In [79]:
np.median(scores)

0.9193245778611632

In [80]:
score = rfr.score(X_test_clean, y_test)
score

0.9295352323838081

In [84]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.92      1.00      0.96       566
        True       0.98      0.54      0.70       101

    accuracy                           0.93       667
   macro avg       0.95      0.77      0.83       667
weighted avg       0.93      0.93      0.92       667



While the precision is quite high, the recall is very low.

In [87]:
feat_import = {name:score
               for name, score
                   in zip(X_test_clean.columns, rfr.feature_importances_)
}
feat_import

{'account length': 0.035498190022040924,
 'number vmail messages': 0.020592308421105654,
 'total day minutes': 0.11804135151960904,
 'total day calls': 0.03804331663135271,
 'total day charge': 0.1212535905430427,
 'total eve minutes': 0.05391157862233688,
 'total eve calls': 0.03249380678395122,
 'total eve charge': 0.055131823626085846,
 'total night minutes': 0.03803670265221742,
 'total night calls': 0.03583430578252492,
 'total night charge': 0.03705201431684862,
 'total intl minutes': 0.04161032935796411,
 'total intl calls': 0.03582124658729292,
 'total intl charge': 0.04692170198477239,
 'customer service calls': 0.09768470683566628,
 'x0_AL': 0.001019097368012481,
 'x0_AR': 0.002401339668776563,
 'x0_AZ': 0.0021205199319452275,
 'x0_CA': 0.002646821365896307,
 'x0_CO': 0.0016856259637319762,
 'x0_CT': 0.0017611339844510335,
 'x0_DC': 0.0016934429404031938,
 'x0_DE': 0.001368623384628852,
 'x0_FL': 0.001678356176437631,
 'x0_GA': 0.0016283981151337514,
 'x0_HI': 0.0008706765510

### Extra Trees 

In [88]:
etr = ExtraTreesClassifier(max_features='sqrt', 
                          max_samples=.5,
                          bootstrap=True,
                          random_state=42)

In [89]:
etr.fit(X_train_clean, y_train)

ExtraTreesClassifier(bootstrap=True, max_features='sqrt', max_samples=0.5,
                     random_state=42)

In [90]:
scores = cross_val_score(etr, X_train_clean, y_train, cv=5)
scores

array([0.88014981, 0.88742964, 0.89305816, 0.87804878, 0.88930582])

This is not great, so lets move on.

### Stacking

In [97]:
estimators = [
    ('lr', LogisticRegression()),
    ('kkn', KNeighborsClassifier()),
    ('dt', DecisionTreeClassifier())
]

sr = StackingClassifier(estimators)

sr.fit(X_train_clean, y_train)

StackingClassifier(estimators=[('lr', LogisticRegression()),
                               ('kkn', KNeighborsClassifier()),
                               ('dt', DecisionTreeClassifier())])

In [98]:
sr.score(X_train_clean, y_train)

0.9872468117029257

In [99]:
sr.score(X_test_clean, y_test)

0.9355322338830585

In [100]:
y_pred = sr.predict(X_test_clean)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.94      0.99      0.96       566
        True       0.89      0.65      0.75       101

    accuracy                           0.94       667
   macro avg       0.92      0.82      0.86       667
weighted avg       0.93      0.94      0.93       667



### Adaboost

In [107]:
abc = AdaBoostClassifier()

abc.fit(X_train_clean, y_train)

AdaBoostClassifier()

In [108]:
abc.score(X_train_clean, y_train)

0.8934733683420856

In [116]:
y_pred = abc.predict(X_test_clean)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.89      0.97      0.93       566
        True       0.65      0.34      0.44       101

    accuracy                           0.87       667
   macro avg       0.77      0.65      0.69       667
weighted avg       0.86      0.87      0.85       667



Not great so let's move on.

### Gradient Boost

In [112]:
gbc = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.0)

gbc.fit(X_train_clean, y_train)

GradientBoostingClassifier(learning_rate=1.0, max_depth=2, n_estimators=3)

In [113]:
gbc.score(X_train_clean, y_train)

0.9017254313578394

In [115]:
y_pred = gbc.predict(X_test_clean)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.90      0.99      0.94       566
        True       0.84      0.42      0.56       101

    accuracy                           0.90       667
   macro avg       0.87      0.70      0.75       667
weighted avg       0.89      0.90      0.88       667



Although the accuracy is very good, again the recall is quite low.

### XGBoost

In [118]:
xgb = xgboost.XGBClassifier(random_state=42, objective='reg:squarederror')

xgb.fit(X_train_clean, y_train)

'gamma': 0.1, 'max_depth': 4,

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='reg:squarederror', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [119]:
xgb.score(X_train_clean, y_train)

1.0

In [120]:
#potential best model(xgboost)
y_pred = xgb.predict(X_test_clean)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.96      0.98      0.97       566
        True       0.89      0.76      0.82       101

    accuracy                           0.95       667
   macro avg       0.92      0.87      0.89       667
weighted avg       0.95      0.95      0.95       667



This default is very similar to our bag(which was our best model so far). The accuracy in the bag was 95% with a recall of 75% and precision of 93%. This model has the same accuracy with a slightly worse precision and slightly better recall. Just as we tweaked our bag model, let's try to tweak our current model to improve the recall

In [152]:
y_prob = xgb.predict_proba(X_test_clean)[:,1]
y_pred = (y_prob >= 0.25).astype(int)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.97      0.95      0.96       566
        True       0.75      0.84      0.79       101

    accuracy                           0.93       667
   macro avg       0.86      0.90      0.88       667
weighted avg       0.94      0.93      0.94       667



In [153]:
print(bag_report)

              precision    recall  f1-score   support

       False       0.97      0.96      0.97       566
        True       0.80      0.83      0.82       101

    accuracy                           0.94       667
   macro avg       0.88      0.90      0.89       667
weighted avg       0.94      0.94      0.94       667



If we make the y_prob the same as the bag model they have the same scores in precision and accuracy but our XGBoost model has a recall of .81. If we lower the y_prob to .25 we arguably have a better model for our specific case with a recall 1% better while the precision is 5% worse, and the overall accuracy is also 1% worse in our XGBoost model.

Let's run a grid sesarch to see if we can improve the model

In [159]:
param_grid = {
    "max_depth": [4, 6, 8],
    "learning_rate": [0.1, 0.3, .5],
    "gamma": [0, 0.1, 0.5]
}

# Create GridSearchCV object
grid_search = GridSearchCV(xgb, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the GridSearchCV object on training data
grid_search.fit(X_train_clean, y_train)

GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0, gpu_id=-1,
                                     importance_type='gain',
                                     interaction_constraints='',
                                     learning_rate=0.300000012,
                                     max_delta_step=0, max_depth=6,
                                     min_child_weight=1, missing=nan,
                                     monotone_constraints='()',
                                     n_estimators=100, n_jobs=0,
                                     num_parallel_tree=1,
                                     objective='reg:squarederror',
                                     random_state=42, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, subsample=1,
                      

In [160]:
grid_search.best_params_

{'gamma': 0.1, 'learning_rate': 0.3, 'max_depth': 4}

In [161]:
grid_search.score(X_train_clean, y_train)

0.9804951237809453

In [192]:
#potential best model(xg boost)
y_prob = xgb.predict_proba(X_test_clean)[:,1]
y_pred = (y_prob >= 0.25).astype(int)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.97      0.95      0.96       566
        True       0.75      0.84      0.79       101

    accuracy                           0.93       667
   macro avg       0.86      0.90      0.88       667
weighted avg       0.94      0.93      0.94       667



Let's see if we can run a grid search on our bag if we can further improve this model

In [164]:
base_estimator = DecisionTreeClassifier()

# Define the BaggingClassifier
bagging = BaggingClassifier(base_estimator=base_estimator, random_state=42)

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [10, 100],
    'max_samples': [0.5, 1.0],
    'bootstrap': [True, False],
}

# Define the GridSearchCV object
grid_search = GridSearchCV(bagging, param_grid=param_grid, cv=5)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train_clean, y_train)

# Print the best hyperparameters and corresponding score
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

Best Hyperparameters:  {'bootstrap': False, 'bootstrap_features': False, 'max_features': 1.0, 'max_samples': 0.5, 'n_estimators': 100}
Best Score:  0.9478613740329278


In [190]:
#potential best model(bag)
y_prob = grid_search.predict_proba(X_test_clean)[:,1]
y_pred = (y_prob >= 0.32).astype(int)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.97      0.97      0.97       566
        True       0.85      0.81      0.83       101

    accuracy                           0.95       667
   macro avg       0.91      0.89      0.90       667
weighted avg       0.95      0.95      0.95       667



In [186]:
confusion_matrix(y_test, y_pred)

array([[551,  15],
       [ 19,  82]])