# Customer Churn - Classification Model
This notebook is the second of two exploring the Telco Customer Churn dataset.

As a financial services firm, it is very important to know which of your services that clients find most useful. With this knowledge, you can prioritize discounts and know which customers to offer special deals to if they are likely to go to another company for similar services.

Determining whether a customer will leave (churn) is a classification problem - we will take multiple points of information about each customer and predict whether or not they will continue to employ our services in the future.

This notebook is the second of two in this repository that deals with the Telco Customer Churn dataset - this one will act as the home for the feature engineering and modelling. If you are interested in seeing the notebook with the exploratory analysis, click [here].

The data source can be found [here](https://www.kaggle.com/blastchar/telco-customer-churn).

## Importing Packages and Data

In [25]:
# import packages
# data handling
import os
import pandas as pd
import numpy as np

In [26]:
# import data
os.chdir('/Users/user1/Downloads')
filename = 'Telco_Customer_Churn.csv'
df = pd.read_csv(filename)

In [27]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Data Type Handling
In our EDA, we determined that TotalCharges needs to be coerced to a numeric datatype, and filled with 0's for Null values.

In [28]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors = 'coerce')
df['TotalCharges'].fillna(0, inplace = True)

## Splitting Train and Test Data
Let's convert Churn to a numerical feature and separate it out from the rest our data. Let's also verify that it was correctly converted.

In [29]:
y = df['Churn'].map({'Yes':1, 'No':0})
y.head()

0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64

Let's create 'X' - a matrix of our features. In this step, we also drop customerID, since it is an identifier / key, which doesn't contribute anything to our model.

In [30]:
X = df.drop(['Churn', 'customerID'], axis = 1)
X['SeniorCitizen'] = X['SeniorCitizen'].map({0:'No', 1:'Yes'})
X.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,Female,No,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,Male,No,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
2,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
3,Male,No,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,Female,No,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


Looks like everything has been prepared to be separated into training and test sets. For this, we will use sklearn's train_test_split. One concern with this is that it produces its splits randomly, so we may have an imbalance of Churn.

As a note, SeniorCitizen is converted with a map method (which is a quick and dirty method) to make sure that is gets processed as an object when encoded rather than a numeric column - MIGHT WANT TO CHANGE THIS LATER.

In [31]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

## Feature Scaling and Engineering
Now that we have our train/test split, we can work on two main things before we move to actually modelling using our data. First, we should convert all of our categorical features to numeric features. 

Because none of our categorical variables have an order of magnitude (or, no value is "bigger" than another in size or relationship), we should use One-Hot Encoding to create dummy variables. By doing so, our algorithm will avoid learning relationships by avoiding the assumption that our variables are ordinal. 

However, the number of features will drastically increase, so it may be important to have an iterative process of feature selection (utilize something like sklearn's RFE) to later reduce the number of features.

More information on this for Mitchell's reference: https://towardsdatascience.com/a-look-into-feature-importance-in-logistic-regression-models-a4aa970f9b0f

In [32]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
oh_encoder = OneHotEncoder()
cat_cols = X_train.columns[X_train.dtypes == 'object']
cat_cols

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod'],
      dtype='object')

I'm going to build the pipeline using ColumnTransformer rather than using fit_transform on the train data directly so that we can experiment with StandardScaling and Min-Max scaling easily later.

Next, let's define our categorical pipeline. For this dataset, we will use OneHot encoding.

In [33]:
cat_pipeline = Pipeline([
    ('onehot', oh_encoder)
])

We've prepared a categorical pipeline (which will be used in a larger pipeline later).

However, Instead of using the categorical pipeline now, (which would return a sparse matrix with no numerical columns) we will temporarily use pd.get_dummies and concatenate back our numerical columns.

In [34]:
dummies = pd.get_dummies(X_train[cat_cols])
X_train_prep = pd.concat([dummies, X_train[['tenure', 'MonthlyCharges', 'TotalCharges']]], axis=1)
X_train_prep.head()

Unnamed: 0,gender_Female,gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,...,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure,MonthlyCharges,TotalCharges
2920,1,0,1,0,0,1,1,0,0,1,...,1,1,0,0,0,0,1,72,85.1,6155.4
2966,1,0,0,1,1,0,1,0,1,0,...,0,0,1,0,0,1,0,14,46.35,672.7
6099,1,0,1,0,0,1,0,1,0,1,...,1,1,0,1,0,0,0,71,24.7,1810.55
5482,0,1,1,0,0,1,0,1,0,1,...,0,0,1,0,0,0,1,33,73.9,2405.05
2012,1,0,1,0,0,1,1,0,0,1,...,0,0,1,0,0,1,0,47,98.75,4533.7


You can see that our categorical features have been oneHot encoded and our numerical data has been added back in with no transformation. We'll take one more step before moving onto models / testing different pipelines - saving the columns of this matrix so we can determine feature importance later.

In [35]:
full_cols = X_train_prep.columns

We are all done with the initial preparation of the training data - time to build an initial model and see how it performs.

## Building an initial model
For now, I'm going to stck with one main model - Logistic Regression. We'll use this model to build a function that will evaluate our models for accuracy, recall, and precision.

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

logreg = LogisticRegression(solver = 'lbfgs', max_iter = 250)

def eval_model(model, cv = 5, model_name = None):
    scoring = ['accuracy', 'recall', 'precision', 'f1']
    scores = cross_validate(model, X_train_prep, y_train, scoring = scoring, cv = cv)
    for i in scoring:
        print('Mean {}: {}'.format(i, scores[str('test_' + i)].mean()))

eval_model(logreg)



Mean accuracy: 0.8061750300881438
Mean recall: 0.5575968992248062
Mean precision: 0.6622692912348084
Mean f1: 0.6046740928311374


Because we are interested in which customers are going to Churn, rather than not, we'd like to play it on the safe side and prioritize Recall over Precision. This is because prioritizing models with higher recall over higher precision create more false positives rather than false negatives. A false negative (loss of a customer) will create a much larger loss for our company than a false positive (loss due to offering a deal to a customer that is not going to churn).

## Experimenting with Different Pipelines
First, how does using a StandardScaler affect our results?

In [37]:
num_cols = X_train.columns[X_train.dtypes != 'object']

full_pipeline = ColumnTransformer([
    ('cat', cat_pipeline, cat_cols),
    ('num', StandardScaler(), num_cols)
])

def apply_pipeline(data, pipeline):
    transformed_data = pipeline.fit_transform(data)
    return pd.DataFrame(transformed_data, columns = full_cols)

# using apply_pipeline and .head() allows us to view the changes that the different Transformers apply

X_train_prep = apply_pipeline(X_train, full_pipeline)
X_train_prep.head()[num_cols]

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
0,1.612532,0.674154,1.704761
1,-0.747907,-0.614894,-0.709771
2,1.571835,-1.335097,-0.208672
3,0.02534,0.301578,0.05314
4,0.595101,1.128231,0.990579


Here, we redefine and apply the full pipeline with a StandardScaler. 

In [38]:
eval_model(logreg)

Mean accuracy: 0.8059967800574978
Mean recall: 0.5655968992248063
Mean precision: 0.6586693994181699
Mean f1: 0.6078690784992602


This is a slightly better result - what about MinMaxScaling?

In [39]:
from sklearn.preprocessing import MinMaxScaler

full_pipeline = ColumnTransformer([
    ('cat', cat_pipeline, cat_cols),
    ('num', MinMaxScaler(feature_range = (0,1)), num_cols)
])

X_train_prep = apply_pipeline(X_train, full_pipeline)
X_train_prep.head()[num_cols]

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
0,1.0,0.665174,0.708756
1,0.194444,0.279602,0.077457
2,0.986111,0.064179,0.208473
3,0.458333,0.553731,0.276926
4,0.652778,0.800995,0.522027


In [40]:
eval_model(logreg)

Mean accuracy: 0.8043989895963479
Mean recall: 0.5589368770764119
Mean precision: 0.6564007348699026
Mean f1: 0.6031707499487482


MinMax Scaling performs slightly worse than StandardScaling. Let's stick with StandardScaling for now, so here's our final result with logreg before hyperparameter tuning.

Final Data Preparation:

In [41]:
full_pipeline = ColumnTransformer([
    ('cat', cat_pipeline, cat_cols),
    ('num', StandardScaler(), num_cols)
])

X_train_prep = apply_pipeline(X_train, full_pipeline)
eval_model(logreg)

Mean accuracy: 0.8059967800574978
Mean recall: 0.5655968992248063
Mean precision: 0.6586693994181699
Mean f1: 0.6078690784992602


## Initializing Different Models
First, let's import some more classification models.

In [42]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

Let's start with GaussianNB, or Naïve Bayes.

In [43]:
GB = GaussianNB()
eval_model(GB)

Mean accuracy: 0.7062489126290685
Mean recall: 0.8374440753045403
Mean precision: 0.4720115688851777
Mean f1: 0.6033786659647309


With respects to our LogisticRegression model from earlier, Naive Bayes performs significantly better in the area of recall (with an improvement of .30+)

In [44]:
DTC = DecisionTreeClassifier()
eval_model(DTC)

Mean accuracy: 0.7287863198936885
Mean recall: 0.5089723145071983
Mean precision: 0.4915335154158683
Mean f1: 0.499984205851996


Decision Tree Classifier performs worse than our LogReg Model.

In [45]:
RFC = RandomForestClassifier(n_estimators = 100)
eval_model(RFC)

Mean accuracy: 0.7916205959514218
Mean recall: 0.49564784053156147
Mean precision: 0.6416651123453619
Mean f1: 0.5589476414859471


Random Forest Classifier Boasts a stronger score in precision but performs worse than our Logistic Regression and Naive Bayes models in Recall.

In [46]:
Support_Vector = SVC(gamma = 'auto')
eval_model(Support_Vector)

Mean accuracy: 0.8024491072400324
Mean recall: 0.4850033222591362
Mean precision: 0.6818551010562737
Mean f1: 0.5665229485429812


The Support Vector Machine model performs similarly to the Random Forest Classifier, with a higher precision score but the worst recall score yet.

Right now, our two best models are Naive Bayes and Logistic Regression.

# HyperParameter Tuning
As we determined in our last step, Logistic Regression and Naive Bayes look the most promising. Now we can used RandomizedSearch CV to tune the parameters for each of these models and figure out which one we will use.

In [47]:
# Random Search CV