Problem statement:

Due to increasing churn rate of customers from last several quarters, the company has decided to dive deep into factors driving it. Past efforts in retaining customers have been reactive where the suggested steps were applied at the point of no return. The idea is to use machine learning to predict the likelihood of churn for each customer and reasons behind it. These reasons could be different for each customer. This will help with targeted response tailored to customer’s preferences thereby helping them stay and improving customer loyalty.

You are a data scientist in the team tasked to drive its implementation. You will be responsible for conceptualizing end to end pipeline for training a supervised model for churn prediction i.e., data preparation, understanding factors and their trends, model training and validation of results.

Dataset : Telco_Customer_Churn.csv

In [23]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline

In [42]:
telcoDF = pd.read_csv('Telco-Customer-Churn.csv')
telcoDF.head(25)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
5,9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
6,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,...,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
7,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
8,7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,...,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
9,6388-TABGU,Male,0,No,Yes,62,Yes,No,DSL,Yes,...,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No


# Applying cleanup steps to the data as below<br>
1> No internet service is as good as No value for couple of fields replacing same with No.<br>
2> No phone service is as good as No value for couple of fields replacing same with No.<br>
3> TotlaCharges field has couple of blank/special character containing attributes. since at any given point of time the TotalCharges is um of all charges or equal to MonthCharges,<br>
we shall replace blank values in TotalCharges with values from MonthlyCharges if same is not null.<br>
4> CustomerID columns identifies each record uniquely and thus provides no variability in dataset thus we shall drop this column from our EDA, pre-processing and Model Building part.<br>

# Replacing No internet service and No phone service values with No in all columns the values exists.

In [46]:
telcoDF = telcoDF.replace('No internet service','No').replace('No phone service','No')
# telcoDF.head(25)

# Replacing TotalCharges Blank values with MonthlyCharges if Monthly charges is not null

In [60]:
# first remove blanks in TotalCharges column with Nan value
telcoDF['TotalCharges'].replace(' ', np.nan, inplace=True)
# telcoDF.describe(include='all')

# Then replace the nan value with MonthlyCharges values
telcoDF.TotalCharges.fillna(telcoDF.MonthlyCharges, inplace=True)
# telcoDF.to_csv('data1.csv')

# removing customer ID column from the dataframe as it uniquely identifies each records and thus provides
# no variance to data
telcoDF = telcoDF.drop(['customerID'], axis = 1)

In [66]:
# using dictionary to convert specific columns datatypes
convert_dict = {'TotalCharges': float}
telcoDF = telcoDF.astype(convert_dict)
# telcoDF.dtypes

In [67]:
# split the data in dependant and independant variables
telcoDFX = telcoDF[['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges']]
telcoDFy = telcoDF[['Churn']]

In [70]:
# telcoDFX.head(5)
# telcoDFy.head(5)

In [79]:
# Generate dummeis/one-hot encoding for all the categorical/binary/object type variables
# dummyClmns =['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity','OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV','StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
telcoDFX = pd.get_dummies(telcoDFX) #, columns=dummyClmns)
# telcoDFX.head(5)
# telcoDF.columns
# telcoDFX.to_csv('data1.csv')

In [80]:
# converting target column into binary values 1/0 using label encoder for classification
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
 
# Encode labels in column 'species'.
telcoDFy['Churn'] = label_encoder.fit_transform(telcoDFy['Churn'])
telcoDFy['Churn'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


array([0, 1])

In [83]:
# splitting dataset into train test splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(telcoDFX, telcoDFy, test_size = 0.25, random_state = 0)

In [85]:
# apply standardisation to numerical columns so all values are at same scale for better ML performace
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Logistic regression

In [95]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, np.ravel(y_train))

LogisticRegression(random_state=0)

In [96]:
# apply predictions on test dataset
y_pred = classifier.predict(X_test)

In [97]:
# Generate confusion matrix and accuracy for the results obtained
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1159  139]
 [ 215  248]]


0.7989778534923339

# Gaussian Naive Bayes

In [98]:
from sklearn.naive_bayes import GaussianNB
classifierNB = GaussianNB()
classifierNB.fit(X_train, np.ravel(y_train))

GaussianNB()

In [99]:
# apply predictions on test dataset
y_pred_NB = classifierNB.predict(X_test)

In [100]:
# Generate confusion matrix and accuracy for the results obtained
from sklearn.metrics import confusion_matrix, accuracy_score
cmNB = confusion_matrix(y_test, y_pred_NB)
print(cmNB)
accuracy_score(y_test, y_pred_NB)

[[976 322]
 [123 340]]


0.7473026689381034

# Stochastic Gradient Descent

In [105]:
from sklearn.linear_model import SGDClassifier
classifierSgd = SGDClassifier(loss='modified_huber',shuffle=True,random_state=0)
classifierSgd.fit(X_train, np.ravel(y_train))

SGDClassifier(loss='modified_huber', random_state=0)

In [106]:
# apply predictions on test dataset
y_pred_sgd = classifierSgd.predict(X_test)

In [107]:
# Generate confusion matrix and accuracy for the results obtained
from sklearn.metrics import confusion_matrix, accuracy_score
cmSgd = confusion_matrix(y_test, y_pred_sgd)
print(cmSgd)
accuracy_score(y_test, y_pred_sgd)

[[1143  155]
 [ 374   89]]


0.6996024985803521

# K-Nearest Neighbour

In [135]:
from sklearn.neighbors import KNeighborsClassifier
classifierKnn = KNeighborsClassifier(n_neighbors=12)
classifierKnn.fit(X_train, np.ravel(y_train))

KNeighborsClassifier(n_neighbors=12)

In [136]:
# apply predictions on test dataset
y_pred_knn = classifierKnn.predict(X_test)

In [137]:
from sklearn.metrics import confusion_matrix, accuracy_score
cmKnn = confusion_matrix(y_test, y_pred_knn)
print(cmKnn)
accuracy_score(y_test, y_pred_knn)

[[1150  148]
 [ 249  214]]


0.7745599091425327

# Decision Tree Classifier

In [201]:
from sklearn.tree import DecisionTreeClassifier
classifierDt = DecisionTreeClassifier(max_depth=8,random_state=0,max_features=None,min_samples_leaf=14)
classifierDt.fit(X_train, np.ravel(y_train))

DecisionTreeClassifier(max_depth=8, min_samples_leaf=14, random_state=0)

In [202]:
# apply predictions on test dataset
y_pred_dt = classifierDt.predict(X_test)

In [205]:
from sklearn.metrics import confusion_matrix, accuracy_score
cmDt = confusion_matrix(y_test, y_pred_dt)
print(cmDt)
accuracy_score(y_test, y_pred_dt)

[[1148  150]
 [ 225  238]]


0.787052810902896

# Random Forest Classifier

In [245]:
from sklearn.ensemble import RandomForestClassifier
classifierRft = RandomForestClassifier(n_estimators=70,oob_score=True,n_jobs=-1,random_state=0,max_features=None,min_samples_leaf=30)
classifierRft.fit(X_train, np.ravel(y_train))

RandomForestClassifier(max_features=None, min_samples_leaf=30, n_estimators=70,
                       n_jobs=-1, oob_score=True, random_state=0)

In [246]:
# apply predictions on test dataset
y_pred_rft = classifierRft.predict(X_test)

In [247]:
from sklearn.metrics import confusion_matrix, accuracy_score
cmRft = confusion_matrix(y_test, y_pred_rft)
print(cmRft)
accuracy_score(y_test, y_pred_rft)

[[1162  136]
 [ 237  226]]


0.7881885292447472

In [275]:
from sklearn.svm import SVC
classifierSvc = SVC(kernel='linear',C=0.025,random_state=0)
classifierSvc.fit(X_train, np.ravel(y_train))

SVC(C=0.025, kernel='linear', random_state=0)

In [276]:
# apply predictions on test dataset
y_pred_svc = classifierSvc.predict(X_test)

In [277]:
from sklearn.metrics import confusion_matrix, accuracy_score
cmSvc = confusion_matrix(y_test, y_pred_svc)
print(cmSvc)
accuracy_score(y_test, y_pred_svc)

[[1166  132]
 [ 224  239]]


0.7978421351504826