
# Telecom Customer Churn Prediction

## Packages

In [2]:
import pandas as pd
import numpy as np
import os

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# # scikit-learn
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder, LabelBinarizer
from sklearn.compose import ColumnTransformer, make_column_transformer

# from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Function for creating model pipelines - imblearn
from imblearn.pipeline import make_pipeline as imbl_pipe

# # Over-sampling using SMOTE
from imblearn.over_sampling import SMOTE

# Classification metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

import joblib


## Load Analytical Base Table

In [4]:
df = pd.read_csv("./Resources/Analytical_Base_Table.csv")
print(f"Dataframe dimensions: {df.shape}")
df.head()

Dataframe dimensions: (6499, 21)


Unnamed: 0,CustomerID,Gender,Senior_Citizen,Partner,Dependents,Tenure,Phone_Service,Multiple_Lines,Internet_Service,Online_Security,...,Device_Protection,Tech_Support,Streaming_TV,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn
0,7590-VHVEG,0,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,5575-GNVDE,1,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,3668-QPYBK,1,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,7795-CFOCW,1,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,0,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6499 entries, 0 to 6498
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         6499 non-null   object 
 1   Gender             6499 non-null   int64  
 2   Senior_Citizen     6499 non-null   int64  
 3   Partner            6499 non-null   object 
 4   Dependents         6499 non-null   object 
 5   Tenure             6499 non-null   int64  
 6   Phone_Service      6499 non-null   object 
 7   Multiple_Lines     6499 non-null   object 
 8   Internet_Service   6499 non-null   object 
 9   Online_Security    6499 non-null   object 
 10  Online_Backup      6499 non-null   object 
 11  Device_Protection  6499 non-null   object 
 12  Tech_Support       6499 non-null   object 
 13  Streaming_TV       6499 non-null   object 
 14  Streaming_Movies   6499 non-null   object 
 15  Contract           6499 non-null   object 
 16  Paperless_Billing  6499 

### Separate dataframe into separate object

In [7]:
X = df.drop(["CustomerID","Churn"], axis=1)

y = df["Churn"]

# display shapes of X and y
print(X.shape, y.shape)

(6499, 19) (6499,)


## Create a Train Test Split

In [8]:
random_state = 10

# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=random_state,
                                                   stratify=df.Churn)

# Print number of observations in X_train, X_test, y_train, and y_test
print(len(X_train), len(X_test), len(y_train), len(y_test))

4549 1950 4549 1950


## Load Saved Models

In [9]:
dt_model = joblib.load('models/challa_decision_tree.sav')
knn_model = joblib.load('models/challa_knn.sav')
lr_model = joblib.load('models/challa_logistic_regression.sav')
rf_model = joblib.load('models/challa_random_forest.sav')
xgb_model = joblib.load('models/challa_XGBoost_model.sav')


**Dictionary `'models'`**

In [10]:
# Create models dictionary, it will be needed for ploting
models = {
    'dt' : 'Decision Tree',
    'knn' : 'K-nearest Neighbors',
    'lr' : 'Logistic Regression',
    'rf' : 'Random Forest',
    'xgb' : 'XGBoost'
}

**Dictionary `'loaded_models'`**

In [11]:
# Dictionary of all loaded models
loaded_models = {
    'dt' : dt_model,
    'knn': knn_model,
    'lr' : lr_model,
    'rf' : rf_model,
    'xgb' : xgb_model
}

'target_names' variable will be used later for printing evaluation results.

In [12]:
target_names = ['Stays', 'Exits']

### Helper Functions

**The function for creating the dataframe with evaluation metrics for each model.**

<pre>input: loaded models dictionary
output: evaluation metrics dataframe</pre>

In [13]:
def evaluation_test(fit_models):
    lst = []
    for name, model in fit_models.items():
        pred = model.predict(X_test)
        lst.append([name, 
                    precision_score(y_test, pred, average='macro'),
                    recall_score(y_test, pred, average='macro'),
                    f1_score(y_test, pred, average='macro'),
                    accuracy_score(y_test, pred)])

    eval_df = pd.DataFrame(lst, columns=['model', 'precision', 'recall', 'f1_macro', 'accuracy'])
    eval_df.set_index('model', inplace = True)
    return eval_df

**The helper function for displaying confusion matrix and classification report.**

<pre>input: loaded models dictionary, models dictionary and a dictionary key for one of the models
output: confusion matrix dataframe and classification report</pre>

In [14]:
def class_rep_cm(fit_models, models, model_id):
    # Predict classes using model_id
    pred = fit_models[model_id].predict(X_test)
    print()
    print('\t', models[model_id])
    print('\t', '='*len(models[model_id]))

    # Display confusion matrix for y_test and pred
    conf_df = pd.DataFrame(confusion_matrix(y_test, pred), columns=target_names, index=target_names)
    conf_df.index.name = 'True Labels'
    conf_df = conf_df.rename_axis('Predicted Labels', axis='columns')
    display(conf_df)
    
    # Display classification report
    print()
    print(classification_report(y_test, pred, target_names=target_names))


In [17]:
def evaluation_train(fit_models):
    lst = []
    for name, model in fit_models.items():
        pred = model.predict(X_train)
        lst.append([name, 
                    precision_score(y_train, pred, average='macro'),
                    recall_score(y_train, pred, average='macro'),
                    f1_score(y_train, pred, average='macro'),
                    accuracy_score(y_train, pred)])

    eval_df = pd.DataFrame(lst, columns=['model', 'precision', 'recall', 'f1_macro', 'accuracy'])
    eval_df.set_index('model', inplace = True)
    return eval_df

### Display evaluation metrics

In [18]:
evaluation_train(loaded_models)

Unnamed: 0_level_0,precision,recall,f1_macro,accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dt,0.716612,0.731772,0.723084,0.777094
knn,0.827503,0.864975,0.842346,0.870081
lr,0.717153,0.768475,0.724508,0.755551
rf,0.845431,0.871204,0.856698,0.88437
xgb,0.762633,0.777417,0.769249,0.815564


In [19]:
evaluation_test(loaded_models)

Unnamed: 0_level_0,precision,recall,f1_macro,accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dt,0.712504,0.731389,0.720126,0.772308
knn,0.653232,0.674663,0.659668,0.715385
lr,0.717949,0.769138,0.725431,0.75641
rf,0.724289,0.748517,0.733567,0.781026
xgb,0.724914,0.745827,0.733312,0.782564


During cross-validation we were trying two scorers, f1_macro and accuracy, and then used a model that had better recal for true positive ("Exits"). 

### Display confusion matrix and classification report 

In [20]:
# Display classification report and confusion matrix for all models

for model in models.keys():
    class_rep_cm(loaded_models, models, model)


	 Decision Tree


Predicted Labels,Stays,Exits
True Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
Stays,1174,261
Exits,183,332



              precision    recall  f1-score   support

       Stays       0.87      0.82      0.84      1435
       Exits       0.56      0.64      0.60       515

    accuracy                           0.77      1950
   macro avg       0.71      0.73      0.72      1950
weighted avg       0.78      0.77      0.78      1950


	 K-nearest Neighbors


Predicted Labels,Stays,Exits
True Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
Stays,1092,343
Exits,212,303



              precision    recall  f1-score   support

       Stays       0.84      0.76      0.80      1435
       Exits       0.47      0.59      0.52       515

    accuracy                           0.72      1950
   macro avg       0.65      0.67      0.66      1950
weighted avg       0.74      0.72      0.72      1950


	 Logistic Regression


Predicted Labels,Stays,Exits
True Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
Stays,1065,370
Exits,105,410



              precision    recall  f1-score   support

       Stays       0.91      0.74      0.82      1435
       Exits       0.53      0.80      0.63       515

    accuracy                           0.76      1950
   macro avg       0.72      0.77      0.73      1950
weighted avg       0.81      0.76      0.77      1950


	 Random Forest


Predicted Labels,Stays,Exits
True Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
Stays,1173,262
Exits,165,350



              precision    recall  f1-score   support

       Stays       0.88      0.82      0.85      1435
       Exits       0.57      0.68      0.62       515

    accuracy                           0.78      1950
   macro avg       0.72      0.75      0.73      1950
weighted avg       0.80      0.78      0.79      1950


	 XGBoost


Predicted Labels,Stays,Exits
True Labels,Unnamed: 1_level_1,Unnamed: 2_level_1
Stays,1182,253
Exits,171,344



              precision    recall  f1-score   support

       Stays       0.87      0.82      0.85      1435
       Exits       0.58      0.67      0.62       515

    accuracy                           0.78      1950
   macro avg       0.72      0.75      0.73      1950
weighted avg       0.80      0.78      0.79      1950



**Data Description:**

* The target variable is 'Churn'. It has a value of 1 for churn and 0 for not churn.
* There are a lot of binary variables with 'Yes/No' values.
* There are three continuous variables: tenure, monthly charges, and total charges.
* The shape of the data is (6499,21)

**Data Cleaning:**

* The column 'Total_Charges' had 9 missing values. I have imputed the values with median of Total Charges
* In the continuous variables, there are no outliers.

**Key Observations from EDA:**

* `Tenure`: The average tenure of customers with the company is around 32 months.
* `Monthly_Charges`: Average monthly charges is 64.77 USD.
* `Total_Charges`: Average total charges is 2282.94 USD. The distribution is skewed slightly to the right.
* `Senior Citizen`: About 16% of customers are senior citizens.
* `Dependents`: More than 70% of customers don't have dependents.
* `Phone_Services`: More than 90% of customers have phone services enabled.
* `Internet_Service`: 44% of customers use Fibre Optic for internet service. 34% use DSL, while the rest don't have internet services at all. 
* `Contract`: There are 55% customers with month-to-month contracts. Other two types of contract are: One-year and Two-year
* `Payment_Method`: Electronic check is the most used payment method among the four methods of payment.
* `Churn`: The churn rate in the data is about 26%.
* `Churn vs Senior_Citizen`: Among Senior Citizen customers, the churn rate is about 41%. Senior Citizens are more likely to churn compare to others.
* `Churn vs Internet_Service`: Among customers who don't use Internet Service, the churn rate is very low(8%). While, the churn rate is highest for Fibre Optic users(42%).
* `Churn vs Contract`: As the length of contract increases, the likelihood of churning decreases. 43% of monthly contract customers are likely to churn, followed by 11% of one-year contracts, while two-year contract customers have the least churn rate of 3%
* `Churn vs Payment_Method`: Customers with Electronic Check payment have a higher churn rate than any other payment method.
* `Churn vs Tenure`: As tenure increases, the customers are less likely to churn. Customers with low tenure have churned the most.
* `Churn vs Charges`: Customers who have churned, have higher monthly charges but lower total charges.
* `Contract vs Internet_Service`: Among the month-to-month contract customers, the most used service for Internet is Fiber Optic. Among the one-year and two-year contract customers, DSL service is more used as compared to Fiber Optic.
* `Contract vs Payment_Method`: Among the month-to-month contract customers, Electronic check method of payment is used extensively. Among the one-year and two-year contract customers, Credit Card and Bank transfer methods of payment are more used as compared to other methods
* `Internet_Service vs Payment_Method`: Customers without internet service use the mailed check payment method the most. Customers with Fibre Optice internet service use the Electronic Check method the most.

* In many other columns, like Online_Security, Online_Backup, Tech_Support, Streaming_Movies, etc. there is a level named 'No internet service'. Moreover, the count for the 'No internet service' level is also the same in all columns. This means that customers with No internet service don't have access to many other services like online security, streaming movies, etc.

### Conclusions


* K-Nearest Neighbors and Random Forest models overfit the data and is not able to generalise well.
* The accuracies of Logistic Regression, Decision Tree and XGBoost models perform well in both training and test dataset.
* Logistic Regression have given a generalised performance with high recall and high precision.
* Overall, let's see what the precision and recall means in customer churn-

**Precision** - Of all the customers that the algorithm predicts will churn, how many of them do actually churn?

**Recall** – What percentage of customers that end up churning does the algorithm successfully find?

Both precision and recall values are important for customer churn.

High recall and low precision means the model is unnecessarily predicting non-churned customers as churned, adding overhead to the business.



### Recommendations

* The company should attract the customers likely to churn with bonus plans or discounts on recharges.
* The company should improve Fiber optic internet services as its use is somehow increasing the probability of churn
* The company should try to attract the newly joined customers with attractive offers so that they stay longer with the network. And also, try to make long-term contracts with the customers.
* As per observations from the model, customers who use the internet to stream TV and movies have higher chances of churning than those who don't. This may be due to the poor internet connection faced by the customer. The company should improve the internet connectivity to check this.
* Online Security and Tech Support should be provided to as many customers as possible