<a href="https://colab.research.google.com/github/jsroa15/BCG/blob/main/Churn_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Churn prediction: Model and Evaluation**

# **Importing packages and modules**

In [44]:
#Necesary importings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Machine learning models

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.naive_bayes import GaussianNB

#Machine learning utilities
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

# **Loading data**


In [45]:
X_train=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BCG/X_train.csv')
X_test=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BCG/X_test.csv')
y_train=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BCG/y_train.csv')
y_test=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BCG/y_test.csv')



In the previous EDA we discovered that original data is unbalanced. We have 90% of customers with retention and only 10% of customers with churn.

Based on that input, we are going to balance the data with synthetic samples.

In [46]:
#Fixing issue with unbalanced data. Generation of synthetic samples

#Methodology based on: https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18


from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=123, ratio=1)
X_train_smote, y_train_smote = sm.fit_sample(X_train, y_train)

#X_train_smote=pd.DataFrame(X_train_smote,columns=X_train.columns)
#y_train_smote=pd.DataFrame(y_train_smote,columns=y_train.columns)

X_train=X_train.values
y_train=y_train.values.ravel()
X_test=X_test.values
y_test=y_test.values.ravel()

  y = column_or_1d(y, warn=True)


In [47]:
y_train

array([0, 0, 0, ..., 0, 0, 0])

In [48]:
#Checking

from collections import Counter

print(Counter(y_train)) #Categories on initial y_train

print(Counter(y_train_smote)) #Categories in final y_train

Counter({0: 9689, 1: 1055})
Counter({0: 9689, 1: 9689})


# **Modeling**

## **Baseline model: Naive Bayes classifier with unbalanced data**

We are going to create a function to measure the performance of a classification model.





In [49]:
def baseline_model_performance_classification(model,Xtrain,Ytrain,Xtest,Ytest):

  '''
  Calculate metrics as Confusion matrix, AUC, Accuracy, Precision, Recall, F1 Score

  **Note**: Before running the function it's necessary to intantiate a classification model

  Parameters
  ------------------------------

  model: Classification model created prior to apply function.
  Xtrain: Your X train dataset
  Ytrain: Your Y train dataset
  Xtest: Your X test dataset
  Ytest: Your Y test dataset

  '''

  #Fit the model

  model.fit(Xtrain,Ytrain)

  #Make predictions

  y_pred=model.predict(Xtest)
  y_pred_proba=model.predict_proba(Xtest)[:,1]

  #Printing Metrics

  print('\nConfusion Matrix')
  print(confusion_matrix(Ytest, y_pred))


  print('\nScores')
  print('------------------------')
  print('AUC:',np.round(roc_auc_score(Ytest,y_pred_proba)*100,2),'%')
  print('Accuracy:',np.round(accuracy_score(Ytest,y_pred)*100,2),'%')
  print('Precision:',np.round(precision_score(Ytest,y_pred)*100,2),'%')
  print('Recall:',np.round(recall_score(Ytest,y_pred)*100,2),'%')
  print('F1 score:',np.round(f1_score(Ytest,y_pred)*100,2))



In [50]:
#Call the new function

model=GaussianNB()
baseline_model_performance_classification(model,X_train,y_train,X_test,y_test)



Confusion Matrix
[[2167  675]
 [ 186  107]]

Scores
------------------------
AUC: 60.49 %
Accuracy: 72.54 %
Precision: 13.68 %
Recall: 36.52 %
F1 score: 19.91


In [51]:
#Measure performance with balanced data

model=GaussianNB()
baseline_model_performance_classification(model,X_train_smote,y_train_smote,X_test,y_test)



Confusion Matrix
[[1661 1181]
 [ 126  167]]

Scores
------------------------
AUC: 59.76 %
Accuracy: 58.31 %
Precision: 12.39 %
Recall: 57.0 %
F1 score: 20.35


From the above we can conclude that when there is balanced data to train the model, metrics like AUC and Accuracy decreased a little bit.

Now let's see how a robust model looks like.

## **Robust models**

Before moving forward, it's necessary to define the proper metric the evaluate our model. In our case we are going to take ROC AUC metric to compare and assess models.

The following is a function to perform that evaluation with cross validation and validation in test set.

In [52]:
def performance_classification_auc(model_dictionary,Xtrain,ytrain,Xtest,ytest,folds=5):
  '''
  Evaluate the performance of classification models via CV. The base metric is F1 score

  Parameters
  ------------
    model_dictionary: A dictionary that contains as key the instantiate model and as value the model name
    Xtrain: The X train dataset
    ytrain: The y train dataset
    Xtest: The X test dataset
    ytest: The y test dataset
    folds: Number of folds to cross-validate. Default 3

  '''
  null_list1=[]
  null_list2=[]

  for key,value in model_dictionary.items():
    metric=cross_val_score(key,Xtrain,ytrain,cv=folds,scoring='roc_auc')
    
    #Store the F1 score from CV
    metric_mean=round(metric.mean()*100,2)


    #Fit and predict
    key.fit(Xtrain,ytrain)
    y_pred=key.predict_proba(Xtest)[:,1]

    #Calculate the ROC AUC Score in Test
    metric_test=round(roc_auc_score(y_test,y_pred)*100,2)
    null_list1.append(metric_mean)
    null_list2.append(metric_test)

  index=[x for x in model_dictionary.values()]
  dff=pd.DataFrame([null_list1,null_list2],index=['ROC AUC Cross Val','ROC AUC Test set']).transpose()
  dff.index=index
  print(dff)

In [53]:
xg=XGBClassifier(random_state=123)
rf=RandomForestClassifier()
dt=DecisionTreeClassifier(random_state=123)
gb=GradientBoostingClassifier(random_state=123)
ad=AdaBoostClassifier(random_state=123)

models={xg:'Xgboost',rf:'Random Forest',dt:'Decision Tree',gb:'GradientBoost',ad:'Adaboosting'}

performance_classification_auc(models,X_train,y_train,X_test,y_test,folds=10)

               ROC AUC Cross Val  ROC AUC Test set
Xgboost                    68.06             67.86
Random Forest              68.47             69.01
Decision Tree              55.85             54.34
GradientBoost              67.77             66.91
Adaboosting                65.67             65.02


In [54]:
%time performance_classification_auc(models,X_train_smote,y_train_smote,X_test,y_test,folds=10)

               ROC AUC Cross Val  ROC AUC Test set
Xgboost                    95.74             66.73
Random Forest              98.02             68.20
Decision Tree              89.40             55.47
GradientBoost              95.76             66.43
Adaboosting                94.35             62.43
CPU times: user 3min 54s, sys: 208 ms, total: 3min 55s
Wall time: 3min 55s


From the above,we can see that classifiers that were trained with unbalanced data present difficulties to classify the customer.

On the other hand, classifiers with balanced data have a good performance in CV but have a bad performance in Test set, which shows that giving more data and try to balance it it's not worthy.

The strategy to follow is to make hyper parameter tuning and then use emsemble to improve performance.

## **Hyperparameter tuning**

Based on the previous results, the best models are Xgboost and Random Forest.

Let's setup hyperparameters

### Random Forest

In [55]:
#Exploring Random Forest parameters

rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [56]:
#Some parameters to evaluate

rf_params={
  'max_depth':[3,6,9,12],
  'n_estimators':[100,200,500],
  'max_features':['log2','auto','sqrt'],
  'min_samples_leaf':[2,10,30,40],
  'random_state':[123]
}

rf_cv=RandomizedSearchCV(rf,rf_params,scoring='roc_auc',cv=3)

%time rf_cv.fit(X_train,y_train)


CPU times: user 1min 16s, sys: 122 ms, total: 1min 16s
Wall time: 1min 16s


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [57]:
#Extracting best score of ROC AUC

print('ROC AUC: ',round(rf_cv.best_score_*100,3))

ROC AUC:  68.149


From the above we can see that the improvement is very small, now we can try the same aproach with balanced dataset.

Due to computational time, this time we are gonna use the Radomized Grid Search

In [58]:
rf_params={
  'max_depth':[3,6,9,12],
  'n_estimators':[100,200,500],
  'max_features':['log2','auto','sqrt'],
  'min_samples_leaf':[2,10,30,40],
  'random_state':[123]
}

rf_cv_rand=RandomizedSearchCV(rf,rf_params,scoring='roc_auc',cv=3)

%time rf_cv_rand.fit(X_train_smote,y_train_smote)

CPU times: user 3min 12s, sys: 165 ms, total: 3min 12s
Wall time: 3min 12s


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [59]:
#Extracting best score of ROC AUC

print('ROC AUC: ',round(rf_cv_rand.best_score_*100,3))

ROC AUC:  97.112


### XGBoost Classifier

Now we are going to tune parameters for XGboost

In [60]:
#Extract parameters

xg.get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 123,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}

In [61]:
xg_params={
'min_child_weight': [i for i in np.arange(1,15,1)],
 'gamma': [i for i in np.arange(0,6,0.5)],
 'subsample': [i for i in np.linspace(0.3,1,15)],
 'colsample_bytree': [i for i in np.linspace(0.3,1,15)],
 'max_depth': [i for i in np.arange(1,15,2)],
 'scale_pos_weight':[i for i in np.arange(1,15,1)],
 'learning_rate': [i for i in np.arange(0,0.15,0.01)],
 'n_estimators' : [i for i in np.arange(0,2000,100)],
 'colsample_bylevel': [i for i in np.linspace(0.3,1,15)],
 'colsample_bynode': [i for i in np.linspace(0.3,1,15)],


}

xg_cv_rand=RandomizedSearchCV(xg,xg_params,scoring='roc_auc',cv=5)

xg_cv_rand.fit(X_train,y_train)

print('ROC AUC: ',round(xg_cv_rand.best_score_*100,3))



ROC AUC:  67.886


In [64]:
model_xg=xg_cv_rand.best_estimator_
model_rf=rf_cv.best_estimator_

## Emsembling methods

In order to increase the performance, we are going to use voting classifier.



In [89]:
# Define the list classifiers

classifiers = [('XGBoost Classifier', model_xg), ('Random Forest',model_rf)]

#Instantiate Voting Classifier

vc = VotingClassifier(estimators=classifiers)


vc.fit(X_train, y_train)
y_pred_vc = vc.predict(X_test)

roc_auc=roc_auc_score(y_test,y_pred_vc)

print(round(roc_auc*100,2))


50.17


Voting classifier didn't give a significance improvement. Our Final model will be the Xgboost Classifier. 

In [87]:
y_pred=model_xg.predict_proba(X_test)[:,1]

roc_auc=roc_auc_score(y_test,y_pred)

print(round(roc_auc*100,2))

68.13


# **Final evaluation and conclusions**

## Feature importance for XGBoost Classifier

In [80]:
#Extracting feature importances for XGBoost Classifier

fi=model_xg.feature_importances_

aux=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BCG/X_train.csv')

fi=pd.DataFrame(fi,columns=['feature_importance'],index=aux.columns)

fi.sort_values(by='feature_importance',ascending=False)



Unnamed: 0,feature_importance
origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,0.059809
channel_sales_lmkebamcaaclubfxadlmueccxoimlema,0.042912
origin_up_lxidpiddsbxsbosboudacockeimpuepw,0.039937
has_gas_t,0.037898
channel_sales_ewpakwlliwisiwduibdlfmalxowmwpci,0.035863
margin_gross_pow_ele,0.035706
num_years_antig,0.0333
date_end_qtr_Q4,0.03106
cons_last_month,0.030504
channel_sales_foosdfpfkusacimwkcsosbicdxkicaua,0.029979


In [84]:
#Model Evaluation

baseline_model_performance_classification(model_xg,X_train,y_train,X_test,y_test)


Confusion Matrix
[[2728  114]
 [ 220   73]]

Scores
------------------------
AUC: 68.13 %
Accuracy: 89.35 %
Precision: 39.04 %
Recall: 24.91 %
F1 score: 30.42


In [104]:
y_pred=model_xg.predict_proba(X_test)[:,1]
y_pred_pro=y_pred>0.65
print(roc_auc_score(y_test,y_pred))
print(precision_score(y_test,y_pred_pro))
print(recall_score(y_test,y_pred_pro))
print(f1_score(y_test,y_pred_pro))

0.6813100902359296
0.4716981132075472
0.17064846416382254
0.25062656641604014
