## Customer Churn Analysis : Ding

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Reading the dataset
df = pd.read_csv('Test Dataset.csv')

### Exploratory Data Analysis and Data Preprocessing

In [None]:
df.head()

In [None]:
# Checking for duplicated rows in the dataset
df.duplicated().sum()

In [None]:
# Dropping the customer identifaction column as we wouldn't need it for our analysis
df.drop('customer_id', inplace=True, axis = 1)

In [None]:
print('Number of rows :',df.shape[0])

In [None]:
print('Number of columns :',df.shape[1])

In [None]:
# Checking the spread of the data
df.describe()

In [None]:
# Checking the datatype of each columns and number of non-null values in each column
df.info()

#### Seems like we have very few missing values

In [None]:
# Checking number of null values in each column
df.isnull().sum()

In [None]:
# Checking the rows with null values
df[df.isnull().any(axis=1)]

#### Since we have only 4 rows with missing data, we can drop them as it is a very small proportion of total data

In [None]:
# Dropping the rows with null values
df = df.dropna()

#### We have few categorical variables in our dataset. We need to convert it to numerical columns

In [None]:
for feature in df:
    if df[feature].dtype == 'object':
        df[feature] = pd.Categorical(df[feature]).codes

In [None]:
df.info()

In [None]:
df.isnull().sum().sum()

#### We have all the variables in numerical format and there are no null values as we dropped them earlier

### We need to perform predictive analytics using Machine Learning models to understand who are likely to churn.
### However, it is important to understand how are our users churning.
### Let us Perform some Discriptive Analysis using the data to understand how the customers are churning:

### Discriptive Analysis:

In [None]:
# Checking the proportion of users who had churned
df.churned.value_counts(normalize=True)

#### 72.6% of our customers are churning which is quite high
#### Let us try to understand how the churn rate is different for domestic and international 

In [None]:
sns.countplot('is_domestic', hue='churned', data=df)

#### The plot shows that domestic users are churning at high rate. Let us check the churn rate in numbers

In [None]:
df.groupby('is_domestic')['churned'].value_counts(normalize=True)

#### Churn rate of Domestic users is at 84% while overall global churn rate is at 72.6%. Domestic users are churning faster.
#### Let us try to understand further based on the platform users use for top ups.

In [None]:
sns.countplot('platform', hue='churned', data=df)

In [None]:
df.groupby('platform')['churned'].value_counts(normalize=True)

In [None]:
df.groupby(['is_domestic','platform'])['churned'].value_counts(normalize=True)

#### The Web users churn at 79% while App users are churning at the rate of 57.4%
#### However, domestic web users are churning at 87%

### We understand that web users have higher churn rate than App users.
### To further understand this, we need to answer few questions. For example:
##### - If the same customer uses Application regulargly and use Website occationally, are they considered 2 different users?
##### - What are the ways to encourage users to use Application
etc

## Predictive Analytics:

### Model Building and evaluation:

In [None]:
# Seperating dependent and independent variables
x = df.drop('churned', axis = 1, inplace=False).copy()
y = df.pop('churned')

In [None]:
# Splitting the dataset into training set and test set
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.30, random_state=0)

In [None]:
print('x_train',x_train.shape)
print('x_test',x_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, classification_report, confusion_matrix, accuracy_score

#### We have variables on different scale in terms of magnitude. Let us consider algorithms which are tolerent to outliers and scale of the variables. Starting with Decision tree

#### Decision Tree Classifier:

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
# Model hyperparameter tuning for best combination of model parameters using param grid and cross validation
param_grid = {
    'criterion': ['gini'],
    'max_depth': [4,6,8,10],
    'min_samples_leaf': [50,100,150],
    'min_samples_split': [150,200,300],
    'random_state': [0]
}

grid_search = GridSearchCV(estimator = dtree, param_grid = param_grid, cv = 3)


In [None]:
grid_search.fit(x_train, y_train)
print(grid_search.best_params_)

In [None]:
best_grid_dtcl = grid_search.best_estimator_
best_grid_dtcl

In [None]:
# checking feature importances
print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values('Imp',ascending=False))


In [None]:
## prediction
ytrain_predict_dtcl = best_grid_dtcl.predict(x_train)
ytest_predict_dtcl = best_grid_dtcl.predict(x_test)

ytrain_predict_prob_dtcl = best_grid_dtcl.predict_proba(x_train)
ytest_predict_prob_dtcl = best_grid_dtcl.predict_proba(x_test)


In [None]:
# Model metrics for Decision Tree
print('Training accuracy:',accuracy_score(y_train, ytrain_predict_dtcl))
print('Testing accuracy:', accuracy_score(y_test, ytest_predict_dtcl)) 
print('Confusion matrix for training set: \n',confusion_matrix(y_train, ytrain_predict_dtcl)) 
print('Confusion matrix for test set: \n',confusion_matrix(y_test, ytest_predict_dtcl)) 
print('classification report for training set: \n',classification_report(y_train, ytrain_predict_dtcl))

print('classification report for test set: \n',classification_report(y_test, ytest_predict_dtcl))
 
print('training roc_auc_score:',roc_auc_score(y_train, ytrain_predict_prob_dtcl[:,1]))
print('test set roc_auc_score:',roc_auc_score(y_test, ytest_predict_prob_dtcl[:,1]))


#### Decision tree model metric analysis:

### The Prediction accuracy for both training and test sets is around 80%
### Also, out of all the +ve predictions, 82% are truly positive (precision).
### Oout of all the positives in the dataset, 92% are captured by the model to predict them as positive making our model have a very strong Recall.


#### However, In feature importances, we see that there are few features contributing very less to the model's decision making in terms of prediction. 

#### Let us use the Random Forest classifier which is based on decision trees considered in an enseamble

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfcl = RandomForestClassifier()

### Since we can have multiple trees with different features, we can use higher explanability from our model as random forest model considers multiple decision trees and uses voting for class prediciton (Mode). 


In [None]:
param_grid = {
    
    'max_depth': [4,6,8,10],
    'min_samples_leaf': [50,100,150],
    'min_samples_split': [150,200,300],
    
    'max_features': [4, 6, 8],
    'n_estimators': [101, 301],

    'random_state': [0]
}

In [None]:
grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 3)

In [None]:
grid_search.fit(x_train, y_train)


In [None]:
grid_search.best_params_


In [None]:
best_grid_rfcl = grid_search.best_estimator_
best_grid_rfcl


In [None]:
print (pd.DataFrame(best_grid_rfcl.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values('Imp',ascending=False))

#### We see that feature importance for low contributing features from decision tree has increased in random forest model

In [None]:
# Prediction
ytrain_predict_rfcl = best_grid_rfcl.predict(x_train)
ytest_predict_rfcl = best_grid_rfcl.predict(x_test)

ytrain_predict_prob_rfcl = best_grid_rfcl.predict_proba(x_train)
ytest_predict_prob_rfcl = best_grid_rfcl.predict_proba(x_test)


In [None]:
print('Training accuracy:',accuracy_score(y_train, ytrain_predict_rfcl))
print('Testing accuracy:', accuracy_score(y_test, ytest_predict_rfcl)) 
print('Confusion matrix for training set: \n',confusion_matrix(y_train, ytrain_predict_rfcl)) 
print('Confusion matrix for test set: \n',confusion_matrix(y_test, ytest_predict_rfcl)) 
print('classification report for training set: \n'classification_report(y_train, ytrain_predict_rfcl))
print('classification report for test set: \n'classification_report(y_test, ytest_predict_rfcl))
print('training roc_auc_score:',roc_auc_score(y_train, ytrain_predict_prob_rfcl[:,1]))
print('test roc_auc_score:',roc_auc_score(y_test, ytest_predict_prob_rfcl[:,1]))


#### Random Forest Classifier model metric analysis:

### The Prediction accuracy for both training and test sets is still around 80%, almost same as decision tree. So is the precision.
### However, recall  has increased to 94%, making it a better model.

#### Let us consider another enseamble algorithm XGBoost classifier to check if it gives better results

#### XGBost Model:

In [None]:
import xgboost as xgb



In [None]:

XGB_model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)
XGB_model.fit(x_train, y_train)



In [None]:


y_train_predict_XGB = XGB_model.predict(x_train)
y_test_predict_XGB = XGB_model.predict(x_test)

y_train_predict_prob_XGB = XGB_model.predict_proba(x_train)
y_test_predict_prob_XGB = XGB_model.predict_proba(x_test)


In [None]:
print('Training accuracy:',accuracy_score(y_train, y_train_predict_XGB))
print('Testing accuracy:',accuracy_score(y_test, y_test_predict_XGB)) 
print('Confusion matrix for training set: \n',confusion_matrix(y_train, y_train_predict_XGB)) 
print('Confusion matrix for test set: \n',confusion_matrix(y_test, y_test_predict_XGB)) 
print('classification report for training set: \n',classification_report(y_train, y_train_predict_XGB))
print('classification report for test set: \n',classification_report(y_test, y_test_predict_XGB))
print('training roc_auc_score:',roc_auc_score(y_train, XGB_model.predict_proba(x_train)[:,1]))
print('test roc_auc_score:',roc_auc_score(y_test, XGB_model.predict_proba(x_test)[:,1]))


#### XGBoost Classifier model metric analysis:
#### The metrics are almost same as our random forest classified it is not performing as best as random forest in terms of the metric 'recall'

### Scope for Model Optimization techniques:

### Since we selected algorithms where the dimensions of the data does not affect the model's performance, scaling the data is not necessary
### However, we know that data is not completely balanced. we have 76.2% churners.
### From the metrics, we notice that models are performing great for class 1. But, recall and precision are not great for class 0
### Possible reason for this could be the data imbalance and there are less 0s than 1s in our data.
### Although 3:7 is a good ratio for binary class feature, balancing the data would give equal visibility of classes to our Machine Learning models

#### We can use SMOTE technique to  balance the data and re-run our models to check if we get better results

### Conclusion:

### In Churn analysis, we would want to make sure to capture most of the churners. The metric for this is Recall / Sensitivity
### In those terms, we can say our random forest classifier is doing a great job by capturing 94% of churners in its predictions