# Problem Statement
Customer churn and engagement has become one of the top issues for most banks. It costs significantly more to acquire new customers than retain existing. It is of utmost important for a bank to retain its customers.  
 
We have a data from a MeBank (Name changed) which has a data of 7124 customers. In this data-set we have a dependent variable “Exited” and various independent variables.  
  
Based on the data, build a model to predict when the customer will exit the bank. Split the data into Train and Test dataset (70:30), build the model on Train data-set and test the model on Test-dataset. Secondly provide recommendations to the bank so that they can retain the customers who are on the verge of exiting.


# Data Dictionary
<b>CustomerID</b> - Bank ID of the Customer  
<b>Surname</b> - Customer’s Surname  
<b>CreditScore</b> - Current Credit score of the customer  
<b>Geography</b> - Current country of the customer  
<b>Gender</b> - Customer’s Gender  
<b>Age</b> - Customer’s Age  
<b>Tenure</b> - Customer’s duration association with bank in years  
<b>Balance</b> - Current balance in the bank account.  
<b>Num of Dependents</b> - Number of dependents  
<b>Has Crcard</b> - 1 denotes customer has a credit card and 0 denotes customer does not have a credit card  
<b>Is Active Member</b> - 1 denotes customer is an active member and 0 denotes customer is not an active member  
<b>Estimated Salary</b> - Customer’s approx. salary  
<b>Exited</b> - 1 denotes customer has exited the bank and 0 denotes otherwise  

### Load library and import data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier

In [3]:
churn=pd.read_csv("Churn_Modelling.csv")

### Inspect the data

In [4]:
churn.head()

Unnamed: 0,RowNumber,CustomerId,Surname,Credit Score,Geography,Gender,Age,Tenure,Balance,Num of Dependents,Has CrCard,Is Active Member,Estimated Salary,Exited
0,1,15634602,Hargrave,619.0,France,Female,42,2.0,3000.0,1,1.0,1.0,101348.88,1
1,2,15647311,Hill,608.0,Spain,Female,41,1.0,83807.86,1,0.0,1.0,112542.58,0
2,3,15619304,Onio,502.0,France,Female,42,8.0,159660.8,3,1.0,0.0,113931.57,1
3,4,15701354,Boni,699.0,France,Female,39,1.0,3000.0,2,0.0,0.0,93826.63,0
4,5,15737888,Mitchell,850.0,Spain,Female,43,2.0,125510.82,1,1.0,1.0,79084.1,0


In [5]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7124 entries, 0 to 7123
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   RowNumber          7124 non-null   int64  
 1   CustomerId         7124 non-null   int64  
 2   Surname            7124 non-null   object 
 3   Credit Score       7118 non-null   float64
 4   Geography          7120 non-null   object 
 5   Gender             7116 non-null   object 
 6   Age                7124 non-null   object 
 7   Tenure             7110 non-null   float64
 8   Balance            7121 non-null   object 
 9   Num of Dependents  7124 non-null   int64  
 10  Has CrCard         7111 non-null   float64
 11  Is Active Member   7114 non-null   float64
 12  Estimated Salary   7123 non-null   float64
 13  Exited             7124 non-null   int64  
dtypes: float64(5), int64(4), object(5)
memory usage: 779.3+ KB


Age and Balance variable has numeric data but data type is object. It appears some special character is present in this variable.  
Also there are missing values for some variables.

# EDA

### Removing unwanted variables

In [0]:
# remove the variables and check the data for the 10 rows 



churn.head(10)

Checking dimensions after removing unwanted variables,

### Summary

In [0]:
churn.describe(include="all")

In [0]:
churn.shape

### Proportion of observations in Target classes

In [0]:
# Get the proportions




### Checking for Missing values

In [0]:
# Are there any missing values ?





There are some missing values

### Checking for inconsistencies in Balance and Age variable

In [0]:
churn.Balance.sort_values()

There are 3 cases where '?' is present, and 3 cases where missing values are present for Balance variable.  
Summary also proves the count of missing variables.  
To confirm on the count of ?  , running value_counts()

In [0]:
churn.Balance.value_counts()

In [None]:
churn[churn.Balance=="?"]

This confirms there are 3 cases having ?

In [0]:
churn.Age.value_counts().sort_values()

There is 1 case where ? is present

### Replacing ? as Nan in Age and Balance variable

Verifying count of missing values for Age and Balance variable below:

In [0]:
churn.Balance.isnull().sum()

In [0]:
churn.Age.isnull().sum()

### Imputing missing values

In [None]:
sns.boxplot(churn['Credit Score'])

As Outliers are present in the "Credit Score", so we impute the null values by median

In [None]:
sns.boxplot(churn['Tenure'])

In [None]:
sns.boxplot(churn['Estimated Salary'])

Substituting the mean value for all other numeric variables

In [0]:
for column in churn[['Credit Score', 'Tenure', 'Estimated Salary']]:
    mean = churn[column].mean()
    churn[column] = churn[column].fillna(mean)

In [0]:
churn.isnull().sum()

### Converting Object data type into Categorical

In [0]:
for column in churn[['Geography','Gender','Has CrCard','Is Active Member']]:
    if churn[column].dtype == 'object':
        churn[column] = pd.Categorical(churn[column]).codes 

In [0]:
churn.head()

In [0]:
churn.info()

### Substituting the mode value for all categorical variables

In [0]:
for column in churn[['Geography','Gender','Has CrCard','Is Active Member']]:
    mode = churn[column].mode()
    churn[column] = churn[column].fillna(mode[0])

In [0]:
churn.isnull().sum()

Age and Balance are still not addressed. Getting the modal value

In [0]:
churn['Balance'].mode()

In [0]:
churn['Age'].mode()

Replacing nan with modal values,

In [0]:
churn['Balance']=churn['Balance'].fillna(3000)
churn['Age']=churn['Age'].fillna(37)

In [0]:
churn.isnull().sum()

There are no more missing values.

In [0]:
churn.info()

Age and Balance are still object, which has to be converted

### Converting Age and Balance to numeric variables

In [0]:
churn['Age']=churn['Age'].astype(str).astype(int)
churn['Balance']=churn['Balance'].astype(str).astype(float)

### Checking for Duplicates

In [0]:
# Are there any duplicates ?
dups = churn.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
churn[dups]

There are no Duplicates

### Checking for Outliers

In [0]:
plt.figure(figsize=(15,15))
churn[['Age','Balance','Credit Score', 'Tenure', 'Estimated Salary']].boxplot(vert=0)

Very small number of  outliers are present, which is also not significant as it will not affect much on ANN Predictions

### Checking pairwise distribution of the continuous variables

In [0]:
import seaborn as sns
sns.pairplot(churn[['Age','Balance','Credit Score', 'Tenure', 'Estimated Salary']])

### Checking for Correlations

In [0]:
# construct heatmap with only continuous variables
plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
sns.heatmap(churn[['Age','Balance','Credit Score', 'Tenure', 'Estimated Salary']].corr(), annot=True)

There is hardly any correlation between the variables

### Train Test Split

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
#Extract x and y





In [0]:
#split data into 70% training and 30% test data




In [0]:
# Checking dimensions on the train and test data
print('x_train: ',x_train.shape)
print('x_test: ',x_test.shape)
print('y_train: ',y_train.shape)
print('y_test: ',y_test.shape)

### Scaling the variables

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
#Initialize an object for StandardScaler




In [0]:
#Scale the training data




In [0]:
x_train

In [0]:
# Apply the transformation on the test data
x_test = sc.transform(x_test)

In [0]:
x_test

### Building Neural Network Model

In [0]:
clf = MLPClassifier(hidden_layer_sizes=100, max_iter=5000,
                     solver='sgd', verbose=True,  random_state=21,tol=0.01)

In [0]:
# Fit the model on the training data




### Predicting training data

In [0]:
# use the model to predict the training data
y_pred = 




### Evaluating model performance on training data

In [0]:
from sklearn.metrics import confusion_matrix,classification_report

In [0]:
confusion_matrix(y_train,y_pred)

In [0]:
print(classification_report(y_train, y_pred))

In [0]:
# AUC and ROC for the training data
# predict probabilities
probs = clf.predict_proba(x_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_train, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()

### Predicting Test Data and comparing model performance

In [0]:
y_pred = clf.predict(x_test)

In [0]:
confusion_matrix(y_test, y_pred)

In [0]:
print(classification_report(y_test, y_pred))

In [0]:
# AUC and ROC for the test data

# predict probabilities
probs = clf.predict_proba(x_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(y_test, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()

### Model Tuning through Grid Search

**Below Code may take too much time.These values can be used instead {'hidden_layer_sizes': 500, 'max_iter': 5000, 'solver': 'adam', 'tol': 0.01}**

In [0]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'hidden_layer_sizes': [100,200,300,500],
    'max_iter': [5000,2500,7000,6000],
    'solver': ['sgd','adam'],
    'tol': [0.01],
}

nncl = MLPClassifier(random_state=1)

grid_search = GridSearchCV(estimator = nncl, param_grid = param_grid, cv = 10)

In [0]:
grid_search.fit(x_train, y_train)

In [0]:
grid_search.best_params_

In [0]:
best_grid = grid_search.best_estimator_

In [0]:
best_grid

In [0]:
ytrain_predict = best_grid.predict(x_train)
ytest_predict = best_grid.predict(x_test)

In [0]:
confusion_matrix(y_train,ytrain_predict)

In [0]:
# Accuracy of Train data



In [0]:
print(classification_report(y_train,ytrain_predict))

In [0]:
#from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(y_train,best_grid.predict_proba(x_train)[:,1])
plt.plot(rf_fpr,rf_tpr, marker='x', label='NN')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is', roc_auc_score(y_train,best_grid.predict_proba(x_train)[:,1]))

In [0]:
confusion_matrix(y_test,ytest_predict)

In [0]:
# Accuracy of Test data



In [0]:
print(classification_report(y_test,ytest_predict))

In [0]:
#from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(y_test,best_grid.predict_proba(x_test)[:,1])
plt.plot(rf_fpr,rf_tpr, marker='x', label='NN')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is', roc_auc_score(y_test,best_grid.predict_proba(x_test)[:,1]))

In [0]:
best_grid.score

## Conclusion

AUC on the training data is 86% and on test data is 84%. The precision and recall metrics are also almost similar between training and test set, which indicates no overfitting or underfitting has happened. 
  
best_grid model has better improved performance over the initial clf model as the sensitivity was much lesser in the initial model.

The Overall model performance is moderate enough to start predicting if any new customer will churn or not. 