<a href="https://colab.research.google.com/github/kelvinren/Bank-Customer-Dataset/blob/master/RSF3G3_TanYongShen(19WMR02140)%2CYimJianwei(19WMR01569).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**RSF3 G3**               
**Tan Yong Shen ( 19WMR02140 )**    
**Yim Jianwei ( 19WMR01569 )**

# Bank Customer Data set  ( Customer Churn Analysis ) 

# Introduction

# Business Understanding

The main business of bank is to lend money to people, business and more that are eligible who need funds. These services are provided by them also include a charged fee which are known as interest on loans. However, the money that are required to provides the service are funded through others available bank services such as bank account, fixed deposits, remittance service, transaction of foreign exchange and more. By having this perpetual loop of services, bank company gains profits by charging interest of loans higher than interest of deposits. Some studies confirmed that acquiring new customers can cost five times more than satisfying and retaining existing customers(Landis, 2020). Therefore, managing and predicting customer churn became an essential objective to bank business model as customer are the main source of bank business model flow as a poor handling could cause a massive customer churn and may results in a bankruptcy. The goal of this project is to test different prediction model on accuracy and probability of customer is likely to churn through various data from different region that are close up together, which are France, Spain and Germany, with various methods. Moreover, we also will visualise which factors contribute to customer churn.

**Preparation**

1.   Import Libraries
2.   Load Data Set



In [None]:
#1. Import Libraries
from __future__ import print_function
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #visualization
import matplotlib.pyplot as plt #visualization
%matplotlib inline

import itertools
import warnings
warnings.filterwarnings("ignore")
import os
import io
import plotly.offline as py #visualization
py.init_notebook_mode(connected=True) #visualization
import plotly.graph_objs as go #visualization
import plotly.tools as tls #visualization
import plotly.figure_factory as ff #visualization
#print(os.listdir("../input"))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# 2. Load Data Set
BANK_DATA = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Bank_Customer_Dataset.csv')

**Data Preview**

In [None]:
BANK_DATA.head()

# Data Understanding

**Data Size**

In [None]:
# Understand the size of data
BANK_DATA.size

**Data Shape**

In [None]:
# Understand the Data Shape
# Result = 10000 Rows, 14 Columns
BANK_DATA.shape

**Data Columns**

In [None]:
# Understand what are the columns available 
# Result = 14 Columns
BANK_DATA.columns

**Data Types**

In [None]:
# Understand the data types of each columns
BANK_DATA.dtypes

**Missing Values**

In [None]:
# Check for missing values
# Result = No missing values
BANK_DATA.isnull().sum()

**Remove unnecessary columns**

In [None]:
# Irrelevant Data = Row number, ID, Name
# This is because these information are not usefule for analysis
BANK_DATA = BANK_DATA.drop(["RowNumber", "CustomerId", "Surname"], axis = 1)

**Seperate Churn and Not Churn Customer**

In [None]:
Churn     = BANK_DATA[BANK_DATA["Exited"] == 1]
Not_Churn = BANK_DATA[BANK_DATA["Exited"] == 0]

In [None]:
target_col = ["Exited"]
cat_cols   = BANK_DATA.nunique()[BANK_DATA.nunique() < 6].keys().tolist()
cat_cols   = [x for x in cat_cols if x not in target_col]
num_cols   = [x for x in BANK_DATA.columns if x not in cat_cols + target_col]

**Data Set After Manipulation :**

In [None]:
# We can see that the first three columns are removed.
BANK_DATA.head()

In [None]:
# Data size is reduce to 110000
BANK_DATA.size

In [None]:
# New Data Shape 
BANK_DATA.shape

**Customer Churn Analysis ( Percentage ) :**

In [None]:
# 0 = Not Churn
# 1 = Churn
# Result = Not Churn (79.63%), Churn (20.37%)
Churn_Percentage = BANK_DATA['Exited'].value_counts(normalize = True) * 100
Churn_Percentage

**Customer Churn Analysis ( Bar Chart ) :**

In [None]:
# Bar Chart representation of the target label percentage.
total_len = len(BANK_DATA['Exited'])
sns.set()
sns.countplot(BANK_DATA.Exited).set_title('Data Distribution')
ax = plt.gca()
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2.,
            height + 2,
            '{:.2f}%'.format(100 * (height/total_len)),
            fontsize=14, ha='center', va='bottom')
sns.set(font_scale=1.5)
ax.set_xlabel("Labels for exited column")
ax.set_ylabel("Numbers of records")
plt.show()

**Customer Churn Analysis ( Relationship with other categorical ) :**

In [None]:
    # We first review the 'Status' relation with categorical variables
    fig, axarr = plt.subplots(2, 3, figsize=(25, 12))
    sns.countplot(x='Geography', hue = 'Exited',data = BANK_DATA, ax=axarr[0][0])
    sns.countplot(x='Gender', hue = 'Exited',data = BANK_DATA, ax=axarr[0][1])
    sns.countplot(x='NumOfProducts', hue = 'Exited',data = BANK_DATA, ax=axarr[0][2])
    sns.countplot(x='HasCrCard', hue = 'Exited',data = BANK_DATA, ax=axarr[1][0])
    sns.countplot(x='IsActiveMember', hue = 'Exited',data = BANK_DATA, ax=axarr[1][1])

Based on the graphs above : 

1.   **Geography** : Germany has the highest rate of churn customer, while Spain has the lowest rate of churn customer.

2. **Gender** : Female customer has higher rate of churning compare to Male customer

3. **Number Of Products** : Customers that have only one products are the propotion with highest churning rate, followed by the number of products 2, 3 and 4.

4. **Has Credit Card** : Based on the result, majority of the customer that churned are those with credit card. 

5. **Active Status** : Inactive member of the bank customers have the greater number of churned compare to active members.



**Correlation ( Heatmap ) :**

In [None]:
# Visualization of the correlation matrix using heatmap plot
sns.set()
sns.set(font_scale = 1.25)
sns.heatmap(BANK_DATA[BANK_DATA.columns[:10]].corr(), annot = True,fmt = ".1f")
plt.show()

# Data Preparation

In [None]:
# One-Hot encoding our categorical attributes
list_cat = ['Geography', 'Gender']
BANK_DATA = pd.get_dummies(BANK_DATA, columns = list_cat, prefix = list_cat)

**Data set preview :**

In [None]:
# Print the first five rows
BANK_DATA.head()

**Drop Exited Column:**

In [None]:
# Drop Exited column because it contains the answer/target/label for each row. 
X = BANK_DATA.drop('Exited', axis=1)
y = BANK_DATA.Exited

# Modeling

In [None]:
# Import different models 
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Scoring function
from sklearn.metrics import roc_auc_score, roc_curve

**Train-test Split :**

In [None]:
# Splitting the dataset in training and test set
# 75% of data for training purpose
# Remaining 25% of data is use to check the training accuracy of our trained model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

**K-Nearest Neighbour ( KNN ) :**

In [None]:
# Initialization of the KNN
knMod = KNeighborsClassifier(n_neighbors = 5, weights = 'uniform', algorithm = 'auto', leaf_size = 30, p = 2,
                             metric = 'minkowski', metric_params = None)
# Fitting the model with training data 
knMod.fit(X_train, y_train)

***Logistic Regression ( LR ) :***

In [None]:
# Initialization of the Logistic Regression
lrMod = LogisticRegression(penalty = 'l2', dual = False, tol = 0.0001, C = 1.0, fit_intercept = True,
                            intercept_scaling = 1, class_weight = None, 
                            random_state = None, solver = 'liblinear', max_iter = 100,
                            multi_class = 'ovr', verbose = 2)
# Fitting the model with training data 
lrMod.fit(X_train, y_train)

**Random Forest ( RF ) :**

In [None]:
# Initialization of the Random Forest model
rfMod = RandomForestClassifier(n_estimators=10, criterion='gini')
# Fitting the model with training data 
rfMod.fit(X_train, y_train)

**Model Accuracy Computation :**

**KNN :**

In [None]:
# Compute the model accuracy on the given test data and labels
knn_acc = knMod.score(X_test, y_test)
# Return probability estimates for the test data
test_labels = knMod.predict_proba(np.array(X_test.values))[:,1]

**LR :**

In [None]:
# Compute the model accuracy on the given test data and labels
lr_acc = lrMod.score(X_test, y_test)
# Return probability estimates for the test data
test_labels = lrMod.predict_proba(np.array(X_test.values))[:,1]

**RF :**

In [None]:
# Compute the model accuracy on the given test data and labels
rf_acc = rfMod.score(X_test, y_test)
# Return probability estimates for the test data
test_labels = rfMod.predict_proba(np.array(X_test.values))[:,1]

# Evaluation

In [None]:
models = ['KNN', 'Logistic Regression', 'Random Forest']
accuracy = [knn_acc, lr_acc, rf_acc]

d = {'Accuracy': accuracy}
df_metrics = pd.DataFrame(d, index = models)
df_metrics

Based on the result above, we can see that **RF** has the highest accuracy among all of the models. Followed by **LR** and lastly the **KNN** which has the lowest accuracy.

# Conclusion 

For conclusion, based on the result above, we have achieve our objectives. During the Data Understanding, we found out that customer from country Germany are the most churned. Female customer are more likely to churn compare to male customer. Customer that only using one product are the most churned. Other than that, credit card holder of the bank customer has higher churning rate compare to non credit card holder. Lastly, we also found that inactive member have greater churning rate than active member. 

For the prediction models we tested above, the best model that produce highest accuracy is Random Forest ( RF ) with about accuracy of 85%. Thus, the RF is the best prediction models for predicting whether a customer will churn or not churn.