# Introduction to Neural Networks: Bank Churn Prediction Porject 7
## Paige Singleton
## November 2022
## Problem Statement
"Given a Bank customer, build a neural network-based classifier that can determine whether they will leave or not in the next 6 months."
## Data Dictionary
- CustomerId: Unique ID which is assigned to each customer
- Surname: Last name of the customer 
- CreditScore: It defines the credit history of the customer.  
- Geography: A customer’s location    
- Gender: It defines the Gender of the customer   
- Age: Age of the customer     
- Tenure: Number of years for which the customer has been with the bank
- NumOfProducts: It refers to the number of products that a customer has purchased through the bank.
- Balance: Account balance
- HasCrCard: It is a categorical variable that decides whether the customer has a credit card or not.
- EstimatedSalary: Estimated salary
- isActiveMember: It is a categorical variable that decides whether the customer is an active member of the bank or not ( Active member in the sense, using bank products regularly, making transactions, etc )
- Exited: It is a categorical variable that decides whether the customer left the bank within six months or not. It can take two values 
                    0=No ( Customer did not leave the bank )

                    1=Yes ( Customer left the bank )

## Section 1:Exploratory Data Analysis
- Define problem statement - Read the dataset 
- Print the overview of the data (statistical summary, shape, info, etc) 
- Eliminate the unique features from the dataset with proper reasoning 
- Univariate analysis 
- Bivariate analysis

Note: Reusing code from previous projects!!

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# libaries to help with data visualization
import matplotlib.pyplot as pyplot
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# library to import to standardize the data
from sklearn.preprocessing import StandardScaler
#To import different metrics 
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score, mean_absolute_error
#Importing classback API
from keras import callbacks
# Importing tensorflow library
import tensorflow as tf
# importing different functions to build models
from tensorflow.keras.layers import Dense, Dropout,InputLayer
from tensorflow.keras.models import Sequential
# Importing Batch Normalization
from keras.layers import BatchNormalization
# Importing backend
from tensorflow.keras import backend
# Importing shffule 
from random import shuffle
from keras.callbacks import ModelCheckpoint
# Importing optimizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers import RMSprop
# Library to avoid the warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
#Read CSV
data = pd.read_csv("Churn.csv")
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
#Make a back-up copy of original dataset
data_copy=data.copy()

In [4]:
# Display 10 random rows
np.random.seed(1)
data.sample(n=10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9953,9954,15655952,Burke,550,France,Male,47,2,0.0,2,1,1,97057.28,0
3850,3851,15775293,Stephenson,680,France,Male,34,3,143292.95,1,1,0,66526.01,0
4962,4963,15665088,Gordon,531,France,Female,42,2,0.0,2,0,1,90537.47,0
3886,3887,15720941,Tien,710,Germany,Male,34,8,147833.3,2,0,1,1561.58,0
5437,5438,15733476,Gonzalez,543,Germany,Male,30,6,73481.05,1,1,1,176692.65,0
8517,8518,15671800,Robinson,688,France,Male,20,8,137624.4,2,1,1,197582.79,0
2041,2042,15709846,Yeh,840,France,Female,39,1,94968.97,1,1,0,84487.62,0
1989,1990,15622454,Zaitsev,695,Spain,Male,28,0,96020.86,1,1,1,57992.49,0
1933,1934,15815560,Bogle,666,Germany,Male,74,7,105102.5,1,1,1,46172.47,0
9984,9985,15696175,Echezonachukwu,602,Germany,Male,35,7,90602.42,2,1,1,51695.41,0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [6]:
# of rows and columns
print(f'There are {data.shape[0]} rows and {data.shape[1]} columns.') 

There are 10000 rows and 14 columns.


In [7]:
# Check for null values in dataset
data.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [9]:
# Calculate Descriptive stats
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RowNumber,10000.0,5000.5,2886.89568,1.0,2500.75,5000.5,7500.25,10000.0
CustomerId,10000.0,15690940.0,71936.186123,15565701.0,15628528.25,15690740.0,15753230.0,15815690.0
CreditScore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
Age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
Tenure,10000.0,5.0128,2.892174,0.0,3.0,5.0,7.0,10.0
Balance,10000.0,76485.89,62397.405202,0.0,0.0,97198.54,127644.2,250898.09
NumOfProducts,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
HasCrCard,10000.0,0.7055,0.45584,0.0,0.0,1.0,1.0,1.0
IsActiveMember,10000.0,0.5151,0.499797,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,10000.0,100090.2,57510.492818,11.58,51002.11,100193.9,149388.2,199992.48


In [10]:
# Understand the number of unique values in each column
data.nunique()

RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age                   70
Tenure                11
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

In [11]:
#Check for duplicates
data[data.duplicated()].count()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [12]:
#Understand the number of each value for the following categorical columns: 
cat_cols=['Geography','Gender','Tenure','NumOfProducts','HasCrCard','IsActiveMember','Exited']
for col in cat_cols:
    print(data[col].value_counts())
    print("_________________________________")

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64
_________________________________
Male      5457
Female    4543
Name: Gender, dtype: int64
_________________________________
2     1048
1     1035
7     1028
8     1025
5     1012
3     1009
4      989
9      984
6      967
10     490
0      413
Name: Tenure, dtype: int64
_________________________________
1    5084
2    4590
3     266
4      60
Name: NumOfProducts, dtype: int64
_________________________________
1    7055
0    2945
Name: HasCrCard, dtype: int64
_________________________________
1    5151
0    4849
Name: IsActiveMember, dtype: int64
_________________________________
0    7963
1    2037
Name: Exited, dtype: int64
_________________________________


In [16]:
#Understand the number of each value for the following categorical columns: 
cat_cols=['Geography','Gender','Tenure','NumOfProducts','HasCrCard','IsActiveMember','Exited']
for col in cat_cols:
    print(data[col].value_counts(normalize=True)*100)
    print("_________________________________")

France     50.14
Germany    25.09
Spain      24.77
Name: Geography, dtype: float64
_________________________________
Male      54.57
Female    45.43
Name: Gender, dtype: float64
_________________________________
2     10.48
1     10.35
7     10.28
8     10.25
5     10.12
3     10.09
4      9.89
9      9.84
6      9.67
10     4.90
0      4.13
Name: Tenure, dtype: float64
_________________________________
1    50.84
2    45.90
3     2.66
4     0.60
Name: NumOfProducts, dtype: float64
_________________________________
1    70.55
0    29.45
Name: HasCrCard, dtype: float64
_________________________________
1    51.51
0    48.49
Name: IsActiveMember, dtype: float64
_________________________________
0    79.63
1    20.37
Name: Exited, dtype: float64
_________________________________


## Section 2: EDA Insights 
- Identify key meaningful observations on individual variables and the relationship between variables.

## Section 3: Data Pre-processing
- Split the target variable and predictors 
- Split the data into train and test 
- Categorical Encoding 
- Normalize the data

## Section 4: Model Building
- Comment on which metric to use and why? 
- Build the Neural Network model

## Model Performance Evaluation
- Find the optimal threshold using ROC-AUC curves 
- Comment on model performance 
- Can model performance be improved? check and comment 
- Build another model to implement these improvements 
- Include all the model which were trained to reach at the final one

## Actionable Insights and Recommendations
- Key takeaways for the business