# Artificial Neural Networks for Classification

## Introduction

The goal of this study case is to create a job demographic segmentation model to tell the bank which of their customers are at highest risk of leaving.

To achieve it, we are going to implement an Artificial Neural Network (ANN) Classification model using tensorflow and python to predict the movement of customers of the bank (whether or not they continue to be customers of the bank).

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

In [2]:
#check the tensorflow version that we are using
tf.__version__

'2.1.0'

## Part 1 - Data preprocessing 

### 1.1 Importing the dataset

Let’s load the dataset and visualize the information.

In [3]:
#loading the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
#check the datatype of the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


Explanation about denpendent and independent variables....

In [5]:
#get de independent variables and dependent variable
X = dataset.iloc[:, 3:-1].values #we do not need the first 3 columns (info without impact on the dependent variable)
y = dataset.iloc[:, -1].values

In [6]:
print(X)

[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]


In [7]:
print(y)

[1 0 1 ... 1 1 0]


### 1.2 Taking care of missing data

In [8]:
#check if there are null values in the dataset
dataset.isnull().sum().sum()

0

We do not have to take care of any missing data.

### 1.3 Encoding categorical data

#### A) Encoding the "Gender" column

__Important:__ The column Gender has the index 2 in the dataset (index 0 is the column "CreditScore" and the index 1 is the column called "Geography"). 

In [9]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])

In [10]:
print(X)

[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


#### B) One Hot Encoding the "Geography" column

In [11]:
#encoding the Geography column creating dummy variables

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [12]:
print(X)

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


### 1.4 Splitting the dataset into the Training set and Test set

In [13]:
#getting the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### 1.4  Feature Scaling

__Feature Scaling is absolutely compulsory for deep learning and categorization.__

So, we have to apply always feature scaling in all the variables of our dataset when we are working with Artificial Neural Networks. 

In [14]:
#applying feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Part 2 - Building the ANN

### 2.1 Initializing the ANN

In order to build an ANN, firstly we have to create a new object that will be the properly ANN itself. This new object belongs to the sequential class, as an ANN is actually a sequence of layers, which starts from the input layer and then we have hidden layers fully connected until the output layer.

In addition, the sequential class comes from the Keras module of TensorFlow 2.

In [15]:
#create the object for the ANN
ann = tf.keras.models.Sequential()

### 2.2 Adding the input layer and the first hidden layer

The way to add a fully connected layer into an ANN at whatever phase you are is using the “dense class”. So, we have to take our object and call the “add” method of the sequential class. The layers will be created as objects of a new class, which is the “dense class”. Regarding the hidden layer, we will choose the default option of 6 hidden neurons. 

__Important:__ We do not have to enter the number of features that we want for the input layer because the features will be recognized automatically by tensorflow. So, once we included the matrix of features in the training, the ANN will automatically collect these four features. Therefore, we do not need to specify that we have 4 features.

On another hand, we will use the “rectifier” function for the activation function parameter, which will break the linearity of the operations happening between this input layer and the first hidden layer.

In [16]:
#add a simple connected layer
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

### 2.3 Adding the second hidden layer 

For the second hidden layer, we will use the same code as the first hidden layer.

In [17]:
#add the second hidden layer
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

### 2.4 Adding the output layer

We are going to build output layer fully connected to that second hidden layer and we need to use the “dense class” again.

We want to predict a binary variable [0, 1]. So, it is enough to take only one neuron. 

On another hand, as we are in the output layer, we have to replace the “rectifier activation function” for the “sigmoid activation function”. What will we get?
* Get the predictions of whether the customers choose to leave or not the bank.
* We have for each customer the probability that the customers leave the bank.


In [18]:
#add the output layer
ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

## Part 3 - Training the ANN

### 3.1 Compiling the ANN

We will use the “compile” method to compile the ANN. Here we have to include three parameters:
* __Optimizer__ =>> choose an optimizer to adjust the weights through stochastic gradient descent and reduce the loss function in the next iteraction. The most common is the “adam” optimizer.
* __Loss function__ =>> It computes the difference between the predictions and the real result. For binary classification, we can use the “binary_crossentropy” loss function. For non-binary classification, we use the “categorical_crossentropy” loss function when we are predicting more than two categories.
* __Matrix__ =>> “accuray”


In [19]:
#compiling the ANN
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

### 3.2 Training the ANN on the Training set

We have to use the “fit” method to train out ANN. Here we have to enter two main parameters:

* __Number of epochs__ => Forward-propagation and Backward-propagation happens over many epochs and over each epoch the loss functions is slightly reduced. Therefore, we want to repeat these epochs in order to reduce more little by little the loss function. By default, we can use 100 epochs.
* __Batch size__ => Instead of propagating all the features one by one, we propagate them in batches of a certain number of elements of a certain sets of the features. By default, we can use the 32 in the batch size.

When we apply backward-propagation, we can adjust the weights of the connections between the neurons. With this action, the loss function approaches 0 the next time that we will use the ANN for a prediction. 

In order to get it, we need to use the “gradient descent”. It will change the weights in small increments with the calculation of the derivate (gradient) of the loss function, which allow us to see the descent direction until the global minimum. This calculation is done in batches  in the following interactions (epochs) of the data that we are sending along the ANN.


In [20]:
#training the ANN
ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

Train on 8000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100

<tensorflow.python.keras.callbacks.History at 0x24eb5b64948>

## Part 4 - Making the predictions and evaluating the model

### 4.1 Predicting the result of a single observation 

Now, we are going to try to predict the outcome of a single observation, meaning a single customer.

In order to do it, please, use our ANN model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: $ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: $ 50000

Question: Should we say goodbye to that customer?

In [21]:
#predicting the result of a single observation
#predicted probability
print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))

[[0.12957944]]


In [22]:
#predicting the result of a single observation
#predicted absoluted result
print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])) > 0.5)

[[False]]


Solution =>> Our ANN model predicts that this customer stays in the bank!

### 4.2 Predicting the test results

Now, let’s check how the model predict the test results.

We have to use the “predict” method in the test set. Then, we will get all these predictions of the test in a new vector, which is called the y_pred. Finally, we can compare the y_pred with the real results (y_test).

In [23]:
#predicting the absolute results of the test set
y_pred = ann.predict(X_test)
y_pred = (y_pred > 0.5) #show the results in a non-probability form
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 1]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


Comments =>>  The results look really good.

### 4.3 Building the Confussion Matrix

The real way to check the performance of the models is building a confussion matrix and check the accuracy of the ANN in the test set.

So, let's do it!

In [24]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[1541   54]
 [ 258  147]]


0.844

The 84% of the values were predicted correctly. 

So, the model is not properly perfect, but it can predict correctly a huge amount of data, and it is a good result!