# Building ANN using Tensorflow 2 for Classification
###            (Predicting whether a customer will leave the bank or not)
Build a fully connected neural network with fully connected layer i.e with no Convolutional or Recurrent layers.
- Hee we Will have an input vector containing different features and we'll predict an out come which will be a binary variable.
  - because ANN can be used for both classification and regression problems.
    - And here we're going to do it for classification problem.
- **Dataset**:
  - We'll be using a dataset of a Bank who collected some information about their customers.
  - The bank observed if a customer for a certain period of time will leave or not, and they gather the outcome in the Exited 
    column.<br>
  - The bank got all this features, to understand the correlation between the fact whether a customer will leave or not and 
    the features. 
     - i.e. The Bank want to understand the reason why customer leave the bank and once the manage to build a model to predict whether a 
       customer will leave or not, they will deploy the model on new customer. 
       - For all the customer which the model predict they will leave the bank, they will give them a special offer to those customers 
         so that they will stay in the bank.
         - All this is called **Customer Retention**, i.e. preventing customers from leaving the bank.
  - **Features**
    - **The features are**:
      - CustomerId
      - CreditScore
      - Geography
      - Gender
      - Age
      - Tenure
      - Balance
      - NumOfProducts
      - HasCrCard
      - IsActiveMember
  - **Exited**:
    - Exited is a dependent variable which tells us if the customer left the bank or not.
      - 0: Customer did not leave the bank
      - 1: Customer left the bank
 

# Part 0 -  libraries

### &nbsp; 1 Importing the libraries

In [142]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
import tensorflow as tf

In [143]:
# Printing the tensorflow version
print(tf.__version__)

2.21.0-dev20250721


# Part 1 - Data Preprocessing

### &nbsp; 1. Importing the dataset

In [144]:
dataset = pd.read_csv('Churn_Modelling.csv')

# Since the first, second, third column i.e. index, customerID, surname will not help in our prediction we'll drop it.
X = dataset.iloc[:, 3 : -1].values       # we want the value of the index of all the columns except the last one
Y = dataset.iloc[:, -1].values           # we want the value of the last column only

In [145]:
# Printing the metric of features X 
  # i.e. printing all the feature, except the dependable variable
print(X)


[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]


In [146]:

# Printing the dependable variable Y
print(Y)

[1 0 1 ... 1 1 0]


### &nbsp; &nbsp; 2. Encoding categorical data
- We've two categorical variables, i.e. Geography and Gender.

#### &nbsp; &nbsp; &nbsp; &nbsp; 2.1 Label Encoding the "Gender" column
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i.e. Converting the categorical variable into numerical one <br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i.e. assigning 0 and 1 to Male and Female respectively

In [147]:
from sklearn.preprocessing import LabelEncoder
Le = LabelEncoder()

# We've to apply the label encoding only on the gender column
X[:, 2] = Le.fit_transform(X[:, 2])          # Meaning we want to transform all the row but only the column of index 2, ....
                                             # ... then assigning the result to the gender column 


In [148]:
# To Make sure that we're not more seeing the male and female in the data set
    # Female was encoded into 0 and Male into 1
print(X)

[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


#### &nbsp; &nbsp; &nbsp; &nbsp; 2.2 One Hot Encoding the "Geography" column
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i.e. We're performing One Hot Encoding on the Geography column because the is no **numerical order<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; relationship** or **rank** between france, germany and spain.<br> 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i.e. Using **numbers** would **mislead the model** into thinking some countries are **"higher"** or <br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **"closer"** to others, which could hurt performance. $therefore$ One-Hot Encoding creates <br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **binaries** for each country.<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i.e.  One-Hot avoids introducing **false numerical relationships** **that** would lead to rankinking and as consequence bias the model.<br>

In [149]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1])], remainder = 'passthrough')   # We want to apply One Hot Encoding on the Geography column
                                                                                                        # ... i.e. index 1 of the feature variable  X
X = np.array(ct.fit_transform(X))

In [150]:
print(X)

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


- France was encode into [1.0, 0.0, 0.0] <br>
- Spain was encode into [0.0, 0.0, 1.0]  <br>
- Spain was encode into [0.0, 1.0, 0.0]  <br>

### &nbsp; &nbsp; 3. Splitting the dataset into the Training set and Test set

In [151]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

### &nbsp; &nbsp; 4. Feature Scaling
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Note**: Feature Scaling is absolute compulsory in Deep Learning. <br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i.e. We're **normalizing** or **standardizing** the data to have a mean of zero and a standard deviation of one. <br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - And will be applied to all our feature variables irrespective of whether they are already in the desire scale/range <br>


In [152]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# Applying feature scaling feature of all the training and test set i.e. on only the feature.
X_train = sc.fit_transform(X_train)         # fit_transform is fitted to the train set in order to avoid information leakage.
X_test = sc.transform(X_test)

# Part 2 - Building the ANN

### &nbsp; &nbsp; 1. Initializing the ANN
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Here we'll initialize the ANN as a sequence of layers

In [153]:
# 1. Creating a variable which is nothing than tha ANN itself
    # i.e.ANN will be create as an object of the sequential class, which allows the building of an ANN as a sequence of layers.
ann = tf.keras.models.Sequential()         # The Sequential class is taken from the models module of the keras library

### &nbsp; &nbsp; 2. Adding the input layer and the first hidden layer

In [154]:
# The way to add layer into an ANN is to use the add() method of the dense class 
    # i.e. this class allows the creation of a fully connected layer
ann.add(tf.keras.layers.Dense(units = 6 , activation = 'relu'))    # The Dense class is taken from the layers module of the keras library.
                                                                   # The layer module contains tools to add layers in our ANN.
                                                                   # units corresponds to the number of neurons we want to have in the 1st hidden layer. 
                                                    # In the input neurons will simply be our features variable starting from credit score until Customer salary.
                                                    # The activation function is the Rectifier activation function i.e. the RELU activation function
                                                    # The RELU activation function is one of the most popular activation functions.
# My  unit = number of input features / 2 = 10/2 = 5
# 5 is called the number of hidden layers or hyperparameter value.

#### &nbsp; &nbsp; &nbsp; &nbsp; Determine the number of neurons in the hidden layers.

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; The is no rule of thumb to determine the number of neurons, based on experiment.
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Start with a number of hidden neurons between the **number of input features (n)** and **2–3× n**, then **tune** based on **validation performance**<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Too few neurons → underfitting.<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Too many neurons → overfitting, slower training, more compute.<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Use dropout or L2 regularization if you go large.<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Always monitor validation loss to adjust the architecture.<br>
<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - n is the number of input features and assuming we're doing classification, regression or fully connecting (dense) layers.

| Situation                        | Suggested Hidden Neurons                    |
| -------------------------------- | ------------------------------------------- |
| **Simple problem**               | $\text{hidden\_neurons} \approx n \ or \ \frac{n}{2}$|
| **Moderate complexity**          | $\text{hidden\_neurons}\approx 1.5n \ \text{or} \approx 2n$ |
| **High complexity / large data** | $\text{hidden\_neurons} \approx 3n$ or more |



### &nbsp; &nbsp; 3. Adding the second hidden layer
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Here we will add a 2nd layer in order to build a deep learning model as suppose to a shallow modell.

In [155]:
# to add a new layer we just copy and past the code above.
    # i.e. the 2nd layer is added the first way as the 1st.
ann.add(tf.keras.layers.Dense(units = 5 , activation = 'relu'))

### &nbsp; &nbsp; 4. Adding the output layer
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Here we will add the output layer which will contain what we want to predict.

In [156]:

ann.add(tf.keras.layers.Dense(units = 1 , activation = 'sigmoid'))  # We're still using the dense class because we want to create a fully connected layer i.e. ...
                                                                    # ... we want our output layer to be connected to our input layer.
                                                                    # Since the output i.e. exited in our dataset is binary i.e. can only be 0 or 1, therefore ...
                                                                    # ... we need one neuron i.e perceptron in the output layer, this is because with a perceptron ...
                                                                    # ... we can only have 0 or 1 as output.                                                              
# for the output layer we want to have a activation function is the Sigmoid activation function, this because ...
# ... the sigmoid function(only for binary prediction) will not only give us the prediction but also it will give us the probability od the out come.
# ... the softmax function(only for categorical prediction) will give us the probability od the out come.

# Part 3 - Training the ANN

### &nbsp; &nbsp; 1. Compiling the ANNwith an optimizer, a loss function and metrics

In [157]:
# using the ann object to call the compile() method
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])   # optimizer is the learning algorithm
                                                                                            # the best optimizer is the one that can perform stochastic gradient descent...
                                                                                            # ... which is the adam optimizer.
                                                                                            # SGD will update the weights in order to reduce the loss function
                                                                                       # loss is the loss function
                                                                                            # binary_crossentropy is the loss function for binary prediction
                                                                                            # categorical_crossentropy is the loss function for categorical prediction
                                                                                       # metrics is the metric used to evaluate the performance of the model.
                                                                                            # metric should be enter in a pair of square bracket since we can enter several...
                                                                                            # ... metrics in the same time.
                                                                                            # metrics could be: accuracy, precision, recall
                                                                                                # metrics = ['accuracy'] is the accuracy metric
                                                                                                # metrics = ['accuracy', 'precision'] is the accuracy and precision metric
                                                                                                # metrics = ['accuracy', 'precision', 'recall'] is the accuracy, precision and recall metric

### &nbsp; &nbsp; 2. Training the ANN on the Training set over certain number of epochs

In [158]:
ann.fit(X_train, Y_train, batch_size = 32, epochs = 100)   # batch_size = 32 is the default value
                                                           # epochs = 100 is the default value


# To train our model, we use the ann object to call the fit() method
    # the method to train what ever ML model is the fit() method and will always take the following parameters:
        # X_train: the matrix of features of the training set
        # Y_train: the dependent variable of the training set
            # When training an ANN, we have to enter 2 more parameters:
                # batch_size: because batch_size is the number of samples processed before the model is updated
                    # batch learning is efficient and performant when training an ANN on a large dataset, ...
                    # ... because it allows the model to update its weights in batches instead of updating them one sample at a time
                    # The Batch size usually chosen is 32
                # epochs: the number of times the model will be trained on the entire dataset, so as to improve the accuracy of the model over time.
                    # The number of epochs usually chosen is 100, but we can choose any number as long is it not small in order to learn properly.


                                                            

Epoch 1/100


[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7959 - loss: 0.5302
Epoch 2/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8076 - loss: 0.4538
Epoch 3/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8117 - loss: 0.4353
Epoch 4/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8151 - loss: 0.4273
Epoch 5/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8184 - loss: 0.4216
Epoch 6/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8217 - loss: 0.4165
Epoch 7/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8253 - loss: 0.4122
Epoch 8/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8291 - loss: 0.4078
Epoch 9/100
[1m250/250[0m [32m━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x224b7179590>

- Accuracy 0.8637 means that out of 100 observations, 86% of them are correctly predicted.

# Part 4 - Making the predictions and evaluating the model

### &nbsp; &nbsp; 1. Predict the result of a single observation
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  our ANN model to predict the result of a single observation<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Geography**: France<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Credit Score**: 600<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Gender**: Male<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Age**: 40<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Tenure**: 3 <br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Balance**: 60000<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Number of Products**: 2<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Has Credit Card**: Yes<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Is Active Member**: Yes<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **Estimated Salary**: 50000<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; **So, should we say goodbye to that customer?**<br>

### &nbsp; &nbsp; Solution

In [None]:
# In order to predict a single observation, we will use the predict() method of the ann 
    # Any input of the predict method must be a 2D array i.e. in double brackets.
    # Since Geography is a categorical feature, we'll have to see its binary value in the cell right above where we created the dummy variable .i.e.
        # where we did the one hot encoding. France was encoded as 1.0, 0.0, 0.0 or 1, 0, 0
    # Male was label/encoded as 1 and female as 0.
    # since the predict method should be called or applied on the observation with the same scale as the one used for the training. ...
        # ... since the sc was used to scaled the observations, we've to call it on the transformed and not the fit_transformed, ...
        # ... otherwise will lead to information leakage
        # on new observation or when we want to predict a new observation we can only applied the transform method.
        
print(ann.predict(sc.fit_transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))
    # predict() method will return the probability of the observation to be 0 or 1.

# if we don't want the result in probability form, we have to do the following:
# print(ann.predict(sc.fit_transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])) > 0.5)  # meaning the final result/probability is greater than 0.5, it will be 1/True
                                                                                               # and if the probability is less than 0.5, it will be 0/False


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
[[False]]


- The sigmoid activation function at the output layer will give us the probability of the customer leaving the bank.
  - If the probability is **greater than 0.5**, the customer **will leave** the bank.
    - **So, should we say goodbye to that customer?** False, meaning the customer **will not leave** the bank.
  - If the probability is **less than 0.5**, the customer **will not leave** the bank.
  

### &nbsp; &nbsp; 2. Predicting the Test set results
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Display next to each other the predicted values and the real values. i.e. vector of predicted values and vector of real values.

In [162]:
# Predicting the Test set results
Y_pred = ann.predict(X_test)

# Converting the probabilities into binary predictions i.e. 0 or 1
Y_pred = (Y_pred > 0.5)           # if the probability is greater than 0.5, it will be 1/True
                                  # and if the probability is less than 0.5, it will be 0/False

print(np.concatenate((Y_pred.reshape(len(Y_pred),1), Y_test.reshape(len(Y_test),1)),1))

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step
[[0 0]
 [0 1]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


### &nbsp; &nbsp; 3. Making the Confusion Matrix
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Get the final accuracy on the test set.

In [163]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(Y_test, Y_pred)
print(cm)
accuracy_score(Y_test, Y_pred)

[[1523   72]
 [ 200  205]]


0.864

- Meaning out of 100 customers, 86.4 were predicted correctly.
  - 1523 correct that the customer stay in the bank.
  - 205 correct prediction that the customer will leave the bank.
  - 200 incorrect prediction that the customer stay in the bank.