<a href="https://colab.research.google.com/github/hrampadarath/Deep_Learning_examples/blob/main/ANN_grid_search_cv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ANN grid-search with cross-validation

The previous ANN notebook [ANN_cross_validation.ipynb](ANN_cross_validation.ipynb) used cross-validation to test the stability of the model. In this notebook, the hyper parametrs will be tuned usin grid-search

In [None]:
import pandas as pd
import numpy as np

### 1. data preprocessing

the first step is to load the data. This is the same as the prevous notebook. Descriptive info will be removed.

In [None]:
# add google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# load the dataframe
df = pd.read_csv("/content/drive/My Drive/datasets/Churn_Modelling.csv")
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
# print information regarding each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


The goal of this exercise is to see if we can predict if a customer will leave the bank in the next 6 months. The truth variables is given in the "Exited" column. 1 == the customer has left; 0 == the customer has not left

In [None]:
# define the target column
target = df["Exited"]

Theere are a few columns that are well useless for gaining any insights, such as: RowNumber, CustomerId, and Surname. I would also hesitate to use Gender, as this could be used as a tool for discrimination. One would also argue age  as well. 

In [None]:
features = df.drop(["RowNumber","CustomerId","Surname","Gender","Exited"], axis=1)
features.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,42,2,0.0,1,1,1,101348.88
1,608,Spain,41,1,83807.86,1,0,1,112542.58
2,502,France,42,8,159660.8,3,1,0,113931.57
3,699,France,39,1,0.0,2,0,0,93826.63
4,850,Spain,43,2,125510.82,1,1,1,79084.1


Before continuing we need to take care of the Geography column, which is categorical

In [None]:
geo_dummy = pd.get_dummies(df["Geography"])
geo_dummy.head()

Unnamed: 0,France,Germany,Spain
0,1,0,0
1,0,0,1
2,1,0,0
3,1,0,0
4,0,0,1


drop the geography column and concat the geo_dummy data

In [None]:
features = pd.concat([features.drop(["Geography"],axis=1),geo_dummy],axis=1)
features.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,France,Germany,Spain
0,619,42,2,0.0,1,1,1,101348.88,1,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0
3,699,39,1,0.0,2,0,0,93826.63,1,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1


### 2. Train/test split.
Will still apply the train test split to the data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features,target, random_state=0)

In [None]:
print('Shape of X_train = {}, and of y_train = {}'.format(np.shape(X_train), np.shape(y_train)))

Shape of X_train = (7500, 11), and of y_train = (7500,)


In [None]:
print('Shape of X_test = {}, and of y_test = {}'.format(np.shape(X_test), np.shape(y_test)))

Shape of X_test = (2500, 11), and of y_test = (2500,)


Before begining with our ANN, the data needs to be scaled to (0,1) or (-1,1). To do so we will use the preprocessing.StandardScaler function from scikit-learn. 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_scale = scaler.transform(X_train)
X_train = pd.DataFrame(X_train_scale,columns=X_train.columns)

# apply to test set
X_test_scale = scaler.transform(X_test)
X_test = pd.DataFrame(X_test_scale,columns=X_test.columns)

### 3. ANN model and grid-search

Will use a combination of scikit-learn and the Keras Classifier

In [None]:
import tensorflow as tf
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

As previously we will define the model as a function. However, there are a few parameters we want to tune

In [None]:
def build_model(optimizer):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(32, input_dim=11, activation="relu"))
    model.add(tf.keras.layers.Dense(8, activation="relu"))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    #compile model
    model.compile(loss="binary_crossentropy",optimizer=optimizer,metrics=["accuracy"])
    return model

create the classifier for scikit learn. Will tune the batch_size and the number of epochs

In [None]:
classifier = KerasClassifier(build_fn = build_model)

Now to define a dictionary with the parameters to search over

In [None]:
parameters = {"batch_size": [50, 100],
             "epochs": [500],
             "optimizer": ["adam","rmsprop"]}

Nw for the grid search

In [None]:
grid_search = GridSearchCV(estimator = classifier,
                          param_grid = parameters,
                          scoring = "accuracy",
                          cv = 10)
grid_search = grid_search.fit(X_train, y_train)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 7

In [None]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'batch_size': 100, 'epochs': 500, 'optimizer': 'adam'}
0.8484
