# Artificial Neural Networks

### Practical Session

Prof. Dr. Georgios K. Ouzounis
<br/>email: [georgios.ouzounis@go.kauko.lt](georgios.ouzounis@go.kauko.lt)

## Contents

1. Challenge
2. Getting the data-set
3. Data loading and pre-processing
4. Compiling the ANN
5. Deploying the ANN
6. Testing individual cases
7. Improving the model

## Challenge

We are given a data set consisting of 14 features and a total of 10000 records of customers of a financial institution. 

Among the features there is one tagged as **Exited** that takes binary values and if true it means that the given customer rejected a product or if false that he/she retained it.

The goal of this exercise is to train a model that can predict as accurately as possible the decision of new customers and using the same features. 


<img src="https://www.cuinsight.com/wp-content/uploads/2015/07/27178-d21ed95f.jpg" width="600"/>

## Getting the data-set

The data-set is a comma-separated values file (CSV) and contains a sample data table of 10000 records. It can be found at the [Kaggle.com website](https://www.kaggle.com/aakash50897/churn-modellingcsv) or at various web locations after searching for its filename:

**Churn_Modelling.csv**

Example locations: 

- [Floobits.com](https://floobits.com/calvinlow18/ANN/raw/Churn_Modelling.csv)
- [Pushkar Mandot (Medium blog writer) G-drive](https://drive.google.com/file/d/0By9Y49AzZGaUemtpNWtQMWdqRDA/view)

Open a terminal session:

<img src="https://www.macworld.co.uk/cmsdata/features/3608274/Terminalicon2_thumb800.png" align="left" width="200"/>

Use the wget command to put it of the selected location

```shell
wget https://floobits.com/calvinlow18/ANN/raw/Churn_Modelling.csv
```

## Loading and pre-processing the data 


### Import the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

[numpy](http://www.numpy.org): it is the fundamental package for scientific computing with Python. It contains among other things a powerful N-dimensional array object that can be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. 

[matplotlib](https://matplotlib.org):  it is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

[pandas](https://pandas.pydata.org): is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

### Import & explore the dataset

The variable dataset is a python dataframe holding the contents of the opened file. To scout it’s contents use the **info()** and **head()** functions.

In [None]:
#Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')

In [None]:
# view the features
dataset.info()

In [None]:
# view the head of the file
dataset.head()

### Data Cleaning/ Splitting

The **independent variables** are to be stored in matrix X. Evidently, neither to row ID (column 0) nor the customer number (column 1) nor the surname (column 2) can influence the decision of the customer thus we can read the all other features leaving these two out.

The **dependent variable**, i.e. the one we want to predict, is to be stored on a separate matrix (vector) y and contains the contents of column 13 alone.

In [None]:
# all the independent variables stored in columns 3 to 12 
# are stored in X 
X = dataset.iloc[:, 3:13].values 
X[0,:]

In [None]:
# column index 13 : the dependent variables
y = dataset.iloc[:, 13].values 
y[0]

### Encoding categorical data

The independent variables **Geography** and **Gender** are **strings**, or **objects** using **info()** function, that need to be encoded into discrete variables as discussed previously 
in the **Features** session.

**LabelEncoder** takes in as argument the column index and converts all categorical entries to integer labels.


In [None]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:
# geography column: enumerate countries
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) 

In [None]:
# gender column: enumerate female/male
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

In [None]:
X[0,:]

Label encoding introduces a new problem in our data. LabelEncoder has replaced France with 0, Germany 1 and Spain 2 but Germany is not higher than France and France is not smaller than Spain so we need to create a dummy variable for Country. 

Read more on [dummy variables here](https://en.wikiversity.org/wiki/Dummy_variable_%28statistics%29).

We don’t need to do same for **Gender** variable as it is binary.



Create dummy variables using the ScikitLearn library function called [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). It takes the column number as input and returns the minimum number of columns needed to encode the categories with unique combinations of 0s & 1s. [Read more here](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).

Read more on the [dummy variable trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html) :: remove redundancies!

In [None]:
# encode enumerations to switch tables to prevent conflicts
onehotencoder = OneHotEncoder(categorical_features = [1])

In [None]:
X = onehotencoder.fit_transform(X).toarray()

In [None]:
# remove the first column to avoid the dummy variable trap
X = X[:, 1:] 

In [None]:
X[0,:]

### Split the dataset to training and testing sets

Next, we need to divide our data set to two subsets, one for testing and one for training. 
ScikitLearn library provides the function [train_test_split()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):

**sklearn.model_selection.train_test_split()**

that splits arrays or matrices into random train and test subsets.


In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Feature Scaling

Feature scaling is essential as discussed if the **Features** lecture and needs to be applied to both the training and test sets.

That is simply because some variables have values in the thousands while some others have values is the tens or ones. It is very important to ensure that none of our variables  dominate over the others.

It is computed using the ScikitLearn library [StandardScaler()](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) which is fitted in the training set and applied to both the training and test sets.

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [None]:
X_train = sc.fit_transform(X_train)

In [None]:
X_test = sc.transform(X_test)

## Compiling the ANN

### Import the keras libraries

<a href="https://keras.io"><img src="https://s3.amazonaws.com/keras.io/img/keras-logo-2018-large-1200.png" width="400" align="left"/></a>

- Import the sequential model from the Keras API to initialize our ANN;
- Import the Dense layer template from the Keras API to add hidden layers;
- Create an instance of the sequential model called classifier since our job is in the classification domain.

The Dense layer is a layer in which all inputs are connected to all outputs!


In [None]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense


In [None]:
# Initialising the ANN
classifier = Sequential()

### Add First Hidden Layer

The first Dense layer added to our classifier:

- consists of 6 units (neurons), thus generating 6 outputs;
- has a uniform kernel initialization (weight matrix);
- applies a ReLU activation function on the output of each unit;
- takes a 11 inputs 


In [None]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

### Add Second Hidden Layer


The second Dense layer added to our classifier:

- consists of 6 units (neurons), thus generating 6 outputs;
- has a uniform kernel initialization (weight matrix);
- applies a ReLU activation function on the output of each unit;
- takes as input the outputs of the previous layer; 


In [None]:
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

### Add Output Layer

The output Dense layer added to our classifier:

- consists of 1 unit (neuron), thus generating a binary output;
- has a uniform kernel initialization (weight matrix);
- applies a Sigmoid activation function on the output of the single unit;
- takes as input the outputs of the previous layer; 

If the number of categories in the output layer is more than 2 we then need to use the SoftMax activation function.


In [None]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

### Compile the ANN

In the model compilation we customize the:
    
- [Optimizer](https://keras.io/optimizers/): is the algorithm used to find optimal set of weights. Adam employs Stochastic Gradient Descent (SGD)!
- [Loss function](https://keras.io/losses/#available-loss-functions): SGD requires a loss function. With binary outputs we use a logarithmic loss function called the binary_crossentropy. If the dependent variable was categorical, i.e. taking more than 2 values, we would have used the categorical_crossentropy.
- [Metric](https://keras.io/metrics/): this is the metric used for model improvement; we use accuracy!

In [None]:
#compile the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

## Deploying the ANN

### Fit the ANN to the training set

We can now train our ANN using the data in our training set X and our class labels (dependent variables) in y. Parameters that can be specified are the:

- Batch size: specifies the number of observations fed into the model after which the weight matrix is updated. 
- Number of epochs: number of iterations of the whole process!

[more here](https://keras.io/models/model/#fit)


In [None]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)

### Predicting the Test set results

Objective: using the trained ANN on our Training set X, lets see how well it performs on our Test set for which we have ground truth, i.e. we know the results.

For each probability returned we generate a categorical outcome (true/false) by thresholding it at a value of 50% 


In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
# threshold the probabilities into True > 0.5 or False
y_pred = (y_pred > 0.5) 

In [None]:
y_pred[0]

### Evaluating the model

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. Use the ScikitLearn library [confucion_matrix()](https://en.wikipedia.org/wiki/Confusion_matrix) function to compute it and display it.

<img src="https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png" align="left"/>

In [None]:
# computing the Confusion Matrix
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_test, y_pred)

In [None]:
cm

In [None]:
s1 = sum(cm)
s2 = sum(s1)

In [None]:
accuracy = (cm[0,0] + cm[1,1])/s2
accuracy

## Testing individual cases

In this lecture we will learn how to predict the behaviour of an new data sample outside our training and test data sets. 

A new observation (data entry) is given. Given the model we trained can we predict if this new customer is likely to stay or to go?

<img src="https://catalystforbusiness.com/wp-content/uploads/2017/12/customer-care.jpg" align="left" width="400"/>

New customer data

| Geography | Credit Score | Gender | Age | Tenure | Balance | Number of Products | Has Credit Card | Is Active Member | Estimated Salary | 
|---|---|---|---|---|---|---|---|---|---|
| France | 600 | Male | 40 | 3 | 60000 | 2 | Yes | Yes | 50000 |


### Predicting new observations

The new data need to be placed in the same order/format as in the case of the training/test sets.

1. Create a new NP array and populate it accordingly.
2. Use sc.transform to transform the vector to the desired format.
3. Request a prediction and threshold it as before.


In [None]:
# request a prediction from the ANN using the new data formatted as needed;
# set the first entry of the numpy array as float to avoid complaints
new_prediction = classifier.predict(sc.transform(np.array([[0.0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]]))) # 2D array use row 1

In [None]:
new_prediction = (new_prediction > 0.5)
new_prediction

## Improving the model

In this lecture we will learn how to evaluate, improve and tune the ANN 

### Evaluate the ANN

You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers found at keras.wrappers.scikit_learn.py.

There are [two wrappers available](https://keras.io/scikit-learn-api/). Consider the first: keras.wrappers.scikit_learn.KerasClassifier(build_fn=None, \**sk_params), which implements the Scikit-Learn classifier interface.

In [None]:
# Evaluating the ANN

# load the libraries
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense

We can use the Keras scikit_learn wrapper to compute some statistics about our ANN

1. Create the equivalent sckit_learn compatible classifier.
2. Parameterize it as before and run k-fold cross validation
3. Obtain the metrics

Define a function to configure your classifier as requested:

In [None]:
#define our classifier

def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier


We need to compile a Keras classifier for the sckit_learn library to compute the k-fold cross validation. The latter will produce a set of accuracy metrics for each run from which we aim at the mean

Use these settings to Regularize the Dropout to reduce overfitting if necessary.

- [Dropout Regularization in Deep Learning Models With Keras](https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/)
- [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/)

Browser compatibility for parallel execution using Jupyter notebooks: 
- test on Safari **Failed**, 
- test on Chrome **Pass** (~16min).


In [None]:
# Run k-fold cross validation

# configure the classifier as needed; set the building function, the batch size and the number of epochs, as before
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)


In [None]:
# Run the k-fold cross validation; n_jobs = number of cpus, when set to -1 it means use all
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1) 

In [None]:
mean = accuracies.mean()
mean

In [None]:
variance = accuracies.std()
variance

We have an insignificant change on our mean accuracy! This means that **no overfitting** occurs!

### Improving the ANN

If overfitting was to be observed, one way to counter it and make the model more general is by using dropout regularization. 

Dropout constraints the number of neurons that get activated in an arbitrary manner. The parameter p specifies (%wise) how many neurons to be switched off in each layer.

We do not need to run this since no overfitting is observed in our case.


In [None]:
# add this library
from keras.layers import Dropout

In [None]:
# re-initialising the ANN
classifier = Sequential()

In [None]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11)) 
classifier.add(Dropout(p = 0.1))

In [None]:
# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(p = 0.1))


In [None]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))


In [None]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

### Tuning the ANN

We can use the Keras scikit_learn wrapper to compute some statistics about our ANN:

1. Create the equivalent sckit_learn compatible classifier.
2. Parameterize it as before, add more options and run k-fold cross validation for each parameter set
3. Obtain global metrics and get the best settings/accuracy

In [None]:
# load the libraries; note the Grid Search Cross Validation lib
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense

In [None]:
#define our classifier
def build_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier


Compile the Keras classifier with no parameters. 

Create a separate vector of parameters, each with a number of different settings.

Run GridSearchCV using the classifier as estimator, the parameters vector, and by specifying the number of k-folds and the scoring metric.


In [None]:
# configure the classifier as needed; set the building function
classifier = KerasClassifier(build_fn = build_classifier)


In [None]:
# enter different options for the batch size, the number of epochs and the optimizer:
parameters = {'batch_size': [25, 32], 'epochs': [100, 500], 'optimizer': ['adam', 'rmsprop']}

In [None]:
# Customize the Grid Search CV
grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10)

We now know which parameter setting from them all scores the highest accuracy.

Printing out the best parameters we observe the following:


In [None]:
# fit the grid_search model to our training data
grid_search = grid_search.fit(X_train, y_train)

In [None]:
# obtain the best parameters and best accuracy
best_parameters = grid_search.best_params_
best_parameters

In [None]:
best_accuracy = grid_search.best_score_
best_accuracy

<img src="https://drive.google.com/uc?id=1ssIjY7LC98PSTGfU9RlWpig-5pEjpD-r" align="left" width="400"/>