# Artificial Neural Network
In this notebook we will learn to build an Artificial Neural Network, based on the training data generated by the notebook number 2. But...


What is an Artificial Neural Network? 

#### The Perceptron

Before giving the definition of a neural network, we need to take a step back and establish what is meant by Artificial Neuron (or **Perceptron**). 

An Artificial Neuron is a function that maps an input vector $\{x_1, ..., x_k\}$ to a scalar output $y$ via a weight
vector $\{w_1, ..., w_k\}$ and a function $f$ (typically non-linear).
<img src="images/neuron.png">
$$y = \sum_{i=0}^{k} w_ix_i = f(w^Tx)$$




#### The Activation Function
The function $f$ is called the activation function and generates a non-linear input/output relationship. A common choice for the activation function is the **Logistic function** (or **Sigmoid**).
<img src="images/activation.png" height="500" width="500">

#### Find the weights: an optimization problem

We want to find the weights $\{w_1, ..., w_k\}$ such that the **objective function** (or **loss**) is minimized. The objective function  measures the difference between the actual output $t$ and the predicted output $y$.

To find the weights, we will use the **Gradient Descent**:
- Iterative optimization algorithm used in machine learning to find the best results (minima of a curve).
- Compute the gradient of the objective function with respect to an element $w_i$ of the vector $\{w_1, ..., w_k\}$.
<img src="images/gradient.png" height="500" width="500">

- Let’s update the weights using the gradient descent update equation (in vector notation):

$$w^{new}_i = w^{old}_i - \eta \frac{\partial J(w)}{\partial w_i}$$

####  The Hyperparameters of an Artificial Neural Network

Hyperparameters are the parameters which determine the **network structure** (e.g. Number of Hidden Units) and the parameters which determine **how the network is trained** (e.g. Learning Rate).

1. **Learning Rate $\eta$**
    
    - Training parameter that controls the size of weight changes in the learning phase of the training algorithm.
    - The learning rate determines how much an updating step influences the current value of the weights.

*Many updates required before reaching the minimum*. 

*Drastic updates can lead to divergent behaviors, missing the minimum*.

<img src="images/small.png" height="250" width="200">
<img src="images/big.png" height="250" width="250"> 

2. **Momentum $\alpha$**
    - Momentum simply adds a fraction of the previous weight update to the current one.
    - When the gradient keeps pointing in the same direction, this will increase the size of the steps taken towards the minimum.
    
    <img src="images/momentum.png" height="250" width="250"> 
    $$\Delta w_i(t+1) = - \eta \frac{\partial J(w)}{\partial w_i} + \alpha \Delta w_i(t) $$

3. **Weight decay $\lambda$**
     - Weight decay $\lambda>0$ penalizes the weight changes.
     - By shrinking your coefficients toward zero, it tends to decrease the magnitude of the weights, and helps prevent overfitting
     $$\Delta w_i(t+1) = - \eta \frac{\partial J(w)}{\partial w_i} - \lambda \eta w_i(t) $$

4. **Number of epochs**
    - The number of epochs is the number of times the whole training data is shown to the network while training.
    
5. **Batch size**
   - The number of samples shown to the network before the gradient computation and the parameter update.


## 1. Load the knowledge base
First of all, you need to load the knowledge base, ie the training data contained in one of the files generated in the previous notebook. Use `m`, `N` and `num_of_matches` to load the right model

To do this:

In [None]:
import pandas

# These parameters must be set to load the correct training set

m = 1
N = 10
num_of_matches = 10

path = 'output/train_set_m{}/num_of_matches_{}.txt'.format(m, num_of_matches)
dataset = pandas.read_csv(path, ',', delimiter=None, header=None)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values



print("Dataset: " + path + '\n')
print(dataset)
print("\nx:")
print(X)
print("\ny:")
print(y)

#### Use the Label Encoding and One Hot Encoding!
Label Encoder is used to convert categorical data, or text data, into numbers, which our predictive models can better understand. What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

This is to ensure that each example has an expected probability of 1.0 for the actual class value and an expected probability of 0.0 for all other class values when `softmax` activation function is used. This can be achieved using the `to_categorical()` Keras function.

In [None]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)
y_tc = to_categorical(encoded_y, 4)
print(y)
print("is converted into")
print(encoded_y)
print("\n one hot encoding")
print(y_tc)

## 2.  Split your data!
All you have to do is divide your training data into **training set** and **test set** because later we want to evaluate our classifier's performance.

To do is invoke these simple commands:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_tc, test_size=0.3, random_state=4)

# print the shapes of the new X objects
print("\nTraining set dimensions (X_train):")
print(X_train.shape)
print("\nTest set dimensions (X_test):")
print(X_test.shape)

# print the shapes of the new y objects
print("\nTraining set dimensions (y_train):")
print(y_train.shape)
print("\nTest set dimensions (y_test):")
print(y_test.shape)



#### Scaling data
Many machine learning algorithms require that features are on the same scale. Also, optimization algorithms such as gradient descent work best if our features are centered at mean zero with a standard deviation of one — i.e., the data has the properties of a standard normal distribution.

In [None]:
from sklearn.preprocessing import StandardScaler


# Define the scaler
scaler = StandardScaler().fit(X_train)
# Scale the training set
X_train = scaler.transform(X_train)
# Scale the test set
X_test = scaler.transform(X_test)

## 3.  Build the Neural Network Architecture

Now we are ready to build our classifier: the core data structure of Keras is a model, a way to organize layers. The simplest type of model is the **Sequential model**, a linear stack of layers.

In [None]:
from keras.models import Sequential

model = Sequential()

Stacking layers is as easy as `.add()`. Activations can either be used through an Activation layer, or through the `activation` argument supported by all forward layers.

##### Avoid overfitting with Dropout

Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel.

During training, some number of layer outputs are randomly ignored or “dropped out.” This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different “view” of the configured layer. 

Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.

In [None]:
from keras.layers import Activation, Dense, Dropout

model.add(Dense(45, input_shape=(X_train.shape[1],)))
model.add(Activation('elu'))
model.add(Dropout(0.1))
model.add(Dense(30))
model.add(Activation('elu'))
model.add(Dropout(0.1))
model.add(Dense(15))
model.add(Activation('elu'))
model.add(Dropout(0.1))
model.add(Dense(4))
model.add(Activation('softmax'))

Once your model looks good, configure its learning process with `.compile()`:

In [None]:
from keras.optimizers import Adam

adm = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(loss='categorical_crossentropy',
              optimizer=adm,
              metrics=['accuracy'])

You can now iterate on your training data in batches:

In [None]:
history = model.fit(X_train, y_train,
          epochs=200,
          batch_size=128, validation_split=0.2)

## 3.  Evaluate the Classifier
This phase is very important and allows us to evaluate the model based on some standard metrics, such as **accuracy**.

In [None]:
from sklearn.metrics import confusion_matrix


# Make predictions for the labels of the test set
y_pred = model.predict_classes(X_test)
print("\nPredictions")
print(y_pred)

# Evaulate model
score = model.evaluate(X_test, y_test,verbose=1)
# The score is a list that holds the loss and the accuracy
print("\nScore (loss,accuracy)")
print(score)

## 4.  The Loss plot

This graph allow us to understand whether our model is overfitting or not. In fact, one way to limit this phenomenon is to set the number of training periods to such a value that the loss of the validation set begins to increase, which is a symptom of overfitting.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(5.0, 5.0), dpi=120)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.plot(history.epoch, np.array(history.history['loss']),
           label='Train Loss')
plt.plot(history.epoch, np.array(history.history['val_loss']),
           label = 'Val loss')
plt.legend()
plt.ylim([0, 2])
plt.show()


## 5.  The Accuracy plot

This graph can provide an indication of useful things about the training of the model, such as:

   - It’s speed of convergence over epochs (slope).
   - Whether the model may have already converged (plateau of the line).
   - Whether the mode may be over-learning the training data (inflection for validation line).

In [None]:
import matplotlib.pyplot as plt
import numpy as np


plt.figure(figsize=(5.0, 5.0), dpi=120)
plt.xlabel('Epoch')
plt.ylabel('Accurancy')
plt.plot(history.epoch, np.array(history.history['acc']),
           label='Train Acc')
plt.plot(history.epoch, np.array(history.history['val_acc']),
           label = 'Val Acc')
plt.legend()
plt.ylim([0, 1])
plt.show()


## 6. Hyperparameters Tuning

Hyperparameter optimization is a big part of deep learning.

The reason is that neural networks are notoriously difficult to configure and there are a lot of parameters that need to be set. On top of that, individual models can be very slow to train.


In the next optional notebook it is possible to retrace the steps performed in this notebook, but with a different approach: we will try to perform hyperparameter tuning using Grid Search

## 7. Save your model

You have reached the last step of this notebook. If the developed model satisfies you, all you have to do is save it in the same folder that contains the training data. This is important because this model will be loaded into the next notebook. Therefore good luck!

In [None]:
path_of_model = 'output/train_set_m{}/model_N_N_{}_{}.h5'.format(m, num_of_matches, num_of_matches)
model.save(path_of_model)

print('model saved in {}'.format(path_of_model))

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "images/ultron.jpg")