## Multi-layer perceptron

I am not even going to try and write a better intro. to neural nets than this...

https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

### Softmax Equation

Given an array of values of length n, the softmax of value i in the array is:

$$\frac{e^{i}}{\sum_{j}^{n}e^{j}}$$

### Deep Neural Network

When you have multiple hidden layers - the layers in between the input and softmax layers, the network is called deep.

### Backpropagation

Neural nets are trained using a technique called backpropagation. At a very high level, you pass a training example through your network (forward pass), then measure its error, and then you go backwards through each layer to measure the contribution of each connection to the error (backwards pass). You then use this information to adjust the weights of your connections using gradient descent. 

### Vanishing/Exploding gradients

When your gradients start to get too small or too large this can negatively effect learning. For example, a zero gradient will stop learning all together and when you gradients get too large your learning can diverge.

### Activation Functions

The article above does not talk much about activation functions. Typically, in an MLP after you pass connections to a neuron you then apply an activation function. Historically, that activation function was a logistic function, which then is basically logistic regression. This tends to suffer from vanishing gradient problem.

Another very popular activation function now is relu. Relu(z) = max(0,z). This is very fast to compute and in practice works very well. This function suffers less from the vanishing gradient problem.

One problem with relu is that the connections can die. This happens if the inputs to a neuron end up negative resulting in a zero gradient. Thus, the **leaky relu** was invented: Leaky Relu(x) = max($\alpha$x, x) where $\alpha$ is usually a value of 0.01 or 0.02. The $\alpha$ value is the slope when x < 0 and ensures that the activation never truly dies, though it can become quite small.

**Elu** is another activation function which generally performs the best but is slower to compute then a leaky relu. Again, when x > 0 you just get x. But when x < 0 you get $\alpha$(exp(x) -1). $\alpha$ represents the value that the function approaches when x is a large negative number. Usually, it is set to 1. This function is also smooth everywhere, including zero.

### Batch Normalization

As we have learned it is important to scale - or normalize - your data before feeding it to a neural net. Another important normalization step is right before your activation function to again normalize your data by subtracting the mean and dividing by the standard deviation. Since you are working with a batch, you use the batch mean and standard deviation. You also allow each batch normalization to learn an appropriate scaling and shifiting factor for your standardized values. 

This technique has been shown to reduce the vanishing/exploding gradient problem, allow the use or larger learning rates, and be less sensitive to initalization. On the downside, it reduces runtime prediction speed.


### Cross-entropy

$$-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}y_{k}^{i}log(p_{k}^{i})$$

Where:

* m - the number of data points
* K - the number of classes
* y_{k}^{i} - the true class value for row i, class k. Either a zero or one depending on if k is the correct class
* p_{k}^{i} - the value predicted by your model for class k, row i. Usually from your softmax

This is the cost function you are trying to minimze.

### Important to Remember

* Scale data - usually zero to one
* Shuffle data

### Tuning Hyper-parameters

* Better to use random search
* Start with reasonable, known architectures
* Number of hidden layers:
    * Often can be valuable to have a deep network to learn heirarchy. Usually converge faster and generalize better. 
    * More complex problems can often require deeper networks and more data
* Number of neurons:
    * Typically size the layers to form a type of funnel with fewer and fewer neurons at each layer. This comes back the heirachy idea where you might need more neurons to learn lower level features. 
    * Also can try picking same number of neurons for all layers to have less parameters to tune
* Usually more value in going deeper than wider
* Can try going deeper and wider than you think necessary and use regularization techinques to prevent overfitting. Such as early stopping.

### Initialization

In turns out with neural nets that how you initalize your weights can be quite important. Instead of random initalization, it is usually preferred to use either Xavier or He initalization. 

P. 278 of Hands on Machine learning has a good description of these initalizations.

If you are going to use Relu or Elu activation functions, I would recommend He, which is supported by Keras:

https://keras.io/initializers/#he_normal

For He normal you initialize from a truncated normal distribution centered around 0 and with a standard deviation of sqrt(2/ (number of inputs + number of outputs))

### Transfer Learning

It turns out that the weights of a neural network can be used by other networks with the same architecture. For example, imagine Google has trained a neural network on millions of images from google search to predict 100 categories. Now, you would like to take a few thousand photos from your own photo collection and train a neural network to detect whether or not you are in a photo (binary classification).

It turns out that you can start with Google's network and weights (Assuming you can get them) and use them as a starting place for your network. Assuming you are okay with the rest of their architecture, you would just need to change the last layer to 2 nodes intead of 100 and learn those weights from scratch.

This is a really powerful idea and allows you to train much faster and with less data.

This is such a good idea that you are almost always better starting with pre-trained weights if you can find them even if the problem they were trained on is not that close to your problem. Obviously, the closer the problems the better.

Many deep learning frameworks have what are called model zoos where you can find pre-trained models. Keras' model zoo is here: https://keras.io/applications/

You can find more details on how to perform some of these techniques using Keras here: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

Lastly, another valuable option when you have little data is to pre-train your own network on related data. For example, if you want to classify whether you are in an image or not, but only have a few images of yourself. You can first train your network on images of people in general and then fine-tune your network with the images of yourself.

### Optimizers

We have previously discussed vanilla gradient descent where you move in the direction negative to the gradient in proportion to the learning rate. It turns out that there are faster techniques for finiding the minimum - or a minimum - in your cost function. These faster techniques are very valuable with neural nets which already take a long time to train.

We won't cover these in too much detail, but there is a good description here:

http://ruder.io/optimizing-gradient-descent/index.html#whichoptimizertochoose

And starting on p.295 of Hands on Machine Learning.

Generally, a good place to start can be the Adam optimizer.

### Regularization

As we have discussed, neural nets can be quite prone to over fitting. Thus, we have some techniques to combat this:

* **Early Stopping:** Keep track of your validation error after every iteration and stop training when it stops going down. Usually, you would say something like: if the validation error has not decreased in 5 continuous iterations, stop.
* **L1 and L2:** Just like logistic and linear regression, we can add a penalty term for large weights.
* ** Dropout:** This is probably the most popular regularization method and is seen in many architectures. It is simple: at every iteration, every neuron has some probability of being turned off or inactive during that iteration (except the output neurons). This probability is usually referred to as the dropout rate and is a hyper parameter you have to choose. What this does, is it forces the network to become pretty robust. At anytime, it can lose a neuron and thus can't learn to become too dependent on a small set of neurons - including the input. This isn't too different from random forest where each decision tree sees slightly different samples and features. With dropout, every iteration is a slightly different neural net that sees different features (or neurons). 

Keras has a dropout layer: https://keras.io/layers/core/#dropout

### Data Augmentation

Neural nets - especially deep ones - love data. Sometimes you don't have a lot of data or would like more data, but getting additional samples can be costly. One way of dealing with this is by augmenting your current data via transformations.

This idea is quite popular in computer vision. Say your task is to predict whether or not a dog is in an image and you have 5,000 images. To get more images you can randomly transform the 5,000 images you have. For example, you can change the rotation, brightness, size, etc. This then creates additional data while still not changing the label (the picture still contains a dog or not).

These augmentations usually lead to your net being more robust to the transformations you applied and less prone to over-fit.

Keras supports image data augmentation: https://keras.io/preprocessing/image/

Even if your data are not images, though, you may be able to think of some creative ways of augmenting your data.

In [1]:
import numpy as np

values = np.array([1.0, 3.0, 8.0, 4.0, 12.0])
exp_values = np.exp(values)
softmax = exp_values / sum(exp_values)
print([round(x,2) for x in softmax])
print(sum(softmax))

[0.0, 0.0, 0.02, 0.0, 0.98]
1.0


## Example using Python

In [35]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, BatchNormalization
from keras.utils import np_utils
from keras.datasets import mnist
from sklearn.metrics import confusion_matrix
import numpy as np
from __future__ import division

In [3]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

def vectorize_image(images):
    scaled_images = images / 255
    return images.reshape(scaled_images.shape[0],-1)

x_train = vectorize_image(x_train)
x_test = vectorize_image(x_test)

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz

In [51]:
model = Sequential([
    Dense(128, input_shape=(784,), activation='elu', kernel_initializer='he_normal'),
    BatchNormalization(),
    Dropout(0.5),
    Dense(64, activation='elu', kernel_initializer='he_normal'),
    BatchNormalization(),
    Dropout(0.5),
    Dense(10, activation='softmax')
])

In [52]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_17 (Dense)             (None, 128)               100480    
_________________________________________________________________
batch_normalization_3 (Batch (None, 128)               512       
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_18 (Dense)             (None, 64)                8256      
_________________________________________________________________
batch_normalization_4 (Batch (None, 64)                256       
_________________________________________________________________
dropout_6 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_19 (Dense)             (None, 10)                650       
Total para

In [53]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy')

In [54]:
model.fit(x_train, y_train, epochs=20, batch_size=64, validation_split=0.1)

Train on 54000 samples, validate on 6000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x126926908>

In [55]:
test_predictions = np.argmax(model.predict(x_test),1)
y_test_sparse = np.argmax(y_test, 1)

In [56]:
confusion_matrix(y_test_sparse, test_predictions)

array([[ 969,    0,    1,    1,    1,    2,    3,    1,    2,    0],
       [   0, 1126,    3,    1,    0,    0,    1,    0,    4,    0],
       [   1,    3, 1006,    5,    2,    0,    1,    9,    5,    0],
       [   1,    0,    6,  991,    0,    3,    0,    6,    3,    0],
       [   0,    0,    3,    0,  966,    0,    6,    1,    1,    5],
       [   2,    1,    0,   10,    1,  863,    5,    3,    5,    2],
       [   5,    3,    1,    0,    2,    6,  937,    0,    4,    0],
       [   1,    4,    9,    2,    0,    0,    0, 1008,    0,    4],
       [   5,    2,    2,    6,    5,    2,    3,    5,  942,    2],
       [   4,    6,    0,   10,   14,    1,    0,    8,    4,  962]])

In [57]:
np.sum(y_test_sparse == test_predictions) / test_predictions.shape

array([0.977])