# Deep Neural Networks (L2, part 2)

## 1. Neural Network Architecture
Combining simple linear models (perceptrons) into more complicated non-linear models.
We can also weight the different models and assign higher relevance to certain model/ layer outputs.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/nna.png" style="display: inline-block; height: 180px">
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/nna1.png" style="display: inline-block; height: 180px">
This is similar to the simple models from earlier - a linear combination of input values, times the weights plus some bias `(y = xW+b)`. Now the model is a linear combination of the previous models output, times the weights and plus some bias.

In essence Neural Networks are composed from simpler linear models into more complex models.

## 2. Feedforward
The process neural networks use to turn the input into output. Training neural networks is choosing parameters on the network edges in order to model the data well.

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/ffn.png" style="display: inline-block; height: 200px">
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/ffn1.png" style="display: inline-block; height: 200px">
    
    Here the bias is threated as a weight and input

## 3. Backpropagation
In a nutshell:
* Doing a feedforward operation
* Comparing the output of the model with the desired output
* Calculating the error
* Running a feedforward backwards (backprop) to spread the error of each of the weights and bias
* Update the weights and get a better model
* Continue until our model is good (convergence)

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/backp.png" style="height: 300px">

### Backpropagation math
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/backp1.png" style="display: inline-block; height: 200px">
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/backp2.png" style="display: inline-block; height: 200px">

        E is a vector
#### Chain Rule
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/chainrule.png" style="height: 300px">

#### Process
First we perform a feedforward operation. (Here the bias is interpreted as a weight for convenience, such that it can fit in a matrix).
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/backprop.png" style="height: 300px">
We compare the output of the model with the desired output, by calculating the partial derivatives of the error function with respect to each weight (only W_11 is shown) using the chain rule.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/backprop1.png" style="display: inline-block; height: 200px">
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/backprop2.png" style="display: inline-block; height: 200px">

## 4. Training Optimization

### Stochastic Gradient Descent (SGD)
We take smaller subsets of the data, run them through the neural network, calculate the error function based on that subset.
We still want to use all the data, so we split it into separate batches and calculate the gradients for each. 
Splitting the data into batches and calculated the gradient on each, might be less accurate then calculating the gradient for the entire data set. But it is much faster and in practice, accurate enough.

### Learning Rate
If the learning rate is big, the optimizer is taking large steps. When it's small, the optimizer is taking small steps. Usually, we want to take larger steps at the beginning and smaller steps at the end. 

A general rule of thumb is, that if the model isn't working try to decrease the learning rate.

### Split data into train, test and validation set
Helps aliviate overfitting and underfitting, by separating the data into different sets to evaluate the model.


    Usually we want to try to overfit our model and then apply different techniques to try to generalize better.
    
### Early Stopping
<img src="img/earlystopping.png">
<img src="img/earlystopping1.png">

### Regularization (L1 and L2)
Regularization punishes large weights (coefficients):

<img src="img/regularization.png" style="display: inline-block; height: 200px">
<img src="img/regularization1.png" style="display: inline-block; height: 200px">
Both L1 and L2 and popular and should be chosen depending on the problem:

**L1**:
* Tend to end up with sparse vectors `(1, 0, 0, 1, 0)` -> small weights will tend to go to 0. If we want to reduce the number of weights and end up with a small set.
* Good for feature selection

**L2**:
* Tend not to favor sparse vectors `(0.5, 0.3, -0.2, 0.4, 0.1)`, since it tries to maintain all weights homogeniusly small.
* Normally better for training models

### Dropout
Sometimes training our neural networks, some parts of the network ends up dominating with very large weights whereas others don't. To solve this, we randomly turn some nodes of the network on and off during epochs.

### Vanishing Gradient (chaning the Activation Function)
With the sigmoid function, the derivatives can get very small making it harder to train our model.

To fix this, we can change the activation function:

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/tanh.png" style="display: inline-block; height: 250px">
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/relu.png" style="display: inline-block; height: 250px">

* tanh is very similar to sigmoid, but since the range is between -1 and 1, the gradients are larger.
* ReLU returns the same value of positive and 0 if negative (max between x and 0). The derivative is 1 if the number is positive - improves training time significantly without sacrificing accuracy.


### Local Minima
Techniques for avoiding getting stuck in a local minima:
* Random Restarts
* Change optimizer, e.g. momentum, adam, adagrad, etc.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/momentum.png">