## Neural network - Training & Optimization

The are a lot factors why a network could not work as expected, there are many things that can fail:
* Our architeture can be poorly chosen
* Our data can be noisy
* Our model could maybe taking years to run and we need it to run faster.

### Testing
We separate data in *__training__* and *__testing__* set, then we __train__ our data with the training set without looking at the testing set, and then we __evalutate__ the results on the testing to see how we did.

![](images/training_testing_01.png)
![](images/training_testing_02.png)


### Types of errors

![](images/training_types_of_errors_1.png)

*__Underfitting__*: Trying to solve a complex problem with a simple solution.

`Error due to bias` The classifier it's too simple
![](images/training_types_of_errors_2.png)


*__Overfitting__*: Trying to solve a simple problem with a complex solution, it adds extra complexity.

`Error due to variance` The classifier it's too specific
It will fit the data but it will fail to generalize.
![](images/training_types_of_errors_3.png)


##### In Neural networks
The model in the middle will probably generalize better, the model in the middle looks at the point as noise, while the one on the right gets confused by it and tries to feed it too well.

![](images/training_types_of_errors_4.png)

```
We will err on the side of an overly complicated models and then we'll apply certain techniques to prevent overfitting on it.
```

### Early Stopping

For example, This case will fits the training data really well, but it will generalize horribly. 

![](images/training_types_of_errors_5.png)

![](images/training_types_of_errors_6.png)

#### Model Complexity graph

* Y-axis: Measure of the error.
* X-axis: Measure of the complexity of the model.

![](images/training_types_of_errors_7.png)

We degrade in descent until the testing error stops decreasing and starts to increase. At that moment we stop.

Solution 2 it's a great prediction, It's super accurate, but this hint's a bit towards overfitting. 

### Regularization

![](images/regularization_01.png)
When we multiply by 10 and takes the sigmoid of $10x_1+10x_2$, our predictions are much better since they are closer to zero and one. But the function becomes much steeper and it's much harder to do great descent, since the derivatives are mostly close to zero and then very large when we get to the middle of the curve. Therefore, in order to do gradient descent properly, We want a model like the one in the left more than a model like one in the right.

The first model is betten even if it gives a larger error. Whw apply sigmoid to small values, we get the function on the left which has a nice slope to the gradient descent.
![](images/regularization_02.png)
The model in the right is too *__CERTAIN__* and it gives little room for applying gradient descent, and the points that are classified incorrectly in the model in the right, will generate larger errors and it will be hard to tune the model to correct them. 

To prevent this type of overfitting from happening, we have to tweak the error function a bit, basically we want to punish large coefficients: 
![](images/regularization_03.png)
We take the error function and we add a term which is big when the weights are big.


__Two way to do this:__

*__L1 regularization:__* Add the sums of absolutes values of the weights times a constant lamba.

*__L2 reguralization:__* Add the sum of squares of the weights times the same constant.

The two are large if the weights are large. The lamda parameter will tell how much we want to penalize the coefficients. If lamda is large we penalized them a lot.

![](images/regularization_04.png)

When we apply *__L1__* we tend to end up with sparse vectors, this means, small weights will tend to go to zero. 
If we want to reduce the number of weights and end up with a small set, we can use L1. It's good if we have a problem with hundreds of features, and L1 regularization will help us select which ones are important, and it will turn the rest into zeros.

*__L2__* in the other hand, tends not to favor sparse vectors since it tries to maintain all the weights homogeneously small, This one normally gives better results for training models.


Conclusion:
L1 regularization: produce vectors with sparse weights.
L2 regularization: produce vectors with small homogeneous weights.

### Dropout

Sometimes part of the network has very large weights and it ends up dominating all training 
![](images/dropout_0.png)


To solve this, we turn part of the network off while training as we go throught the epochs. We gives the algorithm a parameter, the probability that each node gets dropped at a particular epoch.

![](images/dropout_1.png)

0.2, means that each node gets turned off with a probability of 20 percent.

### Local minima
When you reach a local minimum place you can get with gradient descent, but it's not the minimum of the whole

![](images/local_minima.png)


### Random Restart
This is just start from a few differents random places and do gradient descent from all of them,  This increase the probability that we'll get to the global minimum, or at leat a pretty good local minimum.

![](images/random_restart.png)


### Vanishing Gradient
*__PROBLEM:__*
Looking at the sigmoid function, the curve gets very flat on the sides. If we calculate the derivative at a point way at the right or way at the left, this derivative is almost zero.
![](images/vanishing_gradient.png)

This no good, since the derivative tells us in what direction to move. This is even worse in multi linear perceptrons.

The derivative of the error function with respect to a weight was the product of all the derivatives calculated at the nodes in the corresponding path to the output.

![](images/vanishing_gradient_2.png)

All these derivatives are derivatives as a sigmoid function, they are small and the product of a bunch of small numbers is tiny.

![](images/vanishing_gradient_3.png)

This makes the training difficult because grading descent gives us very very tiny changes to make on the weights, which means we make very tiny steps and we will never be able to get the minimum. 


One way to solve this is to change the activation function.

#### Hyperbolic tangent function
This is similar to sigmoid, but since the range is between -1 and 1, the derivatives are larger.
![](images/hyperbolic_tangent_function.png)


#### Rectified linear unit (ReLU)
Very simple, it says: If you are positive, I will return the same value, if you are negative, I will return 0.
The maximun between x and zero.

![](images/relu_function.png)

This one can improve the calculation without affecting accuracy, since the derivative is one id the number is positive.

Here with multiples ReLU units, noted that the last one is a sigmoid function. since our final output still needs to be a probability between zero and one. 

### Batch vs Stochastic Gradient Descent
The gradient descent algorithm, you take a bunch of steps following the negative of the gradient of the height, which is the error function. Each step is called the epoch.

![](images/batch_stochastic_gradient_descent_1.png)

In an epoch we take our input, namely all of our data and run it through the entire neural network. Then we find our predictions, we calculate the error, namely, how far they are from where their actual labels. And finally, we back-propagate this error in order to update the weights in the neural network.

This will give us a better boundary for predicting our data. *__This is done for all the data__*

![](images/batch_stochastic_gradient_descent_2.png)


The idea behind `Stochastic gradient descent` is simply that we take subsets of data, run them through the neural network, calculate the gradient of the error function based on those points and then move one step in that direction. We still want to use all our data:
* We split the data into several batches (Ex: 24 points, we split in 4 of 6 point each).
* We take the points in the first batch and we run them throught the neural network, calculate the error and its gradients and back-propagate to update weights. 
* This will give us new weights, which will define a better boundary region.
* Now, we take the points in the second batch and we do the same thing.
* And then we do the same thing for the 3rd and 4th batch. 
* This took us 4 steps, while with normal gradient descent takes ony one with all the data.
* The 4 steps where less accurate, but in practice it's much better to take a bunch of slightly inaccurate steps than to take one good one.

![](images/batch_stochastic_gradient_descent_3.png)


### Learning Rate Decay

General rule to choose learnig rate:
If your learning rate is too big then you are taking huge steps which could be fast at the beginning but you may miss the minimum and keep going, which will make your model pretty chaotic.
If you have small learing rate, you will make steady steps and have a better chance of arriving to your local minumum.

![](images/learning_rate_1.png)

*__RULE__*: if your model is not working, decrease the learning rate.

![](images/learning_rate_2.png)

```
The best learning rate are those who decrease as the model is getting closer to a solution.
```

### Momentum


We can take the average of the previous steps, even better we can weight them, then the previous step matters a lot and the steps before that matter less and less.


![](images/momentum_1.png)

*__Momentum__* is a constant beta ($\beta$) between 0 and 1 that attaches to the steps as follows:
* the previous step gets multiplied by 1,
* the one before, by beta, the one before, by beta squared, the one before by cubed, etc

In this way the steps that happened a long time ago will matter less than the ones that happened recently. We can see that gets us over the hump. Once we get to the global minumum, it will still be pushing us away a bit but not as much.