### Training Optimization 

There are many things that can go wrong when training a mode ranging from the dataset, architecture, to other hyperparameters.

### Testing

How do we decide which model is better on a given dataset and problem?

![Why testing?](images/training_testing.png)

This is a classic example, a simple one though where the difference is made by the performance on the testing set.
Train set = we train our network using this dataset
Test set = unseen data by the network during training that we use as an indicator of how well the model is doing

**Tip**: In ML, between a model that does well but is simple and a model that does a little better but is more complex, the simpler model should be selected.

### Overfitting and underfitting

Two major concepts for training a network illustrated in the image below. Essentially, based on what we know we are trying to make the right choice during training. 

What is **underfitting**? As an analogy is it's trying to kill Godzilla with a fly swatter. On the other, **overfitting** means trying to kill a fly with a bazooka. 

![Overfitting and underfitting](images/underfitting_overfitting.png)

#### Early Stopping

When training a model, it is essential to consider the aspect of training duration for a model, as in for many epochs do I train it? 5, 10, 100, 500? What I like to think about it is that at some point during training gradient descent reaches the minimum and after that it starts to memorize the training data when that happens it's time to stop. The graph below, taken from the video is a good representation of how we should think about overfitting, the gradient descent is slowly reaching the bottom, both the training and the test errors being about the same until it reaches the goldilocks zone. After that it starts to diverge, as epochs pass the training error becomes really small however the drawback is that our model fails to generalize on the test set thus it's basically memorizing the training data.

![Early stopping](images/early_stopping.png)

One of the techniques used is early stopping which means that after a certain amount of time if the error hasn't decreased thenn it should stop training

#### Regularization

It is generally considered that a simple model is better than a complex one if they're both doing well, remember that in the previou section it was explained that a complex model can be too specific with the training points. This can lead to overfitting in a subtle way e.g. it can be hard to do gradient descent if the function is much steeper since the derivates are very close to 0 and then very large when getting to the middle of the curve thus involving a lot of variation. Secondly, misclassified points are heavily penalized thus making it difficult to adjust. To solve this problem we are going to apply regularization techniques.

![Regularization](images/regularization.png)

Now, between the two which one is better? It's actually more complicated.

![Regularization L1 vs L2](images/l1_l2_regularization.png)

#### Dropout

A common way to avoid overfitting that is very useful, it involves turning off nodes during every epoch given a certain probability. Why is it useful? Because not all nodes train and contribute equally so we need to balance out the way our network learns. On average since we are doing this multiple times every node gets turned off so it won't be the case that some are turned off more often than others.

![Dropout](images/dropout.png)

### More on Gradient Descent

**Local Minima** - when gradient descent hits a lower error but is not exactly the minimum we are looking for. It basically gets stuck and cannot advance any further.

One way to solve this is to use **random restarts**.

#### Vanishing gradient

A fundamental problem when performing gradient descent is that the derivative of the sigmoid result in a small number and when multiply the products of such elements the result is smaller than before thus the steps taken by gradient descent being very small which in turn does not allow us to reach the solution. The larger the network the worse the vanishing gradient problem becomes.

Solving this problem requires using other activation functions. Two popular options are *Hyperbolic Tangent Function* ($tanh$) and *Rectified Linear Unit* ($ReLU$), both bringing a lot of value to the field.

$$tanh(x) = \frac {e^{x} - e^{-x}}{e^{x} + e^{-x}}$$

With $tanh$ the ranges are larger which lead to bigger derivation and it's similar to sigmoid.

$$relu(x) =\begin{cases}
    x & \text{if x >= 0}\\
    0 & \text{if x < 0}
  \end{cases}$$ 
  
What $relu$ does is offers a better function that returns a slightly larger product after derivatives multiplication tus the GD steps will be bigger.

### Batch vs Stochastic Gradient Descent

**Batch Gradient Descent (BGD)** - means that in each epoch we take all of our data in one go and run it through the network. The problem with it is that if we have a lot of data => many steps => long time to train a NN.

With **Stochastic Gradient Descent (SGD)**, we are only taking subsets of the data (think of random smaples) that we feed through the NN. As an example, if we split the data into small batches then we can run train the network a lot faster even though it will be a bit less accurate. In practice, SGD is the way to go due to the speed-accuracy tradeoff.

### Learning rate decay

**High** - fast in the beginning but might miss the minima making the GD more chaotic.
**Low** - steady steps, slow but better chances to reach the minima.

**Rule of thumb** if model is not wortking, decrease learning rate.

![Learning rate decay](images/learning_rate_decay.png)

### Momentum

Concept of averaging the most recent previous steps that allows GD to power through in order to find the minima.

![Momentum](images/momentum.png)

