# Training Neural Networks

## 1. Hyperparameters

Weights and bias are ones we can use gradients to update

BUT, there are many other different aspects of a neural network that we can change (<mark>**hyperparameters**</mark>):

- Batch size
- Number of layers
- Layer size
- Type of activation function
- Learning rate

$\implies$ Nested loop:

- Outer loop of optimization: hyperparameters (number of iterations)
- Inner loop of optimization: weights, bias

![Alt text](images/img24.png)

***** There are so many different layer configurations, **how do we search for the more optimized models?**

<mark>**We look at configurations which include all hyperparameters**</mark>

There are 2 ways to tune hyperparameters

- Grid search (enumerate through all configurations) --> very expensive
- **Random search** (we can limit to 20 random models) --> More common
    - Create a space based on limits
    - Randomly run models, and pick the one with the best validation

![Alt text](images/img25.png)

***
## 2. Optimizers

An optimizer is an algorithm that, based on the value of the **loss function**, adjusts how each parameter (weight/learning rate) should change.

The optimizer solves the **credit assignment problem**: how do we assign credit (or blame) to the parameters based on how the network performs?

<mark>All neural network optimizers in this course will be based on **gradient descent**.</mark>

**PyTorch** automates the gradient computation.

### Stochastic Gradient Descent (SGD)

- For each iteration evaluate a training sample from the dataset **taken at random**
- **Computing the gradient takes less time**, but... may not actually be faster.
- Optimization path that looks rather erratic
- SGD allows you to do more of a **global search** for an optimum, often results in a better set of weights for your model
- GD on entire training data!

![Alt text](images/img26.png)

Red: using only 1 example, then 1 example --> **SGD**
- Slower, less stable: Gradient jumps around more and more (cat, then dog)
- BUT: more randomness

Black: using all datas at once --> **Batch Gradient Descent**
- Faster, more stable
- BUT: will go towards a mean, which will aggregate and **drop a lot of information**

### <mark>Mini-Batch Gradient Descent</mark> (Combination of the two)

- Instead of working with one sample at a time... can apply **batching**...

> - Use our network to make the predictions for **n samples**
> - Compute the average loss for those **n samples**
> - Take optimize the average loss of those **n samples**, means finishing an iteration

**Batch size** (n): Number of training examples used per optimization “step” a.k.a iteration.  
**Iteration**: One step: The parameters are updated once per iteration.  
**Epoch**: Number of times all the train data is used once to update the parameters  

Suppose there are 1000 samples in train data, batch size = 20 $\implies$ 1 epoch contain 50 iterations

### Gradient Descent: N-Dimensional

A deep neural network has millions or billions of parameters  

<mark>**Real gradient descent of a deep network is optimization in millions of dimensions!**</mark>

Most points of zero gradients are saddle points.  
Plateaus are a problem but can be addressed using specialized variants on gradient descent  

Prob [all dimensions have gradient = 0] ~ 0

It's possible that we're minima in 1 dimension, and maxima in another dimension (**saddle point**)

--> Can easily skip it (since we are trying to minimize for all dimensions)

Plateaus can easily be addressed as well

### **SGD with Momentum**

<mark>**Ravines**</mark>: areas where the surface curves much more steeply in one dimension than in another, common around **local optima**.

SGD has trouble navigating ravines → it oscillates across the slopes of the ravine --> SLOW

Momentum helps **accelerate** SGD in the relevant direction and dampens oscillations SGD

<img src="images/img28.png" width="50%" height="50%">

The momentum term increases for dimensions whose gradients point in the same  
directions and reduces updates for dimensions whose gradients change directions

Analogy → we push a ball down a hill. The ball accumulates momentum as it rolls downhill,  
becoming faster and faster on the way until it reaches its terminal velocity

$$
\begin{cases}
    v_{ji}^t = \lambda v_{ji}^{t-1} - \gamma \frac{\partial E}{\partial w_{ji}^{t}} \\
    w_{ji}^t = w_{ji}^{t-1} + v_{ji}^t
\end{cases}
$$

Initially (previous formula), $\lambda$ is 0. Then, if $\lambda$ is set to be non 0, movement is accelerated

### <mark>Adaptive Moment Estimation (Adam)</mark>

$\implies$ USE FOR OPTIMIZER (Very stable and give good results)

Adaptive learning rates → <mark>**each weight has its own rate**</mark>, instead of a fixed $\gamma$

$$m_t = \beta _1 m_{t-1} + (1 - \beta _1)({\partial E \over \partial w_{ji}})$$
$$v_t = \beta _2 v_{t-1} + (1 - \beta _2)({\partial E \over \partial w_{ji}})^2$$
$$w_{ji}^{t+1} = w_{ji}^{t} - ({\gamma \over \sqrt{v_t} + \epsilon})m_t$$

**The added denominator is dependent on gradient of each weight**

$({\gamma \over \sqrt{v_t} + \epsilon})$ is the **new learning rate**

**** This incorporates momentum and adaptive learning rate (different rate per weight)

- rapid convergence
- requires minimal tuning
- commonly used optimizer

    `torch.optim.Adam(model.parameters(),lr=0.001)`
    

<img src="images/img29.png" width="20%" height="20%">

***
## 3. Learning Rate ($\gamma$)

The learning rate determines the **size of the step** that an optimizer takes during each iteration

$$w_{ji}^{t+1} = w_{ji}^{t} - \gamma \frac{\partial E}{\partial w_{ji}}$$

Learning rate size is also important, and depends on many different things:
- The learning problem
- The optimizer
- The batch size
    - Large batch → larger learning rates.
    - Small batch → smaller learning rate.
- The stage of training
    - Reduce as training progresses

<img src="images/img30.png" width="70%" height="70%">

$\implies$ **Begin with large learning rate, then decreases** (learning rate scheduling)

***
## 4. Normalization

We always normalize the inputs to **prevent the model from paying attention to the features with larger range (magnitude)**.

First layer:
- $\mu$ = average of X across samples
- $\sigma$ = standard deviation across samples

<img src="images/img31.png" width="30%" height="30%">

Each layer for the next layers: (output of this layer also needs to be normalized)

### 4.1 Batch Normalization

Normalize activations batch-wise for each layer

<img src="images/img32.png" width="50%" height="50%">

**Inference Time (application time)**: We're given just the input data, and the model is trained with normalization and batch

THEREFORE, we need to add something. **Keep a moving average** of $\mu$ and $\sigma$ during training <mark>**(Step 5)**</mark>, so we can use it at inference time

<img src="images/img33.png" width="90%" height="90%">

Pros:

- Higher learning rate → speeding up the training
- Regularizes the model
- Less sensitivity to initialization

Cons:
- Depends on batch size → No effect with small batches
- Cannot work with SGD

### <mark>4.2 Layer Normalization</mark>

USE THIS 

**Normalization is applied on the neuron** for a single instance across all features **in each example**

- Simpler to implement, no moving averages or parameters
- Not dependent on batch size

<img src="images/img34.png" width="20%" height="20%">

***
## 5. Regularization

Techniques to **avoid overfit**, making the model hard to memorize by making the problem more difficult

### 5.1 Dropout

Forces a neural network to learn more robust features
- **During training** → In each training step: Drop activations (neurons) (set to 0) with probability p
    - e.g. Each neuron has p% of being kept/drop
    - The network **must adapt and learn based on the underlying distribution of data rather than memorizing**
- **During inference** → multiply weights by (1-p) to keep the same distribution
    - During distribution, we have the full network
    - *For example, a neuron have p% chance to receive K' inputs out of K. Therefore, during inference, we must multiply each of the K inputs to the neuron by (1 - p), to get something closer to K'*

![Alt text](images/img35.png)



### 5.2 Weight Decay (L2)

Prevents the weights from growing too much → **Lowering variance**

Add a term to the loss function (magnitude of the entire weight matrix) $\implies$ **The magnitude of the weights should not get too big**

Weight reduction is multiplicative and proportion to the scale of W

<img src="images/img36.png" width="50%" height="50%">




### 5.3 Early Stopping with Patience

Recall: we should stop the training once we observe that the validation error is increasing.  
However, this will sometimes miss the more optimal results

- In each training iteration observe the validation loss
- As soon as validation loss starts to increase, start a counter
- If the validation loss decreases, reset the counter
- Otherwise, wait for a fixed iterations (patience) and then stop the training

<img src="images/img37.png" width="50%" height="50%">