# Optimization Algorithms

## Contents

1. [**Prerequisites**](#Prerequisites)  
    1. [Exponentially Weighted Average](#Exponentially-Weighted-Average)
2. [**Optimization Algorithms**](#Optimization-Algorithms)  
    1. [Momentum](#Momentum)  
    2. [Nesterov Accelerated Gradient](#Nesterov-Accelerated-Gradient)  
    3. [RMSProp](#RMSProp)  
    4. [Adam](#Adam) 
3. [**Common Practices**](#Common-Practices)
    1. [Dynamic learning rate](#Dynamic-learning-rate)

# Prerequisites

## Exponentially Weighted Average

TODO

# Optimization Algorithms

We can use variants of the stochastic or mini-batch gradient descent that may help to speed up learning and escape local minima. One of these techniques is the gradient descent with momentum. Two more popular optimization algorithms that have been shown to work well on a wide range of neural network architectures are RMSProp and Adam.

## Momentum

There exist several euristics that can be used to avoid getting stuck in local minima and may help accelerate the learning. For example we can include a **weight momentum** in the weight update. The basic idea is to compute an **exponentially weighted average** (governed by a parameter $\beta$) of the gradients and to use it to update our weights instead. Assume we are performing mini-batch gradient descent, and consider the update step for the $l$-th layer (we'll omit the superscript denoting the layer in the next lines). First we compute the derivatives $\mathrm{d}W$ and $\mathrm{d}\mathbf{b}$ of the loss function with respect to that layer's weights, on the current mini-batch. Then we update the weights in this way:

$$V_{\mathrm{d}W} = \beta V_{\mathrm{d}W} + (1-\beta) \mathrm{d}W$$

$$V_{\mathrm{d}b} = \beta V_{\mathrm{d}b} + (1-\beta) \mathrm{d}b$$

$$W = W - \alpha V_{\mathrm{d}W}$$

$$b = b - \alpha V_{\mathrm{d}b}$$

In this way we can smooth out the steps of gradient descent and follow a more straightforward path. Weight momentum can also be interpreted by making an analogy with physics: the current value of the weights represents the position of a physical object rolling down a surface (defined by the cost function). The term $\mathrm{d}W$ can be thought as an acceleration and the term $V_{\mathrm{d}W}$ as a velocity. The hyperparameter $\beta$, which is smaller than $1$, represents a _friction_ and prevents the object from speeding up without limit, but rather than taking each step independently from the previously taken ones, the object can gain momentum from and, eventually, escape a local minimum.

In practice $V_{\mathrm{d}W}$ is a matrix of the same dimension of $W$, $V_{\mathrm{d}b}$ is a vector of the same dimension of $b$, and they are both initialized to zero. Tipically, $\beta$ is initialized to some value around $0.9$.

## Nesterov Accelerated Gradient

Nesterov Accelerated Gradient is slighlty different than momentum in the sense that we kind of "look into the future" to see how much momentum is required. The update equations become:

$$\delta^{(k+1)} = -\eta \nabla L(W^{(k)} + \alpha \delta^{(k)}) + \alpha\delta^{(k)}$$

$$W^{(k+1)} = W^{(k)} + \delta^{(k+1)}$$

The same holds also for $b$. With Nesterov momentum, first we move in the previous accumulated gradient computed the iteration before (from $W^{(k)}$ to $W^{(k)} + \alpha\delta^{(k)}$, then we compute the gradient in that point ($-\eta \nabla L(W^{(k)} + \alpha \delta^{(k)})$) and finally make a correction.

<img src="images/neural_networks/Nesterov.png" style="width:40em; display: block; margin-left: auto; margin-right: auto;" />

## RMSProp

Root Mean Square Propagation ([RMSProp](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)) is quite similar to weight momentum, but it divides the gradient by a running weighted average. The main reason behind this method is that the fact that the magnitude of the gradient for different weights may change during the learning makes it difficult to choose a single global learning rate.

$$S_{\mathrm{d}W} = \beta_2 S_{\mathrm{d}W} + (1-\beta_2) \mathrm{d}W^2$$

$$S_{\mathrm{d}b} = \beta_2 S_{\mathrm{d}b} + (1-\beta_2) \mathrm{d}b^2$$

$$W = W - \alpha \frac{\mathrm{d}W}{\sqrt{S_{\mathrm{d}W}}+\varepsilon}$$

$$b = b - \alpha \frac{\mathrm{d}b}{\sqrt{S_{\mathrm{d}b}}+\varepsilon}$$

Note that the squaring operation is computed element-wise, and we used different letters $S_{\mathrm{d}W}$ and $\beta_2$ than with momentum, because with Adam we are going to combine both momentum and RMSProp. $\varepsilon$ is a small term (in practice we can set it to $10^{-8}$) that prevents dividing by some quantity close to zero that would make the numerator explode. Note also that, differently from weight momentum, here we dump out the oscillations by dividing by a term which is proportional to the width of the step (to the derivative of the loss function with respect to $W$ or $b$), therefore, wider steps (steps in a direction in which the function is steeper) will be dump out more than narrow steps (steps in a direction in which the function is flatter). In other words, every parameter is weighted by a different learning rate.

A consequence of this is that we can use a larger learning rate $\alpha$ without diverging in the steepest directions.

## Adam

Adaptive moment estimation (Adam) puts together weight momentum and RMSProp. The parameters are initialized to zero.

$$V_{\mathrm{d}W}=0, \quad S_{\mathrm{d}W}=0, \quad V_{\mathrm{d}b}=0, \quad S_{\mathrm{d}b}=0.$$

Then, on iteration $t$:
- Compute $\mathrm{d}W$ and $\mathrm{d}b$ using the current mini-batch 
- Compute Momentum exponentially weighted average:  
    - $V_{\mathrm{d}W}=\beta_1 V_{\mathrm{d}W} + (1-\beta_1)\mathrm{d}W$  
    - $V_{\mathrm{d}b}=\beta_1 V_{\mathrm{d}b} + (1-\beta_1)\mathrm{d}b$  
- Compute the RMSProp update terms:
    - $S_{\mathrm{d}W}=\beta_2 S_{\mathrm{d}W} + (1-\beta_2)\mathrm{d}W^2$  
    - $S_{\mathrm{d}b}=\beta_2 S_{\mathrm{d}b} + (1-\beta_2)\mathrm{d}b^2$  
- Perform the bias correction:
    - $V_{\mathrm{d}W}^{corr} = \frac{V_{\mathrm{d}W}}{1-\beta_1^t}$  
    - $V_{\mathrm{d}b}^{corr} = \frac{V_{\mathrm{d}b}}{1-\beta_1^t}$  
    - $S_{\mathrm{d}W}^{corr} = \frac{S_{\mathrm{d}W}}{1-\beta_2^t}$  
    - $S_{\mathrm{d}W}^{corr} = \frac{S_{\mathrm{d}W}}{1-\beta_2^t}$  
- Update the weights:
    - $W := W - \alpha \frac{V_{\mathrm{d}W}^{corr}}{\sqrt{S_{\mathrm{d}W}^{corr}}+\varepsilon}$
    - $b := b - \alpha \frac{V_{\mathrm{d}b}^{corr}}{\sqrt{S_{\mathrm{d}b}^{corr}}+\varepsilon}$
    
This algorithm has some hyperparameters that have to be tuned, and others that are tipically initialized to some common values:
- Learning rate $\alpha$ needs to be tuned (we could try few values and choose the one yielding the best result, or adopt learning rate decay).
- $\beta_1 = 0.9$
- $\beta_2 = 0.999$
- $\varepsilon = 10^{-8}$

# Common practices

### Dynamic learning rate

Instead of using a fixed, chosen a priori, learning rate, $\alpha$ is often replaced by a learning rate that decreases over time, for example:

$$\alpha = \frac{\alpha_0}{1 + \text{decay_rate} \cdot \text{epoch_num}}$$

where $\alpha_0$ is the initial learning rate.

There exist other learning rate decay methods, for instance:

- **Exponential decay**:

$$\alpha = k^{\text{epoch_num}}\cdot \alpha_0$$

where $k$ is a constant, e.g. $k=0.95$.

- **Based on epoch number:**

$$\alpha = \frac{k}{\sqrt{\text{epoch_num}}} \cdot \alpha_0$$

where $k$ is a constant.

- **Based on batch size:**

$$\alpha = \frac{k}{\sqrt{\text{batch_size}}} \cdot \alpha_0$$

where $k$ is a constant.



