Optimization Algorithms

- Gradient descent methods
- Gradient descent methods
- Newton and Quasi-newton methods


# Gradient methods

In ML the default approach to optimization is gradient descent.

Described by Cauchy 1847

Good analogy - fog in the mountains. You need to reach the summit, but you see only in radius of 10 meters.

In simple terms GD algorithm does the following: 
1. chooses the starting point
2. computes gradient - the direction of the steepest descent
3. updates current location
4. if update is insignificantly small => stop
5. goto 2

Suppose we want to minimize $f(\theta)$, where $\theta = (\theta_1, \theta_2 \cdots \theta_n)$. We choose some initial  approximation $\theta_0$ and repeat the update process:

$$\theta_{t} = \theta_{t-1} + \Delta_t$$
where
$$\Delta_t = - \eta \cdot grad(\theta_{t-1})$$

NOTE: if we need to maximize $f(\theta)$ instead of minimization, we will <u>add</u> the gradient $\Delta_t = \eta \cdot grad(\theta_{t-1})$

### Variations of GD

There are 3 flavors of the Gradient Descend algorithm:

|||
| --- | --- |
|**Batch**| calculates gradients for each data sample and combines them into one big update |
|**Stochastic**| updates for each data sample |
|**Minibatch**| constructs small minibatch updates and apply them |

Here is the illustrsation of convergence of these 3 alogrithms:
<img src= "gd_variants.png" width=500>

## Modifications of GD

### SGD with Momentum

Suppose we do standard Gradient Descent:
$$\theta_{t} = \theta_{t-1} + \Delta$$

We want our updates 

$$\Delta = - \eta \cdot grad(\theta_{t})$$

to be less susceptible to small arbitrary changes in direction. That's why we add momentum - some rate ($\gamma$) of previous movement.

$$\Delta = \gamma grad(\theta_{t-1}) + \eta grad(\theta_{t})$$

Physical analogy - moving objects do not stop immediately when they need to change the direction, they continue movment for some time.

Advantages of momentum can be easily seen in case of "ravines" when locally large gradients give wrong direction. Here momentum gives more weight to another dimension whose gradients are more stable => convergence is faster.

<img src = "momentum.png" width = 500>

### Nesterov Accelerated Momentum
We do SGD with momentum but slightly modify the order of execution:
1. first we follow the momentum from previous step
2. then we add current gradient as a correction

This gives us the following formula:
$$\theta_{t} = \theta_{t-1} + \gamma \cdot grad(\theta_{t-1}) + \eta \cdot grad(\theta_{t-1} - \gamma \cdot grad_{t-1})$$

[arxiv (2013)](http://proceedings.mlr.press/v28/sutskever13.pdf)

### Adagrad
We would like to define some schedule of decrease for learning rate $\eta$. 
Here we do it by simply adding a normalizer that divides learning rate by root of sum of accumulated squared gradients
$$\theta_{t} = \theta_{t-1} + \frac{\eta}{\sqrt{\sum_t {grad}^2_t + \epsilon}} \cdot grad(\theta_t)$$

Note that denominators are different for each agrument. So in formula above *grad* is actually a diagonal matrix.

$\epsilon$ here just prevents from division by zero

[arxiv (2011)](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

### Adadelta
The proposed in Adagrad decrease tends to be too radical. We would like to make it more gradual. 
So we modify Adagrad by by changing sum of accumulated gradients to sliding average of accumulated gradients.
We will refer to this sliding average as $RMS[grad(\theta_{t-1})]$

$$\theta_{t} = \theta_{t-1} + \frac{\eta}{RMS[grad(\theta_{t-1})]} \cdot grad(\theta_t)$$

[arxiv (2012)](https://arxiv.org/abs/1212.5701)

### RMSprop
This method is almost the same as Adadelta
$$\theta_{t} = \theta_{t-1} + \frac{\eta}{RMS[grad(\theta_{t-1})]} \cdot grad(\theta_t)$$
It stems from rprop

[arxiv (2015)](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)

### Adam
Now we would like to add momentum to RMSProp like we did in earlier algorithms. So we add (exponential) momentum - accumulated gradient

[arxiv(2014)](https://arxiv.org/abs/1412.6980)

$$\theta_{t} = \theta_{t-1} + \frac{\eta}{RMS[grad(\theta_{t-1})]} \cdot (\gamma \cdot grad_{t-1} + (1-\gamma) \cdot grad_{t})$$

### Adamax
As a slight modification to Adam algorithm we allow usage of arbitrary $L^p$ metrics for gradient momentum.

[arxiv (2014)](https://arxiv.org/abs/1412.6980)


### Nadam

Why not switching from momentum in Adam algorithm to accelerated Nesterov momentum

[arxiv (2015)](http://cs229.stanford.edu/proj2015/054_report.pdf)

## What  algorithm to use

Currently (2021) Adam algorithm is considered the state-of-the-art approach and is recommended as a deafult algorithm for most deep learning applications.



<img src = "optimization.png" width = 350>

Algorithm parameters:
- $\eta$ - initial learning rate
- $\beta_1$ - momentum retention rate
- $\beta_2$ - controls the level of $\eta$ decay
- $\epsilon$ - prevents division by zero
