# "[OptimizationTheory] CH04. Gradient based Optimizations"
> Optimization theory summary note.

- toc: false
- badges: false
- comments: false
- categories: [optimization-theory]
- hide_{github,colab,binder,deepnote}_badge: true

#### 4.1. Introduction
Gradient descent is an iterative first-order optimisation algorithm used to find a local minimum of a given function.

#### 4.2. Gradient Descent

##### Algorithm.4.1. Gradient Descent

Gradient descent is an iterative method to find a stationary point of an unconstraint optimization problem : <br>
$$
 \theta^* = \underset{\mathbf{\theta}} {\arg\min} L (\mathbf{\theta}) 
$$

$$
L(\mathbf{\theta} + \eta \mathbf{d}) \approx L(\mathbf{\theta}) + \eta \bigtriangledown  _\mathbf{\theta} ^ T L( \mathbf{\theta} ) \mathbf{d} \quad where \quad \eta > 0, \,\  \left \| \mathbf{d}  \right \| = 1 
$$

$$
L(\mathbf{\theta} + \eta \mathbf{d}) - L(\mathbf{\theta}) \approx \eta \bigtriangledown  _\mathbf{\theta} ^ T L( \mathbf{\theta} ) \mathbf{d} = \eta \cos{(\phi)} \left \| \bigtriangledown  _\mathbf{\theta} ^ T L( \mathbf{\theta} ) \right \| 
$$

Find the directional vector $\mathbf{d}$ that minimizes $ L(\mathbf{\theta} + \eta \mathbf{d} ) - L(\mathbf{\theta}) \le 0 $

$$
\cos{(\phi)} = -1 \,\ \rightarrow \,\ \mathbf{d} = - \frac{\bigtriangledown_\mathbf{\theta} L( \mathbf{\theta} ) }{ \left \| \bigtriangledown  _\mathbf{\theta} L( \mathbf{\theta} ) \right \| } 
$$

$$
\therefore \mathbf{\theta} + \eta \mathbf{d} = \mathbf{\theta} - \eta \frac{\bigtriangledown_\mathbf{\theta} L( \mathbf{\theta} ) }{ \left \| \bigtriangledown  _\mathbf{\theta} L( \mathbf{\theta} ) \right \| } = \mathbf{\theta} - \alpha \bigtriangledown_\mathbf{\theta} L( \mathbf{\theta} )
$$


#### 4.3. 4 Types of Gradient Descent

$ (i) \,\ \text{Standard (or steepest) Gradient Descent} $
$$ \mathbf{w} \leftarrow \mathbf{w} - \eta \bigtriangledown \mathbb{E}[J(\mathbf{w})] $$
 - Practically infeasible
 - Thus, we need distribution about data $\mathbf{x}$ (Contradiction)
 - So, We can use sample mean
<br><br>

$ (ii) \,\ \text{Stochastic(online) Gradient Descent} $
$$ \mathbf{w} \leftarrow \mathbf{w} - \eta \bigtriangledown J_i(\mathbf{w}) $$
 - Simple to implement 
 - Effective for large-scale problem
 - Much less memory
 - Unstable(zigzaging)
 - Purpose : We just consider one of data
 - It can be convergent. But there is little unstable.
<br><br>

$ (iii) \,\ \text{Batch gradient Descent} $
$$ \mathbf{w} \leftarrow \eta \bigtriangledown \sum_{i=1}^{N} J_i (\mathbf{w}) $$
 - Accurate estimation of gradients
 - Parallelization of learning
 - Large memory
 - Big time-complexity can be problem in this method.(So slow)
 - But, there isn't problem in convergence. 
 - Purpose : We consider all of data!
<br><br>

$ (vi) \,\ \text{Mini-Batch Gradient Descent} $
$$ \mathbf{w} \leftarrow \mathbf{w} - \eta \bigtriangledown \sum_{i \in \mathfrak{I}}^{N} J_i (\mathbf{w}), \quad 1 \le \left | \mathfrak{I} \right | \le N $$
 - Most generalized version
 - Effective to deal with large
 - Amount of training data
 - Purpose : We just consider seveal datas.

#### 4.4. Newton's Method
Newton's method is zero finding algorithm. Many equations can be solved by this algorithm and bisection search algorithm in numerical analysis. We use this method too because of gradient necessary condition, which is $\nabla L = \mathbf{0}$.

##### Algorithm.4.2. Newton-Rapson Method in Multivariate Function
In gradient updating context, we can find hyperplane of $L$ at $\mathbf{w}_0$

$$
\mathbf{y} = \nabla^2 L(\mathbf{w})^T (\mathbf{w} - \mathbf{w}_0) + \nabla L(\mathbf{w}_0)
$$

And we have to find next $\mathbf{w}$ by obtaining solution of following eqation:

$$
\nabla^2 L(\mathbf{w})^T (\mathbf{w} - \mathbf{w}_0) + \nabla L(\mathbf{w}_0) = \mathbf{0}
$$

Therefore,

$$
\mathbf{w}_1 = \mathbf{w}_0 - H(\mathbf{w}_0)^{-1} \nabla L(\mathbf{w}_0)
$$

Actually, we can consider too polynomial approximation like Taylor series expansion. The result is surprising.

$$
L(\mathbf{w} + \Delta \mathbf{w}) \approx L(\mathbf{w}) + \nabla L (\mathbf{w})^T \Delta \mathbf{w} + \frac{1}{2} \Delta \mathbf{w}^T H(\mathbf{w}) \Delta \mathbf{w}
$$

$$
\frac{\partial}{\partial \Delta \mathbf{w}} L(\mathbf{w} + \Delta \mathbf{w}) \approx \nabla L(\mathbf{w}) + H(\mathbf{w}) \Delta \mathbf{w} = \mathbf{0}
$$

$$
\therefore \,\ \Delta \mathbf{w} = H(\mathbf{w})^{-1} \nabla L(\mathbf{w})
$$

The above result is the same as the result of the Newton Method.

#### 4.4. Quasi-Newton Method
The inverse of the Hessian matrix appearing in Newton's method is difficult to use because of its too much computation. By replacing this with an average gradient, the amount of computation can be reduced. Explore the BFGS method.

#### 4.5. Update Rule with Momentum
We can add a momentum term to the update equation to prevent slowing down of learning or reduce instability of learning. Basic update rule is following:

$$
\mathbf{w}_{k + 1} = \mathbf{w}_k - \eta \nabla_\mathbf{w} L(\mathbf{w}_k) + \gamma \mathbf{w}_{k - 1} \,\ \text{for} \,\ k \ge 2.
$$

There are various variants of the gradient update algorithm using momentum.

##### Algorithm.4.4. Nesterov Accelerated Gradient(NAG)
When using Momentum, the direction of the gradient is also slightly shifted in the previous direction.

$$
\mathbf{w}_{k + 1} = \mathbf{w}_k - \eta \nabla_\mathbf{w} L(\mathbf{w}_k + \gamma \mathbf{w}_{k - 1}) + \gamma \mathbf{w}_{k - 1} \,\ \text{for} \,\ k \ge 2.
$$

#### 4.6. Update Rule with Adaptive Leaning Rate
If the learning rate is too small, the learning time is too long, and if the learning rate is too large, it diverges(zigzagging) and learning is not performed properly. <br>
AdaGrad solves this problem through learning rate decay. However, this also has a problem (zero convergence problem), so the following methods are used.

##### Algorithm.4.5. Adaptive Gradient(AdaGrad)

$$
\mathbf{w}_{k + 1} = \mathbf{w}_k - \frac{\eta}{\sqrt{\epsilon + \mathbf{d}_k}} \odot \nabla_\mathbf{w} L(\mathbf{w}_k), \,\ \mathbf{d}_k = \mathbf{d}_{k - 1} +  \nabla_\mathbf{w} L(\mathbf{w}_k) \odot \nabla_\mathbf{w} L(\mathbf{w}_k)
$$

The above algorithm has a fatal flaw. Since $d$ is infinitely increasing, the amount of change in the gradient will converge to zero.

##### Algorithm.4.6. Root Mean Square Propagation(RMSProp)

$$
\mathbf{w}_{k + 1} = \mathbf{w}_k - \frac{\eta}{\sqrt{\epsilon + \mathbf{d}_k}} \odot \nabla_\mathbf{w} L(\mathbf{w}_k), \,\ \mathbf{d}_k = \gamma \mathbf{d}_{k - 1} + (1 - \gamma) \nabla_\mathbf{w} L(\mathbf{w}_k) \odot \nabla_\mathbf{w} L(\mathbf{w}_k)
$$

##### Algorithm.4.7. Adaptive Delta(AdaDelta)

$$
\mathbf{w}_{k + 1} = \mathbf{w}_k - \frac{\sqrt{\epsilon + \mathbf{u}_k}}{\sqrt{\epsilon + \mathbf{d}_k}} \odot \nabla_\mathbf{w} L(\mathbf{w}_k), \,\ \mathbf{d}_k = \gamma \mathbf{d}_{k - 1} + (1 - \gamma) \nabla_\mathbf{w} L(\mathbf{w}_k) \odot \nabla_\mathbf{w} L(\mathbf{w}_k), \,\ \mathbf{u}_k = \gamma \mathbf{u}_{k - 1} - (1 - \gamma) \frac{\sqrt{\epsilon + \mathbf{u}_{k-1}}}{\sqrt{\epsilon + \mathbf{d}_{k-1}}} \odot \nabla_\mathbf{w} L(\mathbf{w}_{k-1}) \,\ \text{for} \,\ k \ge 2.
$$

#### 4.6. Hybrid Update Rule with Momentum and Adaptive Learning Rate

- Adaptive Moment Estimation(Adam) : Momentum + RMSProp
- Nesterov-accelerated Adaptive Moment Estimation(NAdam) : NAG + Adam

#### 4.7. Laerning Rate Scheduler

In implementations of neural network, the optimizer is important, but the learning rate scheduler is also important. In pytorch implementation, the followings are a commonly used learning rate scheduler.

1. Constant Learning Rate
2. LambdaLR

```python
scheduler = LambdaLR(optimizer, lr_lambda = lambda epoch: 0.95 ** epoch)
```

```python
def func(epoch):
    if epoch < 40:
        return 0.5
    elif epoch < 70:
        return 0.5 ** 2
    elif epoch < 90:
        return 0.5 ** 3
    else:
        return 0.5 ** 4

scheduler = LambdaLR(optimizer, lr_lambda = func
```

3. StepLR

```python
scheduler = StepLR(optimizer, step_size=200, gamma=0.5)
```

4. MultiStepLR

```python
scheduler = MultiStepLR(optimizer, milestones=[200, 350], gamma=0.5)
```

5. ExponentialLR

```python
scheduler = ExponentialLR(optimizer, gamma=0.95)
```

6. CosineAnnealingLR

```python
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.001)
```

7. CosineAnnealingWarmRestarts

```python
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=50, T_mult=2, eta_min=0.001)
```

8. Custom CosineAnnealingWarmRestarts

```python
import math
from torch.optim.lr_scheduler import _LRScheduler

class CosineAnnealingWarmUpRestarts(_LRScheduler):
    def __init__(self, optimizer, T_0, T_mult=1, eta_max=0.1, T_up=0, gamma=1., last_epoch=-1):
        if T_0 <= 0 or not isinstance(T_0, int):
            raise ValueError("Expected positive integer T_0, but got {}".format(T_0))
        if T_mult < 1 or not isinstance(T_mult, int):
            raise ValueError("Expected integer T_mult >= 1, but got {}".format(T_mult))
        if T_up < 0 or not isinstance(T_up, int):
            raise ValueError("Expected positive integer T_up, but got {}".format(T_up))
        self.T_0 = T_0
        self.T_mult = T_mult
        self.base_eta_max = eta_max
        self.eta_max = eta_max
        self.T_up = T_up
        self.T_i = T_0
        self.gamma = gamma
        self.cycle = 0
        self.T_cur = last_epoch
        super(CosineAnnealingWarmUpRestarts, self).__init__(optimizer, last_epoch)
    
    def get_lr(self):
        if self.T_cur == -1:
            return self.base_lrs
        elif self.T_cur < self.T_up:
            return [(self.eta_max - base_lr)*self.T_cur / self.T_up + base_lr for base_lr in self.base_lrs]
        else:
            return [base_lr + (self.eta_max - base_lr) * (1 + math.cos(math.pi * (self.T_cur-self.T_up) / (self.T_i - self.T_up))) / 2
                    for base_lr in self.base_lrs]

    def step(self, epoch=None):
        if epoch is None:
            epoch = self.last_epoch + 1
            self.T_cur = self.T_cur + 1
            if self.T_cur >= self.T_i:
                self.cycle += 1
                self.T_cur = self.T_cur - self.T_i
                self.T_i = (self.T_i - self.T_up) * self.T_mult + self.T_up
        else:
            if epoch >= self.T_0:
                if self.T_mult == 1:
                    self.T_cur = epoch % self.T_0
                    self.cycle = epoch // self.T_0
                else:
                    n = int(math.log((epoch / self.T_0 * (self.T_mult - 1) + 1), self.T_mult))
                    self.cycle = n
                    self.T_cur = epoch - self.T_0 * (self.T_mult ** n - 1) / (self.T_mult - 1)
                    self.T_i = self.T_0 * self.T_mult ** (n)
            else:
                self.T_i = self.T_0
                self.T_cur = epoch
                
        self.eta_max = self.base_eta_max * (self.gamma**self.cycle)
        self.last_epoch = math.floor(epoch)
        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
            param_group['lr'] = lr
            
            
optimizer = optim.Adam(model.parameters(), lr = 0)
scheduler = CosineAnnealingWarmUpRestarts(optimizer, T_0=150, T_mult=1, eta_max=0.1,  T_up=10, gamma=0.5)
```