# Gradient Descent and its variants.

Gradient Descent (GD) is a fundamental optimization algorithm widely used in machine learning and mathematical optimization. It serves as the backbone for training various models, including linear regression, neural networks, and support vector machines. The main objective of Gradient Descent is to find the optimal parameters of a model that minimize or maximize a given objective function.

In the context of supervised learning, the objective function typically represents the loss or cost function, which measures the discrepancy between the model's predictions and the actual ground-truth labels. The goal of Gradient Descent is to iteratively update the model parameters in a way that reduces this discrepancy, effectively improving the model's performance.


## Mathematical Formulation

Let's define the following terms for the mathematical formulation of GD:

- $\theta$: Model parameters that we want to optimize.
- $J(\theta)$: The loss function that measures the difference between predicted and true labels.
- $\eta$: Learning rate, which controls the step size of parameter updates.
- $X$: The input data.
- $y$: The true labels corresponding to the input data. 
- $f_\theta$: parametric function, $\hat{y} = f_\theta(x)$ 

We will define the loss function as follows:  
$J(\theta) = \sum_{n=1}^{N}\mathcal{L}(y_n, f_{\theta}(x_n))$  

The objective of GD is to find the optimal parameters $\theta^*$ that minimize the loss function $J(\theta)$:  
$\theta^* = \arg \min_{\theta} J(\theta)$


The update rule for GD can be expressed as:

$\theta_{t+1} = \theta_{t} - \eta \cdot \nabla J(\theta_{t})$

where $\nabla J(\theta)$ represents the gradient of the loss function with respect to the model parameters $\theta$.

## Variants of GD

GD has several variants that incorporate additional techniques to improve convergence and stability. Some of the popular variants include:

1. **Mini-batch SGD**: Instead of using a all data points, it uses a small mini-batch of data points to compute the gradient at each iteration. This strikes a balance between efficiency and stability.

2. **Momentum**: Momentum introduces a momentum term that helps the optimization process to overcome oscillations and accelerate convergence. It accumulates a weighted average of past gradients to update the parameters.

3. **Nesterov Momentum**: Nesterov Momentum is an extension of classic Momentum that takes into account the lookahead position when computing the gradient, leading to faster convergence.

4. **RMSprop**: RMSprop adapts the learning rate for each parameter based on the root mean square of the past gradients. It helps to handle different scales of gradients and improves convergence.

5. **Adam**: Adam combines the ideas of Momentum and RMSprop. It adapts the learning rate for each parameter based on the first and second moments of the past gradients.

Each variant of GD has its strengths and is suited for different scenarios. The choice of the variant often depends on the specific problem and the characteristics of the dataset.


## Mini-batch SGD

Let us assume that the dataset is split in M minibatches each of size M. Than we will obtain following sequences of data:  
$\underbrace{a=b}_{k}$

## Nesterov Momentum

Nesterov Momentum is an extension of the classic Momentum optimization algorithm. It addresses the issue of oscillations in the optimization process, allowing for faster convergence. In Nesterov Momentum, the algorithm considers the lookahead position when computing the gradient, which helps to better estimate the direction of the gradient and adjust the velocity accordingly.

The update rule for Nesterov Momentum is as follows:

1. Calculate the lookahead position $\theta_{\text{lookahead}}$:
   $$
   \theta_{\text{lookahead}} = \theta + \mu \cdot \text{velocity}
   $$

2. Compute the gradient of the loss function with respect to the lookahead position $\theta_{\text{lookahead}}$:
   $$
   \nabla J(\theta_{\text{lookahead}})
   $$

3. Update the velocity using the previous velocity, learning rate $\eta$, and the gradient at the lookahead position:
   $$
   \text{velocity} = \mu \cdot \text{velocity} - \eta \cdot \nabla J(\theta_{\text{lookahead}})
   $$

4. Finally, update the model parameters using the velocity and the learning rate:
   $$
   \theta = \theta + \text{velocity}
   $$

Nesterov Momentum effectively reduces oscillations and improves the convergence speed compared to classic Momentum, making it a powerful optimization technique.

## RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm. It helps to handle the issues of varying gradient scales and ill-conditioned optimization surfaces. RMSProp adaptively adjusts the learning rate for each parameter based on the root mean square of the past gradients.

The update rule for RMSProp is as follows:

1. Initialize a moving average of the squared gradients:
   $$
   \text{squared\_gradients} = 0
   $$

2. Compute the squared gradients for each parameter using the decay rate $\gamma$:
   $$
   \text{squared\_gradients} = \gamma \cdot \text{squared\_gradients} + (1 - \gamma) \cdot (\nabla J(\theta))^2
   $$

3. Update the model parameters using the learning rate $\eta$ and the scaled gradient:
   $$
   \theta = \theta - \frac{\eta}{\sqrt{\text{squared\_gradients} + \epsilon}} \cdot \nabla J(\theta)
   $$

RMSProp adapts the learning rate for each parameter based on the history of gradients, allowing for faster convergence and better optimization performance.

## Adam

Adam (Adaptive Moment Estimation) is another adaptive learning rate optimization algorithm that combines the ideas of Momentum and RMSProp. It not only adapts the learning rate for each parameter but also keeps track of the first and second moments of the past gradients.

The update rule for Adam is as follows:

1. Initialize the first and second moment estimates:
   $$
   \text{m} = 0, \quad \text{v} = 0
   $$

2. Update the first moment estimate:
   $$
   \text{m} = \beta_1 \cdot \text{m} + (1 - \beta_1) \cdot \nabla J(\theta)
   $$

3. Update the second moment estimate:
   $$
   \text{v} = \beta_2 \cdot \text{v} + (1 - \beta_2) \cdot (\nabla J(\theta))^2
   $$

4. Compute bias-corrected first and second moment estimates:
   $$
   \hat{\text{m}} = \frac{\text{m}}{1 - \beta_1^t}, \quad \hat{\text{v}} = \frac{\text{v}}{1 - \beta_2^t}
   $$

5. Update the model parameters using the learning rate $\eta$ and the bias-corrected estimates:
   $$
   \theta = \theta - \frac{\eta}{\sqrt{\hat{\text{v}}} + \epsilon} \cdot \hat{\text{m}}
   $$

Adam efficiently combines the benefits of both Momentum and RMSProp, making it a popular choice for optimizing deep learning models.

These optimization algorithms offer various improvements over traditional Gradient Descent and play a crucial role in training complex machine learning models effectively and efficiently. The choice of the optimization algorithm depends on the specific problem, dataset, and model architecture.


In [None]:
from mllib.optimizers import Adam

In [1]:
from mllib.optimizers import Adam
import numpy as np 

def loss_fn(z, y, x):
        return x**2 - 6*x + 9

def gradient_fn(z, y, x):
    return 2*x - 6

adam = Adam(
        gradient_fn=gradient_fn,
        parameters=np.array([20.]),
        loss_fn=loss_fn,
        learning_rate=1e-1,
        max_iter=1000,
        tolerance=1e-12,
        batch_size=1
        )
X = y = np.array([0])
x_opt = adam.optimize(X, y)
assert abs(x_opt-3) <= 1e2, "Wrong answer"
print(f"x_opt: {x_opt} f_opy: {loss_fn(None, None, x_opt)} "
        f"grad_fn: {gradient_fn(None, None, x_opt)}")


x_opt: [3.00002925] f_opy: [8.55525428e-10] grad_fn: [5.84986886e-05]
