# IMPORTING NECESSARY LIBRARIES

In [2]:
import numpy as np
import torch.nn as nn

# OPTIMIZATION ALGORITHMS

They are used to minimize or maximize a loss function by adjusting model parameters during training. In pytorch they are held under `torch.optim`

They include:

1. Gradient descent
2. Momentum
3. Nestrov accelerated Gradient (NAG)
4. Adagrad
5. RMSProp
6. Adam (Adaptive moment Estimation
7. AdamW - Weight decay Adam

## Gradient descent

Adjust parameters by computing the gradient of the loss function. They include:

1. Batch Gradient descent (the entire batch)
2. Stochastic gradient descent (one sample per time)
3. mini-batch gradient descent (a sample of the batch)

The formula to update the gradient descent : ***param = param - learning rate * Gradient of the loss function w.r.t to tetha***

In [None]:
class SGD:
    def __init__(self,model, learning_rate = 0.01):
        self.model = model
        self.lr = learning_rate
    def step(Self):
        for param, grad in self.model.gradients().items():
            # self.model.gradients() - returns a dictionary of computed gradients for each parameter (param) in the model
            # updating the params
            self.model.parameters()[param] -= self.lr * grad
        

In `torch.optim`, we import SGD: `torch.optim.SGD` to perform stochastic gradient descent automatically using computed gradients
```python
optimizer = nn.optim.SGD(model.parameters(), lr = 0.01) # this is defined outside the train loop
optimizer.step() # this is defined inside the train loop
```
pytorch will automatically update the model's parameters. You just call `optimizer.step()` & this is called immediately after `loss.backward()`. Remember before you need to clear out previous gradients to avoid it accumulating: `optimizer.zero_grad()`

All this is done in the train by the way, After every batch loaded into the train method

Params that could be passed into the SGD:

1. required
   - **model.parameters()** - these are the learnable params from the model (normally the weights and the biases) : these are the things you keep updating until you find the optimal solutions
   - **lr** - learning rate that controls the step size for updating parameters
3. optional
   - **momentum** : this is used to add a fraction of the previous gradient to the current one to accelerate convergence
   - **weight decay** : Adds L2 Regularization (also ridge regression) to prevent overfitting
   - **dampening** : dampens the momentum effect
   - **nestrov** : enables nestrov's momentum, which looks ahead by applying momentum before the gradient update

## Momentum

Accelerates gradient descent by accumulating a velocity vector that considers past gradients

This is an enhanced SGD that helps accelerates convergence by using a moving average of past gradients. It maintains a velocity `V_t`. Parameters are updated using the velocity. 

So we first compute the velocity: `V_t = (momentum factor)(prev velocity) + (1 - momentum factor)(current gradient)`. Then you use the velocity to update the parameters: `param -= lr*V_t`. This is accelerates convergence.

**Lower momentum** reduces the influence of past gradients, making updates more senstive to new gradients while **higher momentum** increases the influence of past gradients, smoothening updates but risking overfitting

With momentum:

1. convergence speed **Faster**
2. Oscilations: **Lower**
3. Escaping local minima: **Easy**
4. Stability: **More stable**

In pytorch, momentum is an optional param in `torch.nn.optim.SGD`. 
```py
optimizer = torch.nn.optim.SGD(model.parameters(), lr = 0.01, momentum = 0.9)
```

Again in the training loop, remember:
```py
optimizer.zero_grad()
# perform forward pass (output = model(inputs), loss = criterion(outputs, targets))
# perform backward pass (loss.backward())
# perform weights updates
optimizer.step()
# repeat the cycle

```

## Nestrov Accelerated Gradient (NAG)

Similar to mometum but looks ahead to calculate the gradient

## Adagrad

Adjusts the learning rate for each parameter based on past gradients

## RMSProp

Fixed Adagrad's decaying learning rate problem by using a moving average of squared gradients

## Adam (Adaptive moment Estimation)

combines momentum and RMSProp for adaptive learning rates

## AdamW (Weight decay Adam)

Variant of Adam that decouples weight decays from the optimization step