# optim

> The `Optim` module in minima is a flexible and powerful toolbox for optimizing the parameters of your deep learning models. Built on a high-level,  
> intuitive, and pythonic API, it provides several out-of-the-box optimization strategies, such as Stochastic Gradient Descent (SGD), Adam, and more.  
> In the heart of this module lies the abstract `Optimizer` class that defines a standard interface for all the optimization strategies.  
> Each specific optimizer class implements this interface, which ensures a consistent usage and allows for easy swapping of different strategies in your training loop.  

> Among the features of this module are:  
> - Efficient gradient computations and updates.  
> - Advanced optimization strategies with adaptive learning rates.  
> - Easy to extend to custom optimization strategies.  
> - Supports weight decay regularization for avoiding overfitting.  

> Whether you're training a simple linear regression or a complex deep neural network, `Optim` has got you covered. With its simple and consistent interface, the module makes the task of optimizing your models a breeze.


In [None]:
#| default_exp optim

In [None]:
#| export
import minima as mi
from minima.nn import Parameter
from minima.autograd import Tensor
from minima import init
import numpy as np

In [None]:
#| export
class Optimizer:
    """
    Base class for all optimizers. Not meant to be instantiated directly.

    This class represents the abstract concept of an optimizer, and contains methods that 
    all concrete optimizer classes must implement. It is designed to handle the parameters 
    of a machine learning model, providing functionality to perform a step of optimization 
    and to zero out gradients.
    
    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.

    Raises
    ------
    NotImplementedError
        If the `step` method is not implemented in a subclass.
    """
    def __init__(
        self,
        params # The parameters of the model to be optimized.
    ):
        self.params = params

    def step(self):
        """
        Performs a single optimization step.

        This method must be overridden by any subclass to provide the specific optimization logic.
        
        Raises
        ------
        NotImplementedError
            If the method is not implemented in a subclass.
        """
        raise NotImplementedError()

    def zero_grad(self):
        """
        Zeros out all gradients in `params`.

        This method is typically used before backpropagation to ensure that gradients 
        are not being accumulated from multiple passes.
        """
        for p in self.params:
            p.grad = None

## SGD Optimizer

This is a PyTorch-style implementation of the classic optimizer Stochastic Gradient Descent (SGD).

SGD update is,

$$
\theta_{t} = \theta_{t-1} - \alpha \cdot g_{t}
$$

where $\alpha$ is the learning rate, and $g_{t}$ is the gradient at time step $t$. $θ_{t}$ represents the model parameters at time step $t$.

The learning rate $\alpha$ is a scalar hyperparameter that controls the size of the update at each iteration.

An optional momentum term can be added to the update rule:

$$
\begin{align*}
v_{t} & \leftarrow \mu v_{t-1} + (1-\mu) \cdot g_t \\
\theta_{t} & \leftarrow \theta_{t-1} - \alpha \cdot v_t 
\end{align*}
$$

where $v_{t}$ is the momentum term at time step $t$, and $\mu$ is the momentum factor. The momentum term increases for dimensions whose gradients point in the same   
direction and reduces updates for dimensions whose gradients change direction, thereby adding a form of preconditioning.  

A weight decay term can also be included, which adds a regularization effect:

$$
\theta_{t} = (1 - \alpha \cdot \lambda) \cdot \theta_{t-1} - \alpha \cdot g_t
$$

where $\lambda$ is the weight decay factor. This results in the model weights shrinking at each time step, which can prevent overfitting by keeping the model complexity in check.

In [None]:
#| export
class SGD(Optimizer):
    """
    Implements stochastic gradient descent (optionally with momentum).

    This is a basic optimizer that's suitable for many machine learning models, and is often
    used as a baseline for comparing other optimizers' performance.

    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.
    lr : float, optional
        The learning rate.
    momentum : float, optional
        The momentum factor.
    wd : float, optional
        The weight decay (L2 regularization).
    """
    def __init__(
        self,
        params, # The parameters of the model to be optimized.
        lr=0.01, # The learning rate.
        momentum=0.0, # The momentum factor.
        wd=0.0 # The weight decay (L2 regularization).
    ):
        super().__init__(params)

        self.lr = lr
        self.momentum = momentum
        self.u = {}
        self.wd = wd

    def step(self):
        """
        Performs a single optimization step.

        This method uses the current gradients to adjust the parameters using stochastic gradient descent.
        """
        for self.idx, p in enumerate(self.params):
            self._reg_step(p)
            self._opt_step(p)

    def _opt_step(self, p):
        """
        Performs the optimization step for a single parameter tensor.

        If momentum is set, it applies momentum by using a running average of the previous gradients.
        """
        # import pdb; pdb.set_trace()
        # TODO: there is a bug (somewhere :3) -> The dtype of the given the grad tensor is (float64) is not the same as the dtype of the current tensor (float32). 
        # will do a temp fix for now : Xd
        grad = Tensor(p.grad, dtype='float32')
        if self.idx not in self.u:
            self.u[self.idx] = init.zeros(*p.shape)
        self.u[self.idx] = self.momentum * self.u[self.idx] + (1 - self.momentum) * grad
        p.data = p.data - self.lr * self.u[self.idx]

    def _reg_step(self, p):
        """
        Applies weight decay for a single parameter tensor.

        This form of L2 regularization can help prevent overfitting.
        """
        if self.wd != 0:
            p.data *= (1 - self.lr * self.wd)

In [None]:
# import minima as mi
# import numpy as np
# import minima.nn as nn
# from minima.data import *
# X_tr = mi.Tensor(np.random.randn(50, 30))
# y_tr = mi.Tensor(np.random.choice([0, 1], size=(50,)))

In [None]:
# class NeuralNetwork(nn.Module):
#     def __init__(self, input_shape, output_shape):
#         super(NeuralNetwork, self).__init__()
#         self.dense1 = nn.Linear(in_features=input_shape, out_features=24)
#         self.dense2 = nn.Linear(24, 24)
#         self.dense3 = nn.Linear(24, 24)
#         self.dense4 = nn.Linear(24, output_shape)
#         self.relu = nn.ReLU()
#         self.softmax = nn.Softmax()

#     def forward(self, x):
#         x = self.relu(self.dense1(x))
#         x = self.relu(self.dense2(x))
#         x = self.relu(self.dense3(x))
#         # print(self.dense4(x))
#         x = self.dense4(x)
#         return x

# # Create the neural network
# input_shape = 30  # Replace with the actual input shape
# output_shape = 2  # Replace with the actual output shape

# network = NeuralNetwork(input_shape, output_shape)

AttributeError: module 'minima' has no attribute 'cpu'

In [None]:
# # Custom Dataset class
# class MyDataset(Dataset):
#     def __init__(self, X, y):
#         self.X = mi.Tensor(X)
#         self.y = mi.Tensor(y)
    
#     def __len__(self):
#         return len(self.X)
    
#     def __getitem__(self, index):
#         return self.X[index], self.y[index]

# tr_ds = MyDataset(X_tr, y_tr)
# # val_ds = MyDataset(X_val, y_val)

# # Creating the data loader
# batch_size = 2
# tr_dl = DataLoader(tr_ds, batch_size=batch_size, shuffle=True)
# # val_dl = DataLoader(val_ds, batch_size=64, shuffle=True)

In [None]:

# network = NeuralNetwork(input_shape, output_shape)
# opt = SGD(network.parameters(), lr=0.01)
# bce = nn.CrossEntropyLoss()

# network.train()
# num_epochs = 70

# for epoch in range(num_epochs):
#     train_losses = []
#     val_losses = []
#     train_accs = []
#     val_accs = []
    
#     # Training phase
#     network.train()
#     for xb, yb in tr_dl:
#         preds = network(xb)
#         loss = bce(preds, yb)
#         # import pdb; pdb.set_trace()
#         loss.backward()
#         opt.step()
#         opt.zero_grad()
#         train_losses.append(loss)
        
#         Calculate accuracy
#         _, predicted_labels = preds.argmax(axis=1)
#         accuracy = (predicted_labels == yb).sum().item() / yb.shape[0]
#         train_accs.append(accuracy)
    
#     # Validation phase
#     network.eval()
#     with torch.no_grad():
#         for xb_val, yb_val in val_dl:
#             preds_val = network(xb_val)
#             val_loss = bce(preds_val, yb_val)
#             val_losses.append(val_loss.item())
            
#             # Calculate accuracy
#             _, predicted_labels_val = torch.max(preds_val, dim=1)
#             accuracy_val = (predicted_labels_val == yb_val).sum().item() / yb_val.size(0)
#             val_accs.append(accuracy_val)
    
#     avg_train_loss = sum(train_losses) / len(train_losses)
#     avg_val_loss = sum(val_losses) / len(val_losses)
#     avg_train_acc = sum(train_accs) / len(train_accs)
#     avg_val_acc = sum(val_accs) / len(val_accs)
    
#     Print epoch-wise loss and accuracy
#     print(f"epoch {epoch + 1:02d}/{num_epochs:02d} - loss: {avg_train_loss:.4f} - acc: {avg_train_acc:.4f} - val_loss: {avg_val_loss:.4f} - val_acc: {avg_val_acc:.4f}")
#     print(f"epoch {epoch + 1:02d}/{num_epochs:02d} - loss: {avg_train_loss}")


## AdaGrad Optimizer

>**Intuitive explanation:**
  
Imagine you're trying to navigate your way across a complex terrain - like a big mountain with lots of hills, valleys and flat areas.   
Your goal is to find the lowest valley. This is much like the problem a neural network faces when it's trying to find the optimal values for its weights - the lowest point in its loss function.

You start at a random point on this terrain, which is like initializing your model with random weights. Now, you need to figure out which direction to go in to get to the lowest point.    
You can't see the whole terrain at once, but you can look around your current location and see which way is downhill. This is like calculating the gradient of the loss function with respect to the weights.   

In a basic gradient descent algorithm, you would just go in the direction of the steepest slope with a fixed step size. But this approach can lead to problems.    
What if you're on a steep slope and you take too big of a step? You might overshoot the valley you're trying to get to. Or, what if you're on a flat part of the   
terrain and you take too small of a step? You might get stuck and not make much progress.  

This is where AdaGrad comes in. AdaGrad is like a smart hiker that adjusts its step size based on the terrain it's currently on.   
If it's on a steep slope, it takes smaller steps to avoid overshooting the valley. If it's on a flat area, it takes bigger steps to make faster progress. 

It does this by keeping track of the sum of the squares of the gradients that it has seen so far (kinda like a memory), and uses this to scale down the step size.   
This means that parameters with larger gradients will have their learning rate decreased more, while parameters with smaller gradients will have their learning rate   

The neat thing about AdaGrad is that it adjusts the learning rate for each parameter individually, based on what it's learned about the landscape around that parameter.   
This can be especially useful when dealing with sparse data, where only a few parameters might be updated frequently.

> **Detailed explanation**

Building on the foundational concepts of Stochastic Gradient Descent (SGD), we have AdaGrad, an algorithm that introduces an innovative twist to the optimization process.  
Unlike traditional SGD that utilizes a single learning rate $\alpha$ across all parameters, AdaGrad institutes a per-parameter learning rate. The learning rate for AdaGrad is computed as:

$$
\theta_{t} = \theta_{t-1} - \frac{\alpha}{\sqrt{G_t + \epsilon}} \cdot g_{t}
$$

where $\theta_{t}$ represents the model parameters at time step $t$, $\alpha$ is the initial learning rate, $g_{t}$ is the gradient at time step $t$, $G_{t}$ is a diagonal matrix   
where each diagonal element $i, i$ is the sum of the squares of the gradients w.r.t. $\theta_i$ up to time step $t$, and $\epsilon$ is a smoothing term to avoid division by zero (usually on the order of $1e-7$).

In AdaGrad, each parameter $\theta_i$ gets its own learning rate, which is inversely proportional to the square root of the sum of the squares of past gradients.  
This is the `cache` in the implementation, which holds a history of squared gradients. The greater the sum of the past gradients for a particular parameter, the smaller the learning rate for that parameter.

This feature allows AdaGrad to normalize the updates made during training, preventing any single weight from rising too high compared to the others.  
This is particularly beneficial when dealing with sparse data, as the less frequently updated parameters are allowed larger updates when they do get updated, thereby effectively utilizing more neurons for training.

However, it's important to note that AdaGrad has a tendency to decrease the learning rate quite aggressively due to the constant accumulation of the square of gradients in $G_{t}$.   
This can sometimes lead to premature and excessive decay of the learning rate during training, causing the model to stop learning before reaching the optimal point.   
This monotonic decrease in the learning rate is one reason AdaGrad is not as widely used, except in some specific applications.

To summarize, AdaGrad adds a valuable tool to our optimization toolkit by providing an adaptive learning rate for each individual parameter.  
It elegantly solves the problem of learning rate selection and normalization of parameter updates, and while it has some limitations, it's a   
powerful concept that has paved the way for further innovations in optimization algorithms.

In [None]:
#| export
class AdaGrad(Optimizer):
    """
    Implements AdaGrad optimization algorithm.

    AdaGrad is an optimizer with parameter-wise learning rates, which adapts the learning rate
    based on how frequently a parameter gets updated during training. It's particularly useful
    for sparse data.

    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.
    lr : float, optional
        The initial learning rate.
    wd : float, optional
        The weight decay (L2 regularization).
    eps : float, optional
        A small constant for numerical stability.
    """
    def __init__(
        self,
        params,  # The parameters of the model to be optimized.
        lr=0.001,  # The initial learning rate.
        wd=0.0,  # The weight decay (L2 regularization).
        eps=1e-7,  # A small constant for numerical stability.
    ):
        super().__init__(params)

        self.lr = lr
        self.cache = {}
        self.wd = wd
        self.eps = eps

    def step(self):
        """
        Performs a single optimization step.

        This method uses the current gradients to adjust the parameters using AdaGrad algorithm.
        """
        for self.idx, p in enumerate(self.params):
            self._reg_step(p)
            self._opt_step(p)

    def _opt_step(self, p):
        """
        Performs the optimization step for a single parameter tensor.

        It computes parameter-wise learning rates and updates the parameters accordingly.
        """
        if self.idx not in self.cache:
            self.cache[self.idx] = init.zeros(*p.shape)
        self.cache[self.idx] += p.grad.data ** 2
        p.data = p.data - (self.lr / (self.cache[self.idx] + self.eps) ** 0.5 ) * p.grad.data

    def _reg_step(self, p):
        """
        Applies weight decay for a single parameter tensor.

        This form of L2 regularization can help prevent overfitting.
        """
        if self.wd != 0:
            p.data *= (1 - self.lr * self.wd)

### RMSProp Optimizer

RMSProp, short for Root Mean Square Propagation, which is an optimization algorithm that introduces an adaptive learning rate for each parameter in a model. 

RMSProp introduces an adaptive learning rate for each parameter to tackle different landscapes of the loss function. It does this by maintaining a moving (or 'running') average  
of the squared gradients, effectively measuring the scale of recent gradients. This running average, also known as the cache, is calculated as follows:   

$$
cache_{t} = \rho \cdot cache_{t-1} + (1-\rho) \cdot (g_{t})^2
$$

where $\rho$ is the decay rate that determines how much of the history of squared gradients we retain. This cache term holds a form of "memory" of the magnitude of recent gradients, and its contents "move" with the data over time. 

Then, the parameter update rule becomes:

$$
\theta_{t} = \theta_{t-1} - \frac{\alpha}{\sqrt{cache_{t} + \epsilon}} \cdot g_{t}
$$

where $\epsilon$ is a small constant for numerical stability, often around $1e-8$. This normalization by the square root of the cache ensures smooth changes in the learning rate and   
helps retain the global direction of parameter updates. This adaptivity makes the learning rate changes more resilient to fluctuations in the gradient.    

RMSProp introduces a new hyperparameter, $\rho$, the cache memory decay rate. Given the momentum-like properties of RMSProp, even small gradient updates can have substantial effects    
due to the adaptive learning rate updates. As such, the default learning rate often used with RMSProp is smaller, around $0.001$, to ensure stability.   

In [None]:
#| export
class RMSProp(Optimizer):
    """
    Implements RMSProp optimization algorithm.

    RMSProp is an optimizer with parameter-wise adaptive learning rates, which adapt the learning rate
    for each parameter individually, making it suitable for dealing with sparse or multi-scale data.

    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.
    lr : float, optional
        The initial learning rate.
    wd : float, optional
        The weight decay (L2 regularization).
    eps : float, optional
        A small constant for numerical stability.
    rho : float, optional
        The decay rate for the moving average of squared gradients.
    """
    def __init__(
        self,
        params,  # The parameters of the model to be optimized.
        lr=0.001,  # The initial learning rate.
        wd=0.0,  # The weight decay (L2 regularization).
        eps=1e-7,  # A small constant for numerical stability.
        rho=0.9, # The decay rate for the moving average of squared gradients.
    ):
        super().__init__(params)

        self.lr = lr
        self.cache = {}
        self.wd = wd
        self.eps = eps
        self.rho = rho

    def step(self):
        """
        Performs a single optimization step.

        This method uses the current gradients to adjust the parameters using RMSProp algorithm.
        """
        for self.idx, p in enumerate(self.params):
            self._reg_step(p)
            self._opt_step(p)

    def _opt_step(self, p):
        """
        Performs the optimization step for a single parameter tensor.

        It computes parameter-wise learning rates and updates the parameters accordingly.
        """
        if self.idx not in self.cache:
            self.cache[self.idx] = init.zeros(*p.shape)
        self.cache[self.idx] = self.rho * self.cache[self.idx] + (1 - self.rho) * p.grad.data ** 2
        p.data = p.data - (self.lr / (self.cache[self.idx] + self.eps) ** 0.5 ) * p.grad.data

    def _reg_step(self, p):
        """
        Applies weight decay for a single parameter tensor.

        This form of L2 regularization can help prevent overfitting.
        """
        if self.wd != 0:
            p.data *= (1 - self.lr * self.wd)

## Adam Optimizer

This is a PyTorch-like implementation of popular optimizer *Adam* from paper
 [Adam: A Method for Stochastic Optimization](https://papers.labml.ai/paper/1412.6980).

*Adam* update is,
$$
\begin{align}
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) \cdot g_t \\
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) \cdot g_t^2 \\
\hat{m}_t &\leftarrow \frac{m_t}{1-\beta_1^t} \\
\hat{v}_t &\leftarrow \frac{v_t}{1-\beta_2^t} \\
\theta_t &\leftarrow \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\end{align}
$$
where $\alpha$, $\beta_1$, $\beta_2$ and $\epsilon$ are scalar hyper parameters.
$m_t$ and $v_t$ are first and second order moments.
$\hat{m}_t$  and $\hat{v}_t$ are biased corrected moments.
$\epsilon$ is used as a fix for division by zero error, but also acts as a form of a hyper-parameter
that acts against variance in gradients.

Effective step taken assuming $\epsilon = 0$ is,
$$\Delta t = \alpha \cdot \frac{\hat{m}_t}{\hat{v}_t}$$
This is bounded by,
$$\vert \Delta t \vert \le \alpha \cdot \frac{1 - \beta_1}{\sqrt{1-\beta_2}}$$
when $1-\beta_1 \gt \sqrt{1-\beta_2}$
and
$$\vert \Delta t\vert  \le \alpha$$
otherwise.
And in most common scenarios,
$$\vert \Delta t \vert \approx \alpha$$

In [None]:
#| export
class Adam(Optimizer):
    """
    Implements the Adam optimization algorithm.

    Adam is an adaptive learning rate optimization algorithm that has been designed specifically for training 
    deep neural networks. It leverages the power of adaptive learning rates methods to find individual learning 
    rates for each parameter.

    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.
    lr : float, optional
        The learning rate. Default is 0.01.
    beta1 : float, optional
        The exponential decay rate for the first moment estimates. Default is 0.9.
    beta2 : float, optional
        The exponential decay rate for the second moment estimates. Default is 0.999.
    eps : float, optional
        A small constant for numerical stability. Default is 1e-8.
    weight_decay : float, optional
        Weight decay (L2 penalty). Default is 0.

    Attributes
    ----------
    t : int
        The time step for the Adam optimizer.
    exp_avg : dict
        The dictionary to store the exponential moving average of gradient values.
    exp_avg_sq : dict
        The dictionary to store the exponential moving average of squared gradient values.
    """
    def __init__(
        self,
        params, # `params` is the list of parameters
        lr=1e-5, # `lr` is the learning rate $\alpha$
        beta1=0.9, # The exponential decay rate for the first moment estimates. Default is 0.9.
        beta2=0.999, # The exponential decay rate for the second moment estimates. Default is 0.999.
        eps=1e-8, # `eps` is $\hat{\epsilon}$ or $\epsilon$ based on `optimized_update`
        weight_decay=0.0, # is an instance of class `WeightDecay` defined in [`__init__.py`](index.html)
    ):
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.wd = weight_decay
        self.t = 0

        self.exp_avg = {}
        self.exp_avg_sq = {}

    def step(self):
        """
        Performs a single optimization step.

        This method updates the parameters based on the current gradient.
        """
        for self.idx, p in enumerate(self.params):
            self._reg_step(p)
            self._opt_step(p)

    def _opt_step(self, p):
        """
        Performs the optimization step for a single parameter tensor.

        The method updates the moving averages of the gradient (m) and the squared gradient (v), and then 
        computes the bias-corrected estimates of these two variables. These bias-corrected estimates are 
        then used to update the parameter.
        """
        if self.idx not in self.exp_avg:
            self.exp_avg[self.idx] = init.zeros(*p.shape)
            self.exp_avg_sq[self.idx] = init.zeros(*p.shape)
        
        # Update biased first and second moment estimates
        self.exp_avg[self.idx] = self.beta1 * self.exp_avg[self.idx] + (1 - self.beta1) * p.grad.data
        self.exp_avg_sq[self.idx] = self.beta2 * self.exp_avg_sq[self.idx] + (1 - self.beta2) * p.grad.data**2
        
        # Compute bias-corrected first and second moment estimates
        exp_avg_hat = self.exp_avg[self.idx] / (1 - self.beta1 ** (self.idx + 1))
        exp_avg_sq_hat = self.exp_avg_sq[self.idx] / (1 - self.beta2 ** (self.idx + 1))
        p.data = p.data - self.lr * exp_avg_hat / (exp_avg_sq_hat ** 0.5 + self.eps)

    def _reg_step(self, p):
        """
        Applies weight decay for a single parameter tensor.

        This form of L2 regularization can help prevent overfitting. It adjusts the parameter by 
        a small factor of its current value.
        """
        if self.wd != 0:
            p.data *= (1 - self.lr * self.wd)
        # all same :3
        # p.data *= (1 - self.lr * self.weight_decay)
        # p.data = p.data - self.lr * self.weight_decay * p.data
        # p.data -= self.lr * self.weight_decay * p.data

#| hide
## Export

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()