# optim

> The `Optim` module in minima is a flexible and powerful toolbox for optimizing the parameters of your deep learning models. Built on a high-level,  
> intuitive, and pythonic API, it provides several out-of-the-box optimization strategies, such as Stochastic Gradient Descent (SGD), Adam, and more.  
> In the heart of this module lies the abstract `Optimizer` class that defines a standard interface for all the optimization strategies.  
> Each specific optimizer class implements this interface, which ensures a consistent usage and allows for easy swapping of different strategies in your training loop.  

> Among the features of this module are:  
> - Efficient gradient computations and updates.  
> - Advanced optimization strategies with adaptive learning rates.  
> - Easy to extend to custom optimization strategies.  
> - Supports weight decay regularization for avoiding overfitting.  

> Whether you're training a simple linear regression or a complex deep neural network, `Optim` has got you covered. With its simple and consistent interface, the module makes the task of optimizing your models a breeze.


In [None]:
#| default_exp optim

In [None]:
#| export
import minima as mi
from minima.nn import Parameter
from minima.autograd import Tensor
from minima import init
import numpy as np

In [None]:
#| export
class Optimizer:
    """
    Base class for all optimizers. Not meant to be instantiated directly.

    This class represents the abstract concept of an optimizer, and contains methods that 
    all concrete optimizer classes must implement. It is designed to handle the parameters 
    of a machine learning model, providing functionality to perform a step of optimization 
    and to zero out gradients.
    
    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.

    Raises
    ------
    NotImplementedError
        If the `step` method is not implemented in a subclass.
    """
    def __init__(
        self,
        params # The parameters of the model to be optimized.
    ):
        self.params = params

    def step(self):
        """
        Performs a single optimization step.

        This method must be overridden by any subclass to provide the specific optimization logic.
        
        Raises
        ------
        NotImplementedError
            If the method is not implemented in a subclass.
        """
        raise NotImplementedError()

    def zero_grad(self):
        """
        Zeros out all gradients in `params`.

        This method is typically used before backpropagation to ensure that gradients 
        are not being accumulated from multiple passes.
        """
        for p in self.params:
            p.grad = None

## SGD Optimizer

This is a PyTorch-style implementation of the classic optimizer Stochastic Gradient Descent (SGD).

SGD update is,

$$
\theta_{t} = \theta_{t-1} - \alpha \cdot g_{t}
$$

where $\alpha$ is the learning rate, and $g_{t}$ is the gradient at time step $t$. $θ_{t}$ represents the model parameters at time step $t$.

The learning rate $\alpha$ is a scalar hyperparameter that controls the size of the update at each iteration.

An optional momentum term can be added to the update rule:

$$
\begin{align*}
v_{t} & \leftarrow \mu v_{t-1} + (1-\mu) \cdot g_t \\
\theta_{t} & \leftarrow \theta_{t-1} - \alpha \cdot v_t 
\end{align*}
$$

where $v_{t}$ is the momentum term at time step $t$, and $\mu$ is the momentum factor. The momentum term increases for dimensions whose gradients point in the same   
direction and reduces updates for dimensions whose gradients change direction, thereby adding a form of preconditioning.  

A weight decay term can also be included, which adds a regularization effect:

$$
\theta_{t} = (1 - \alpha \cdot \lambda) \cdot \theta_{t-1} - \alpha \cdot g_t
$$

where $\lambda$ is the weight decay factor. This results in the model weights shrinking at each time step, which can prevent overfitting by keeping the model complexity in check.

In [None]:
#| export
class SGD(Optimizer):
    """
    Implements stochastic gradient descent (optionally with momentum).

    This is a basic optimizer that's suitable for many machine learning models, and is often
    used as a baseline for comparing other optimizers' performance.

    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.
    lr : float, optional
        The learning rate.
    momentum : float, optional
        The momentum factor.
    wd : float, optional
        The weight decay (L2 regularization).
    """
    def __init__(
        self,
        params, # The parameters of the model to be optimized.
        lr=0.01, # The learning rate.
        momentum=0.0, # The momentum factor.
        wd=0.0 # The weight decay (L2 regularization).
    ):
        super().__init__(params)

        self.lr = lr
        self.momentum = momentum
        self.u = {}
        self.wd = wd

    def step(self):
        """
        Performs a single optimization step.

        This method uses the current gradients to adjust the parameters using stochastic gradient descent.
        """
        for self.idx, p in enumerate(self.params):
            self._reg_step(p)
            self._opt_step(p)

    def _opt_step(self, p):
        """
        Performs the optimization step for a single parameter tensor.

        If momentum is set, it applies momentum by using a running average of the previous gradients.
        """
        if self.idx not in self.u:
            self.u[self.idx] = init.zeros(*p.shape)
        self.u[self.idx] = self.momentum * self.u[self.idx] + (1 - self.momentum) * p.grad.data
        p.data = p.data - self.lr * self.u[self.idx]

    def _reg_step(self, p):
        """
        Applies weight decay for a single parameter tensor.

        This form of L2 regularization can help prevent overfitting.
        """
        if self.wd != 0:
            p.data *= (1 - self.lr * self.wd)

## Adam Optimizer

This is a PyTorch-like implementation of popular optimizer *Adam* from paper
 [Adam: A Method for Stochastic Optimization](https://papers.labml.ai/paper/1412.6980).

*Adam* update is,
$$
\begin{align}
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) \cdot g_t \\
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) \cdot g_t^2 \\
\hat{m}_t &\leftarrow \frac{m_t}{1-\beta_1^t} \\
\hat{v}_t &\leftarrow \frac{v_t}{1-\beta_2^t} \\
\theta_t &\leftarrow \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\end{align}
$$
where $\alpha$, $\beta_1$, $\beta_2$ and $\epsilon$ are scalar hyper parameters.
$m_t$ and $v_t$ are first and second order moments.
$\hat{m}_t$  and $\hat{v}_t$ are biased corrected moments.
$\epsilon$ is used as a fix for division by zero error, but also acts as a form of a hyper-parameter
that acts against variance in gradients.

Effective step taken assuming $\epsilon = 0$ is,
$$\Delta t = \alpha \cdot \frac{\hat{m}_t}{\hat{v}_t}$$
This is bounded by,
$$\vert \Delta t \vert \le \alpha \cdot \frac{1 - \beta_1}{\sqrt{1-\beta_2}}$$
when $1-\beta_1 \gt \sqrt{1-\beta_2}$
and
$$\vert \Delta t\vert  \le \alpha$$
otherwise.
And in most common scenarios,
$$\vert \Delta t \vert \approx \alpha$$

In [None]:
#| export
class Adam(Optimizer):
    """
    Implements the Adam optimization algorithm.

    Adam is an adaptive learning rate optimization algorithm that has been designed specifically for training 
    deep neural networks. It leverages the power of adaptive learning rates methods to find individual learning 
    rates for each parameter.

    Parameters
    ----------
    params : Iterable
        The parameters of the model to be optimized.
    lr : float, optional
        The learning rate. Default is 0.01.
    beta1 : float, optional
        The exponential decay rate for the first moment estimates. Default is 0.9.
    beta2 : float, optional
        The exponential decay rate for the second moment estimates. Default is 0.999.
    eps : float, optional
        A small constant for numerical stability. Default is 1e-8.
    weight_decay : float, optional
        Weight decay (L2 penalty). Default is 0.

    Attributes
    ----------
    t : int
        The time step for the Adam optimizer.
    exp_avg : dict
        The dictionary to store the exponential moving average of gradient values.
    exp_avg_sq : dict
        The dictionary to store the exponential moving average of squared gradient values.
    """
    def __init__(
        self,
        params, # `params` is the list of parameters
        lr=0.01, # `lr` is the learning rate $\alpha$
        beta1=0.9, # The exponential decay rate for the first moment estimates. Default is 0.9.
        beta2=0.999, # The exponential decay rate for the second moment estimates. Default is 0.999.
        eps=1e-8, # `eps` is $\hat{\epsilon}$ or $\epsilon$ based on `optimized_update`
        weight_decay=0.0, # is an instance of class `WeightDecay` defined in [`__init__.py`](index.html)
    ):
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.wd = weight_decay
        self.t = 0

        self.exp_avg = {}
        self.exp_avg_sq = {}

    def step(self):
        """
        Performs a single optimization step.

        This method updates the parameters based on the current gradient.
        """
        for self.idx, p in enumerate(self.params):
            self._reg_step(p)
            self._opt_step(p)

    def _opt_step(self, p):
        """
        Performs the optimization step for a single parameter tensor.

        The method updates the moving averages of the gradient (m) and the squared gradient (v), and then 
        computes the bias-corrected estimates of these two variables. These bias-corrected estimates are 
        then used to update the parameter.
        """
        if self.idx not in self.exp_avg:
            self.exp_avg[self.idx] = init.zeros(*p.shape)
            self.exp_avg_sq[self.idx] = init.zeros(*p.shape)
        
        # Update biased first and second moment estimates
        self.exp_avg[self.idx] = self.beta1 * self.exp_avg[self.idx] + (1 - self.beta1) * p.grad.data
        self.exp_avg_sq[self.idx] = self.beta2 * self.exp_avg_sq[self.idx] + (1 - self.beta2) * p.grad.data**2
        
        # Compute bias-corrected first and second moment estimates
        exp_avg_hat = self.exp_avg[self.idx] / (1 - self.beta1 ** (self.idx + 1))
        exp_avg_sq_hat = self.exp_avg_sq[self.idx] / (1 - self.beta2 ** (self.idx + 1))
        p.data = p.data - self.lr * exp_avg_hat / (exp_avg_sq_hat ** 0.5 + self.eps)

    def _reg_step(self, p):
        """
        Applies weight decay for a single parameter tensor.

        This form of L2 regularization can help prevent overfitting. It adjusts the parameter by 
        a small factor of its current value.
        """
        if self.wd != 0:
            p.data *= (1 - self.lr * self.wd)
        # all same :3
        # p.data *= (1 - self.lr * self.weight_decay)
        # p.data = p.data - self.lr * self.weight_decay * p.data
        # p.data -= self.lr * self.weight_decay * p.data

#| hide
## Export

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()