<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In [1]:
#| echo: false
#| output: asis
show_doc(Optimizer)

  else: warn(msg)


---

[source](https://github.com/m0saan/minima/blob/main/minima/optim.py#L14){target="_blank" style="float:right; font-size:smaller"}

### Optimizer

>      Optimizer (params)

Base class for all optimizers. Not meant to be instantiated directly.

This class represents the abstract concept of an optimizer, and contains methods that 
all concrete optimizer classes must implement. It is designed to handle the parameters 
of a machine learning model, providing functionality to perform a step of optimization 
and to zero out gradients.

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| params | Iterable | The parameters of the model to be optimized. |

## SGD Optimizer

This is a PyTorch-style implementation of the classic optimizer Stochastic Gradient Descent (SGD).

SGD update is,

$$
\theta_{t} = \theta_{t-1} - \alpha \cdot g_{t}
$$

where $\alpha$ is the learning rate, and $g_{t}$ is the gradient at time step $t$. $θ_{t}$ represents the model parameters at time step $t$.

The learning rate $\alpha$ is a scalar hyperparameter that controls the size of the update at each iteration.

An optional momentum term can be added to the update rule:

$$
\begin{align*}
v_{t} & \leftarrow \mu v_{t-1} + (1-\mu) \cdot g_t \\
\theta_{t} & \leftarrow \theta_{t-1} - \alpha \cdot v_t 
\end{align*}
$$

where $v_{t}$ is the momentum term at time step $t$, and $\mu$ is the momentum factor. The momentum term increases for dimensions whose gradients point in the same   
direction and reduces updates for dimensions whose gradients change direction, thereby adding a form of preconditioning.  

A weight decay term can also be included, which adds a regularization effect:

$$
\theta_{t} = (1 - \alpha \cdot \lambda) \cdot \theta_{t-1} - \alpha \cdot g_t
$$

where $\lambda$ is the weight decay factor. This results in the model weights shrinking at each time step, which can prevent overfitting by keeping the model complexity in check.

In [2]:
#| echo: false
#| output: asis
show_doc(SGD)

---

[source](https://github.com/m0saan/minima/blob/main/minima/optim.py#L63){target="_blank" style="float:right; font-size:smaller"}

### SGD

>      SGD (params, lr=0.01, momentum=0.0, wd=0.0)

Implements stochastic gradient descent (optionally with momentum).

This is a basic optimizer that's suitable for many machine learning models, and is often
used as a baseline for comparing other optimizers' performance.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| params | Iterable |  | The parameters of the model to be optimized. |
| lr | float | 0.01 | The learning rate. |
| momentum | float | 0.0 | The momentum factor. |
| wd | float | 0.0 | The weight decay (L2 regularization). |

## Adam Optimizer

This is a PyTorch-like implementation of popular optimizer *Adam* from paper
 [Adam: A Method for Stochastic Optimization](https://papers.labml.ai/paper/1412.6980).

*Adam* update is,
$$
\begin{align}
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) \cdot g_t \\
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) \cdot g_t^2 \\
\hat{m}_t &\leftarrow \frac{m_t}{1-\beta_1^t} \\
\hat{v}_t &\leftarrow \frac{v_t}{1-\beta_2^t} \\
\theta_t &\leftarrow \theta_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
\end{align}
$$
where $\alpha$, $\beta_1$, $\beta_2$ and $\epsilon$ are scalar hyper parameters.
$m_t$ and $v_t$ are first and second order moments.
$\hat{m}_t$  and $\hat{v}_t$ are biased corrected moments.
$\epsilon$ is used as a fix for division by zero error, but also acts as a form of a hyper-parameter
that acts against variance in gradients.

Effective step taken assuming $\epsilon = 0$ is,
$$\Delta t = \alpha \cdot \frac{\hat{m}_t}{\hat{v}_t}$$
This is bounded by,
$$\vert \Delta t \vert \le \alpha \cdot \frac{1 - \beta_1}{\sqrt{1-\beta_2}}$$
when $1-\beta_1 \gt \sqrt{1-\beta_2}$
and
$$\vert \Delta t\vert  \le \alpha$$
otherwise.
And in most common scenarios,
$$\vert \Delta t \vert \approx \alpha$$

In [3]:
#| echo: false
#| output: asis
show_doc(Adam)

  else: warn(msg)


---

[source](https://github.com/m0saan/minima/blob/main/minima/optim.py#L126){target="_blank" style="float:right; font-size:smaller"}

### Adam

>      Adam (params, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-08,
>            weight_decay=0.0)

Implements the Adam optimization algorithm.

Adam is an adaptive learning rate optimization algorithm that has been designed specifically for training 
deep neural networks. It leverages the power of adaptive learning rates methods to find individual learning 
rates for each parameter.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| params | Iterable |  | `params` is the list of parameters |
| lr | float | 0.01 | `lr` is the learning rate $\alpha$ |
| beta1 | float | 0.9 | The exponential decay rate for the first moment estimates. Default is 0.9. |
| beta2 | float | 0.999 | The exponential decay rate for the second moment estimates. Default is 0.999. |
| eps | float | 1e-08 | `eps` is $\hat{\epsilon}$ or $\epsilon$ based on `optimized_update` |
| weight_decay | float | 0.0 | is an instance of class `WeightDecay` defined in [`__init__.py`](index.html) |