📝 **Author:** Amirhossein Heydari - 📧 **Email:** <amirhosseinheydari78@gmail.com> - 📍 **Origin:** [mr-pylin/pytorch-workshop](https://github.com/mr-pylin/pytorch-workshop)

---


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Optimizer](#toc2_)    
  - [Built-in Optimizers](#toc2_1_)    
    - [Optimizer: SGD](#toc2_1_1_)    
    - [Optimizer: Adam](#toc2_1_2_)    
    - [Optimizer: Adagrad](#toc2_1_3_)    
  - [Custom Optimizers](#toc2_2_)    
- [Adjust Learning Rate](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [1]:
import torch
from torch import optim

# <a id='toc2_'></a>[Optimizer](#toc0_)

- An **optimizer** updates model **parameters** during training to minimize the loss function.  
- It adjusts weights based on **gradients** computed through **backpropagation**.  
- Passing `model.parameters()` to the optimizer is **mandatory** to specify which parameters should be updated.  

🛠 **Optimizer Methods**:

- `zero_grad()` **clears** the **accumulated gradients** from the previous step (**iteration**) to prevent incorrect updates.
- `step()` **updates** the model parameters using the computed **gradients**.

🖥 **Typical Workflow**:

```python
output = model(data)            # Forward pass
loss = loss_fn(output, target)  # Compute loss
loss.backward()                 # Backpropagation (compute gradients)
optimizer.step()                # Update parameters
optimizer.zero_grad()           # Clear previous gradients
```

ℹ️ **Learn more**:

- details about gradient: [**02-gradient.ipynb**](../02-gradient.ipynb)
- details about models: [**model.ipynb**](./model.ipynb)
- details about losses: [**loss.ipynb**](./loss.ipynb)


## <a id='toc2_1_'></a>[Built-in Optimizers](#toc0_)

- PyTorch provides several built-in optimizers in `torch.optim`, each with **different** strategies for updating model parameters.

<table style="margin:0 auto;">
  <thead>
    <tr>
      <th>Optimizer</th>
      <th>Best For</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><pre>optim.SGD()</pre></td>
      <td>General training, often best with momentum</td>
    </tr>
    <tr>
      <td><pre>optim.Adam()</pre></td>
      <td>Default choice, works well for most models</td>
    </tr>
    <tr>
      <td><pre>optim.RMSprop()</pre></td>
      <td>RNNs and nonstationary data</td>
    </tr>
    <tr>
      <td><pre>optim.Adagrad()</pre></td>
      <td>Sparse datasets with rare features</td>
    </tr>
    <tr>
      <td><pre>optim.AdamW()</pre></td>
      <td>When weight decay regularization is needed</td>
    </tr>
  </tbody>
</table>

📝 **Docs**:

- `torch.optim`: [docs.pytorch.org/docs/stable/optim.html](https://docs.pytorch.org/docs/stable/optim.html)


### <a id='toc2_1_1_'></a>[Optimizer: SGD](#toc0_)

- Implements stochastic gradient descent (optionally with momentum).
- Updates parameters using gradients computed via **backpropagation**. 
- Works well for convex problems but may struggle with noisy gradients in deep networks.  

🎛 **Optimizer Parameters**:

<table style="margin:0 auto;">
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Type</th>
      <th>Default</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span style="font-family: monospace;">params</span></td>
      <td>iterable</td>
      <td><b>Required</b></td>
      <td>Iterable of parameters or <span style="font-family: monospace;">named_parameters</span> to optimize. Can also be a list of dicts defining parameter groups.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">lr</span></td>
      <td>float or Tensor</td>
      <td><span style="font-family: monospace;">1e-3</span></td>
      <td>Learning rate (step size).</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">momentum</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>Momentum factor for smoothing updates and accelerating convergence.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">dampening</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>Dampens momentum updates. If <span style="font-family: monospace;">&gt;0</span>, reduces the effect of momentum.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">weight_decay</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>L2 regularization (prevents overfitting).</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">nesterov</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>Enables <b>Nesterov momentum</b> for faster convergence. Requires <span style="font-family: monospace;">momentum &gt; 0</span>.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">maximize</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>If <span style="font-family: monospace;">True</span>, maximizes the objective function instead of minimizing it.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">foreach</span></td>
      <td>bool or None</td>
      <td><span style="font-family: monospace;">None</span></td>
      <td>Uses <span style="font-family: monospace;">foreach</span> implementation on CUDA for better performance. Uses more memory but speeds up execution.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">differentiable</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>If <span style="font-family: monospace;">True</span>, allows autograd to track the optimizer step. Can slow down training.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">fused</span></td>
      <td>bool or None</td>
      <td><span style="font-family: monospace;">None</span></td>
      <td>Uses a fused implementation for <b>better performance</b> on <span style="font-family: monospace;">float32</span>, <span style="font-family: monospace;">float64</span>, <span style="font-family: monospace;">float16</span>, and <span style="font-family: monospace;">bfloat16</span>.</td>
    </tr>
  </tbody>
</table>


In [None]:
# simple model
model = torch.nn.Linear(2, 1)

# optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=0.01, nesterov=True)

# training loop simulation
for i in range(3):
    loss = model(torch.randn(1, 2)).sum()  # dummy loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # log parameters
    print(f"iteration {i+1}/{3}:")
    for n, v in model.named_parameters():
        print(f"{n:6}: {v.data}")
    print()

### <a id='toc2_1_2_'></a>[Optimizer: Adam](#toc0_)

- **Adam (Adaptive Moment Estimation)** Combines **momentum** and **adaptive learning rates** for efficient optimization.  
- Uses **exponentially moving averages** of past gradients and squared gradients.  
- Works well for non-stationary objectives and sparse gradients.  

🎛 **Optimizer Parameters**:  

<table style="margin:0 auto;">
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Type</th>
      <th>Default</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span style="font-family: monospace;">params</span></td>
      <td>iterable</td>
      <td><b>Required</b></td>
      <td>Iterable of parameters or <span style="font-family: monospace;">named_parameters</span> to optimize. Can also be a list of dicts defining parameter groups.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">lr</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">1e-3</span></td>
      <td>Learning rate (step size).</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">betas</span></td>
      <td>Tuple[float, float]</td>
      <td><span style="font-family: monospace;">(0.9, 0.999)</span></td>
      <td>Coefficients for computing running averages of gradient and squared gradient.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">eps</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">1e-8</span></td>
      <td>Term added to the denominator for numerical stability.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">weight_decay</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>L2 regularization (prevents overfitting).</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">amsgrad</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>Enables AMSGrad variant for better convergence in some cases.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">foreach</span></td>
      <td>bool or None</td>
      <td><span style="font-family: monospace;">None</span></td>
      <td>Uses <span style="font-family: monospace;">foreach</span> implementation on CUDA for better performance. Uses more memory but speeds up execution.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">maximize</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>If <span style="font-family: monospace;">True</span>, maximizes the objective function instead of minimizing it.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">capturable</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>Whether this instance is safe to capture in a CUDA graph. Passing <span style="font-family: monospace;">True</span> can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it <span style="font-family: monospace;">False</span>.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">differentiable</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>If <span style="font-family: monospace;">True</span>, allows autograd to track the optimizer step. Can slow down training.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">fused</span></td>
      <td>bool or None</td>
      <td><span style="font-family: monospace;">None</span></td>
      <td>Uses a fused implementation for <b>better performance</b> on <span style="font-family: monospace;">float32</span>, <span style="font-family: monospace;">float64</span>, <span style="font-family: monospace;">float16</span>, and <span style="font-family: monospace;">bfloat16</span>.</td>
    </tr>
  </tbody>
</table>


In [None]:
# simple model
model = torch.nn.Linear(2, 1)

# optimizer
optimizer = optim.Adam(model.parameters(), lr=0.1, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01, amsgrad=True)

# training loop simulation
for i in range(3):
    loss = model(torch.randn(1, 2)).sum()  # dummy loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # log parameters
    print(f"iteration {i+1}/{3}:")
    for n, v in model.named_parameters():
        print(f"{n:6}: {v.data}")
    print()

### <a id='toc2_1_3_'></a>[Optimizer: Adagrad](#toc0_)

- **Adagrad (Adaptive Gradient Algorithm)** adapts the learning rate individually for each parameter based on past gradients.
- Suitable for sparse data and NLP tasks but may suffer from aggressive learning rate decay.

🎛 **Optimizer Parameters**:

<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Type</th>
      <th>Default</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span style="font-family: monospace;">params</span></td>
      <td>iterable</td>
      <td><b>Required</b></td>
      <td>Iterable of parameters or <span style="font-family: monospace;">named_parameters</span> to optimize. Can also be a list of dicts defining parameter groups.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">lr</span></td>
      <td>float or Tensor</td>
      <td><span style="font-family: monospace;">1e-2</span></td>
      <td>Learning rate (step size).</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">lr_decay</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>Decay factor applied to the learning rate.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">weight_decay</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>L2 regularization (prevents overfitting).</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">initial_accumulator_value</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">0.0</span></td>
      <td>Initial value of the sum of squares of gradients.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">eps</span></td>
      <td>float</td>
      <td><span style="font-family: monospace;">1e-10</span></td>
      <td>Term added to the denominator to improve numerical stability.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">foreach</span></td>
      <td>bool or None</td>
      <td><span style="font-family: monospace;">None</span></td>
      <td>Uses <span style="font-family: monospace;">foreach</span> implementation on CUDA for better performance. Uses more memory but speeds up execution.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">maximize</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>If <span style="font-family: monospace;">True</span>, maximizes the objective function instead of minimizing it.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">differentiable</span></td>
      <td>bool</td>
      <td><span style="font-family: monospace;">False</span></td>
      <td>If <span style="font-family: monospace;">True</span>, allows autograd to track the optimizer step. Can slow down training.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">fused</span></td>
      <td>bool or None</td>
      <td><span style="font-family: monospace;">None</span></td>
      <td>Uses a fused implementation for <b>better performance</b> on <span style="font-family: monospace;">float32</span>, <span style="font-family: monospace;">float64</span>, <span style="font-family: monospace;">float16</span>, and <span style="font-family: monospace;">bfloat16</span>. Not supported for sparse or complex gradients.</td>
    </tr>
  </tbody>
</table>


In [None]:
# simple model
model = torch.nn.Linear(2, 1)

# optimizer
optimizer = optim.Adagrad(
    model.parameters(), lr=0.1, lr_decay=0.01, weight_decay=0.01, initial_accumulator_value=0.1, eps=1e-10
)

# training loop simulation
for i in range(3):
    loss = model(torch.randn(1, 2)).sum()  # dummy loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # log parameters
    print(f"iteration {i+1}/{3}:")
    for n, v in model.named_parameters():
        print(f"{n:6}: {v.data}")
    print()

## <a id='toc2_2_'></a>[Custom Optimizers](#toc0_)

- PyTorch allows defining **custom optimizers** by extending `torch.optim.Optimizer`.  
- Custom optimizers give full control over parameter updates, allowing modifications beyond standard methods like **SGD**, **Adam**, or **Adagrad**.
- To implement a custom optimizer:
  1. **Inherit** from `torch.optim.Optimizer`.
  2. **Initialize** parameters and defaults in `__init__()`.
  3. **Implement `step()`**, which defines how parameters are updated each iteration.

**Defining `step()`**
- The `step()` function is where the **gradient-based update rule** is applied.
- Inside `step()`, iterate over `self.param_groups` and update each parameter using its gradient.
- Use `@torch.no_grad()` to disable gradient tracking during updates.
- Support an optional `closure` function for loss recomputation (useful in some optimizers like **LBFGS**).


📝 **Docs**:

- `torch.optim.Optimizer`: [docs.pytorch.org/docs/stable/optim.html#torch.optim.Optimizer](https://docs.pytorch.org/docs/stable/optim.html)


In [6]:
class CustomSGD(optim.Optimizer):
    def __init__(self, params, lr=0.01):
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        defaults = {"lr": lr}
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            lr = group["lr"]
            for param in group["params"]:
                if param.grad is not None:
                    param -= lr * param.grad

        return loss

In [None]:
# simple model
model = torch.nn.Linear(2, 1)

# optimizer
optimizer = CustomSGD(model.parameters(), lr=0.1)

# training loop simulation
for i in range(3):
    loss = model(torch.randn(1, 2)).sum()  # dummy loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # log parameters
    print(f"iteration {i+1}/{3}:")
    for n, v in model.named_parameters():
        print(f"{n:6}: {v.data}")
    print()

# <a id='toc3_'></a>[Adjust Learning Rate](#toc0_)

- PyTorch provides **learning rate schedulers** in `torch.optim.lr_scheduler` to dynamically adjust the learning rate during training.
- Learning rate scheduling should be applied **after** optimizer’s update.
- Learning rate schedulers **must be called per epoch or per iteration**, depending on the chosen strategy.
- Common Learning Rate Schedulers:
<table style="margin:0 auto;">
  <thead>
    <tr>
      <th>Scheduler</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span style="font-family: monospace;">StepLR</span></td>
      <td>Decays LR every <span style="font-family: monospace;">step_size</span> epochs.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">MultiStepLR</span></td>
      <td>Decays LR at multiple predefined epochs.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">ExponentialLR</span></td>
      <td>Decays LR by a fixed multiplicative factor every epoch.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">ReduceLROnPlateau</span></td>
      <td>Reduces LR when a metric has stopped improving.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">CosineAnnealingLR</span></td>
      <td>Uses a cosine function for annealing LR.</td>
    </tr>
    <tr>
      <td><span style="font-family: monospace;">LambdaLR</span></td>
      <td>Custom LR scheduling via a user-defined function.</td>
    </tr>
  </tbody>
</table>

📝 **Docs**:

- How to adjust learning rate: [docs.pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate](https://docs.pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate)


In [8]:
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, "min", patience=3)