<a href="https://www.kaggle.com/code/mrafraim/dl-day-36-cnn-optimizers?scriptVersionId=296516268" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 36: CNN Optimizers
*SGD · RMSProp · Adam · AdamW · Convergence Tradeoffs*

Welcome to Day 36! Not just “pick an optimizer.”

Today you’ll understand:

- Why some CNNs converge fast but overfit  
- Why some slow optimizers generalize better  
- How optimizer interacts with LR, weight decay, and batch size  
- Practical strategies to switch optimizers mid-training

If you found this notebook helpful, your **<b style="color:skyblue;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# What an Optimizer Does

An optimizer’s job is to update the network’s weights $\theta$ to minimize the loss $L(\theta)$:

$$
\theta_{t+1} = \theta_t - \alpha \cdot f(\nabla_\theta L)
$$

Where:  
- $\alpha$ → learning rate (step size)  
- $f(\cdot)$ → optimizer-specific gradient transformation (e.g., momentum, adaptive scaling)

> The choice of optimizer affects not just training speed, but also convergence stability and generalization quality.


# SGD (Stochastic Gradient Descent)


## SGD Intuition

SGD is a first-order optimization algorithm that updates weights in the direction of the negative gradient of the loss:

$$
\boxed{\theta_{t+1} = \theta_t - \alpha \nabla_\theta L}
$$

- $\theta_t$ → current weights  
- $\nabla_\theta L$ → gradient of loss w.r.t weights  
- $\alpha$ → learning rate  

> Idea: move weights opposite to the slope to reduce loss.


### Momentum Extension

Momentum helps accelerate convergence along consistent gradient directions:

$$
v_{t+1} = \mu v_t - \alpha \nabla_\theta L
$$
$$
\theta_{t+1} = \theta_t + v_{t+1}
$$

Where:  
- $v_t$ → accumulated velocity  
- $\mu$ → momentum factor (0–1)  
- $\alpha$ → learning rate  

**Mathematical Breakdown**:

1. $v_{t+1}$ combines previous direction $v_t$ with current gradient $-\alpha \nabla_\theta L$  
2. Update $\theta$ by adding velocity → smooths zig-zag steps in steep directions  
3. Gradients in consistent directions get amplified, oscillations dampened


### Pros

- Excellent generalization on unseen data  
- Deterministic with fixed LR and batch  
- Momentum accelerates learning in shallow or medium-depth CNNs

### Cons
- Sensitive to learning rate
- Slower than adaptive optimizers (Adam, RMSProp)  
- May stall in very deep CNNs without proper initialization or LR schedule


## Momentum in SGD

### 1️. Motivation

Plain SGD update:
$$
\theta_{t+1} = \theta_t - \alpha \nabla_\theta L
$$

- Moves weights directly opposite the gradient
- Problem: in steep, narrow valleys (common in deep CNNs), SGD can oscillate back and forth, slowing convergence
- Momentum is designed to smooth updates and accelerate along consistent directions

### 2️. Momentum Update Rule

Introduce a velocity vector $v_t$:

$$
v_{t+1} = \mu v_t - \alpha \nabla_\theta L
$$
$$
\theta_{t+1} = \theta_t + v_{t+1}
$$

Where:
- $v_t$ → accumulated “motion” from past gradients  
- $\mu$ → momentum factor (0.9 is typical)  
- $\alpha$ → learning rate

### 3️. Intuition
- Think of $v_t$ as velocity of a ball rolling down a slope
- Gradient acts like a force pushing the ball
- Momentum remembers past updates → builds speed in directions where gradient is consistently pointing
- Reduces oscillations across steep directions, increases movement along gentle slopes

### 4️. Step-by-Step Example

Suppose $\mu = 0.9$, $\alpha = 0.01$:

1. **First step**:  
$$
v_1 = 0 - 0.01 \nabla_\theta L_1 = -0.01 \nabla_\theta L_1
$$
$$
\theta_1 = \theta_0 + v_1
$$

2. **Second step**:  
$$
v_2 = 0.9 v_1 - 0.01 \nabla_\theta L_2
$$
- 90% of previous “velocity” carried over  
- Current gradient contributes a small push  
$$
\theta_2 = \theta_1 + v_2
$$

3. **Over time:**
- Updates accumulate along consistent directions  
- Updates cancel out in oscillating directions → smoother path


### 5️. Key Effects

- Faster convergence along shallow directions  
- Reduced zig-zag along steep directions  
- Can escape small local minima more easily  
- Requires tuning momentum factor $\mu$ carefully


> Momentum adds a memory effect to SGD. It’s like giving the optimizer “inertia,” so it doesn’t react only to the current gradient, but also to recent trends.


``` python
# PyTorch SGD with momentum
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
```

## Silent Failures in SGD

Even with correct architecture and initialization, SGD can fail silently if hyperparameters are off:

- **Learning rate too high**  
  - Gradients explode, loss may jump unpredictably  
  - Training appears unstable, but sometimes no NaNs appear  

- **Learning rate too low**  
  - Updates are tiny → training progresses extremely slowly  
  - Loss decreases almost imperceptibly, giving false impression of stagnation  

- **Momentum misconfigured**  
  - Too high → overshoots minima, oscillates along valleys  
  - Too low → slow convergence, little benefit over plain SGD  

> SGD’s failure modes are often quiet, unlike crashes, making careful hyperparameter tuning essential.

# RMSProp (Root Mean Square Propagation)

## RMSProp Intuition

RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimizr*.

<p style="color:orange;">“Adaptive learning rate optimizer” is just a fancy way of saying the optimizer automatically changes the step size for each weight during training, instead of using the same fixed learning rate for all weights.

- **Idea:** scale each weight’s update based on recent gradient magnitudes 
- **Purpose:** avoid vanishing/exploding updates, especially in non-stationary or steep loss landscapes 

RMSProp is essentially momentum for the second moment (squared gradients).

### RMSProp Update Equations

1. **Compute moving average of squared gradients**:

$$
\boxed{v_t = \beta v_{t-1} + (1-\beta) (\nabla_\theta L_t)^2}
$$

Where:  
- $v_t$ → running average of squared gradients (elementwise)  
- $\beta$ → decay factor (typical 0.9)  
- $(\nabla_\theta L_t)^2$ → elementwise square of current gradient  

2. **Update weights:**

$$
\boxed{\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \nabla_\theta L_t}
$$

- $\alpha$ → base learning rate  
- $\epsilon$ → tiny number to prevent division by zero  


**Intuition**

- RMSProp reduces step size for weights with large recent gradients 
- RMSProp increases step size for weights with small recent gradients
- Acts like per-parameter adaptive learning rates

**Analogy:**  
- You’re hiking in a landscape where some slopes are very steep and others shallow  
- RMSProp slows down on steep slopes, speeds up on shallow slopes, so you don’t overshoot or crawl unnecessarily


### Tiny Example

Suppose one weight $\theta$ with gradients: 0.2, 0.1, -0.05  
Set: $\alpha = 0.1$, $\beta = 0.9$, $\epsilon = 1e-8$

**Step 1**:  

$$
v_1 = 0.9*0 + 0.1*(0.2^2) = 0.004
$$
$$
\theta_1 = \theta_0 - 0.1 / \sqrt{0.004 + 1e-8} * 0.2
$$
$$
\sqrt{0.004} \approx 0.0632
$$
$$
\Delta \theta \approx 0.1 / 0.0632 * 0.2 \approx 0.316
$$

- Notice: gradient scaled by recent magnitude  

**Step 2**:  
$$
v_2 = 0.9 * 0.004 + 0.1 * 0.1^2 = 0.00361
$$
$$
\theta_2 = \theta_1 - 0.1 / \sqrt{0.00361} * 0.1
$$
- Step size grows when recent gradients are small


### Pros
- Works well for non-stationary data (gradients change distribution over time)  
- Prevents exploding or vanishing updates by scaling each weight adaptively  

### Cons
- Rarely used in modern CNNs; Adam/AdamW usually preferred  
- Can be less predictable than Adam because it only adapts learning rate, not momentum on gradients

---

### <p style="text-align:center; color:orange; font-size:18px;">RMSProp Update Equations: Detailed Breakdown</p>

### 1️. The Core Equations

1. Moving average of squared gradients:
$$
\boxed{v_t = \beta v_{t-1} + (1-\beta) (\nabla_\theta L_t)^2}
$$

2. Weight update:
$$
\boxed{\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \nabla_\theta L_t}
$$


### 2️. Term-by-Term Explanation

#### a) $v_t = \beta v_{t-1} + (1-\beta)(\nabla_\theta L_t)^2$

- $v_t$ → running average of squared gradients
- $\beta$ → decay factor (typically 0.9) → controls how much history matters
- $(\nabla_\theta L_t)^2$ → current squared gradient  
- **Purpose:** if a gradient is consistently large, $v_t$ grows → future updates are scaled down  
- **Intuition:** RMSProp tracks how steep the slope is for this parameter


#### b) $\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \nabla_\theta L_t$

- $\nabla_\theta L_t$ → standard gradient  
- $\sqrt{v_t + \epsilon}$ → adaptive scaling based on past gradient magnitude  
  - Large $v_t$ → divide by bigger number → smaller step  
  - Small $v_t$ → divide by smaller number → larger step  
- $\epsilon$ → tiny constant (e.g., $1e-8$) → avoids division by zero  

> This ensures all weights move at roughly comparable effective speeds, even if gradients vary a lot across parameters.

### 3️. Step-by-Step Intuition

1. **Compute squared gradient** → measures “steepness” for this parameter  
2. **Combine with past history** using $\beta$ → smooths noisy gradients  
3. **Scale current gradient by inverse sqrt** → large gradients shrink, small gradients grow  
4. **Update weight** → move in negative gradient direction, but scaled adaptively


### 4️. Numeric Mini Example

Suppose a single weight, $\alpha = 0.1$, $\beta = 0.9$, $\epsilon = 1e-8$

| Step | Grad | $v_t$ (squared avg) | Update $\Delta \theta$ |
|------|------|--------------------|------------------------|
| 1    | 0.2  | 0.004              | $-0.1 / \sqrt{0.004} * 0.2 \approx -0.316$ |
| 2    | 0.1  | 0.00361            | $-0.1 / \sqrt{0.00361} * 0.1 \approx -0.052$ |
| 3    | -0.05| 0.00325            | $-0.1 / \sqrt{0.00325} * (-0.05) \approx +0.028$ |

- Notice how the update step changes automatically depending on recent gradient magnitudes 
- RMSProp prevents “overshooting” in steep directions and “slowness” in shallow ones

### Key Takeaways

- RMSProp = adaptive step size per weight
- Large gradients → smaller step  
- Small gradients → larger step  
- Smooths training and handles non-stationary loss surfaces

---

### <p style="text-align:center; color:orange; font-size:18px;">Decay Factor in RMSProp</p>


In RMSProp, the **decay factor** $\beta$ controls how much past gradients influence the running average:

$$
v_t = \beta v_{t-1} + (1-\beta)(\nabla_\theta L_t)^2
$$

- $\beta \in [0,1)$  
- Also called **momentum for second moment** or **smoothing factor**
- In statistics, **first moment** = mean  
- **Second moment** = mean of **squared values** → measures **magnitude or variance**  
- In RMSProp, $v_t$ is a running average of **squared gradients**:

$$
v_t = \beta v_{t-1} + (1-\beta)(\nabla_\theta L_t)^2
$$

So $v_t$ is tracking the second moment of gradients.  

### Role

- **Large $\beta$ (e.g., 0.9)** → keeps long-term memory  
  - Running average changes gradually → smoother, more stable  
- **Small $\beta$ (e.g., 0.5)** → more weight on recent gradients 
  - Running average reacts quickly → can be noisy  

> Think of $\beta$ as controlling how fast you “forget the past”.


### Analogy

- Measure temperature every minute  
- Running average = previous average × $\beta$ + current reading × $(1-\beta)$  
- **High $\beta$** → average changes slowly, remembers past hours  
- **Low $\beta$** → average follows recent readings closely  

Same principle applies to RMSProp: $\beta$ smooths gradient magnitude history.


### Tiny Example

- $\beta = 0.9$, $v_{t-1} = 0.04$, current squared gradient = 0.01  

$$
v_t = 0.9*0.04 + 0.1*0.01 = 0.037
$$

- Most weight (0.9) comes from past → “decays slowly”  
- Small weight (0.1) comes from current → “partial update”

- If $\beta = 0.5$:

$$
v_t = 0.5*0.04 + 0.5*0.01 = 0.025
$$

- Running average reacts faster to new gradient  


Decay factor $\beta$ = how much history you remember in your running average.

---

```python
# Create an RMSprop optimizer for the model
# model.parameters(): all trainable weights of the network
# lr=1e-3: base learning rate
# alpha=0.99: decay factor for the running average of squared gradients
#              (controls how much past gradients influence the adaptive scaling)
optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-3, alpha=0.99)
```


# Adam (Adaptive Moment Estimation)


## Adam Intuition

Adam  is an adaptive gradient optimizer that combines:

1. **Momentum** → keeps a moving average of past gradients  
2. **RMSProp** → scales updates by recent gradient magnitude (adaptive learning rate)

Purpose:
- Fast convergence
- Handles noisy or sparse gradients
- Works well "out-of-the-box" without careful LR tuning


### Adam Update Equations

1. **Compute moving averages** of gradient and squared gradient:

$$
m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta L_t
$$

- $m_t$ → “momentum” term (first moment estimate)  
- $\beta_1$ → decay factor for momentum (typical 0.9)  
- $\nabla_\theta L_t$ → gradient at step $t$

$$
v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla_\theta L_t)^2
$$

- $v_t$ → “variance” term (second moment estimate)  
- $\beta_2$ → decay factor for squared gradients (typical 0.999)  
- $(\nabla_\theta L_t)^2$ → elementwise square of gradient  

2. **Bias correction** (optional but common):

$$
\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}
$$

- Corrects the fact that $m_0$ and $v_0$ start at 0 → avoids underestimation early

3. **Parameter update**:

$$
\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

- $\alpha$ → base learning rate  
- $\epsilon$ → tiny number (e.g., $1e-8$) to avoid division by zero  
- Intuition: divide momentum by RMS (adaptive scaling)


**Intuition**

- $m_t$ → keeps track of trend in gradients (like momentum)  
- $v_t$ → keeps track of how big gradients are 
- $\hat{m}_t / \sqrt{\hat{v}_t}$ → scales step proportionally to signal / noise  
- Large gradients → step reduced  
- Small gradients → step amplified  

> Adam dynamically adjusts the effective learning rate for each weight

### Tiny Example

Suppose one weight $\theta$ with gradients over 3 steps:

| Step | Grad ($\nabla_\theta L$) |
|------|---------------------------|
| 1    | 0.2                       |
| 2    | 0.1                       |
| 3    | -0.05                     |

Set $\alpha=0.1$, $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=1e-8$  

**Step 1**:  

$$
m_1 = 0.9*0 + 0.1*0.2 = 0.02
$$
$$
v_1 = 0.999*0 + 0.001*0.2^2 = 0.00004
$$
$$
\hat{m}_1 = 0.02 / (1-0.9^1) = 0.2
$$
$$
\hat{v}_1 = 0.00004 / (1-0.999^1) \approx 0.04
$$
$$
\theta_1 = \theta_0 - 0.1 * 0.2 / \sqrt{0.04} \approx \theta_0 - 0.1
$$

**Step 2**: similar, $m_2$, $v_2$, compute bias-corrected estimates, update $\theta_2$  

**Step 3**: $m_3$, $v_3$, update $\theta_3$  

> Notice how updates scale automatically based on both trend and magnitude of gradients.


### **Pros**:
- Fast convergence  
- Works well with sparse gradients  
- Often “plug-and-play” without LR tuning

### **Cons**:
- Can overfit (fast convergence → less implicit regularization)  
- Sometimes worse final accuracy than SGD+momentum  
- Implicit weight decay may interfere with explicit regularization


> Adam is a smart, adaptive SGD, it tracks both direction and scale of gradients, making it very effective, but you must still monitor overfitting and final generalization.


---

### <p style="text-align:center; color:orange; font-size:18px;">Adam Optimizer: Core Update Equations Breakdown</p>

Adam combines **momentum (first moment)** with **RMSProp (second moment)** for adaptive, stable updates. Let’s go step by step.


#### Step 1. Compute Gradient
For a weight $\theta$ at time step $t$:

$$
g_t = \nabla_\theta L_t
$$

- $g_t$ is the current gradient
- Represents how much the loss changes with respect to $\theta$  


#### Step 2: First Moment (Momentum)

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$

- $m_t$ = running average of past gradients (momentum)  
- $\beta_1$ = decay factor for first moment (typical 0.9)  
- $(1-\beta_1)$ = fraction of current gradient included  
- **Intuition:** smooths the gradient, capturing the direction of descent  

> Like a “moving average” of gradients, it prevents oscillation in steep or noisy directions.

#### Step 3: Second Moment (RMSProp)

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

- $v_t$ = running average of squared gradients 
- $\beta_2$ = decay factor for second moment (typical 0.999)  
- Tracks magnitude of gradients → adaptive scaling of updates  
- Squared gradient ensures we measure size only, ignoring direction
- **Intuition:** prevents steps from being too big on steep slopes and too small on flat slopes.

#### Step 4: Bias Correction

- At initialization: $m_0 = 0$, $v_0 = 0$  
- First few steps: $m_t$ and $v_t$ are biased toward zero because they are computed as:

$$
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
$$

- Step 1: $m_1 = (1-\beta_1) g_1$ → much smaller than actual gradient magnitude  
- Step 2: $m_2 = \beta_1 m_1 + (1-\beta_1) g_2$ → still smaller than “true average”  

> Without correction, early updates are tiny, slowing learning.

Bias correction divides by $(1 - \beta_1^t)$:

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

- $t$ = current time step  
- $(1 - \beta^t)$ starts small → compensates for the initial “zero bias”  

**Example:**  

- $\beta_1 = 0.9$, $t=1$  
- $1 - \beta_1^1 = 0.1$ → divide by 0.1 → scale up $m_1$ 10×  
- Now $\hat{m}_1 \approx g_1$ → update is correct magnitude

**Intuition**

- Without correction: first updates are too small, learning is slow  
- With correction: first updates match the scale we expect, then later updates naturally settle  
- Bias correction is only significant in **early steps**  
- After many steps ($t \to \infty$), $\beta^t \to 0$, correction factor → 1  
- Then $\hat{m}_t \approx m_t$, $\hat{v}_t \approx v_t$  

> Ensures Adam starts with correct step sizes immediately, preventing slow initial training.

#### Step 5: Parameter Update

$$
\boxed{\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}}
$$

- $\alpha$ = learning rate  
- $\hat{m}_t / \sqrt{\hat{v}_t}$ → **direction (momentum)** / **magnitude scaling (RMSProp)**  
- $\epsilon$ = small number to avoid division by zero
- Intuition: move weight along a smoothed direction with adaptive step size.

#### Summary of Roles

| Term          | Role |
|---------------|------|
| $g_t$         | current gradient (signal) |
| $m_t, \hat{m}_t$ | momentum → smooths direction |
| $v_t, \hat{v}_t$ | second moment → adaptive step size |
| $\alpha$      | base learning rate |
| $\epsilon$    | numerical stability |

Adam combines directional smoothing and per-parameter scaling, making it robust and fast for most deep learning tasks.

---

```python

# PyTorch Adam
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
```

## Adam Failure Modes

- **Quick loss drop** → may give a false sense of convergence, but validation may still be poor  
- **Overfits small datasets** rapidly due to aggressive adaptive steps  
- **Incorrect weight decay / regularization** → can harm generalization

> Adam is fast and adaptive, but does not automatically prevent overfitting.


# AdamW (Adam with Decoupled Weight Decay)


To be continue...

---
<p style="text-align:center; color:skyblue; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
