<a href="https://www.kaggle.com/code/mrafraim/dl-day-37-cnn-learning-rate-scheduling?scriptVersionId=298917645" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 37: CNN Learning Rate Scheduling
StepLR · ReduceLROnPlateau · CosineAnnealing · LR Phase Intuition

Welcome to Day 37!

What You’ll Learn Today:

1. Why fixed learning rates are suboptimal
2. Learning rate decay intuition
3. StepLR
4. ReduceLROnPlateau
5. CosineAnnealing
6. When to use which

If you found this notebook helpful, your **<b style="color:skyblue;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# Why Fixed Learning Rate Fails

Recall the basic gradient descent update:

$$
\boxed{\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)}
$$

Where:
- $\theta_t$ → current parameters  
- $\alpha$ → learning rate  
- $\nabla L$ → gradient (direction of steepest increase)

We subtract the gradient to move **downhill**.

## What the Learning Rate Really Controls

The learning rate $\alpha$ determines **step size**.

It answers:

> “How far should we move in the gradient direction?”

## If $\alpha$ Is Large

$$
\theta_{t+1} = \theta_t - \textbf{large step}
$$

### Pros
- Fast movement across loss surface  
- Escapes sharp local minima  
- Explores wider regions  

### Cons
- Can overshoot minima  
- Oscillates around optimum  
- May diverge  

Large LR is good for:
- Early training
- Rough exploration

## If $\alpha$ Is Small

$$
\theta_{t+1} = \theta_t - \textbf{small step}
$$

### Pros
- Stable convergence  
- Precise fine-tuning  
- Less oscillation  

### Cons
- Very slow training  
- Can get stuck in sharp minima  
- Poor exploration  

Small LR is good for:
- Late training
- Fine-grained optimization

## Core Problem

Training has two different phases:

1️. Exploration phase  
2️. Refinement phase  

A single fixed $\alpha$ cannot optimize both:

- If large → unstable at end
- If small → painfully slow at start 

## Geometric Intuition

Think of descending a mountain:

- At the top → you want big jumps  
- Near the bottom → you want tiny careful steps  

Using one fixed step size is inefficient.

## Strategic Conclusion

> Learning rate must change over time.

That’s why we use:
- Step decay  
- Exponential decay  
- Cosine annealing  
- OneCycle  
- Warmup schedules  

Fixed LR is simple but strategically weak for deep networks.

# Optimization Has Distinct Phases

Training a neural network is **not uniform**.  
The geometry of the loss surface and gradient magnitudes change over time.

## 1️. Exploration Phase (Early Training)

- Loss decreases rapidly  
- Gradients are large  
- Parameters are far from optimum  

Update rule:

$$
\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)
$$

Since $\|\nabla L\|$ is large:

- Large $\alpha$ → fast movement across surface  
- Helps escape sharp or poor local regions  
- Encourages broader exploration  

Goal here:
> Move quickly toward a promising basin.

## 2️. Transition Phase (Approaching Minimum)

- Parameters enter a valley (basin)  
- Gradients shrink but fluctuate  
- Curvature becomes important  

If $\alpha$ remains large:

- Updates overshoot the minimum  
- Oscillations occur across valley walls  

Mathematically:

Large $\alpha$ × small but varying $\nabla L$ → unstable zig-zag behavior  

Goal here:
> Reduce step size to stabilize descent.


## 3️. Fine Convergence Phase (Late Training)

- Gradients are small  
- Surface curvature dominates  
- Small adjustments refine solution  

Now we need:

$$
\alpha \text{ very small}
$$

Why?

Because near minimum:

$$
\nabla L \approx 0
$$

Large $\alpha$ would:
- Bounce around minimum  
- Prevent precise convergence  

Goal here:
> Minimize noise and fine-tune parameters.

## Fixed Learning Rate Fails

A single $\alpha$ cannot satisfy all phases:

| Phase | Ideal LR |
|-------|----------|
| Exploration | Large |
| Transition | Medium |
| Fine Convergence | Small |

If LR is fixed:

- Large LR → never truly settles  
- Small LR → wastes early training potential  

## Core Insight

Optimization is **dynamic**.

The learning rate should:
- Start large  
- Gradually decrease  
- Become very small near convergence  

That’s why we use:

- Step decay  
- Cosine annealing  
- OneCycle policy  
- Warmup + decay  

> No scheduling = stuck oscillating near the minimum instead of converging smoothly.

# What is Learning Rate Scheduling?

**Learning rate scheduling** is the strategy of changing the learning rate during training instead of keeping it constant.

Instead of using a fixed learning rate:

$$
\theta_{t+1} = \theta_t - \alpha \nabla L
$$

we use a time-dependent learning rate:

$$
\boxed{\theta_{t+1} = \theta_t - \alpha_t \nabla L}
$$

where $ \alpha_t $ changes over time.


## Why Do We Need It?

Training usually happens in different phases:

### 1️. Early Phase > Exploration
- Loss decreases rapidly  
- Gradients are large  
- Larger learning rate helps move quickly  

### 2️. Middle Phase > Transition
- Approaching a good region  
- Oscillations may start  
- Learning rate should reduce  

### 3️. Final Phase > Fine Convergence
- Very small improvements  
- Requires precise adjustments  
- Small learning rate is necessary  

A single fixed learning rate cannot work optimally across all these phases.

## What Scheduling Does

Learning rate scheduling:

- Starts large → enables fast progress  
- Gradually reduces → improves stability  
- Prevents oscillation near minimum  
- Often improves final accuracy  


## Simple Example

### Without Scheduling

LR = 0.01 for all 30 epochs

Result:
- Fast initial progress  
- Later oscillation around the minimum  

### With Scheduling

Epoch 1–10   → 0.01<br>
Epoch 11–20  → 0.001<br>
Epoch 21–30  → 0.0001

Result:
- Early: Large steps  
- Later: Small, precise steps  
- Smoother convergence  

## Common Types of LR Scheduling

- Step Decay  
- Exponential Decay  
- Cosine Annealing  
- Reduce on Plateau  
- Warmup + Decay  
- Cyclical Learning Rate  

> Learning rate scheduling is the technique of dynamically adjusting the learning rate during training to improve convergence speed and stability.

# StepLR - Discrete Learning Rate Decay

StepLR is a learning rate scheduler that reduces the learning rate by a fixed factor (`gamma`) after a fixed number of epochs (`step_size`).

It implements **piecewise-constant decay***.

Instead of smoothly decreasing the learning rate, it drops it in sudden steps.

### StepLR Formula

$$
\boxed{\alpha_t = \alpha_0 \cdot \gamma^{\left\lfloor \frac{t}{\text{step\_size}} \right\rfloor}}
$$

Where:

- $\alpha_t$ → learning rate at epoch $t$
- $ t $ → Current epoch  
- $\alpha_0$ → initial learning rate
- $\gamma$ → decay factor (e.g., 0.1)
- $\text{step\_size}$ → number of epochs before decay
- $\lfloor \cdot \rfloor$ → floor function (round down)

**Intuition**

Think of StepLR as:

> "Train normally for some time → then suddenly reduce step size → repeat."

It allows:

- Fast learning early  
- More careful learning later  

---

### <p style="color:orange;text-align:center;">*What is Piecewise-Constant Decay? (Optional)</p>

**Piecewise-constant decay** is a function that:

- Stays constant for a period of time  
- Then suddenly drops to a new constant value  
- Repeats this pattern  

It does not change smoothly.  
It changes in discrete jumps.


Imagine a staircase:


Level 1  ──────────<br>
↓<br>
Level 2  ──────────<br>
↓<br>
Level 3  ──────────


Each flat region is constant.  
Each drop happens at specific intervals.

That staircase shape = piecewise-constant behavior.

### Mathematical View (StepLR Example)

For StepLR:

$$
\alpha_t = \alpha_0 \cdot \gamma^{\left\lfloor \frac{t}{\text{step\_size}} \right\rfloor}
$$

The learning rate:

- Remains constant within each interval  
- Changes only when $ t $ crosses multiples of `step_size`

### Simple Example

Let:

- $ \alpha_0 = 0.01 $
- $ \gamma = 0.1 $
- $ \text{step\_size} = 5 $

Then:

| Epoch | Learning Rate |
|-------|---------------|
| 0–4   | 0.01          |
| 5–9   | 0.001         |
| 10–14 | 0.0001        |

Notice:

- Within each range → LR is constant  
- At epoch 5 and 10 → LR drops suddenly  

That is **piecewise-constant decay**.


### Why It's Called "Piecewise"

- "Piecewise" → defined in separate intervals (pieces)
- "Constant" → value does not change within each interval
- "Decay" → value decreases over time

> Piecewise-constant decay is a step-like schedule where a value remains constant for fixed intervals and then decreases abruptly at predefined points.

---

## What the Floor Function Does

$$
\left\lfloor \frac{t}{10} \right\rfloor
$$

If `step_size = 10`:

| Epoch | t/10 | Floor | Power of γ |
|-------|------|-------|------------|
| 1–10  | <1   | 0     | γ⁰ = 1 |
| 11–20 | 1–2  | 1     | γ¹ |
| 21–30 | 2–3  | 2     | γ² |

This creates the **staircase pattern**.

## Manual Example

Given:

- $\alpha_0 = 0.01$
- $\gamma = 0.1$
- step_size = 10


### Epoch 1:

$$
\alpha_1 = 0.01 \cdot 0.1^{\lfloor 1/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^0
$$

$$
= 0.01 \cdot 1
$$

$$
= 0.01
$$


### Epoch 10:

$$
\alpha_{10} = 0.01 \cdot 0.1^{\lfloor 10/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^1
$$

$$
= 0.001
$$

So decay happens right after epoch 10.


### Epoch 15:

$$
\alpha_{15} = 0.01 \cdot 0.1^{\lfloor 15/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^1
$$

$$
= 0.001
$$

Learning rate remains constant until next boundary.


### Epoch 25:

$$
\alpha_{25} = 0.01 \cdot 0.1^{\lfloor 25/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^2
$$

$$
= 0.0001
$$


### Final Schedule

| Epoch | LR |
|--------|------|
| 1–9 | 0.01 |
| 10–19 | 0.001 |
| 20–29 | 0.0001 |

Notice the **sudden drops**.

## How to Choose `gamma`

`gamma` controls how aggressively you reduce the learning rate:

$$
\alpha_{\text{new}} = \alpha_{\text{old}} \cdot \gamma
$$

So the real question is:

> How big should each decay jump be?

Let’s approach this strategically.


### 1️. First-Principles View

Learning rate controls step size:

$$
\Delta \theta = -\alpha \nabla L
$$

Reducing LR changes:

- Exploration intensity  
- Convergence speed  
- Stability  

`gamma` determines how sharply you transition from exploration → refinement.


### 2️. Standard Practical Values

Most commonly used values:

| Gamma | Effect |
|-------|--------|
| 0.1   | Strong decay (10× smaller) |
| 0.5   | Moderate decay (2× smaller) |
| 0.8–0.9 | Gentle decay |

### Rule of thumb:
- Computer vision (ImageNet-style): **0.1**
- Smaller datasets: **0.5**
- Fine-tuning: **0.1 or smaller**


### 3. Strategic Way to Choose Gamma

Instead of guessing, think in terms of training phases.

Ask:

1. How far am I from convergence?
2. Do I need sharp stabilization or gradual refinement?

#### Case A: Large Model, Large Dataset

Use:
$$
\gamma = 0.1
$$

Why?

- You want clear phase transitions.
- Early phase = aggressive learning.
- Later phase = precise convergence.

Strong drop is fine because gradients are stable.

#### Case B: Small Dataset / Noisy Gradients

Use:
$$
\gamma = 0.5
$$

Why?

- Large abrupt drops may stall learning.
- Gentler decay avoids optimizer shock.

#### Case C: Fine-Tuning Pretrained Model

Often:
$$
\gamma = 0.1 \text{ or } 0.2
$$

Because:

- You are already near a good minimum.
- You want fast stabilization.

### 4️. Quantitative Thinking (Better Way)

Decide how small you want LR at the end.

If:

- Initial LR = $0.01$
- Final desired LR ≈ $0.0001$
- You decay 2 times

Then solve:

$$
0.01 \cdot \gamma^2 = 0.0001
$$

$$
\gamma^2 = 0.01
$$

$$
\gamma = 0.1
$$

So instead of guessing gamma,
**work backward from desired final LR.**

This is the cleanest method.

### Warning: Too Small Gamma

If:

$$
\gamma = 0.01
$$

Then LR collapses too fast.

You lose exploration early.
Training may stall.


### Warning: Too Large Gamma

If:

$$
\gamma = 0.9
$$

Decay barely matters.
You may oscillate too long.


### Practical Recommendation Framework

Choose gamma based on:

1. Desired final LR
2. Number of decay steps
3. Dataset size
4. Stability of gradients

Most robust default:

$$
\gamma = 0.1
$$

Unless you have reason not to.

## PyTorch Implementation

```python
# Create a StepLR scheduler object
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,        # The optimizer whose learning rate we want to control
    step_size=10,     # Number of epochs to wait before reducing the learning rate
    gamma=0.1         # Multiplicative decay factor (new_lr = old_lr * gamma)
)
````
### What This Actually Means

* The learning rate will remain unchanged for 10 epochs.

* At epoch 10 → LR becomes:

  $$
  \text{LR} = \text{LR} \times 0.1
  $$

* At epoch 20 → it is multiplied by 0.1 again.

* This creates the staircase (piecewise-constant) decay pattern.


At each epoch:

```python
# Call this once after each epoch
scheduler.step()
```

### What `scheduler.step()` Does

* It updates the learning rate based on the current epoch count.

* Internally, it checks:

  $$
  \left\lfloor \frac{\text{current\_epoch}}{\text{step\_size}} \right\rfloor
  $$

* If the epoch crosses a decay boundary (10, 20, 30...),
  it applies the multiplicative drop.


### Important Detail (Often Missed)

Typical training loop:

```python
for epoch in range(num_epochs):
    train(...)
    validate(...)
    scheduler.step()   # Step AFTER each epoch
```

If you call `scheduler.step()` in the wrong place (e.g., before training),
your decay timing shifts by one epoch.

**Notes:**

* `step_size` controls when decay happens.
* `gamma` controls how much decay happens.
* `scheduler.step()` triggers the update.


## Why It Works

Recall optimization phases:

1️. Early → large LR helps exploration<br>
2️. Later → smaller LR helps convergence

StepLR forces a manual phase transition.

## Weakness of StepLR

Abrupt decay can:

* Shock optimizer momentum buffers
* Cause temporary instability
* Reduce LR too early
* Waste exploration capacity

Momentum-based optimizers especially feel this shock because:

$$
\text{velocity}_{t+1} = \beta v_t + \alpha \nabla L
$$

Sudden drop in $\alpha$ changes effective velocity.

---

### <p style="text-align:center; color:orange; font-size:18px;"> Why Abrupt LR Drops Can Be a Problem (Optional)<p>

Let’s break this from first principles.

1️⃣ **What Momentum Is Really Doing**

For momentum SGD:

$$
v_{t+1} = \beta v_t + \nabla L_t
$$

$$
\theta_{t+1} = \theta_t - \alpha v_{t+1}
$$

Key idea:

- $v_t$ = **accumulated direction**
- $\alpha$ = **step size multiplier**

Momentum builds up velocity over time.  
It smooths gradients and accelerates movement in consistent directions.

Think of it as a rolling ball gaining speed downhill.


2️⃣ **What Happens Before Decay**

Suppose:

- Learning rate: $\alpha = 0.01$
- Momentum velocity is large (because training has been progressing steadily)

Updates look like:

$$
\Delta \theta \approx -0.01 \cdot v_t
$$

The system is tuned to this step size.


3️⃣ **Sudden StepLR Drop**

At epoch 10:

$$
\alpha \rightarrow 0.001
$$

But notice something:

- $v_t$ is still large (built using old scale)
- Only $\alpha$ changed

Now updates become:

$$
\Delta \theta \approx -0.001 \cdot v_t
$$

That is a 10× smaller step instantly.


4️⃣ **Why This Feels Like a “Shock”**

The optimizer dynamics were stable at:

$$
\text{velocity size }  ×  \text{ learning rate}  =  \text{effective step}
$$


Suddenly:

- Velocity is large
- Learning rate shrinks sharply

So:

- Effective step collapses
- Momentum needs time to re-adjust
- Training temporarily slows or becomes inconsistent

It’s like:

> Driving at 100 km/h and instantly switching to first gear.

System needs time to re-balance.

5️⃣ **Why Smooth Decay Is More Stable**

With smooth decay (e.g., exponential or cosine):

$$
\alpha_t = \alpha_0 e^{-kt}
$$

Learning rate shrinks gradually.

Momentum adapts gradually too.

No sudden system shock.


6️⃣ **"Reduce LR Too Early" Problem**

StepLR forces decay at fixed epochs.

But what if:

- Model is still far from minimum?
- Gradients are still large?
- Loss is decreasing fast?

Then reducing LR early:

- Slows exploration
- Makes convergence slower than necessary
- Traps you in suboptimal regions

This is why adaptive schedulers (e.g., ReduceLROnPlateau) often work better.


**Simple Mental Model**

Without momentum:
- StepLR = less problematic

With momentum:
- You are scaling a moving object suddenly
- System needs time to stabilize

> StepLR causes sudden learning rate drops that disrupt the balance between accumulated momentum and step size, leading to temporary instability or slowed progress.

---

### <p style="text-align:center; color:orange; font-size:18px;"> “Shock Optimizer Momentum Buffers” What It Actually Means (Optional)</p>


1️⃣ **What Is a Momentum Buffer?**

In momentum-based optimizers (SGD with momentum, Adam, etc.),  
the optimizer keeps an internal variable:

$$
v_t
$$

This is often called the **momentum buffer**.

It stores a moving average of past gradients:

$$
v_{t+1} = \beta v_t + \nabla L_t
$$

Think of it as:

> Accumulated direction + speed from past steps.

It has memory.


2️⃣ **Where Learning Rate Enters**

The actual parameter update is:

$$
\theta_{t+1} = \theta_t - \alpha v_{t+1}
$$

Important:

- $v_t$ is built gradually.
- $\alpha$ scales how much of that velocity we apply.

3️⃣ **What Happens in StepLR**

Before decay:

$$
\alpha = 0.01
$$

Suddenly at epoch 10:

$$
\alpha = 0.001
$$

But the momentum buffer:

$$
v_t
$$

is still large because it was built assuming the old learning rate dynamics.


4️⃣ **The “Shock”**

Suddenly:

$$
\Delta \theta = -0.001 \cdot v_t
$$

instead of:

$$
\Delta \theta = -0.01 \cdot v_t
$$

So:

- The optimizer was moving fast.
- Instantly, its effective step becomes 10× smaller.
- The stored velocity no longer matches the new step scale.

That sudden imbalance = **shock to the momentum buffer**.


5️⃣ **What You Observe During Training**

Right after a StepLR drop, you may see:

- Temporary plateau
- Slight loss spike
- Slower progress for a few epochs
- Momentum re-adjustment period

Then training stabilizes again.

6️⃣ **Why Smooth Schedules Avoid This**

With exponential or cosine decay:

- $\alpha$ changes gradually.
- Momentum buffer adapts gradually.
- No internal mismatch.

No shock.

---


## When StepLR Is Still Effective

### 1️. Large Datasets (e.g., ImageNet-Scale Training)

On very large datasets:

- Gradients are stable (low variance)
- Optimization landscape is smoother
- Training runs for many epochs

Because of this stability:

- Abrupt LR drops don’t destabilize training much
- The model has enough signal to recover quickly
- Predefined decay points often align with learning phases

In large-scale vision training, simple schedules are often sufficient.

### 2️. Fixed-Length Training Pipelines

If you know:

- Total epochs = 90
- You always train exactly 90
- Same dataset, same model, same regime

Then you can pre-plan:

- Decay at 30
- Decay at 60

This removes the need for adaptive logic.

StepLR works well when:

> The optimization trajectory is predictable.


### 3️. Classic CNN Training (ResNet-Style)

Many landmark CNN papers (e.g., early ResNet training pipelines) used milestone-based decay.

Typical pattern:

Epoch 0–30   → LR = 0.1<br>
Epoch 30–60  → LR = 0.01<br>
Epoch 60–90  → LR = 0.001

Why it works here:

- CNNs on large vision datasets train in phases.
- Early phase = rapid representation learning.
- Mid phase = feature stabilization.
- Late phase = refinement.

StepLR aligns naturally with these phases.


### 4️. When Milestones Are Known in Advance

If historical runs show:

- Loss plateaus around epoch 25
- Another plateau around epoch 55

You can intentionally schedule drops there.

This turns StepLR from "blind schedule" into "structured optimization design."


### 5️. Strategic Insight

StepLR is strongest when:

- Training dynamics are consistent
- Dataset is large
- Hyperparameters are well-tuned
- You care about reproducibility
- Simplicity matters

It is weakest when:

- Data is small
- Training is unstable
- Optimal decay timing is unknown
- You need adaptive behavior



> StepLR works best when the optimization landscape and training duration are predictable enough that fixed milestone-based decay aligns with the natural learning phases.



# ReduceLROnPlateau - Reactive Learning Rate Control

To be continue...

---
<p style="text-align:center; color:skyblue; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
