# Day 37: CNN Learning Rate Scheduling
StepLR · ReduceLROnPlateau · CosineAnnealing · LR Phase Intuition

Welcome to Day 37!

What You’ll Learn Today:

1. Why fixed learning rates are suboptimal
2. Learning rate decay intuition
3. StepLR
4. ReduceLROnPlateau
5. CosineAnnealing
6. When to use which

If you found this notebook helpful, your **<b style="color:skyblue;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# Why Fixed Learning Rate Fails

Recall the basic gradient descent update:

$$
\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)
$$

Where:
- $\theta_t$ → current parameters  
- $\alpha$ → learning rate  
- $\nabla L$ → gradient (direction of steepest increase)

We subtract the gradient to move **downhill**.

## What the Learning Rate Really Controls

The learning rate $\alpha$ determines **step size**.

It answers:

> “How far should we move in the gradient direction?”

## If $\alpha$ Is Large

$$
\theta_{t+1} = \theta_t - \textbf{large step}
$$

### Pros
- Fast movement across loss surface  
- Escapes sharp local minima  
- Explores wider regions  

### Cons
- Can overshoot minima  
- Oscillates around optimum  
- May diverge  

Large LR is good for:
- Early training
- Rough exploration

## If $\alpha$ Is Small

$$
\theta_{t+1} = \theta_t - \textbf{small step}
$$

### Pros
- Stable convergence  
- Precise fine-tuning  
- Less oscillation  

### Cons
- Very slow training  
- Can get stuck in sharp minima  
- Poor exploration  

Small LR is good for:
- Late training
- Fine-grained optimization

## Core Problem

Training has two different phases:

1️. Exploration phase  
2️. Refinement phase  

A single fixed $\alpha$ cannot optimize both:

- If large → unstable at end
- If small → painfully slow at start 

## Geometric Intuition

Think of descending a mountain:

- At the top → you want big jumps  
- Near the bottom → you want tiny careful steps  

Using one fixed step size is inefficient.

## Strategic Conclusion

> Learning rate must change over time.

That’s why we use:
- Step decay  
- Exponential decay  
- Cosine annealing  
- OneCycle  
- Warmup schedules  

Fixed LR is simple but strategically weak for deep networks.

# Optimization Has Distinct Phases

Training a neural network is **not uniform**.  
The geometry of the loss surface and gradient magnitudes change over time.

## 1️. Exploration Phase (Early Training)

- Loss decreases rapidly  
- Gradients are large  
- Parameters are far from optimum  

Update rule:

$$
\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)
$$

Since $\|\nabla L\|$ is large:

- Large $\alpha$ → fast movement across surface  
- Helps escape sharp or poor local regions  
- Encourages broader exploration  

Goal here:
> Move quickly toward a promising basin.

## 2️. Transition Phase (Approaching Minimum)

- Parameters enter a valley (basin)  
- Gradients shrink but fluctuate  
- Curvature becomes important  

If $\alpha$ remains large:

- Updates overshoot the minimum  
- Oscillations occur across valley walls  

Mathematically:

Large $\alpha$ × small but varying $\nabla L$ → unstable zig-zag behavior  

Goal here:
> Reduce step size to stabilize descent.


## 3️. Fine Convergence Phase (Late Training)

- Gradients are small  
- Surface curvature dominates  
- Small adjustments refine solution  

Now we need:

$$
\alpha \text{ very small}
$$

Why?

Because near minimum:

$$
\nabla L \approx 0
$$

Large $\alpha$ would:
- Bounce around minimum  
- Prevent precise convergence  

Goal here:
> Minimize noise and fine-tune parameters.

## Fixed Learning Rate Fails

A single $\alpha$ cannot satisfy all phases:

| Phase | Ideal LR |
|-------|----------|
| Exploration | Large |
| Transition | Medium |
| Fine Convergence | Small |

If LR is fixed:

- Large LR → never truly settles  
- Small LR → wastes early training potential  

## Core Insight

Optimization is **dynamic**.

The learning rate should:
- Start large  
- Gradually decrease  
- Become very small near convergence  

That’s why we use:

- Step decay  
- Cosine annealing  
- OneCycle policy  
- Warmup + decay  

> No scheduling = stuck oscillating near the minimum instead of converging smoothly.

# What is Learning Rate Scheduling?

**Learning rate scheduling** is the strategy of changing the learning rate during training instead of keeping it constant.

Instead of using a fixed learning rate:

$$
\theta_{t+1} = \theta_t - \alpha \nabla L
$$

we use a time-dependent learning rate:

$$
\boxed{\theta_{t+1} = \theta_t - \alpha_t \nabla L}
$$

where $ \alpha_t $ changes over time.


## Why Do We Need It?

Training usually happens in different phases:

### 1️. Early Phase > Exploration
- Loss decreases rapidly  
- Gradients are large  
- Larger learning rate helps move quickly  

### 2️. Middle Phase > Transition
- Approaching a good region  
- Oscillations may start  
- Learning rate should reduce  

### 3️. Final Phase > Fine Convergence
- Very small improvements  
- Requires precise adjustments  
- Small learning rate is necessary  

A single fixed learning rate cannot work optimally across all these phases.


## What Scheduling Does

Learning rate scheduling:

- Starts large → enables fast progress  
- Gradually reduces → improves stability  
- Prevents oscillation near minimum  
- Often improves final accuracy  


## Simple Example

### Without Scheduling

LR = 0.01 for all 30 epochs

Result:
- Fast initial progress  
- Later oscillation around the minimum  

### With Scheduling

Epoch 1–10   → 0.01<br>
Epoch 11–20  → 0.001<br>
Epoch 21–30  → 0.0001

Result:
- Early: Large steps  
- Later: Small, precise steps  
- Smoother convergence  

## Common Types of LR Scheduling

- Step Decay  
- Exponential Decay  
- Cosine Annealing  
- Reduce on Plateau  
- Warmup + Decay  
- Cyclical Learning Rate  

> Learning rate scheduling is the technique of dynamically adjusting the learning rate during training to improve convergence speed and stability.

# StepLR - Discrete Learning Rate Decay

StepLR is a learning rate scheduler that reduces the learning rate by a fixed factor (`gamma`) after a fixed number of epochs (`step_size`).

It implements **piecewise-constant decay***.

Instead of smoothly decreasing the learning rate, it drops it in sudden steps.

### StepLR Formula

$$
\boxed{\alpha_t = \alpha_0 \cdot \gamma^{\left\lfloor \frac{t}{\text{step\_size}} \right\rfloor}}
$$

Where:

- $\alpha_t$ → learning rate at epoch $t$
- $\alpha_0$ → initial learning rate
- $\gamma$ → decay factor (e.g., 0.1)
- $\text{step\_size}$ → number of epochs before decay
- $\lfloor \cdot \rfloor$ → floor function (round down)

---

### <p style="color:orange;text-align:center;">*What is Piecewise-Constant Decay?</p>

**Piecewise-constant decay** is a function that:

- Stays constant for a period of time  
- Then suddenly drops to a new constant value  
- Repeats this pattern  

It does not change smoothly.  
It changes in discrete jumps.


Imagine a staircase:


Level 1  ──────────<br>
↓<br>
Level 2  ──────────<br>
↓<br>
Level 3  ──────────


Each flat region is constant.  
Each drop happens at specific intervals.

That staircase shape = piecewise-constant behavior.

### Mathematical View (StepLR Example)

For StepLR:

$$
\alpha_t = \alpha_0 \cdot \gamma^{\left\lfloor \frac{t}{\text{step\_size}} \right\rfloor}
$$

The learning rate:

- Remains constant within each interval  
- Changes only when $ t $ crosses multiples of `step_size`


### Simple Example

Let:

- $ \alpha_0 = 0.01 $
- $ \gamma = 0.1 $
- $ \text{step\_size} = 5 $

Then:

| Epoch | Learning Rate |
|-------|---------------|
| 0–4   | 0.01          |
| 5–9   | 0.001         |
| 10–14 | 0.0001        |

Notice:

- Within each range → LR is constant  
- At epoch 5 and 10 → LR drops suddenly  

That is **piecewise-constant decay**.


### Why It's Called "Piecewise"

- "Piecewise" → defined in separate intervals (pieces)
- "Constant" → value does not change within each interval
- "Decay" → value decreases over time


## One-Line Definition

> Piecewise-constant decay is a step-like schedule where a value remains constant for fixed intervals and then decreases abruptly at predefined points.

---

## What the Floor Function Does

$$
\left\lfloor \frac{t}{10} \right\rfloor
$$

If `step_size = 10`:

| Epoch | t/10 | Floor | Power of γ |
|-------|------|-------|------------|
| 1–10  | <1   | 0     | γ⁰ = 1 |
| 11–20 | 1–2  | 1     | γ¹ |
| 21–30 | 2–3  | 2     | γ² |

This creates the **staircase pattern**.

## Manual Example

Given:

- $\alpha_0 = 0.01$
- $\gamma = 0.1$
- step_size = 10


### Epoch 1:

$$
\alpha_1 = 0.01 \cdot 0.1^{\lfloor 1/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^0
$$

$$
= 0.01 \cdot 1
$$

$$
= 0.01
$$


### Epoch 10:

$$
\alpha_{10} = 0.01 \cdot 0.1^{\lfloor 10/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^1
$$

$$
= 0.001
$$

So decay happens right after epoch 10.


### Epoch 15:

$$
\alpha_{15} = 0.01 \cdot 0.1^{\lfloor 15/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^1
$$

$$
= 0.001
$$

Learning rate remains constant until next boundary.


### Epoch 25:

$$
\alpha_{25} = 0.01 \cdot 0.1^{\lfloor 25/10 \rfloor}
$$

$$
= 0.01 \cdot 0.1^2
$$

$$
= 0.0001
$$


### Final Schedule

| Epoch | LR |
|--------|------|
| 1–9 | 0.01 |
| 10–19 | 0.001 |
| 20–29 | 0.0001 |

Notice the **sudden drops**.

## PyTorch Implementation

```python
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=10,  # decay every 10 epochs
    gamma=0.1      # multiply LR by 0.1
)
````

At each epoch:

```python
scheduler.step()
```

## Why It Works

Recall optimization phases:

1️. Early → large LR helps exploration<br>
2️. Later → smaller LR helps convergence

StepLR forces a **manual phase transition**.

## Weakness of StepLR

Abrupt decay can:

* Shock optimizer momentum buffers
* Cause temporary instability
* Reduce LR too early
* Waste exploration capacity

Momentum-based optimizers especially feel this shock because:

$$
\text{velocity}_{t+1} = \beta v_t + \alpha \nabla L
$$

Sudden drop in $\alpha$ changes effective velocity.


## When StepLR Is Still Effective

* Large datasets (ImageNet-style training)
* Fixed training length pipelines
* Classic CNN training (ResNet schedules)
* When decay milestones are known in advance

## Intuition

Think of StepLR as:

> "Train normally for some time → then suddenly reduce step size → repeat."

It allows:

- Fast learning early  
- More careful learning later  