# üåå Level 11 ‚Äî OneCycleLR and Cosine Annealing (Warm Restarts)

> **Objective:**  
> To understand and visualize advanced learning rate control techniques:  
> **OneCycleLR** and **Cosine Annealing with Warm Restarts**,  
> which dynamically vary the learning rate (and momentum) during training  
> to achieve faster convergence and better generalization.


In [None]:
import numpy as np
import matplotlib.pyplot as plt


In [None]:
def one_cycle_lr(t, total_steps=100, max_lr=0.1, min_lr=0.01, pct_up=0.3):
    """Implements Leslie Smith‚Äôs OneCycle policy."""
    up_steps = int(total_steps * pct_up)
    down_steps = total_steps - up_steps
    
    if t < up_steps:
        return min_lr + (max_lr - min_lr) * (t / up_steps)
    else:
        return max_lr - (max_lr - min_lr) * ((t - up_steps) / down_steps)


def cosine_annealing_lr(t, T_max=50, max_lr=0.1, min_lr=0.01):
    """Cosine annealing learning rate without restarts."""
    return min_lr + (max_lr - min_lr) * (1 + np.cos(np.pi * t / T_max)) / 2


def cosine_annealing_warm_restarts(t, T_0=30, T_mult=2, max_lr=0.1, min_lr=0.01):
    """Cosine annealing with warm restarts (SGDR)."""
    Ti = T_0
    cycle = 0
    while t >= Ti:
        t -= Ti
        Ti *= T_mult
        cycle += 1
    return min_lr + (max_lr - min_lr) * (1 + np.cos(np.pi * t / Ti)) / 2


In [None]:
steps = np.arange(0, 150)

lr_onecycle = [one_cycle_lr(t, total_steps=150, max_lr=0.08, min_lr=0.005, pct_up=0.3) for t in steps]
lr_cosine = [cosine_annealing_lr(t, T_max=100, max_lr=0.08, min_lr=0.005) for t in steps]
lr_warm = [cosine_annealing_warm_restarts(t, T_0=30, T_mult=2, max_lr=0.08, min_lr=0.005) for t in steps]

plt.figure(figsize=(8, 5))
plt.plot(steps, lr_onecycle, label="OneCycleLR", lw=2)
plt.plot(steps, lr_cosine, label="Cosine Annealing", lw=2)
plt.plot(steps, lr_warm, label="Cosine Warm Restarts", lw=2)

plt.title("Advanced Learning Rate Schedules", fontsize=13)
plt.xlabel("Iteration")
plt.ylabel("Learning Rate (Œ∑)")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()


In [None]:
def one_cycle_momentum(t, total_steps=100, max_m=0.95, min_m=0.85, pct_up=0.3):
    """Momentum schedule used in OneCycleLR ‚Äî inverse to LR."""
    up_steps = int(total_steps * pct_up)
    down_steps = total_steps - up_steps
    if t < up_steps:
        return max_m - (max_m - min_m) * (t / up_steps)
    else:
        return min_m + (max_m - min_m) * ((t - up_steps) / down_steps)

momentum_schedule = [one_cycle_momentum(t, total_steps=150) for t in steps]

plt.figure(figsize=(8, 4))
plt.plot(steps, lr_onecycle, label="Learning Rate", color="blue")
plt.plot(steps, momentum_schedule, label="Momentum", color="red")
plt.title("OneCycle Policy: Coupled LR‚ÄìMomentum Schedule", fontsize=13)
plt.xlabel("Iteration")
plt.ylabel("Value")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.5)
plt.show()


## üß† Mathematical Insight

### üîπ OneCycleLR (Leslie Smith, 2018)
Learning rate increases, then decreases symmetrically:
$$
\eta_t =
\begin{cases}
\eta_{min} + (\eta_{max} - \eta_{min}) \cdot \frac{t}{T_{up}} & \text{if } t < T_{up} \\
\eta_{max} - (\eta_{max} - \eta_{min}) \cdot \frac{t - T_{up}}{T_{down}} & \text{otherwise}
\end{cases}
$$

Momentum is varied *inversely*:
$$
m_t \propto 1 - \eta_t
$$

‚Üí High LR + low momentum = exploration  
‚Üí Low LR + high momentum = fine-tuning  

---

### üîπ Cosine Annealing
Learning rate follows a cosine curve:
$$
\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\pi t / T))
$$

This smooth ‚Äúwave‚Äù prevents sudden jumps and keeps the optimizer dynamically adaptive.

---

### üîπ Warm Restarts (SGDR)
When training is long, cosine cycles restart periodically:
$$
\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\pi t / T_i))
$$
where \( T_i = T_0 \times T_{mult}^{cycle} \).

Each restart injects *energy* into the system,  
helping the optimizer escape sharp local minima while maintaining convergence.


## üß© Comparison of Advanced LR Strategies

| Schedule | Curve Shape | Key Benefit | Typical Use |
|-----------|--------------|--------------|--------------|
| **OneCycleLR** | Triangular up‚Äìdown | Fastest convergence, avoids saddle points | CNNs, Transformers |
| **Cosine Annealing** | Smooth decay | Stable training, avoids oscillations | Medium-long epochs |
| **Cosine Annealing (Warm Restarts)** | Recurrent waves | Rejuvenates training, improves generalization | Long training regimes |


## üß≠ Takeaway

- **OneCycleLR** dynamically balances exploration and refinement within one training cycle.  
- **Cosine Annealing** ensures smooth, low-noise convergence.  
- **Warm Restarts** reintroduce exploration bursts in long runs.  

Together, they form the *modern foundation of learning rate control*  
‚Äî used in state-of-the-art architectures like **ResNet, BERT, GPT, and ViT**.

> ‚ÄúAn optimizer is not just about gradient steps ‚Äî it‚Äôs about energy choreography.‚Äù
