# Week 3: Gradient-Based Optimization for Deep Learning
**IME775: Data Driven Modeling and Optimization**
ðŸ“– **Reference**: Krishnendu Chaudhury. *Math and Architectures of Deep Learning*, Chapter 4
---
## Learning Objectives
- Understand gradient descent and its variants
- Master momentum-based acceleration
- Learn adaptive learning rate methods (Adam)
- Connect optimization to neural network training


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

## 3.1 Gradient Descent Visualization
Basic update: $\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$
The gradient points toward steepest ascent, so we go the opposite way.


In [None]:
# Visualize gradient descent on a 2D function
def rosenbrock(x, y, a=1, b=100):

## 3.2 Momentum: Accelerating Convergence
Momentum accumulates velocity in consistent gradient directions:
$$v_{t+1} = \beta v_t + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \alpha v_{t+1}$$
Like a ball rolling down a hill - accelerates in consistent directions!


In [None]:
def gd_momentum(grad_f, x0, lr=0.001, momentum=0.9, n_iters=100):
    path = [x0.copy()]
    x = x0.copy()
    v = np.zeros_like(x)
    for _ in range(n_iters):
        g = grad_f(x[0], x[1])
        v = momentum * v + g
        x = x - lr * v
        path.append(x.copy())

## 3.3 Adam: Adaptive Moment Estimation
Combines momentum with adaptive learning rates:
**First moment** (momentum): $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
**Second moment** (RMSprop): $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$
**Update**: $\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$


In [None]:
class AdamOptimizer:
    def __init__(self, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0
    def step(self, x, grad):
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

In [None]:
# Loss curves comparison
fig4, ax4 = plt.subplots(figsize=(10, 5))
for name, path in paths_compare.items():
    losses = [rosenbrock(p[0], p[1]) for p in path]
    ax4.semilogy(losses, label=name, linewidth=2)
ax4.set_xlabel('Iteration')
ax4.set_ylabel('Loss (log scale)')
ax4.set_title('Convergence Comparison on Rosenbrock Function')
ax4.legend()
ax4.grid(True, alpha=0.3)
fig4

## 3.4 Learning Rate Schedules
| Schedule | Formula | Use Case |
|----------|---------|----------|
| Step Decay | $\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}$ | Classic CNNs |
| Exponential | $\alpha_t = \alpha_0 \cdot e^{-\lambda t}$ | Smooth decay |
| Cosine | $\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t}{T}\pi))$ | Transformers |
| Warmup | Linear increase then decay | Large batch training |


In [None]:
# Visualize learning rate schedules
fig5, axes5 = plt.subplots(2, 2, figsize=(14, 8))
epochs = np.arange(100)
alpha_0 = 0.1
# Step decay
step_lr = alpha_0 * (0.1 ** (epochs // 30))
axes5[0, 0].plot(epochs, step_lr, 'b-', linewidth=2)
axes5[0, 0].set_title('Step Decay (Î³=0.1 every 30 epochs)')
axes5[0, 0].set_xlabel('Epoch')
axes5[0, 0].set_ylabel('Learning Rate')
axes5[0, 0].grid(True, alpha=0.3)
# Exponential decay
exp_lr = alpha_0 * np.exp(-0.03 * epochs)
axes5[0, 1].plot(epochs, exp_lr, 'g-', linewidth=2)
axes5[0, 1].set_title('Exponential Decay (Î»=0.03)')
axes5[0, 1].set_xlabel('Epoch')
axes5[0, 1].set_ylabel('Learning Rate')
axes5[0, 1].grid(True, alpha=0.3)
# Cosine annealing
alpha_min = 0.001
cosine_lr = alpha_min + 0.5 * (alpha_0 - alpha_min) * (1 + np.cos(np.pi * epochs / 100))
axes5[1, 0].plot(epochs, cosine_lr, 'r-', linewidth=2)
axes5[1, 0].set_title('Cosine Annealing')
axes5[1, 0].set_xlabel('Epoch')
axes5[1, 0].set_ylabel('Learning Rate')
axes5[1, 0].grid(True, alpha=0.3)
# Warmup + Cosine
warmup_epochs = 10
warmup_lr = np.where(epochs < warmup_epochs,
                     alpha_0 * epochs / warmup_epochs,
                     alpha_min + 0.5 * (alpha_0 - alpha_min) * (1 + np.cos(np.pi * (epochs - warmup_epochs) / (100 - warmup_epochs))))
axes5[1, 1].plot(epochs, warmup_lr, 'm-', linewidth=2)
axes5[1, 1].axvline(warmup_epochs, color='gray', linestyle='--', alpha=0.5, label='End warmup')
axes5[1, 1].set_title('Warmup + Cosine Annealing')
axes5[1, 1].set_xlabel('Epoch')
axes5[1, 1].set_ylabel('Learning Rate')
axes5[1, 1].legend()
axes5[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
fig5

## 3.5 Weight Initialization
| Method | Formula | Best For |
|--------|---------|----------|
| Xavier | $W \sim \mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$ | Sigmoid, Tanh |
| He | $W \sim \mathcal{N}(0, \sqrt{2/n_{in}})$ | ReLU |


In [None]:
# Demonstrate importance of initialization
def forward_pass(W_list, x, activation='relu'):
    activations = [x]
    for W in W_list:
        x = x @ W
        if activation == 'relu':
            x = np.maximum(0, x)
        elif activation == 'tanh':
            x = np.tanh(x)
        activations.append(x)

## Summary
| Optimizer | Key Feature | When to Use |
|-----------|-------------|-------------|
| **SGD** | Simple, generalizes well | Final training |
| **Momentum** | Accelerates convergence | Standard choice |
| **Adam** | Adaptive + momentum | Quick prototyping |
| Schedule | Key Feature | When to Use |
|----------|-------------|-------------|
| **Step Decay** | Simple, predictable | CNNs |
| **Cosine** | Smooth, no hyperparameters | Modern networks |
| **Warmup** | Stabilizes early training | Large models |
---
## References
- **Primary**: Krishnendu Chaudhury. *Math and Architectures of Deep Learning*, Chapter 4.
- **Supplementary**: Ruder, S. "An overview of gradient descent optimization algorithms."
## Connection to ML Refined Curriculum
These optimization techniques are used throughout:
- Weeks 2-3: Foundation for all optimization
- Weeks 4-13: Training any supervised learning model
