In [None]:
import marimo as mo

# Week 3: Gradient-Based Optimization for Deep Learning**IME775: Data Driven Modeling and Optimization**ðŸ“– **Reference**: Krishnendu Chaudhury. *Math and Architectures of Deep Learning*, Chapter 4---## Learning Objectives- Understand gradient descent and its variants- Master momentum-based acceleration- Learn adaptive learning rate methods (Adam)- Connect optimization to neural network training

In [None]:
import numpy as npimport matplotlib.pyplot as pltfrom matplotlib import cmfrom mpl_toolkits.mplot3d import Axes3D

## 3.1 Gradient Descent VisualizationBasic update: $\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$The gradient points toward steepest ascent, so we go the opposite way.

## 3.2 Momentum: Accelerating ConvergenceMomentum accumulates velocity in consistent gradient directions:$$v_{t+1} = \beta v_t + \nabla L(\theta_t)$$$$\theta_{t+1} = \theta_t - \alpha v_{t+1}$$Like a ball rolling down a hill - accelerates in consistent directions!

## 3.3 Adam: Adaptive Moment EstimationCombines momentum with adaptive learning rates:**First moment** (momentum): $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$**Second moment** (RMSprop): $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$**Update**: $\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

In [None]:
class AdamOptimizer:        self.lr = lr        self.beta1 = beta1        self.beta2 = beta2        self.eps = eps        self.m = None        self.v = None        self.t = 0        if self.m is None:            self.m = np.zeros_like(x)            self.v = np.zeros_like(x)        self.t += 1        self.m = self.beta1 * self.m + (1 - self.beta1) * grad        self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2        m_hat = self.m / (1 - self.beta1**self.t)        v_hat = self.v / (1 - self.beta2**self.t)        return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)    adam = AdamOptimizer(lr=lr)    path = [x0.copy()]    x = x0.copy()    for _ in range(n_iters):        g = grad_f(x[0], x[1])        x = adam.step(x, g)        path.append(x.copy())    return np.array(path)# Compare optimizers on Rosenbrockfig3, axes3 = plt.subplots(1, 3, figsize=(16, 5))x_r = np.linspace(-2, 2, 100)y_r = np.linspace(-1, 3, 100)X_r, Y_r = np.meshgrid(x_r, y_r)Z_r = rosenbrock(X_r, Y_r)start_r = np.array([-1.0, 2.0])n_iters_r = 300# Run each optimizer    path = [x0.copy()]    x = x0.copy()    for _ in range(n_iters):        g = grad_f(x[0], x[1])        x = x - lr * g        path.append(x.copy())    return np.array(path)    path = [x0.copy()]    x = x0.copy()    v = np.zeros_like(x)    for _ in range(n_iters):        g = grad_f(x[0], x[1])        v = momentum * v + g        x = x - lr * v        path.append(x.copy())    return np.array(path)paths_compare = {    'SGD': gd_simple2(rosenbrock_grad, start_r, lr=0.001, n_iters=n_iters_r),    'Momentum': gd_momentum2(rosenbrock_grad, start_r, lr=0.001, momentum=0.9, n_iters=n_iters_r),    'Adam': run_adam(rosenbrock_grad, start_r, lr=0.05, n_iters=n_iters_r)}colors_opt = ['blue', 'green', 'red']for ax, (name, path_opt), color in zip(axes3, paths_compare.items(), colors_opt):    ax.contour(X_r, Y_r, Z_r, levels=np.logspace(-1, 3, 15), cmap='viridis')    ax.plot(path_opt[:, 0], path_opt[:, 1], f'{color}.-', markersize=2, linewidth=0.5, alpha=0.8)    ax.scatter(1, 1, color='gold', s=150, marker='â˜…', zorder=5)    ax.set_xlabel('x')    ax.set_ylabel('y')    final_loss = rosenbrock(path_opt[-1,0], path_opt[-1,1])    ax.set_title(f'{name}\nFinal loss: {final_loss:.4f}')plt.tight_layout()fig3return (    AdamOptimizer,    X_r,    Y_r,    Z_r,    ax,    axes3,    color,    colors_opt,    fig3,    final_loss,    gd_momentum2,    gd_simple2,    n_iters_r,    name,    path_opt,    paths_compare,    run_adam,    start_r,    x_r,    y_r,)@app.cell# Loss curves comparisonfig4, ax4 = plt.subplots(figsize=(10, 5))for name, path in paths_compare.items():    losses = [rosenbrock(p[0], p[1]) for p in path]    ax4.semilogy(losses, label=name, linewidth=2)ax4.set_xlabel('Iteration')ax4.set_ylabel('Loss (log scale)')ax4.set_title('Convergence Comparison on Rosenbrock Function')ax4.legend()ax4.grid(True, alpha=0.3)fig4

## 3.4 Learning Rate Schedules| Schedule | Formula | Use Case ||----------|---------|----------|| Step Decay | $\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}$ | Classic CNNs || Exponential | $\alpha_t = \alpha_0 \cdot e^{-\lambda t}$ | Smooth decay || Cosine | $\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t}{T}\pi))$ | Transformers || Warmup | Linear increase then decay | Large batch training |

## 3.5 Weight Initialization| Method | Formula | Best For ||--------|---------|----------|| Xavier | $W \sim \mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$ | Sigmoid, Tanh || He | $W \sim \mathcal{N}(0, \sqrt{2/n_{in}})$ | ReLU |

## Summary| Optimizer | Key Feature | When to Use ||-----------|-------------|-------------|| **SGD** | Simple, generalizes well | Final training || **Momentum** | Accelerates convergence | Standard choice || **Adam** | Adaptive + momentum | Quick prototyping || Schedule | Key Feature | When to Use ||----------|-------------|-------------|| **Step Decay** | Simple, predictable | CNNs || **Cosine** | Smooth, no hyperparameters | Modern networks || **Warmup** | Stabilizes early training | Large models |---## References- **Primary**: Krishnendu Chaudhury. *Math and Architectures of Deep Learning*, Chapter 4.- **Supplementary**: Ruder, S. "An overview of gradient descent optimization algorithms."## Connection to ML Refined CurriculumThese optimization techniques are used throughout:- Weeks 2-3: Foundation for all optimization- Weeks 4-13: Training any supervised learning model