# Week 3: First-Order Optimization - Gradient Descent
**IME775: Data Driven Modeling and Optimization**
ðŸ“– **Reference**: Watt, Borhani, & Katsaggelos (2020). *Machine Learning Refined* (2nd ed.), **Chapter 3**
---
## Learning Objectives
- Understand the first-order optimality condition
- Derive and implement gradient descent
- Explore learning rate selection
- Identify weaknesses of gradient descent


In [None]:
import numpy as np
import matplotlib.pyplot as plt

## The First-Order Optimality Condition (Section 3.2)
For a differentiable function $g(w)$, a necessary condition for $w^*$ to be a local minimum:
$$\nabla g(w^*) = 0$$
### Intuition
- Gradient $\nabla g(w)$ points in direction of steepest **ascent**
- At a minimum, there's no direction of descent
- Hence gradient must be zero


In [None]:
# Visualize gradient and optimality
x = np.linspace(-3, 3, 200)
g = x**2 + 1
grad_g = 2*x
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Function
ax1 = axes[0]
ax1.plot(x, g, 'b-', linewidth=2)
ax1.plot(0, 1, 'r*', markersize=15, label='Minimum at w=0')
ax1.set_xlabel('w', fontsize=12)
ax1.set_ylabel('g(w)', fontsize=12)
ax1.set_title('Function g(w) = wÂ² + 1')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Gradient
ax2 = axes[1]
ax2.plot(x, grad_g, 'g-', linewidth=2)
ax2.axhline(0, color='gray', linestyle='--')
ax2.plot(0, 0, 'r*', markersize=15, label='âˆ‡g(0) = 0')
ax2.set_xlabel('w', fontsize=12)
ax2.set_ylabel('âˆ‡g(w)', fontsize=12)
ax2.set_title('Gradient âˆ‡g(w) = 2w')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
fig

## The Geometry of First-Order Taylor Series (Section 3.3)
Near a point $w$, the function can be approximated:
$$g(w + d) \approx g(w) + \nabla g(w)^T d$$
### Key Insight
The direction of **steepest descent** is:
$$d = -\nabla g(w)$$
Because $\nabla g(w)^T (-\nabla g(w)) = -\|\nabla g(w)\|^2 < 0$


mo.md(r"""
## Gradient Descent (Section 3.5)
### The Algorithm
$$w^{(k+1)} = w^{(k)} - \alpha \nabla g(w^{(k)})$$
Where $\alpha > 0$ is the **learning rate** (step size).
### Pseudocode
```python
def gradient_descent(g, grad_g, w0, alpha, max_iter):
    w = w0
    for k in range(max_iter):
        w = w - alpha * grad_g(w)
        if converged(w):
            break


In [None]:
# Gradient descent visualization
def f(w):

## Learning Rate Selection
The learning rate $\alpha$ is crucial:
| $\alpha$ | Effect |
|----------|--------|
| Too small | Very slow convergence |
| Just right | Smooth, efficient convergence |
| Too large | Oscillation |
| Very large | Divergence |


In [None]:
alpha_slider = mo.ui.slider(0.01, 0.5, value=0.1, step=0.01, label="Learning Rate Î±")
alpha_slider

In [None]:
# Interactive learning rate demo
def f_demo(w):

## Two Natural Weaknesses of Gradient Descent (Section 3.6)
### Weakness 1: Zigzagging
For ill-conditioned functions (very different curvature in different directions),
gradient descent zigzags inefficiently.
### Weakness 2: Saddle Points
In high dimensions, saddle points are common. Gradient is zero but it's not a minimum!
### Solutions (Covered in Appendix A)
- Momentum
- Adaptive learning rates (Adam, RMSprop)
- Second-order methods


In [None]:
# Saddle point visualization
x_range = np.linspace(-2, 2, 100)
y_range = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 - Y**2  # Saddle function
fig4 = plt.figure(figsize=(12, 5))
# 3D surface
ax1 = fig4.add_subplot(121, projection='3d')
ax1.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax1.plot([0], [0], [0], 'r*', markersize=15)
ax1.set_xlabel('$w_1$')
ax1.set_ylabel('$w_2$')
ax1.set_zlabel('g(w)')
ax1.set_title('Saddle Point at (0, 0)')
# Contour
ax2 = fig4.add_subplot(122)
ax2.contour(X, Y, Z, levels=20, cmap='viridis')
ax2.plot(0, 0, 'r*', markersize=15, label='Saddle point: âˆ‡g = 0')
ax2.set_xlabel('$w_1$')
ax2.set_ylabel('$w_2$')
ax2.set_title('g(w) = $w_1^2$ - $w_2^2$')
ax2.legend()
ax2.set_aspect('equal')
plt.tight_layout()
fig4

## Summary
| Concept | Key Points |
|---------|------------|
| **First-order condition** | $\nabla g(w^*) = 0$ at stationary points |
| **Gradient descent** | $w^{(k+1)} = w^{(k)} - \alpha \nabla g(w^{(k)})$ |
| **Learning rate** | Critical for convergence |
| **Weaknesses** | Zigzagging, saddle points |
---
## References
- **Primary**: Watt, J., Borhani, R., & Katsaggelos, A. K. (2020). *Machine Learning Refined* (2nd ed.), Chapter 3.
- **Supplementary**: Nocedal, J. & Wright, S. (2006). *Numerical Optimization*, Chapter 3.
## Next Week
**Second-Order Optimization: Newton's Method** (Chapter 4): Using curvature information for faster convergence.
