In [None]:
import marimo as mo

# Week 4: Second-Order Optimization - Newton's Method**IME775: Data Driven Modeling and Optimization**ðŸ“– **Reference**: Watt, Borhani, & Katsaggelos (2020). *Machine Learning Refined* (2nd ed.), **Chapter 4**---## Learning Objectives- Understand second-order optimality conditions- Derive and implement Newton's method- Compare Newton's method with gradient descent- Identify weaknesses of Newton's method

In [None]:
import numpy as npimport matplotlib.pyplot as plt

## The Second-Order Optimality Condition (Section 4.1)For a twice-differentiable function $g(w)$:### Necessary ConditionsIf $w^*$ is a local minimum:1. $\nabla g(w^*) = 0$ (first-order condition)2. $\nabla^2 g(w^*) \succeq 0$ (Hessian is positive semi-definite)### Sufficient ConditionsIf at $w^*$:1. $\nabla g(w^*) = 0$2. $\nabla^2 g(w^*) \succ 0$ (Hessian is positive definite)Then $w^*$ is a **strict local minimum**.

## The Hessian MatrixThe Hessian is the matrix of second partial derivatives:$$\nabla^2 g(w) = H = \begin{bmatrix} \frac{\partial^2 g}{\partial w_1^2} & \frac{\partial^2 g}{\partial w_1 \partial w_2} & \cdots \\\frac{\partial^2 g}{\partial w_2 \partial w_1} & \frac{\partial^2 g}{\partial w_2^2} & \cdots \\\vdots & \vdots & \ddots\end{bmatrix}$$### Properties- Symmetric (for smooth functions)- Eigenvalues indicate curvature- Positive definite $\Rightarrow$ bowl-shaped (minimum)- Negative definite $\Rightarrow$ hilltop (maximum)- Indefinite $\Rightarrow$ saddle point

## The Geometry of Second-Order Taylor Series (Section 4.2)Near a point $w$:$$g(w + d) \approx g(w) + \nabla g(w)^T d + \frac{1}{2} d^T \nabla^2 g(w) d$$### The Quadratic ApproximationThis is a quadratic function in $d$. To minimize:$$\frac{\partial}{\partial d}\left[g(w) + \nabla g(w)^T d + \frac{1}{2} d^T H d\right] = 0$$$$\nabla g(w) + H d = 0$$$$d = -H^{-1} \nabla g(w)$$

## Newton's Method (Section 4.3)### The Newton Update$$w^{(k+1)} = w^{(k)} - [\nabla^2 g(w^{(k)})]^{-1} \nabla g(w^{(k)})$$Or equivalently:$$w^{(k+1)} = w^{(k)} + d^{(k)}$$where $d^{(k)}$ solves: $\nabla^2 g(w^{(k)}) d^{(k)} = -\nabla g(w^{(k)})$### Algorithm```python    w = w0    for k in range(max_iter):        gradient = grad_g(w)        hessian = hess_g(w)        if np.linalg.norm(gradient) < tol:            break        d = np.linalg.solve(hessian, -gradient)        w = w + d    return w```

In [None]:
# Newton's method visualization    return w[0]**2 + 5*w[1]**2    return np.array([2*w[0], 10*w[1]])    return np.array([[2, 0], [0, 10]])    path = [w0.copy()]    w = w0.copy()    for _ in range(n_iter):        gradient = grad_f(w)        hessian = hess_f(w)        d = np.linalg.solve(hessian, -gradient)        w = w + d        path.append(w.copy())    return np.array(path)    path = [w0.copy()]    w = w0.copy()    for _ in range(n_iter):        w = w - alpha * grad_f(w)        path.append(w.copy())    return np.array(path)w0 = np.array([4.0, 2.0])path_newton = newtons_method(grad_f, hess_f, w0, 5)path_gd = gradient_descent(grad_f, w0, 0.1, 20)# Contour plotx_range = np.linspace(-5, 5, 100)y_range = np.linspace(-3, 3, 100)X, Y = np.meshgrid(x_range, y_range)Z = X**2 + 5*Y**2fig, ax = plt.subplots(figsize=(12, 8))ax.contour(X, Y, Z, levels=30, cmap='viridis')ax.plot(path_gd[:, 0], path_gd[:, 1], 'bo-', markersize=6,         linewidth=1.5, label=f'Gradient Descent ({len(path_gd)-1} steps)')ax.plot(path_newton[:, 0], path_newton[:, 1], 'r*-', markersize=12,         linewidth=2, label=f'Newton ({len(path_newton)-1} steps)')ax.plot(0, 0, 'g*', markersize=20, label='Optimum')ax.set_xlabel('$w_1$', fontsize=12)ax.set_ylabel('$w_2$', fontsize=12)ax.set_title("Newton's Method vs Gradient Descent (ML Refined, Section 4.3)", fontsize=14)ax.legend()ax.set_aspect('equal')fig

## Convergence Properties### Newton's Method ConvergenceFor functions with Lipschitz continuous Hessian, near a local minimum:$$\|w^{(k+1)} - w^*\| \leq C \|w^{(k)} - w^*\|^2$$This is **quadratic convergence** â€” the number of correct digits roughly doubles each iteration!### Comparison| Method | Convergence Rate | Per-iteration Cost ||--------|-----------------|-------------------|| Gradient Descent | Linear: $O((1-1/\kappa)^k)$ | $O(n)$ gradient || Newton's Method | Quadratic | $O(n^3)$ Hessian inverse |

## Two Natural Weaknesses of Newton's Method (Section 4.4)### Weakness 1: Computational Cost- Computing Hessian: $O(n^2)$ storage, $O(n^2)$ computation- Inverting/solving: $O(n^3)$- Impractical for large-scale ML ($n$ = millions)### Weakness 2: Non-Convex Functions- Newton step may go uphill (ascent direction)- May converge to saddle point or maximum- Hessian may be singular or indefinite### Solutions- **Damped Newton**: Use step size $w^{(k+1)} = w^{(k)} - \alpha H^{-1} \nabla g$- **Hessian modification**: Ensure positive definiteness- **Quasi-Newton methods**: Approximate Hessian (BFGS, L-BFGS)

In [None]:
# Convergence comparison    return (w - 2)**4    return 4 * (w - 2)**3    return 12 * (w - 2)**2# Gradient descentw_gd = [5.0]for _ in range(50):    w_gd.append(w_gd[-1] - 0.1 * grad_f_1d(w_gd[-1]))# Newtonw_newton = [5.0]for _ in range(10):    h = hess_f_1d(w_newton[-1])    if abs(h) > 1e-10:        w_newton.append(w_newton[-1] - grad_f_1d(w_newton[-1]) / h)    else:        breakfig2, ax2 = plt.subplots(figsize=(10, 5))ax2.semilogy(range(len(w_gd)), [abs(w - 2) for w in w_gd], 'b-o',              markersize=4, label='Gradient Descent')ax2.semilogy(range(len(w_newton)), [abs(w - 2) for w in w_newton], 'r-*',              markersize=8, label="Newton's Method")ax2.set_xlabel('Iteration')ax2.set_ylabel('|w - w*|')ax2.set_title('Convergence Comparison')ax2.legend()ax2.grid(True, alpha=0.3)fig2

## Summary| Concept | Key Points ||---------|------------|| **Second-order condition** | $\nabla g = 0$ and $H \succ 0$ || **Newton's method** | $w \leftarrow w - H^{-1} \nabla g$ || **Convergence** | Quadratic (very fast near optimum) || **Weaknesses** | Expensive, issues with non-convex functions |---## References- **Primary**: Watt, J., Borhani, R., & Katsaggelos, A. K. (2020). *Machine Learning Refined* (2nd ed.), Chapter 4.- **Supplementary**: Nocedal, J. & Wright, S. (2006). *Numerical Optimization*, Chapters 3, 6.## Next Week**Linear Regression** (Chapter 5): Applying optimization to regression problems.