# Numerical Optimization

This notebook introduces numerical optimization methods in machine learning that minimize loss functions efficiently and stably. The focus is on understanding the intuition, math, and convergence behavior of common optimizers:
- Gradient descent and learning rate schedules
- Stochastic gradient descent (SGD) and mini-batching
- Momentum accelerated gradients
- Adam optimizer (adaptive method)

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=3)

def quad_fn(x, A, b, c):
    """
    f(x) = 0.5 x^T A x + b^T x + c
    
    x: (n,)
    A: (n, n) symmetric matrix
    b: (n,)
    c: scalar
    """
    return 0.5 * x @ A @ x + b @ x + c

def quad_grad(x, A, b):
    """
    grad_f(x) = A x + b

    x: (n,)
    A: (n, n)
    b: (n,)
    """
    return A @ x + b

def rosenbrock_fn(x, y, a, b):
    """
    f(x, y) = (a - x)^2 + b(y - x^2)^2

    x: scalar
    y: scalar
    a: scalar
    b: scalar
    """
    return (a - x)**2 + b * (y - x**2)**2

def rosenbrock_grad(x, y, a, b):
    """
    grad_f(x, y) = [df/dx, df/dy]
    df/dx = -2(a - x) - 4b x (y - x^2)
    df/dy = 2b (y - x^2)
    
    x: scalar
    y: scalar
    a: scalar
    b: scalar
    """
    dfdx = -2 * (a - x) - 4 * b * x * (y - x**2)
    dfdy = 2 * b * (y - x**2)
    return np.array([dfdx, dfdy])

def saddle_fn(x, y):
    """
    f(x, y) = x^2 - y^2

    x: scalar
    y: scalar
    """
    return x**2 - y**2

def saddle_grad(x, y):
    """
    grad_f(x, y) = [df/dx, df/dy]
    df/dx = 2x
    df/dy = -2y

    x: scalar
    y: scalar
    """
    return np.array([2 * x, -2 * y])

## Gradient Descent

## Stochastic Gradient Descent (SGD)

## Momentum Accelerated Gradients

## Adam Optimizer