<a href="https://colab.research.google.com/github/murigugitonga/math_4_ai/blob/dev/03_calculus/08_jacobian_and_hessian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calculus — Jacobian & Hessian

**Author:** Murigu Gitonga  
**Objective:** Demonstrate Jacobian and Hessian matrices and their role in
multivariate optimization, neural networks & second-order learning methods.


## 1. Why Do We Need Them?

- Gradients work for **scalar functions**
- ML models often involve:
  - Vector-valued functions
  - Many parameters
- Jacobian -> first-order behavior of vector functions
- Hessian -> second-order curvature information

These concepts explain:
- Backpropagation mechanics
- Vanishing / exploding gradients
- Why optimization can be slow or unstable


## 2. Jacobian Matrix

Let:
$$
\mathbf{f}(\mathbf{x}) =
\begin{bmatrix}
f_1(x_1, x_2, \dots, x_n) \\
f_2(x_1, x_2, \dots, x_n) \\
\vdots \\
f_m(x_1, x_2, \dots, x_n)
\end{bmatrix}
$$

The Jacobian matrix is:
$$
J =
\begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_2}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
$$


### Example

Let:
$$
\mathbf{f}(x,y) =
\begin{bmatrix}
x^2 + y \\
\sin(xy)
\end{bmatrix}
$$

Partial derivatives:
$$
\frac{\partial f_1}{\partial x} = 2x, \quad
\frac{\partial f_1}{\partial y} = 1
$$

$$
\frac{\partial f_2}{\partial x} = y\cos(xy), \quad
\frac{\partial f_2}{\partial y} = x\cos(xy)
$$


In [1]:
import numpy as np

def jacobian(x, y):
  return np.array([[2*x, 1], [y*np.cos(x*y), x*np.cos(x*y)]])

jacobian(1.0, 2.0)

array([[ 2.        ,  1.        ],
       [-0.83229367, -0.41614684]])

## 3. Jacobian in Machine Learning

- Jacobian appears in:
  - Backpropagation
  - Change of variables
  - Sensitivity analysis
- Each layer’s output depends on previous layer -> Jacobian chain
- Backprop = multiplying Jacobians efficiently


## 4. Hessian Matrix

For scalar function:
$$
f(x_1, x_2, \dots, x_n)
$$

Hessian matrix:
$$
H =
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} &
\frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} &
\frac{\partial^2 f}{\partial x_2^2} & \cdots \\
\vdots & \vdots & \ddots
\end{bmatrix}
$$

- Hessian captures **curvature**
- Used in second-order optimization


### Example

Let:
$$
f(x,y) = x^2 + y^2
$$

Second derivatives:
$$
\frac{\partial^2 f}{\partial x^2} = 2, \quad
\frac{\partial^2 f}{\partial y^2} = 2
$$

$$
\frac{\partial^2 f}{\partial x \partial y} = 0
$$

Hessian:
$$
H =
\begin{bmatrix}
2 & 0 \\
0 & 2
\end{bmatrix}
$$


In [None]:
# Hessian in Python

def hessian():
  return np.array([
      [2, 0],
      [0, 2]
  ])

hessian()

array([[2, 0],
       [0, 2]])

## 5. Why Curvature Matters

- Flat regions -> slow learning
- Sharp curvature -> unstable updates
- Hessian eigenvalues indicate:
  - Positive definite -> minimum
  - Negative definite -> maximum
  - Mixed -> saddle point


## 6. Newton’s Method

Update rule:
$$
\mathbf{w}_{t+1} =
\mathbf{w}_t - H^{-1} \nabla f(\mathbf{w}_t)
$$

- Uses curvature information
- Faster convergence
- Expensive for large models


In [None]:
# Newton's Method in 1D

def f(w):
  return (w -3) ** 2

def grad(w):
  return 2 * (w - 3)

def hess(w):
  return 2

w = 0.0

for _ in range(5):
  w = w - grad(w)/ hess(w)

w



3.0

## 8. Key Takeaways

- Jacobian = first-order behavior of vector functions
- Hessian = second-order curvature
- Backprop uses Jacobians
- Second-order methods use Hessians
- Curvature explains optimization difficulty
