# Backpropagation via Matrix Calculus

We will provide matrix-calculus derivations to implementing a reverse-mode autodiff engine, then use it to implement and verify backpropagation across common layers and losses:
- Derive vector and matrix form gradients using differentials, the trace trick, and chain rules
- Implement a reverse-mode autodiff core (vector-Jacobian products)
- Derive and code backprop for: affine $\rightarrow$ activation $\rightarrow$ softmax-CE; MSE regression; L2 regularization; BatchNorm (stretch)
- Verify gradients with high-precision finite differences on randomized cases
- Show performance wins of reverse vs forward mode on wide nets (empirical scaling)

## Setup

In [1]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=3)

## Matrix Calculus Basics

**Intro**
- $\Delta$: big, finite step (difference)
- $d$: tiny, infinitesimal step (differential)
- $\nabla$: operator that gives you the gradient (vector of partials)

**Differentials**

$d(\cdot)$ means a differential, i.e. the infinitesimal change in a function when its inputs change a little. For a scalar function $f : \mathbb{R}^n \rightarrow \mathbb{R}$,

$$df(x) = \nabla f(x)^T dx \approx f(x + \Delta x) - f(x)$$

where:
- $dx$ is the vector of infinitesimal input changes
- $\nabla f(x)$ is the gradient (rate of change of $f$ with respect to each coordinate)

So $d f(x)$ is the infinitesimal change in the output, predicted linearly from the gradient and input change.

Let's consider the scalar function $f(x, y) = x^T Ay$. Let's see how $f$ changes when $x \mapsto x + dx, y \mapsto y + dy$.

$$
\begin{align}
d f(x, y) &= d(x^T Ay) \\
&= (x + dx)^T A (y + dy) - x^T Ay \\
&= dx^T Ay + x^T Ady + dx^T Ady \\
&= (Ay)^T dx + (A^T x)^T dy
\end{align}
$$

We drop $dx^T Ady$ because it is the product of two infinitesimals.

So the syntax $d(x^T Ay) = (Ay)^T dx + (A^T x)^T dy$ is a compact way of saying:
- the gradient wrt $x$ is $Ay$
- the gradient wrt $y$ is $A^T x$

**Trace trick**

The trace of a square matrix is the sum of the diagonal elements.

$$\text{tr}(M) = \Sigma_i M_{ii}$$

The trace has a cyclic property meaning $\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)$ and linearity meaning $\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)$.

Gradients are sometimes easiest to express in terms of traces. The differential of a scalar function $f(X)$ with matrix input is often written:

$$
\begin{align}
df &= \Sigma_{i, j} \frac{\partial f}{\partial X_{ij}} dX_{ij} \\
&= \text{tr}((\frac{\partial f}{\partial X})^T dX) \\
&= \text{tr}(G^T dX)
\end{align}
$$

where $G$ is the gradient $\frac{\partial f}{\partial X}$.

For example, if we take the scalar function $f(X) = tr(A^T X)$, then its differential is $df = \text{tr}(A^T dX)$ and the gradient matrix with respect to $X$ is $A$.

**Quadratic forms**
- $f(X) = \frac{1}{2}||XW - Y||^2_F \implies \nabla_W f = X^T (XW - Y), \nabla_X f = (XW - Y)W^T$

**Softmax + CE (row-wise)**
- $softmax(z)_i = \frac{e^{z_i}}{\Sigma_j e^{z_j}}, \ell = -\Sigma_i y_i \log s_i \implies \nabla_z \ell = s - y$

**Elementwise activations**
- $\nabla_X\Sigma \phi (X) = \phi' (X)$

**Chain rules**
- Matrix chain: $dL = \text{tr}(G^T dZ), Z = f(Y), Y = g(X) \implies G_X = J^T_{Y \rightarrow X} G_Z$
- VJP: given upstream $G$, compute $G \cdot J$ without forming $J$