# Backpropagation via Matrix Calculus

We will provide written matrix calculus derivations and the implementation for a reverse-mode autodiff engine, then use it to implement and verify backpropagation across common layers and losses:
- Derive vector and matrix form gradients using differentials, the trace trick, and chain rules
- Implement a reverse-mode autodiff core (vector-Jacobian products)
- Derive and code backprop for: affine $\rightarrow$ activation $\rightarrow$ softmax-CE; MSE regression; L2 regularization
- Verify gradients with high-precision finite differences on randomized cases
- Show performance wins of reverse vs forward mode on wide nets (empirical scaling)

## Setup

In [1]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=3)

## Matrix Calculus Basics

**Intro**
- $\Delta$: big, finite step (difference)
- $d$: tiny, infinitesimal step (differential)
- $\partial$: derivative with respect to one variable (partial)
- $\nabla$: gradient operator (vector of partials)

**Differentials**

$d(\cdot)$ means a differential, i.e. the infinitesimal change in a function when its inputs change a little. For a scalar function $f : \mathbb{R}^n \rightarrow \mathbb{R}$,

$$df(x) = \nabla f(x)^T dx \approx f(x + \Delta x) - f(x)$$

where:
- $dx$ is the vector of infinitesimal input changes
- $\nabla f(x)$ is the gradient (rate of change of $f$ with respect to each coordinate)

So $d f(x)$ is the infinitesimal change in the output, predicted linearly from the gradient and input change.

Let's consider the scalar function $f(x, y) = x^T Ay$. Let's see how $f$ changes when $x \mapsto x + dx, y \mapsto y + dy$.

$$
\begin{align}
d f(x, y) &= d(x^T Ay) \\
&= (x + dx)^T A (y + dy) - x^T Ay \\
&= dx^T Ay + x^T Ady + dx^T Ady \\
&= (Ay)^T dx + (A^T x)^T dy
\end{align}
$$

We drop $dx^T Ady$ because it is the product of two infinitesimals.

So the syntax $d(x^T Ay) = (Ay)^T dx + (A^T x)^T dy$ is a compact way of saying:
- the gradient wrt $x$ is $Ay$
- the gradient wrt $y$ is $A^T x$

**Trace trick**

The trace of a square matrix is the sum of the diagonal elements.

$$\text{tr}(M) = \sum_i M_{ii}$$

The trace has a cyclic property meaning $\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)$ and linearity meaning $\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)$.

Gradients are sometimes easiest to express in terms of traces. The differential of a scalar function $f(X)$ with matrix input is often written:

$$
\begin{align}
df &= \sum_{i, j} \frac{\partial f}{\partial X_{ij}} dX_{ij} \\
&= \text{tr}((\frac{\partial f}{\partial X})^T dX) \\
&= \text{tr}(G^T dX)
\end{align}
$$

where $G$ is the gradient $\frac{\partial f}{\partial X}$.

For example, if we take the scalar function $f(X) = tr(A^T X)$, then its differential is $df = \text{tr}(A^T dX)$ and the gradient matrix with respect to $X$ is $A$.

**Quadratic forms**

Let $X \in \mathbb{R}^{N \times d}$, $W \in \mathbb{R}^{d \times k}$, and $Y \in \mathbb{R}^{N \times k}$. Define the loss:

$$f(X, W) = \frac{1}{2} \|XW - Y\|_F^2 = \frac{1}{2}\text{tr}((XW - Y)^T (XW - Y))$$

Differential:
$$
\begin{aligned}
df &= \text{tr}((XW - Y)^T (dX \, W + X \, dW)) \\
&= \text{tr}(((XW - Y)W^T)^T dX) + \text{tr}((X^T(XW - Y))^T dW)
\end{aligned}
$$

Gradients:
$$
\nabla_W f = X^T(XW - Y), \qquad \nabla_X f = (XW - Y)W^T
$$

**Softmax + CE (row-wise)**

For a row of logits (raw scores) $z \in \mathbb{R}^k$, define the softmax:
$$
s = \text{softmax}(z)
$$

$$
s_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Given a target distribution $y \in \mathbb{R}^k$ (often one-hot) or label, the
cross-entropy loss is
$$
\ell(z, y) = -\sum_i y_i \log s_i = -y^T \log s
$$

We know that

$$
\log s_i = z_i - \log \sum_j e^{z_j}
$$

Differential:
$$
\ell(z, y) 
= -y^Tz + (\sum_i y_i) (\log \sum_j e^{z_j})
= -y^Tz + \log \sum_j e^{z_j}
$$

$$
d\ell = -y + \frac{\mathbf{1}}{\sum_j e^{z_j}} \cdot (e^{z_1}, \dotsb, e^{z_k})^T = -y + s
$$

Gradient:
$$
\nabla_z \ell = s - y
$$

**Elementwise activations**

Let $\phi$ be a scalar activation function (like ReLU, sigmoid, tanh, etc.) that is applied to each element of matrix $X \in \mathbb{R}^{m \times n}$ independently.

$$[\phi(X)]_{ij} = \phi(X_{ij})$$

Suppose the loss $L$ depends on $\phi(X)$, and you already know the upstream gradient
$$
G = \nabla_{\phi(X)} L \in \mathbb{R}^{m \times n}
$$

Then by the chain rule (elementwise), the gradient of $L$ with respect to $X$ is:
$$
\nabla_X L = G \odot \phi'(X)
$$
where $\odot$ denotes elementwise multiplication.

Special case: if the loss is a sum over all entries of $\phi(X)$, i.e.
$$
L = \sum_{ij} \phi(X_{ij})
$$

then $G = \mathbf{1}$ and:
$$
\nabla_X L = \phi'(X)
$$

Common activations and their derivatives:
- ReLU:

$$\phi(x) = \max(0, x), \qquad \phi'(x) = \mathbf{1}[x > 0]$$

- LeakyReLU ($\alpha$):

$$\phi(x) = \begin{cases} x & x > 0 \\ \alpha x & x \le 0 \end{cases}, \qquad
\phi'(x) = \begin{cases} 1 & x > 0 \\ \alpha & x \le 0 \end{cases}$$

- Sigmoid $\sigma(x)$:

$$\phi(x) = \frac{1}{1 + e^{-x}}, \qquad \phi'(x) = \sigma(x)(1 - \sigma(x))$$

- $\tanh$:

$$\phi(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}, \qquad \phi'(x) = 1 - \tanh^2(x)$$

- Softplus:

$$\phi(x) = \log(1 + e^x), \qquad \phi'(x) = \sigma(x)$$

**Chain rules**

Suppose we have a scalar loss $L$ that depends on an intermediate variable $Z$,
which itself depends on $Y$, which in turn depends on $X$:

$$
X \xrightarrow{g} Y \xrightarrow{f} Z \xrightarrow{} L
$$

We can always write the differential of the loss with respect to $Z$ as

$$
dL = \text{tr}(G_Z^T dZ)
$$

where $G_Z = \frac{\partial L}{\partial Z}$ is the upstream gradient arriving at $Z$.

If $Z = f(Y)$ and $Y = g(X)$, then by the multivariate chain rule:

$$
G_X = J_{Y \to X}^T G_Z
$$

where $J_{Y \to X} = \frac{\partial Y}{\partial X}$ is the Jacobian of $Y$ with respect to $X$.
This form emphasizes that the **Jacobian is transposed** when propagating gradients backwards.

In practice, the full Jacobian $J$ is rarely fully formed (which can be huge).
Instead, we compute the **vector-Jacobian product (VJP)** directly:

$$
G_Y = J^T G_Z
$$

where:
- $J$ is the Jacobian of the local function $Z = f(Y)$
- $G_Z$ is the upstream gradient $\frac{\partial L}{\partial Z}$
- $G_Y$ is the downstream gradient $\frac{\partial L}{\partial Y}$, the result of the VJP

This is exactly what reverse-mode autodiff (backpropagation) implements:
it efficiently pushes $G$ backwards layer by layer without explicitly building the Jacobian.

- Forward mode: propagate differentials $dZ = J \, dX$
- Reverse mode: propagate gradients $G_X = J^T \, G_Z$
- Backprop = chaining many VJPs over a neural network