# Backpropagation via Matrix Calculus

We will provide matrix-calculus derivations to implementing a reverse-mode autodiff engine, then use it to implement and verify backpropagation across common layers and losses:
- Derive vector and matrix form gradients using differentials, the trace trick, and chain rules
- Implement a reverse-mode autodiff core (vector-Jacobian products)
- Derive and code backprop for: affine $\rightarrow$ activation $\rightarrow$ softmax-CE; MSE regression; L2 regularization; BatchNorm (stretch)
- Verify gradients with high-precision finite differences on randomized cases
- Show performance wins of reverse vs forward mode on wide nets (empirical scaling)

## Setup

In [1]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=3)

## Matrix Calculus Basics

**Differentials & trace trick**
- $d(x^T Ay) = (Ay)^T dx + (A^T x)^T dy$
- $d \text{tr}(A^T y) = \text{tr}(A^T dX) \implies \frac{\partial}{\partial X}\text{tr}(A^T X) = A$

**Quadratic forms**
- $f(X) = \frac{1}{2}||XW - Y||^2_F \implies \nabla_W f = X^T (XW - Y), \nabla_X f = (XW - Y)W^T$

**Softmax + CE (row-wise)**
- $softmax(z)_i = \frac{e^{z_i}}{\Sigma_j e^{z_j}}, \ell = -\Sigma_i y_i \log s_i \implies \nabla_z \ell = s - y$

**Elementwise activations**
- $\nabla_X\Sigma \phi (X) = \phi' (X)$

**Chain rules**
- Matrix chain: $dL = \text{tr}(G^T dZ), Z = f(Y), Y = g(X) \implies G_X = J^T_{Y \rightarrow X} G_Z$
- VJP: given upstream $G$, compute $G \cdot J$ without forming $J$