## Block Jacobians and the Chain Rule

Suppose $f\in C^1\left(\mathcal{T}_{{\bf n}_1}\times\mathcal{T}_{{\bf n}_2};\mathcal{T}_{{\bf k}_1}\right)$, $g\in C^1\left(\mathcal{T}_{{\bf m}_1}\times\mathcal{T}_{{\bf m}_2};\mathcal{T}_{{\bf k}_2}\right)$, and $h\in C^1\left(\mathcal{T}_{{\bf k}_1}\times\mathcal{T}_{{\bf k}_2}, \mathcal{T}_{{\bf l}}\right)$, then the function $\varphi(\mathcal{V}, \mathcal{W},\mathcal{X},\mathcal{Y}) = h(f(\mathcal{V},\mathcal{W}),g(\mathcal{X},\mathcal{Y}))$ satisfies $\varphi\in C^1\left(\mathcal{T}_{{\bf n}_1}\times\mathcal{T}_{{\bf n}_2}\times\mathcal{T}_{{\bf m}_1}\times\mathcal{T}_{{\bf m}_2};\mathcal{T}_{\bf l}\right)$. Because of the Cartesian product, we can't write down a single tensor for our Jacobian. Instead, we can get Jacobians for each block of variables:
$$
D_{\mathcal{V}}\varphi(\mathcal{V}, \mathcal{W},\mathcal{X},\mathcal{Y})\in \mathcal{T}_{{\bf l}\oplus{\bf n}_1},
$$
$$
D_{\mathcal{W}}\varphi(\mathcal{V}, \mathcal{W},\mathcal{X},\mathcal{Y})\in \mathcal{T}_{{\bf l}\oplus{\bf n}_2},
$$
$$
D_{\mathcal{X}}\varphi(\mathcal{V}, \mathcal{W},\mathcal{X},\mathcal{Y})\in \mathcal{T}_{{\bf l}\oplus{\bf m}_1},
$$
and
$$
D_{\mathcal{Y}}\varphi(\mathcal{V}, \mathcal{W},\mathcal{X},\mathcal{Y})\in \mathcal{T}_{{\bf l}\oplus{\bf m}_2},
$$
Thinking of $h$ as $h(\mathcal{F},\mathcal{G})$, we also have the Jacobian blocks
$$
D_{\mathcal{F}} h(\mathcal{F},\mathcal{G})\in \mathcal{T}_{{\bf l}\oplus{\bf k}_1},\: D_{\mathcal{G}} h(\mathcal{F},\mathcal{G})\in \mathcal{T}_{{\bf l}\oplus{\bf k}_2},
$$
$$
D_{\mathcal{V}} f(\mathcal{V},\mathcal{W})\in \mathcal{T}_{{\bf k}_1\oplus{\bf n}_1},\: D_{\mathcal{W}} f(\mathcal{V},\mathcal{W})\in \mathcal{T}_{{\bf k}_1\oplus{\bf n}_2},
$$
$$
D_{\mathcal{X}} g(\mathcal{X},\mathcal{Y})\in \mathcal{T}_{{\bf k}_2\oplus{\bf m}_1},\: D_{\mathcal{Y}} g(\mathcal{X},\mathcal{Y})\in \mathcal{T}_{{\bf k}_2\oplus{\bf m}_2}
$$
The chain rule in this block formulation is then (suppressing arguments)
$$
D_{\mathcal{V}}\varphi_{({\bf i},{\bf j})} = c(D_{\mathcal{F}}h, D_{\mathcal{V}} f)_{({\bf i},{\bf j})}=\left(\sum_{\bf k} \frac{\partial h_{\bf i}}{\partial f_{\bf k}} \frac{\partial f_{\bf k}}{\partial v_{\bf j}}\right)_{({\bf i},{\bf j})}
$$
$$
D_{\mathcal{W}}\varphi_{({\bf i},{\bf j})} = c(D_{\mathcal{F}}h, D_{\mathcal{W}} f)_{({\bf i},{\bf j})}=\left(\sum_{\bf k} \frac{\partial h_{\bf i}}{\partial f_{\bf k}} \frac{\partial f_{\bf k}}{\partial w_{\bf j}}\right)_{({\bf i},{\bf j})}
$$
$$
D_{\mathcal{X}}\varphi_{({\bf i},{\bf j})} = c(D_{\mathcal{G}}h, D_{\mathcal{X}} g)_{({\bf i},{\bf j})}=\left(\sum_{\bf k} \frac{\partial h_{\bf i}}{\partial g_{\bf k}} \frac{\partial g_{\bf k}}{\partial x_{\bf j}}\right)_{({\bf i},{\bf j})}
$$
$$
D_{\mathcal{Y}}\varphi_{({\bf i},{\bf j})} = c(D_{\mathcal{G}}h, D_{\mathcal{Y}} g)_{({\bf i},{\bf j})}=\left(\sum_{\bf k} \frac{\partial h_{\bf i}}{\partial g_{\bf k}} \frac{\partial g_{\bf k}}{\partial y_{\bf j}}\right)_{({\bf i},{\bf j})}
$$

### Example:

Consider a feedforward neural network with a single hidden layer. The objective function may be written as
$$
\sum_{i=1}^N f_2({\bf y}^{(i)}, f_1({\bf x}^{(i)}; W, {\bf b}); V, {\bf c})
$$
for $W\in \mathcal{T}_{(k,d)}, {\bf b}\in \mathcal{T}_{(k)}, V\in\mathcal{T}_{(k, m)}, {\bf c}\in\mathcal{T}_{(m)}$. We will compute the gradient in blocks. First, we note that
$$
\nabla_W\sum_{i=1}^N f_2({\bf y}^{(i)}, f_1({\bf x}^{(i)}; W, {\bf b}); V, {\bf c})=\sum_{i=1}^N \nabla_W f_2({\bf y}^{(i)}, f_1({\bf x}^{(i)}; W, {\bf b}); V, {\bf c}).
$$
Moreover, the gradient is simply the transpose of the Jacobian, so we will simply compute
$$
D_W f_2({\bf y}, f_1({\bf x}; W, {\bf b}); V, {\bf c}).
$$
Viewing $f_2$ as $f_2({\bf y},\xi; V, {\bf c})$, we then have that this Jacobian is the standard contraction of $D_\xi f_2({\bf y}, f_1({\bf x}; W, {\bf b}); V, {\bf c})$ with $D_W f_1({\bf x}; W, {\bf b})$. Similarly, 
$$
D_{\bf b} f_2({\bf y}, f_1({\bf x}; W, {\bf b}); V, {\bf c})
$$
is the contraction of $D_\xi f_2({\bf y}, f_1({\bf x}; W, {\bf b}); V, {\bf c})$ with $D_{\bf b} f_1({\bf x}; W, {\bf b})$. Because $V$ and ${\bf c}$ do not factor through compositions, their blocks are computable without the chain rule.

In [12]:
import numpy as np

# Compute the Jacobian of X @ W wrt W
def affine_jacobian(X, W):
    # (d_{i, j} (X @ W))_{a, b} = e_a^T X e_ie_j^T e_b
    D = np.zeros((X.shape[0], W.shape[1], W.shape[0], W.shape[1]))
    for k in range(W.shape[1]):
        D[:,k,:,k]=X
    return D
    
def logit(z):
    # This is vectorized
    return 1/(1+np.exp(-z))

def Dlogit(Z):
    # The Jacobian of the matrix logit
    D = np.zeros((Z.shape[0], Z.shape[1], Z.shape[0], Z.shape[1]))
    for k in range(Z.shape[0]):
        D[k,:,k,:] = np.diag(logit(Z[k,:])*logit(-Z[k,:]))
    return D

def softmax(z):
    v = np.exp(z)
    return v / np.sum(v)

def matrix_softmax(Z):
    return np.apply_along_axis(softmax, 1, Z)

def Dmatrix_softmax(Z):
    D = np.zeros((Z.shape[0], Z.shape[1], Z.shape[0], Z.shape[1]))
    for k in range(Z.shape[0]):
        v = np.exp(Z[k,:])
        v = v / np.sum(v)
        D[k,:,k,:] = np.diag(v) - v @ v.T
    return D

def cross_entropy(P, Q):
    return -np.sum(P * np.log(Q))

def DQcross_entropy(P, Q):
    return - P @ (1/Q)

def nn_loss_closure(X, Y):
    # vars[0]=W, vars[1]=b, vars[2]=V, vars[3]=c
    def f(vars):
        return cross_entropy(Y, matrix_softmax(logit(X @ vars[0] + vars[1]) @ vars[2] + vars[3]))
    return f

def nn_loss_gradient_closure(X, Y):
    def df(vars):
        # Gather all the intermediate Jacobians
        XWb = X @ vars[0] + vars[1]
        H = logit(XWb)
        Hvc = H @ vars[2] + vars[3]
        DQ = DQcross_entropy(Y, matrix_softmax(HVc))
        DZ = Dmatrix_softmax(HVc)
        DV = affine_jacobian(H, vars[2])
        Dc = affine_jacobian(np.ones(c.shape[1]).T, c)
        DXWb = Dlogit(XWb)
        DW = affine_jacobian(X, vars[0])
        Dc = affine_jacobian(np.ones(b.shape[1]).T, b)
        

## Group Problems

Finish the nn_loss_gradient closure code using the chain rule and np.tensordot.

## Backpropagation

Consider the functions $f_1, f_2, f_3:\mathbb{R}^2\rightarrow\mathbb{R}$ as $f_1(x_1,\theta_1)$, $f_2(x_2,\theta_2)$, and $f_3(x_3,\theta_3)$. Then define

$$
g(x; \theta_1, \theta_2, \theta_3) = f_3(f_2(f_1(x,\theta_1),\theta_2),\theta_3).
$$

Then

$$
\frac{\partial g}{\partial\theta_1}(x;\theta_1, \theta_2, \theta_3)=\frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)\frac{\partial}{\partial\theta_1}\left[f_2(f_1(x,\theta_1),\theta_2)\right] + \frac{\partial f_3}{\partial \theta_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)\frac{\partial \theta_3}{\partial\theta_1}=\frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)\left[\frac{\partial f_2}{\partial x_2}(f_1(x,\theta_1),\theta_2)\frac{\partial f_1}{\partial\theta_1}(x,\theta_1) + \frac{\partial f_2}{\partial \theta_2}(f_1(x,\theta_1),\theta_2)\frac{\partial \theta_2}{\partial \theta_1}\right] = \frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)\frac{\partial f_2}{\partial x_2}(f_1(x,\theta_1),\theta_2)\frac{\partial f_1}{\partial\theta_1}(x,\theta_1).
$$

Similarly,

$$
\frac{\partial g}{\partial\theta_2}(x;\theta_1, \theta_2, \theta_3) = \frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)\frac{\partial f_2}{\partial \theta_2}(f_1(x,\theta_1),\theta_2)
$$

and

$$
\frac{\partial g}{\partial\theta_3}(x;\theta_1, \theta_2, \theta_3) = \frac{\partial f_3}{\partial \theta_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3).
$$

Another way to see this is to view this as the sequence of maps

$$
\begin{pmatrix}
\theta_1\\
\theta_2\\
\theta_3
\end{pmatrix}\longmapsto \begin{pmatrix}
f_1(x,\theta_1)\\
\theta_2\\
\theta_3
\end{pmatrix}\longmapsto \begin{pmatrix}
f_2(f_1(x,\theta_1),\theta_2)\\
\theta_3
\end{pmatrix}\longmapsto f_3(f_2(f_1(x,\theta_1),\theta_2)\theta_3).
$$

The Jacobians of these maps are

$$
\begin{pmatrix}
\frac{\partial f_1}{\partial\theta_1}(x,\theta_1) & 0 & 0\\
0 & 1 & 0\\
0 & 0 & 1
\end{pmatrix}, \begin{pmatrix}
\frac{\partial f_2}{\partial x_2}(f_1(x,\theta_1),\theta_2) & \frac{\partial f_2}{\partial \theta_2}(f_1(x,\theta_1),\theta_2) & 0\\
0 & 0 & 1\\
\end{pmatrix},\text{ and } \begin{pmatrix}
\frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3) & \frac{\partial f_3}{\partial \theta_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)
\end{pmatrix}
$$

The interesting thing to note here is that

$$
\frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3)
$$

is a factor for two of these partial derivatives. This redundancy is more pronounced as we get a deeper composition. Consider the composition of maps

$$
\begin{pmatrix}
\theta_1\\
\theta_2\\
\theta_3\\
\theta_4
\end{pmatrix}\longmapsto \begin{pmatrix}
f_1(x,\theta_1)\\
\theta_2\\
\theta_3\\
\theta_4
\end{pmatrix}\longmapsto \begin{pmatrix}
f_2(f_1(x,\theta_1),\theta_2)\\
\theta_3\\
\theta_4
\end{pmatrix}\longmapsto \begin{pmatrix}
f_3(f_2(f_1(x,\theta_1),\theta_2)\theta_3)\\
\theta_4
\end{pmatrix}\longmapsto f_4(f_3(f_2(f_1(x,\theta_1),\theta_2),\theta_3),\theta_4).
$$

The Jacobian of this composition (by the chain rule)

$$
\tiny\begin{pmatrix}
\frac{\partial f_4}{\partial x_4}(f_3(f_2(f_1(x,\theta_1),\theta_2),\theta_3), \theta_4) & \frac{\partial f_4}{\partial \theta_4}(f_3(f_2(f_1(x,\theta_1),\theta_2),\theta_3), \theta_4)
\end{pmatrix}\begin{pmatrix}
\frac{\partial f_3}{\partial x_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3) & \frac{\partial f_3}{\partial \theta_3}(f_2(f_1(x,\theta_1),\theta_2),\theta_3) & 0\\
0 & 0 & 1
\end{pmatrix}\begin{pmatrix}
\frac{\partial f_2}{\partial x_2}(f_1(x,\theta_1),\theta_2) & \frac{\partial f_2}{\partial \theta_2}(f_1(x,\theta_1),\theta_2) & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 1
\end{pmatrix}\begin{pmatrix}
\frac{\partial f_1}{\partial\theta_1}(x,\theta_1) & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 1
\end{pmatrix}, 
$$

This means that the gradient has the form (with suppression of arguments):

$$
\begin{pmatrix}
\frac{\partial f_4}{\partial x_4} \frac{\partial f_3}{\partial x_3} \frac{\partial f_2}{\partial x_2} \frac{\partial f_1}{\partial \theta_1}\\
\frac{\partial f_4}{\partial x_4} \frac{\partial f_3}{\partial x_3} \frac{\partial f_2}{\partial \theta_2}\\
\frac{\partial f_4}{\partial x_4} \frac{\partial f_3}{\partial \theta_3}\\
\frac{\partial f_4}{\partial \theta_4}
\end{pmatrix}.
$$

This suggests the following computational structure for computing a gradient descent update using step size $\eta>0$:

1. $\theta_4^\prime = \theta_4 - \eta \frac{\partial f_4}{\partial \theta_4}$ and set $q=\frac{\partial f_4}{\partial x_4}$.
2. For $i=3, 2, 1$: set $\theta_i^\prime = \theta_i -\eta q \frac{\partial f_i}{\partial \theta_i}$ and $q= q \frac{\partial f_i}{\partial x_i}$

This is a simplified version of the **backpropagation** algorithm (or, backprop).  Now, let's suppose that $f_1({\bf x}_1, \Theta_1)$, $f_2({\bf x}_2, \Theta_2)$, $f_3({\bf x}_3,\Theta_3)$, and $f_4({\bf x}_4, \Theta_4)$ where the ${\bf x}$'s and $\Theta$'s are vectors of parameters. Then our composition has a *block form* given by

$$
\begin{pmatrix}
\Theta_1\\
\Theta_2\\
\Theta_3\\
\Theta_4
\end{pmatrix}\longmapsto \begin{pmatrix}
f_1({\bf x},\Theta_1)\\
\Theta_2\\
\Theta_3\\
\Theta_4
\end{pmatrix}\longmapsto \begin{pmatrix}
f_2(f_1({\bf x},\Theta_1),\Theta_2)\\
\Theta_3\\
\Theta_4
\end{pmatrix}\longmapsto \begin{pmatrix}
f_3(f_2(f_1({\bf x},\Theta_1),\Theta_2)\Theta_3)\\
\Theta_4
\end{pmatrix}\longmapsto f_4(f_3(f_2(f_1({\bf x},\Theta_1),\Theta_2),\Theta_3),\Theta_4).
$$

and the chain rule gives the Jacobian (in block form)

$$
\tiny\begin{pmatrix}
D_{{\bf x}_4}f_4(f_3(f_2(f_1({\bf x},\Theta_1),\Theta_2),\Theta_3), \Theta_4) & D_{\Theta_4}f_4(f_3(f_2(f_1({\bf x},\Theta_1),\Theta_2),\Theta_3), \Theta_4)
\end{pmatrix}\begin{pmatrix}
D_{{\bf x}_3} f_3(f_2(f_1({\bf x},\Theta_1),\Theta_2),\Theta_3) & D_{\Theta_3} f_3(f_2(f_1({\bf x},\Theta_1),\Theta_2),\Theta_3) & {\bf 0}\\
{\bf 0} & {\bf 0} & I
\end{pmatrix}\begin{pmatrix}
D_{{\bf x}_2} f_2(f_1({\bf x},\Theta_1),\Theta_2) & D_{\Theta_2} f_2(f_1({\bf x},\Theta_1),\Theta_2) & {\bf 0} & {\bf 0}\\
{\bf 0} & {\bf 0} & I & {\bf 0}\\
{\bf 0} & {\bf 0} & {\bf 0} & I
\end{pmatrix}\begin{pmatrix}
D_{\Theta_1} f_1({\bf x},\Theta_1) & {\bf 0} & {\bf 0} & {\bf 0}\\
{\bf 0} & I & {\bf 0} & {\bf 0}\\
{\bf 0} & {\bf 0} & I & {\bf 0}\\
{\bf 0} & {\bf 0} & {\bf 0} & I
\end{pmatrix}.
$$

Therefore we can represent the gradient in the block form

$$
\begin{pmatrix}
\left(D_{\Theta_1}\: f_1\right)^T \left(D_{{\bf x}_2}\: f_2\right)^T \left(D_{{\bf x}_3}\: f_3\right)^T\nabla_{{\bf x}_4}\: f_4\\
\left(D_{\Theta_2}\: f_2\right)^T \left(D_{{\bf x}_3}\: f_3\right)^T\nabla_{{\bf x}_4}\: f_4\\
\left(D_{\Theta_3}\: f_3\right)^T\nabla_{{\bf x}_4}\: f_4\\
\nabla_{\Theta_4}\: f_4
\end{pmatrix}.
$$


