---
title: 12.3 Backpropogation
subject:  Optimization
subtitle: 
short_title: 12.3 Backpropogation
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 22 - An Introduction to Backpropagation.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be

## Learning Objectives

By the end of this page, you should know:
- 

\textbf{Mach. Learning -- Midterm}

Last class, we studied the unconstrained optimization

\begin{equation}
\text{minimize } f(x) \tag{P}
\end{equation}

over $x \in \mathbb{R}^n$, where we look for the $x \in \mathbb{R}^n$ that makes the value of the cost function $f: \mathbb{R}^n \to \mathbb{R}$ as small as possible. We saw that one way to find either a local or global minimum $x^*$ is gradient descent. Starting at an initial guess $x^{(0)}$, we iteratively update our guess via

\begin{equation}
x^{(k+1)} = x^{(k)} - s \nabla f(x^{(k)}), \quad k = 0, 1, 2, \ldots \tag{GD}
\end{equation}

where $\nabla f(x^{(k)}) \in \mathbb{R}^n$ is the gradient of $f$ evaluated at the current guess, and $s > 0$ is a step size chosen large enough to make progress towards $x^*$, but not so big as to overshoot.

Today, we'll focus our attention on optimization problems (P) for which the cost function takes the following special form

\begin{equation}
f(x) = \sum_{i=1}^N f_i(x), \tag{S}
\end{equation}

i.e., cost functions $f$ that decompose into a sum of $N$ "sub-costs" $f_i$. Problems with cost functions of the form (S) are particularly common in machine learning.

For example, a typical problem setup in machine learning is as follows (we saw an example of this when we studied least squares for classification). We are given a set of training data $\{(z_i, y_i)\}_{i=1}^N$, comprised of inputs $z_i \in \mathbb{R}^m$ and outputs $y_i \in \mathbb{R}^p$. Our goal is to find a set of weights $x \in \mathbb{R}^n$ which parametrize a model $m(z_i; x) \approx y_i$ on our training data. A common way of doing this is to minimize a loss function of the form

\begin{equation}
\text{loss}(\{(z_i, y_i)\}; x) = \frac{1}{N} \sum_{i=1}^N \ell(m(z_i; x) - y_i) \tag{L}
\end{equation}

where each term $\ell(m(z_i; x) - y_i)$ is a term penalizing the difference between our model prediction $m(z_i; x)$ on input $z_i$ and the observed output $y_i$. In this setting, the loss function (L) takes the form (S), with $f_i = \frac{1}{N} \ell(m(z_i; x) - y_i)$ the error on training sample $(z_i, y_i)$ and the true output $y_i$.

A common
choice for the "sub-loss" function is $\ell(e) = \|e\|^2$, leading to an least-squares regression problem, but note that most other choices of loss function are compatible with the following discussion.

Now suppose that we want to implement gradient descent (GD) on the loss function (L). Our first step is to compute the gradient $\nabla_x \text{loss}(\{(z_i,y_i)\};x)$. Because of the sum structure of (L), we have that:

\begin{equation}
\nabla_x \text{loss}(\{(z_i,y_i)\};x) = \frac{1}{N} \sum_{i=1}^N \nabla_x \ell(m(z_i;x) - y_i),
\end{equation}

i.e., the gradient of the loss function is the sum of the gradients of the "sub-losses" on each of the $i=1,\ldots,N$ data points.

Our task now is therefore to compute the gradient $\nabla_x \ell(m(z_i;x)-y_i)$. This requires the multivariate chain rule, as $f_i(x) = \ell(m(z_i;x)-y_i)$ is a composition of the functions $\ell(e), e = m(z_i;x) - y_i$, and $m(z_i;x)$.

The Multivariate Chain Rule (Calculus 3 Ch.5)

We begin with a reminder of the chain rule for scalar functions. Let $f:\mathbb{R}\to\mathbb{R}$ and $g:\mathbb{R}\to\mathbb{R}$ be differentiable functions. Then for $h(x) = g(f(x))$, we have that:

\begin{equation}
h'(x) = g'(f(x)) f'(x). \tag{C1}
\end{equation}

If we define $y = g(t)$ and $t = f(x)$, then we can rewrite (C1) as $\frac{dh}{dx} = \frac{dg}{dt} \cdot \frac{df}{dx}$. This is a useful way of writing things as we can "cancel" $dt$ on the RHS to check that our formula is correct.

\textcolor{red}{WARNING: $\frac{dh}{dx} = \frac{dg}{dt} \cdot \frac{df}{dx}$ is shorthand for $h'(x) = g'(f(x))f'(x)$. The cancellation points are lies!}

Generalizing slightly, suppose now that $f:\mathbb{R}^n\to\mathbb{R}$ maps a vector $x\in\mathbb{R}^n$ to $f(x)\in\mathbb{R}$. Then for $h(x) = g(f(x))$, we have:

\begin{equation}
\nabla_x h(x) = g'(f(x)) \nabla f(x), \tag{C2}
\end{equation}

which we see is a natural generalization of equation (C1). It will be convenient for us later to define $\frac{df}{dx} = \nabla f(x)^T$ and $\frac{dh}{dx} = \nabla h(x)^T$. Again defining $y = g(t)$ and $t = f(x), we can rewrite (C2) as $\frac{dh}{dx} = \frac{dg}{dt} \cdot \frac{df}{dx}$, which looks exactly the same as before!

\textcolor{red}{WARNING: $\frac{dh}{dx} = \frac{dg}{dt} \cdot \frac{df}{dx}$ is shorthand for $h'(x) = g'(f(x))\frac{df}{dx}$. The cancellation points are lies!}

Now, let's apply these ideas to computing the gradient of $h(x) = \ell(m(z_i;x)-y_i)$, where we'll assume for now that $m(z_i;x), y_i \in \mathbb{R}$. Applying (C2), we get

\begin{equation}
\nabla_x h(x) = \ell'(m(z_i;x)-y_i) \cdot \nabla_x (m(z_i;x) - y_i) = \ell'(m(z_i;x)-y_i) \cdot \nabla_x m(z_i;x)
\end{equation}

where we use that $\nabla_x y_i = 0$ (since it's a constant). Without knowing more about the functions $\ell$ and $m$, this is all we can say.

Example: Suppose $\ell(e) = \frac{1}{2}e^2$ and $m(z_i; x) = x^Tz_i$. Then
\[\ell(m(z_i;x)-y_i) = \frac{1}{2}(x^Tz_i - y_i)^2 \text{ and } \nabla_x \ell(m(z_i;x)-y_i) = (x^Tz_i - y_i) \cdot z_i\]
\[                                                                                        \underbrace{\ell'(m-y_i)}\cdot \underbrace{\nabla_x m}\]

Next class we will have a brief introduction to deep learning. In deep learning, the function $m(z_i; x)$ is often parameterized as a "chain of function compositions"

\[m(z_i; x) = M_L(M_{L-1}(\cdots(M_2(M_1(z_i))))) \tag{NN}\]
\[= M_L \circ M_{L-1} \circ \cdots \circ M_2 \circ M_1(z_i).\]

A more suggestive way of writing this parameterization (that also highlights the dependence on $x$) is

\begin{align*}
O_0 &= z_i \\
O_1 &= M_1(O_0; x_1) & O_1 \in \mathbb{R}^{n_1}, O_0 \in \mathbb{R}^{n_0} \tag{NNN} \\
O_2 &= M_2(O_1; x_2) & O_2 \in \mathbb{R}^{n_2}, O_1 \in \mathbb{R}^{n_1} \\
&\vdots \\
O_L &= M_L(O_{L-1}; x_L) & O_L \in \mathbb{R}^{n_L}, O_{L-1} \in \mathbb{R}^{n_{L-1}}
\end{align*}

Here the model parameters $x = (x_1, \ldots, x_L)$ we split across the layers $1, \ldots, L$. The intermediate outputs $O_i$ can be of different dimensions, as can the layer parameters $x_i$. While \eqref{NN} and \eqref{NNN} are equivalent, \eqref{NNN} makes explicit the dependence on the model parameters $x$ (which are typically thought of as being "attached" to the individual layers $\ell$ forward). Our goal is then to compute $\nabla_x \ell(m(z_i;x)-y_i)$ for $m$ of the form \eqref{NN}, and where $m: z_i \in \mathbb{R}^{n_0} \mapsto O_L \in \mathbb{R}^{n_L}$ we now allow possibly vector-valued. To do this, we need the fully general multivariate chain rule.

For $h(x) = g(f(x))$ with vector-valued $f: \mathbb{R}^n \to \mathbb{R}^p$ and $g: \mathbb{R}^p \to \mathbb{R}^m$, we need to define the Jacobian matrices for $f$ and $g$:

\begin{equation}
\frac{df}{dx} = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & & \vdots \\
\frac{\partial f_p}{\partial x_1} & \cdots & \frac{\partial f_p}{\partial x_n}
\end{bmatrix}
\quad \text{and} \quad
\frac{dg}{dt} = \begin{bmatrix}
\frac{\partial g_1}{\partial t_1} & \cdots & \frac{\partial g_1}{\partial t_p} \\
\vdots & & \vdots \\
\frac{\partial g_m}{\partial t_1} & \cdots & \frac{\partial g_m}{\partial t_p}
\end{bmatrix}
\tag{J}
\end{equation}

as the $p\times n$ and $m\times p$ matrices of partial derivatives, respectively.


We'll use our same intuition of "cancelling" to derive the expression:

\begin{equation}
\frac{dh}{dx} = \frac{dg}{dt} \cdot \frac{df}{dx} \tag{C3}
\end{equation}

Note that (C3) is defined by a matrix-matrix multiplication of an $m\times p$ and $p\times n$ matrix, meaning $\frac{dh}{dx} \in \mathbb{R}^{m\times n}$. The claim is that $(C3)$ is a perfectly valid expression for $\frac{\partial h_i}{\partial x_j}$ with respect to $x_j$. From (J) and (C3), we have

\begin{equation}
\left(\frac{dh}{dx}\right)_{i,j} = \frac{dg_i}{dt} \cdot \left[\frac{\partial f}{\partial x_j}\right] = \frac{\partial g_i}{\partial t_1} \cdot \frac{\partial f_1}{\partial x_j} + \cdots + \frac{\partial g_i}{\partial t_p} \cdot \frac{\partial f_p}{\partial x_j},
\end{equation}

which is precisely the expression we were looking for. The "cancellation rule" tells us each term in the sum is computing the partial of $\frac{\partial h_i}{\partial x_j}$ in the "t" coordinate.

We can apply this formula recursively to our function class (NN) to obtain the formula:

\begin{equation}
\frac{dm}{dx} = \frac{dM_L}{dM_{L-1}} \cdot \frac{\partial M_{L-1}}{\partial M_{L-2}} \cdots \frac{dM_2}{dM_1} \cdot \frac{\partial M_1}{\partial x} \tag{MC}
\end{equation}

which is a fully general matrix chain rule. We'll use (MC) next to explore the key idea behind backpropagation, which has been a key technical enabler of contemporary deep learning.

Backpropagation

We are going to work out how to efficiently compute the gradient of

\[\ell(m(z_i;x)-y_i)\]

when $m$ takes the form in (NNN). We'll furthermore assume, as is often the case in deep learning, that each layer function $M_\ell$ takes the following form:

\[M_\ell(O_{\ell-1}; x_\ell) = \sigma\left(X_\ell \begin{bmatrix} O_{\ell-1} \\ 1 \end{bmatrix}\right)\]

where $X_\ell$ is a $n_\ell \times (n_{\ell-1}+1)$ matrix with entries given by $x_\ell \in \mathbb{R}^{n_\ell(n_{\ell-1}+1)}$, and $\sigma$ is a pointwise nonlinearity $\sigma(x) = (G(x_1),\ldots,G(x_p))$ called an activation function (we'll soon have more to say).


Applying our matrix chain rule to $\ell(m(x_i)-y_i)$ (we won't write $z_i$ to save space) we get the expression

\begin{equation}
\frac{d\ell}{dx} = \frac{\partial \ell}{\partial m} \frac{dm}{dx} = \frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial m_{L-1}} \cdots \frac{\partial m_2}{\partial m_1} \frac{\partial m_1}{\partial x}.
\end{equation}

Here, $\frac{\partial \ell}{\partial m}$ is a $p_L$ dimensional row vector, and $\frac{\partial m_i}{\partial m_{i-1}}$ is a $p_i \times p_{i-1}$ matrix.

In modern architectures, the layer dimensions (also called layer widths) $p_i$ can be very large (on the order of 100s of thousands or even millions), meaning the $\frac{\partial m_i}{\partial m_{i-1}}$ matrices are \textit{very} large. Too large to store in memory explicitly.

Fortunately, since $\frac{\partial \ell}{\partial m}$ is a row vector, we can build $\frac{d\ell}{dx}$ by sequentially computing inner products. For example, if $\frac{\partial \ell}{\partial m_L} = [a_1 \cdots a_{p_L}]$,

\begin{align*}
\frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial m_{L-1}} &= \underbrace{\frac{\partial \ell}{\partial m_L}}_{1 \times p_L} \underbrace{\begin{bmatrix} a_1 \cdots a_{p_L} \end{bmatrix}}_{p_L \times p_{L-1}} \\
&= [\frac{\partial \ell}{\partial m_L} a_1 \cdots \frac{\partial \ell}{\partial m_L} a_{p_{L-1}}],
\end{align*}

meaning we only ever need to store $\frac{\partial \ell}{\partial m_L}$ and $a_i$ in memory at any given time, which is only $2p_L$ numbers, as opposed to $p_L \times p_{L-1}$. Then once we've computed $\frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial m_{L-1}}$, which is now a $p_{L-1}$ dim. row vector, we can continue our way down the chain.

What's left to do is compute the partial derivatives. Let's break down $\frac{\partial \ell}{\partial x}$ into partial derivatives with respect to a layer's parameters $x_i$. For layer $L$, we have:

\begin{equation}
\frac{\partial \ell}{\partial x_L} = \frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial x_L} + \frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial m_{L-1}} \frac{\partial m_{L-1}}{\partial x_L} = \frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial x_L}
\end{equation}

Since $x_L$ appears in the last layer, it shows up right away in the first term above, which is the derivative of $m_L(m_{L-1};x_L)$ with respect to $x_L$ (the 2nd argument). The second term

\begin{equation}
\frac{\partial \ell}{\partial m_L} \frac{\partial m_L}{\partial m_{L-1}} \frac{\partial m_{L-1}}{\partial x_L} = 0
\end{equation}

which measures how $m_L$ changes with respect to changes in $m_{L-1}$ caused by changes in $x_L$ is zero because $m_{L-1}$ does not depend on $x_L$ at all! This is a key observation in the backpropagation algorithm!

Let's proceed to compute the derivative with respect to the parameter $x_{l-2}$:

\begin{align*}
\frac{\partial l}{\partial x_{l-2}} &= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}} \cdot \left( \frac{\partial w_{l-1}}{\partial x_{l-2}} + \frac{\partial w_{l-1}}{\partial w_{l-2}} \cdot \frac{\partial w_{l-2}}{\partial x_{l-2}} \right) \\
&= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}} \cdot \frac{\partial w_{l-1}}{\partial x_{l-2}}
\end{align*}

We see again that if we can "stop" one or two layers short depends explicitly on $x_{l-2}$. Formally, we have:

\begin{align*}
\frac{\partial l}{\partial x_l} &= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial x_l} \\
\frac{\partial l}{\partial x_{l-1}} &= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}} \cdot \frac{\partial w_{l-1}}{\partial x_{l-1}} \quad \left( \frac{\partial l}{\partial w_{l-1}} = \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}} \right) \\
\frac{\partial l}{\partial x_{l-2}} &= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}} \cdot \frac{\partial w_{l-1}}{\partial w_{l-2}} \cdot \frac{\partial w_{l-2}}{\partial x_{l-2}} \quad \left( \frac{\partial l}{\partial w_{l-2}} = \frac{\partial l}{\partial w_{l-1}} \cdot \frac{\partial w_{l-1}}{\partial w_{l-2}} \right) \\
\frac{\partial l}{\partial x_j} &= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}} \cdot \frac{\partial w_{l-1}}{\partial w_{l-2}} \cdot \frac{\partial w_{l-2}}{\partial x_{l-2}} \cdots \frac{\partial w_{j+1}}{\partial w_j} \cdot \frac{\partial w_j}{\partial x_j} \quad \left( \frac{\partial l}{\partial w_{j+1}} = \frac{\partial l}{\partial w_j} \cdot \frac{\partial w_j}{\partial w_{j+1}} \right)
\end{align*}

Notice that there is a lot of reuse of expressions, which means we don't have to recompute things over and over. In particular

\begin{align*}
\frac{\partial l}{\partial w_{l-1}} &= \frac{\partial l}{\partial w_l} \cdot \frac{\partial w_l}{\partial w_{l-1}}, \quad
\frac{\partial l}{\partial w_j} = \frac{\partial l}{\partial w_{j+1}} \cdot \frac{\partial w_{j+1}}{\partial w_j},
\end{align*}

and in general
\begin{equation*}
\frac{\partial l}{\partial w_{j-1}} = \frac{\partial l}{\partial w_j} \cdot \frac{\partial w_j}{\partial w_{j-1}}
\end{equation*}

where $\frac{\partial l}{\partial w_j}$ will have been computed at the layer above. This is another key piece of backpropagation!

The only thing left to compute is $\frac{\partial w_j}{\partial x_j}$ - this is now just an exercise in calculus, so we'll not work it out in class, but the chain rules will provide tools to forge with further automation for those interested.

Optional:
We apply our chain rule $(w_j = X_j [G_j:1])$ to get

\begin{equation*}
\frac{\partial w_j}{\partial x_j} = \frac{\partial}{\partial x_j} G(X_j [G_{j-1}:1]) = \frac{\partial G}{\partial w} \cdot \frac{\partial w}{\partial x_j}
\end{equation*}

Now for $G(w) = \begin{bmatrix} G(w_1) \\ \vdots \\ G(w_{p+1}) \end{bmatrix}$, $\frac{\partial G}{\partial w} = \begin{bmatrix} G'(w_1) \\ \vdots \\ G'(w_{p+1}) \end{bmatrix}$. Next, we need to find

\begin{equation*}
\frac{\partial w}{\partial x_j} = \frac{\partial}{\partial x_j} (X_j [G_{j-1}:1]). \text{ This can be computed using matrix/linear algebra (tedious).}
\end{equation*}

We won't work it out, but note that it can be found efficiently.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)
