# 03: Back Propagation


![](../img/03_forward_backward.png)

The training loop for a neural network involves:

1. A **forward pass**: Feed the input features ($x_1$, $x_2$) through all the layers of the network to compute our predictions, $\hat{y}$.

2. Compute the **loss** (or cost), $\mathcal{L}(y, \hat{y})$, a function of the predicted values $\hat{y}$ and the actual values $y$.

3. A **backward pass** (or _back propagation_): Feed the loss $\mathcal{L}$ back through the network to compute the rate of change of the loss (i.e. the derivative) with respect to the network parameters (the weights and biases for each node, $w$, $b$)

4. Given their derivatives, update the network parameters ($w$, $b$) using an algorithm like gradient descent.

## Forward Pass: Recap

Each node computes a linear combination of the output of all the nodes in the previous layer, for example:

$$
z_1^{[1]} = w_{1,1}^{[1]} x_1 + w_{2,1}^{[1]} x_2 + b_1^{[1]}
$$

This is passed to an activation function, $g$, (assumed to be the same function in all layers here), to create the final output, or "activation", of each node:

$$
a_3^{[2]} = g(z_3^{[2]})
$$

For example, $\hat{y}$, can be expressed in terms of the activation of the final layer as follows:

$$
\hat{y} = a_1^{[3]} = g\left(w_{1,1}^{[3]} a_{1}^{[2]} + w_{2,1}^{[3]} a_{2}^{[2]} + w_{3,1}^{[3]} a_{3}^{[2]} + b_1^{[3]}\right)
$$

I'm going to introduce a more efficient syntax in a moment but terms not introduced above mean:
- $w_{j,k}^{[l]}$: The weight between node $j$ in layer $l-1$ and node $k$ in layer $l$.
- $a_k^{[l]}$: The activation of node $k$ in layer $l$
- $b_k^{[l]}$: The bias term for node $k$ in layer $l$

## Gradient Descent

Let's consider a simpler network, with one input, two hidden nodes, and one output:

![](../img/03_backprop_example_params.png)

Here I've also included a node after the network's output to represent the calculation of the loss, $\mathcal{L}(y, \hat{y})$, where $\hat{y} = g(z_1^{[2]})$ is the predicted value from the network and $y$ the true value.

This network has seven parameters: $w_1^{[1]}$, $w_2^{[1]}$, $b_1^{[1]}$, $b_2^{[1]}$, $w_1^{[2]}$, $w_2^{[2]}$, $b_1^{[2]}$

In gradient descent we use the partial derivative of the loss function with respect to the parameters to update the network, making small changes to the parameters like:

$$
w_1^{[1]}  = w_1^{[1]} - \alpha\frac{\partial \mathcal{L}}{\partial w_1^{[1]}}
$$

where $\alpha$ is the learning rate.

So to perform gradient descent we need the derivatives for each parameter, i.e. we need to compute:

$$
\frac{\partial \mathcal{L}}{\partial w_1^{[1]}},
\frac{\partial \mathcal{L}}{\partial w_2^{[1]}},
\frac{\partial \mathcal{L}}{\partial b_1^{[1]}},
\frac{\partial \mathcal{L}}{\partial b_2^{[1]}},
\frac{\partial \mathcal{L}}{\partial w_1^{[2]}},
\frac{\partial \mathcal{L}}{\partial w_2^{[2]}},
\frac{\partial \mathcal{L}}{\partial b_1^{[2]}}
$$

How can we compute all those terms?

## Background: Chain Rule

$$
h(x) = f(g(x))
$$

$$
\frac{\mathrm{d} h(x)}{\mathrm{d}x} = \frac{\mathrm{d} f(u)}{\mathrm{d}u} \frac{\mathrm{d} g(x)}{\mathrm{d}x} \\
h'(x) = f'(u)g'(x)
$$


e.g.

$$
f(u) = u^2 \\
g(x) = e^x + x  \\
h(x) = f(g(x)) = (e^x + x)^2 \\
$$

$$
g'(x) = e^x + 1 \\
f'(u) = 2u \\
u = g(x) = e^x + x \\
h'(x) = 2(e^x + x)(e^x + 1) \\
$$

### Multi-variate chain rule

## Back Propagation

Here's the example network again, but with each edge (arrow) labeled by the partial derviative between the two connected nodes:

![](../img/03_backprop_example_diffs.png)

To compute the derivative of the loss with respect to any term in the network we can use the chain rule. Starting with the loss on the right, we move "backwards" through the network, multiplying the partial derivatives until we get to the term we want.

### Example 1: Computing the gradient for $b_1^{[2]}$

$$
\frac{\color{red}{\partial \mathcal{L}}}{\color{blue}{\partial b_1^{[2]}}} = \frac{\color{red}{\partial \mathcal{L}}}{\color{green}{\partial a_1^{[2]}}} \frac{\color{green}{\partial a_1^{[2]}}}{\partial z_1^{[2]}} \frac{\partial z_1^{[2]}}{\color{blue}{\partial b_1^{[2]}}}
$$

Log loss for one data point (remembering that $\hat{y} = a_1^{[2]}$):

$$
\mathcal{L} = - y \log(a_1^{[2]}) - (1 - y)\log(1 - a_1^{[2]}) \\
\frac{\color{red}{\partial \mathcal{L}}}{\color{green}{\partial a_1^{[2]}}} = -\frac{y}{a_1^{[2]}} + \frac{1-y}{1-a_1^{[2]}}
$$

If using a sigmoid activation function:

$$
a_1^{[2]} = \frac{1}{1+\exp(-z_1^{[2]})} \\
\frac{\color{green}{\partial a_1^{[2]}}}{\partial z_1^{[2]}} = a_1^{[2]} (1 - a_1^{[2]})
$$

$z_1^{[2]}$ is a linear combination of its inputs:

$$
z_1^{[2]} = w_1^{[2]}a_1^{[1]} + w_2^{[2]}a_2^{[1]} + b_1^{[2]} \\
\frac{\partial z_1^{[2]}}{\color{blue}{\partial b_1^{[2]}}} = 1
$$

So overall we could write the loss derivative with respect to the bias as:

$$
\frac{\color{red}{\partial \mathcal{L}}}{\color{blue}{\partial b_1^{[2]}}} = \frac{\color{red}{\partial \mathcal{L}}}{\color{green}{\partial a_1^{[2]}}} \frac{\color{green}{\partial a_1^{[2]}}}{\partial z_1^{[2]}}. 1 = \frac{\color{red}{\partial \mathcal{L}}}{\partial z_1^{[2]}}
$$


### Example 2: Computing the gradient for $w_2^{[1]}$

$$
\frac{\color{red}{\partial \mathcal{L}}}{\color{magenta}{\partial w_2^{[1]}}} =
\frac{\color{red}{\partial \mathcal{L}}}{\color{green}{\partial a_1^{[2]}}}
\frac{\color{green}{\partial a_1^{[2]}}}{\partial z_1^{[2]}}
\frac{\partial z_1^{[2]}}{\color{orange}{\partial a_2^{[1]}}}
\frac{\color{orange}{\partial a_2^{[1]}}}{\color{gray}{\partial z_2^{[1]}}}
\frac{\color{gray}{\partial z_2^{[1]}}}{\color{magenta}{\partial w_2^{[1]}}}
$$

We've seen the form of all the derivatives above in the first example, except for the last term:

$$
z_2^{[1]} = w_2^{[1]}x + b_2^{[1]} \\
\frac{\color{gray}{\partial z_2^{[1]}}}{\color{magenta}{\partial w_2^{[1]}}} = x
$$

For the weights after the first layer, the inputs $x$ are replaced by node activations $a$. We can relabel $x = a_1^{[0]}$ to make the general trend clearer.

The first four terms on the right side of the expression for the derivative can be simplified to $\color{red}{\partial \mathcal{L}} / \color{gray}{\partial z_2^{[1]}}$. Then we have:

$$
\frac{\color{red}{\partial \mathcal{L}}}{\color{magenta}{\partial w_2^{[1]}}} =
\frac{\color{red}{\partial \mathcal{L}}}{\color{gray}{\partial z_2^{[1]}}}
\frac{\color{gray}{\partial z_2^{[1]}}}{\color{magenta}{\partial w_2^{[1]}}}
=
\frac{\color{red}{\partial \mathcal{L}}}{\color{gray}{\partial z_2^{[1]}}}
a_1^{[0]}
$$

### Multiple Paths

There is one case not covered by the simplified network and examples above - where you have multiple paths from the output (loss) back to the term of interest. Such as this:

![](../img/03_backprop_multipath.png)

In this case you must sum all the possible paths (this also follows from the multi-variate chain rule).

### Back Propagation and Efficiency

It's important to note that:

- The derivatives in the two examples share many terms in common (e.g. the derivative of the loss with respect to the final output)
- Each term is a fairly simple combination of quantities that must be computed during the forward pass (like the activation values in hidden layers)

These properties of back propagation form the basis for efficient implementations in major frameworks (pytorch, Tensorflow, JAX etc.), mostly via:

- Matrix operations
- Computation graphs
- Caching intermediate values
- Automatic differentiation

## Neural Network Matrix Notation

## Background: Vectors, Matrices, and NumPy

### Dot Products

### Matrix Multiplication

### Broadcasting

## Computation Graphs and Auto-Differentiation