In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import neural_net_helper
%aimport neural_net_helper

nnh = neural_net_helper.NN_Helper()

# Notation review
- Layer $\ll$:
$ 0 \le \ll \le (L-1)$
$$
\y_\llp = a_\llp \left( f_\llp( \y_{(\ll-1)}, \W_\llp ) \right)
$$
- Layer $0$ is input
$$
\y_{(0)} = \x
$$
- Layer $L$ is *head* (Classifier/Regression)
- Layer $(L+1)$ is Loss layer
- We omit writing a separate bias term $\b_\llp$: we fold it into the weights $\W_\llp$


<div>
    <center>Layer notation</center>
    <br>
<img src="images/NN_Layers_select.png"> <!Image source: NN_Layers.drawio; select only one box for export>
    </div>

# Back propagation

Gradient Descent updates weights $\W$ using the derivative of the loss $\loss$ with respect to $\W_\llp$.

$$
\W = \W - \alpha * \frac{\partial \loss}{\partial \W}
$$

where $\alpha \le 1$ is the learning rate.

Since each layer $\ll$ has its own weights $\W_\llp$ the derivatives needed are

$$
\begin{array}[lll] \\
\frac{\partial \loss}{\partial \W_\llp} \,\text{for} \; \ll=1, \ldots, L
\end{array}
$$

We will show how to compute these derivatives via a procedure known as *Back propagation*.

It is really nothing more than an *iterated* application of the Chain Rule of Calculus.

Recall that we created layer $(L+1)$ to compute the Loss function
$$
 \y_{(L+1)} = \mathcal{L}
$$

where layer $L$ is the "head" (Classifier/Regression).

Our computation  thus looks like:

<div>
<center>Additional Loss Layer (L+1)</center>
<br>
<img src="images/NN_Layers_plus_Loss.png">
</div>

We will compute the derivative of the Loss with respect to $\y_\llp$, for each $1 \le \ll \le (L+1)$

Let
$$\loss'_\llp = \frac{\partial \loss}{\partial \y_\llp}$$ 
denote the derivative of $\loss$ with respect to the output of layer $\ll$, i.e., $\y_\llp$.

This is called the **loss gradient**.


The loss gradient can be computed for each layer sequentially in *reverse order*.

That is why the procedure is called *Backwards propagation:*

Starting at the end
$$
\begin{array}[lll]\\
\loss'_{(L+1)} & = & \frac{\partial{\loss}}{\partial \y_{(L+1)}} \\
& = & \frac{\partial{\y_{(L+1)}}}{\partial \y_{(L+1)}} \\
& = & 1
\end{array}
$$

We inductively work our way backwards
- Given $\loss'_\llp$
- Compute $\loss'_{(\ll-1)}$
- Using the chain rule

$$
\begin{array}[lll] \\
\loss'_{(\ll-1)} & = & \frac{\partial \loss}{\partial \y_{(\ll-1)}} \\
         & = & \frac{\partial \loss}{\partial \y_\llp} \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}} \\
         & = & \loss'_\llp \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}}
\end{array}
$$

The loss gradient "flows backward", from $\y_{(L+1)}$ to $\y_{(1)}$.

This is referred to as the *backward pass*.

<div>
<center><strong>Backward pass: Loss to Weights</strong></center>
<br>
<img src="images/NN_Layers_plus_Loss_backward.png">
</div>

Contrast this to the information flow that leads to prediction $\hat{\y} = \y_{(L)}$
- Information flows forward, from input $\x$ to $\y_{(L)}$
- This is called the *forward pass*

<div>
<center><strong>Forward Pass: Input to Loss</strong></center>
<br>
<img src="images/NN_Layers_plus_Loss_forward.png">
</div>

The purpose of flowing the loss gradient backwards is to find the optimal value for $\W_\llp$, the weights for each layer $\ll$, $1 \le \ll \le L$
- Via Gradient Descent, which modifies the current estimate of $\W_\llp$
- Using the derivative of the loss with respect to $\W_\llp$
- Which can be obtained via another application of the Chain Rule


$$
\begin{array}[lll] \\
\frac{\partial \loss}{\partial \W_\llp} & = & \frac{\partial \loss}{\partial \y_\llp} \frac{\partial \y_\llp}{\partial \W_\llp} & = & \loss'_\llp \frac{\partial \y_\llp}{\partial \W_\llp}
\end{array}
$$

Here is a larger picture of the flow during the Forward and Backward pass at layer $\ll$.

<div>
<center>Forward and Backward pass: Detail</center>
<br>
<img src="images/Backward_pass_detail.png">
</div>

Since $\y_\llp$ is a function of
- $\y_{(\ll-1)}$, the previous layer's output
- And $\W_\llp$, the weights of layer $\ll$.

$$
\y_\llp = a_\llp \left( f_\llp( \y_{(\ll-1)}, \W_{\llp}) \right) 
$$

the computation of $\frac{\partial \y_\llp}{\partial \W_\llp}$ depends on the functional form of $a_\llp$
and can be obtained via the rules of Calculus.

The derivatives of $\y_\llp$ with respect to each of its inputs
- $\frac{\partial \y_\llp}{\partial \y_{(\ll-1)}}$
- $\frac{\partial \y_\llp}{\partial \W_\llp}$

are called **local gradients**

Note that we can compute the local gradients
- During the **forward pass** 
- Since the derivatives only depend on inputs and not on any value subsequent to layer $\ll$

We will take advantage of this fact when we demonstrate some pseudo-code for the Forward and Backward passes.

So we say that the loss gradient of the preceding layer is the product of
- The loss gradient of the current layer
- The local gradient with respect to the layer's inputs

# Conclusion

Gradient Descent depends on the ability to compute
- The derivative of the Loss with respect to the weights

We demonstrated a procedure called Back Propagation to compute these derivatives.

The forward pass of a Neural Network is the process of computing outputs (predictions) from inputs.

Back propagation is what happens in the backward pass, which maps loss to weights.


In [4]:
print("Done")

Done
