In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import neural_net_helper
%aimport neural_net_helper

nnh = neural_net_helper.NN_Helper()

# Vanishing/exploding gradients

Now that we have a better view of how backward propagation of gradients work, we are equipped
to understand the difficulties of training the weights.

Until the problems were understood, and solutions found, the evolution of Deep Learning
was extremely slow.

Let's summarize back propagation up until this point
- We compute the loss gradient $\loss'_\llp = \frac{\partial \loss}{\partial \y_\llp}$ of each layer $\ll$ in descending order

- The backward step  to compute the loss gradient of the preceding layer is:  
    - $\loss'_{(\ll-1)} =  \loss'_\llp \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}}$

When we derived back propagation, we didn't look "inside" of the "local gradient " $\frac{\partial \y_\llp}{\partial \y_{(\ll-1)}}$

We will do so now.

Let's look more deeply into the  term $\frac{\partial \y_\llp}{\partial \y_{(i-1)}}$

$$
\begin{array}[lllll] \\
\frac{\partial \y_\llp}{\partial \y_{(\ll-1)}} & = & \frac{\partial a_\llp ( f_\llp(\y_{(\ll-1)}, \W_\llp))}{\partial \y_{(\ll-1)}} & (\text{def. of } \y_\llp) \\
                                      & = & \frac{\partial a_\llp ( f_\llp(\y_{(\ll-1)}, W_\llp) )}{\partial f_\llp(\y_{(\ll-1)}, \W_\llp)} \frac{\partial f_\llp(\y_{(\ll-1)}, \W_\llp)}{\partial \y_{(\ll-1)}} &  (\text{chain rule}) \\
                                      & = a'_\llp f'_\llp
\end{array}
$$

where we define

$$
\begin{array}[lll] \\
a'_\llp & = & \frac{\partial a_\llp ( f_\llp(\y_{(\ll-1)}, \W_\llp) )}{\partial f_\llp(\y_{(\ll-1)}, \W_\llp)}  & \text{derivative of } a_\llp(\ldots) \text{ wrt } f_\llp(\ldots)\\
f'_\llp & = & \frac{\partial f_\llp(\y_{(\ll-1)}, W_\llp)}{\partial \y_{(\ll-1)}} & \text{derivative of } f_\llp(\ldots) \text{ wrt } \y_{(\ll-1)}\\
\end{array}
$$

$a'_\llp$ is the derivative of activation function $a_\llp$.

We won't explicitly write it out other than to observe $a'_\llp \in [0,1]$.


Substituting the value of the loss gradient into the backward update rule:

$$
\begin{array}[llll]\\
\loss'_{(\ll-1)} & = &  \loss'_\llp \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}} \\
         & = &  \loss'_\llp a'_\llp f'_\llp
\end{array}
$$

Hopefully, you can see that if iterate through single backward steps, we can derive
an expression for the loss gradient at layer $\ll$ in terms of the loss gradient
of the final layer $K$:

Since
$$\loss'_\llp  =   \loss'_{(\ll+1)} \frac{\partial \y_{(\ll+1)}}{\partial \y_\llp}$$

we get
$$\loss'_\llp  =   \loss'_{(L+1)} \prod_{l'=\ll+1}^L  a'_{(l')} f'_{(l')}$$

The issue is that, since 
$$
0 \le a'_\llp \le \max{z} a'_\llp(z)
$$

the product 
$$\prod_{l'=i+1}^K {a'_{(l')}}
$$
can be increasingly small as the number of layers $K$ grows, if $\max{z} a'_\llp(z) < 1$.

Note, for $a_\llp = \sigma$ (the sigmoid function), $\max{z} a'_\llp(z) = 0.25$  

Thus, unless offset by the $f'_\llp$ terms, $\loss'_\llp$ will quickly diminish to $0$ as $K$ decreases,
i.e., as we seek to compute $\loss'_\llp$ for layers $\ll$ closest to the input.

This means 

$$
\begin{array}[lll] \\
\frac{\partial \loss}{\partial W_\llp} & = & \frac{\partial \loss}{\partial y_\llp} \frac{\partial y_\llp}{\partial W_\llp} & = & \loss'_\llp \frac{\partial y_\llp}{\partial W_\ip}
\end{array}
$$
will approach $0$.
Since this term is used in the update to $W_\ip$, we won't learn weights for the earliest layers.

We can now diagnose one reason that training of early Deep Learning networks was difficult
- use of sigmoid activations were common (inspired by biology)
- if activations were very large/small, we are in a region where the sigmoid's derivatives are $0$
- even when non-zero,the maximum of the derivative of the sigmoid is much smaller than $1$
- the end result was that deep networks suffered from Vanishing Gradients

The ReLU function's derivative does not suffer from this problem and ReLU's now tend to be
the standard activation (barring other considerations, such as the range of outputs)

# Conclusion

Something seemingly as simple as taking derivatives turned out to have some important subtleties.

The problem of gradients either shrinking to zero or growing too large is a real problem
- It can still hinder the use of very deep (many layers) networks
- This is particularly a problem in Recurrent networks
    - The depth of the "unrolled loop" is the length of the input sequence

We will explore techniques to manage the issue of vanishing and exploding gradients.

In [4]:
print("Done")

Done
