# Vanishing Gradients

### Introduction

In this lesson, we'll learn about the vanishing gradient problem.  All neural networks are susceptible to the vanishing gradient problem, but as we'll see RNNs are particularly susceptible to this issue.  Let's dive in.

### Defining the Vanishing Gradient Problem

The vanishing gradient problem is the following:
> When determining how to update our parameters through backpropagation, if a local gradient is less than one, this gradient gets drastically smaller as we backpropagate further.

The vanishing gradient problem can occur in feedforward neural networks, and convolutional neural networks that we have seen so far.  But with RNNs, we are particularly susceptible to the vanishing gradient problem.

### RNNs and Vanishing Gradients

Now recall that with an RNN, the key component is that we are repeatedly calculating a hidden state, where that hidden state is a function both the previous hidden state, our current word embedding, and our weights and biases: $W_h$, $W_e$, and $b$.  We calculate the hidden state with the following formula.

$h(t) = \text{tanh}(W_h h^{t -1} + W_e e^t + b)$

Ultimately, we'll take that final hidden state, and make a prediction as to the next word, and from there determine how to update our parameters after calculating how off our prediction was.  

To make things more concrete, let's say that we are calculating our hidden state for our fourth word at step 4, and this means that we calculate our hidden state according to the following:

* $h_4 = \text{tanh}(W_h \cdot h_{3} + W_e \cdot e_{w4} + b)$

Remember that we'll use this hidden state, h_4, to predict our next word $w_5$, and that we calculate our loss at that word as $J_4(w_5, h_4) $ .

> The above is just saying that our loss is a function of the next word we are predicting, $w_5$ and the hidden state for the current word $h_4$.

Now, as we know we have some loss $J_5$, and our next step is to update our parameters to reduce our loss throuh gradient descent.  Currently, it looks like our parameters that predict $w_5$ are limited to what we in our above formula for $h_4$: $W_h, h^{3}, W_x, x^4$, but really this $h_3$ arises as a function of all of the previous hidden states.  

Let's remind ourselves of this visually.

<img src="h1-with-words.png" width="40%">

As we see above, our prediction of the next word is (rightfully) influenced by the preceding words through our hidden states.  

Now, let's say we want to determine how to update $h_1$ so that we reduce our cost $J_4(\theta)$.  

> Or in other words, we want to update the influence of the the second word "dog", in our prediction of the next word, to reduce our cost $J_4$. 

To do so, we need to calculate $\frac{\delta J_4(\theta)}{h1}$.

> **Really technical note**: While we cannot directly change $h_1$, we **can** change $W_h$ which influences $h_1$.  So to determine how to update $W_h$ for the purposes of changing $h_1$, we need to determine $h_1$'s influence on $J$.

Now calculating the influence of $h_1$ on $J_4$ becomes a problem, because this equals: 

$\frac{\delta J_4(\theta)}{h1} = \frac{\delta J_4(\theta)}{h_4}*W_h*W_h*W_h$.

> We'll see why the math works out this way later on.  But for right now, let's just assume this is true.

The point is that we are repeatedly multiplying by $W_h$.  Now let's see why this becomes a problem.

As it turns out, this is where our vanishing gradient problem comes in.

Let's see how.  Let's set $W_h$ to be the following matrix, and see what occurs as we multiply $W_h*W_h*W_h$.

In [1]:
import numpy as np
W_h = np.array([[.5, .01], [.3, .5]])
W_h

array([[0.5 , 0.01],
       [0.3 , 0.5 ]])

In [17]:
W_h.dot(W_h).dot(W_h)

array([[0.1295 , 0.00753],
       [0.2259 , 0.1295 ]])

So what we can see is that the further more we multiply by $W_h$, the smaller our numbers get.  And remember, this is the vanishing gradient problem:

> When determining how to update our parameters through backpropagation, if a local gradient is less than one, this gradient gets drastically smaller as we backpropagate further.

So we see this above.  We multiply by W one time for every difference between our the index of our next word, and our influence word.  
> So here we multiply by W 3 times, but this difference could be much larger.

And every time we multiply by `W_h`, our gradient becomes smaller and smaller.

### Why it's a problem

Now, that we saw how our vanishing gradient occurs in recurrent neural networks, let's better appreciate why this is a problem.

This is a problem because oftentimes, when understanding the meaning of a document, words earlier on be a strong predictor of the meaning of words later down in the sequence.  

> And remember that our task with our RNN is to repeatedly predict the next word.

For example, we saw the influence of the word dog in our current example:

<img src="h1-with-words.png" width="40%">

But the problem can be way more extreme then just a few words apart.  For example consider one of our Amazon reviews:

> This is my favorite coconut water that I have tried.  I have tried a lot of different coconut water, and this is the _____.    

The prediction that the next word has a strong chance of being "best" is influenced by the word favorite earlier on.  But because the words are 19 steps apart, it would have a small influence in our current formulation of RNNs.

Another thing to point out is that a consequence of the vanishing gradient problem is not just that the influence of earlier words is small, but that the influence is drastically smaller than more recent words.  And as we viewed in our example, it's sometimes the earlier words that can be more predictive.

### A little deeper into the math

Now, we previously just skipped to the end in specifying that our derivative  $\frac{\delta J_4(\theta)}{h1} = \frac{\delta J_4(\theta)}{h_4}*W_h*W_h*W_h$.  Now let's take some time to see why this is the case.

Remember that when we find $\frac{\delta J_4(\theta)}{h1}$, we are trying to calculate the influence of our hidden state at time 1 (associated with dog) on our cost function at time 4.  As we see from the diagram below, the influence from the parameters in $h1$ on $J_4$ is indirect, and thus to calculate the gradient we'll need to use the chain rule.

<img src="./backprop-rnn1.png" width="40%">

In other words, we need to calculate how a nudge of $h_1$ influences $h_2$, and then how a nudge of $h_2$ influences $h_3$ and so on.  We can write this out as the following:

$\frac{\delta J_4}{\delta h_1} =  \frac{\delta J_4}{\delta h_4}  \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} $

Now, let's see the calculation of three derivatives, $\frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} $.  We lay out the formula for each hidden state, and then the related derivative.  Take a moment to go throuh them.


* $h_4 = W_h h_3 + W_x x_4 +b \rightarrow \frac{\delta h_4}{\delta h_3} = W_h$ 
* $h_3 = W_h h_2 + W_x x_3 + b  \rightarrow \frac{\delta h_3}{\delta h_2} = W_h$ 
* $h_2 = W_h h_1 + W_x x_3 + b \rightarrow \frac{\delta h_2}{\delta h_1} = W_h$ 

So we can see that $\frac{\delta J_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} \frac{\delta J_4}{\delta h_4} = W_h*W_h*W_h \frac{\delta J_4}{\delta h_4}$

And as we know, when $W_h$ is less than one, multiplying these $W_h$ matrices together leads to a vanishing gradient, where our gradient gets drastically smaller as we backpropagate further.

What we can see occurring is that because: 

$\frac{\delta h_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1}$



So now, if we ask the question, how should $h_1$ be nudged to have an influence on $h_4$, so that we ultimately reduce $J_4(\theta)$, we see that  $\frac{\delta h_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} = W_h*W_h*W_h$.

### Summary

In this lesson, we learned about the vanishing gradient problem.  As mentioned, a vanishing gradient occurs when our gradient gets drastically smaller as we backpropagate further.  And RNNs are particularly susceptible to vanishing gradients as the hypothesis function repeatedly uses the same weight matrix $W_h$ to calculate the next hidden state.  Because of this, when backpropagating, we repeatedly multiply by $W_h$ for each time step earlier than $t$.  And this means that  when predicting the next word, $w_{t+1}$, our RNNs tend not to incorporate words from earlier timesteps, $w_{t-n}$.  
