# LSTMs

### Defining the Vanishing Gradient Problem

The vanishing gradient problem is the following:
> When determining how to update our parameters through backpropagation, if a local gradient is smaller, this gradient gets drastically smaller as we backpropagate further.

The vanishing gradient problem can occur in feedforward neural networks, and convolutional neural networks, that we have seen so far.  But with RNNs, we are particularly susceptible to the vanishing gradient problem.

### RNNs and Vanishing Gradients

Now recall that with an RNN, the key component is that we are repeatedly calculating a hidden state, where that hidden state is a function both of the previous hidden state, our current word embedding, and our weights and biases $W_h$, $W_e$, and $b$.

$h(t) = \text{tanh}(W_h h^{t -1} + W_e e^t + b)$

Ultimately, we'll take that final hidden state, and make a prediction as to the next word, and from there determine how to update our parameters after calculating how off our prediction was.  

To make things more concrete, let's say that we are predicting the fifth word, so we make a hypothesis at step 4, and this means that we have:

* $h_4 = \text{tanh}(W_h h^{3} + W_xx^4 + b)$

And that $J_4(\theta) = J_4(w_5, h_4) $ is a function of $h_4$.  Now, as we know we have some loss $J_5$, and we then want to update our parameters to reduce our loss.  Currently, it looks like our parameters that predict $w_5$ are limited to what we see above us: $W_h, h^{3}, W_x, x^4$, but really this $h_3$ is a function of arises as a function of all of the previous hidden states.  

Let's remind ourselves of this visually.

<img src="h1-with-words.png" width="40%">

As we see above, our prediction of the next word is (rightfully) influenced by the preceding words through our hidden states.  And let's say we want to determine how to update $h_1$ so that we reduce our cost $J_4(\theta)$.  

> Or in other words, we want to change how the word "dog" influences our prediction of the word after "over the", to reduce our cost $J_4$. 

To do so, we need to calculate $\frac{\delta J_4(\theta)}{h1}$.

> **Really technical note**: While we cannot directly change $h_1$, we **can** change $W_h$ which influences $h_1$.  So to determine how to update $W_h$ for the purposes of changing $h_1$, we need to determine $h_1$'s influence on $J$.

Now calculating the influence of $h_1$ on $J_4$ becomes a problem, because this equals: 

$\frac{\delta J_4(\theta)}{h1} = \frac{\delta J_4(\theta)}{h_4}*W_h*W_h*W_h$.

The point is that we are repeatedly multiplying by $W_T$.  

> If your curious, we'll show how we get there later on.  But for right now, let's just assume this is true.

And this is where our vanishing gradient comes in.  Let's see how.  Let's set $W_h$ to be the following matrix, and see what occurs as we multiply $W_h*W_h*W_h$.

In [17]:
import numpy as np
W_h = np.array([[.5, .01], [.3, .5]])
W_h.dot(W_h).dot(W_h)

array([[0.1295 , 0.00753],
       [0.2259 , 0.1295 ]])

So what we can see is that the further more we multiply by $W_t$, the smaller our numbers get.  And remember, this is the vanishing gradient problem:

> When determining how to update our parameters through backpropagation, if a local gradient is smaller, this gradient gets drastically smaller as we backpropagate further.

### Why it's a problem

Now, this is a problem because oftentimes words earlier on can have a strong influence later down in the sequence.  For example, we saw the influence of the word dog in our current example:

<img src="h1-with-words.png" width="40%">

But the problem can be way more extreme then just a few words.  For example consider one of our Amazon reviews:

> This is my favorite coconut water that I have tried.  I have tried a lot of different coconut water, and this is the _____.    

The prediction that the next word has a strong chance of being "best" is influenced by the word favorite earlier on.  But because the words are 19 steps apart, it would have a small influence in our current formulation of RNNs.

Another thing to point out is that a consequence of the vanishing gradient problem is not just that the influence of earlier words is small, but that the influence is drastically smaller than more recent words.  And as we viewed in our example, it's the earlier words that can be more predictive.

### A little deeper into the math

Now, we previously just skipped to the end in specifying that our derivative  $\frac{\delta J_4(\theta)}{h1} = \frac{\delta J_4(\theta)}{h_4}*W_h*W_h*W_h$.  Let's see why this is the case.

Remember when we find $\frac{\delta J_4(\theta)}{h1}$, we are trying to calculate the influence of our hidden state at time 1 on our cost function at time 4.  As we see from the diagram below, the influence of $h1$ on $J_4$ is indirect, and thus we'll need the chain rule.

<img src="./backprop-rnn1.png" width="40%">

In other words, we need to calculate how a nudge of $h_1$ influences $h_2$, and then how a nudge of $h_2$ influences $h_3$ and so on.  We can write this out as the following:

$\frac{\delta J_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} \frac{\delta J_4}{\delta h_4}$

Now, we can calculate those three derivatives, $\frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} $ individually with the following:


* $h_4 = W_h h_3 + W_x x_4 +b \rightarrow \frac{\delta h_4}{\delta h_3} = W_h$ 
* $h_3 = W_h h_2 + W_x x_3 + b  \rightarrow \frac{\delta h_3}{\delta h_2} = W_h$ 
* $h_2 = W_h h_1 + W_x x_3 + b \rightarrow \frac{\delta h_2}{\delta h_1} = W_h$ 

So we can see that $\frac{\delta J_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} \frac{\delta J_4}{\delta h_4} = W_h*W_h*W_h \frac{\delta J_4}{\delta h_4}$

What we can see occurring is that because: 

$\frac{\delta h_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1}$



So now, if we ask the question, how should $h_1$ be nudged to have an influence on $h_4$, so that we ultimately reduce $J_4(\theta)$, we see that  $\frac{\delta h_4}{\delta h_1} = \frac{\delta h_4}{\delta h_3} \frac{\delta h_3}{\delta h_2}  \frac{\delta h_2}{\delta h_1} = W_h*W_h*W_h$.

<img src="./vanishing-grad-intuition.png" width="40%">

Consider the gradient of loss of J on step $i$, with respect to the hidden state on some previous step $j$.

* Because at each step, we multiply by the weight matrix, for the chain rule, we keep multiplying the the Weight matrix.

And $W_h$ is small then the term gets vanishingly small as current step $i$ and hidden state $j$ are further apart.

And also if greater than one, will exponentially explode.

* Why vanishing gradients are bad.

<img src="./vanishing-grad-problem.png" width="40%">

So gradient signal from faraway is lost.  And sometimes we do want to learn the connection between what happens early and what happens later.

* Also, gradient can be viewed as measure of effect of past on the future.

Well, remember our language model problem:

* When she tried to print tickets, found it was out of toner.  But she finally printed her tickets.

But if the gradient is small, the model can't learn this dependency.

* The writer of the books is

* Syntactic recency (writer, is)
* Sequential recency (just how close words are)
And if mostly only paying attention to recent.  And RNNs are better at learning from sequential recency then syntactic recency.  

Exploding gradient is a problem because are then changing the parameters too much.

### Solutions

1.For exploding gradient, gradient clipping
* If norm of gradient is greater than some threshold, then scale it down before applying the SGD update.
* Intuition, still take in same direction, but a smaller step.

2. If vanishing gradient

* Problem: Too difficult for RNN to learn to preserve information over many time steps.

* In vanilla RNN, hidden state is constantly being re-written.  In particular because of this non-linearity function.  So not easy to preserve information from one hidden state to another.

So what about an RNN with a separate memory?  Then could it learn to preserve information.

### LSTM

* An RNN to solve the vanishing gradient problem.
* So now we have both a hidden state and a cell state.  
    * Both have vectors of length n
    * The cell stores the long term information
    * The LSTM can erase, write and read information from the cell
    
* The selection of which information is erased, written or read is controlled by gates.
    * The gates are also vectors of length n
    * And each element of the gates are btwn 0 and 1, where 1 is open and 0 is closed
    * So if if gate open then info passed through, and if not
* the gates are dynamic - their value is computed based on the current context

#### Formula

On timestep t

* f^t = \sigma()
* Forget gate: controls what is kept vs forgotten from previous cell state (previous hidden state and current $x_t$
* Input gate - what part of the new cell content are written to the memory cell
* Output gate: What parts of the cell are outputs to the hidden state (this is the read function, and then would be put into the hidden state).


* c^t: what we write to the cell
* f^t we decide to forget some information

Finally, pass cell through output gate and tanh to give us the hidden state.

So the cell is an internal memory, and the hidden state is what we pass on.

<img src="olah-lstm-diagram.png" width="40%">

<img src="./lstm-formulas.png" width="40%">

### Goal of LSTMs

LSTMS make it easier to preserve over many time steps.

How does the architecture makes it easier to preserve information over many timesteps.

* If forget gate remembers everything on every step, then cell preserved indefinitely.

### Vanishing/exploding gradient

* This is a problem for all neural architectures, especially deep ones.
* Though vanishing/exploding gradients are a general problem, RNNs are particularly unstable, due to the repeated multiplication by the same weight matrix.
* Remember, we are applying the same weight matrix again and again.