In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# The Long Short Term Memory (LSTM) layer

The "vanilla" Recurrent Neural Network (RNN) layer that we introduced was powerful but somewhat limited
- It suffered from gradients that vanished or exploded
- It's memory tended to be short-term
- Unable to capture dependencies that were too far separated in time

Researchers developed advanced Recurrent Network layer types to address these specific issues.

The *Long Short Term Memory* (LSTM) layer is one such example.

# The importance of selectively forgetting

An RNN layer, at time step $\tt$
- Takes input element $\x_\tp$
- Updates latent state $\h_\tp$
- Optionally outputs $\y_\tp$

according to the equations

$$
\begin{array}[lll]\\
\h_\tp & = & \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) \\
\y_\tp & = &  \W_{hy} \h_\tp  + \b_y \\
\end{array}
$$

where 
- $\phi$ is an activation function (usually $\tanh$)
- $\W$ are the weights of the RNN layer
    - partitioned into $\W_{xh}, \W_{hh}, \W_{hy}$
    - $\W_{xh}$: weights that update $\h_\tp$ based on $\x_\tp$
    - $\W_{hh}$: weights that update $\h_\tp$ based on $\h_{(\tt-1)}$
    - $\W_{hy}$: weights that update $\y_\tp$ based on $\h_\tp$ 

<table>
    <tr>
        <th><center>RNN</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer.png"</td>
    </tr>
</table>

Latent state $\h_\tp$ is
- A *fixed length* encoding of the variable length input sequence $[\x_{(1)} \dots \x_\tp]$
- All essential information about the prefix of $\x$ ending at step $\tt$ is recorded in $\h_\tp$


$\h_\tp$ is charged with several tasks
- Determining the time $\tt$ output $\y_\tp$ (for a many to many RNN)
- Determining the next latent state $\h_{(\tt+1)}$

But is each and every element $\x_{(\tt')}$ (for $1 \le \tt' \lt \tt$) needed for both these tasks ?

Probably not.

Consider sequence $\x$ as frames in a movie.
- Is every detail of the early scenes of a movie
- Relevant to the final scene ?

It would be very powerful
- To be able to "forget" synthetic features that are **no longer** relevant
- And *selectively* update **immediately relevant** features
- While leaving unchanged those features needed **far in the future**


This is where the "gates" (`if`, `switch/case`) of Neural Programming are relevant.

They will be used in an LSTM
- To determine ("gate") when an individual feature in latent state $\h_\tp$ "forgets" (reset)
- To determine which individual features in latent state $\h_\tp$ get updated at step $\tt$



# Conclusion

The LSTM is an advanced Recurrent Neural Network (RNN) layer type.

The "gating" mechanism of Neural Programming will be key.

It will endow the LSTM with
- The ability to forget a feature
- Selectively update other features

This will allow the LSTM to mitigate the drawbacks of "vanilla" RNN layers
- Short term memory
- Vanishing/Exploding gradients

In [2]:
print("Done")

Done
