In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# The LSTM API

A vanilla RNN uses the latent state vector $\h_\tp$ to assist with 
- Determining the next output $\y_\tp$ (for a many to one RNN)
- Updating the latent state from $\h_\tp$ to $\h_{(\tt+1)}$

An LSTM separates these two tasks, using
- $\h_\tp$ as "short-term" memory (control state)
- For controlling state transition from $\h_\tp$ to $\h_{(\tt+1)}$

and an additional vector
- $\c_\tp$ as a "long-term" memory



Some analogies for the differing roles of short and long term memory
- Your computer
    - Short term memory: RAM (Random Access Memory)
    - Long term memory: a disk, memory stick, flash card
- Your office
    - Short term memory: the desktop
    - Long term memory: the filing cabinet
    


Let's introduce [the basic elements of an LSTM](LSTM_API.ipynb)

# Inside an LSTM Layer

Time to explore the inner workings on an LSTM.
- It will seem complicated at first
- Many small, inter-connected pieces
- Lots of literature with confusing diagrams


Here is a typical diagram from the literature that tries to explain an LSTM

<table>
    <tr>
        <th><center>LSTM diagram</center></th>
    </tr>
    <tr>
        <td><img src="https://upload.wikimedia.org/wikipedia/commons/3/3b/The_LSTM_cell.png" width=600></td>
    </tr>
    <tr>
        <td>Attribution: https://commons.wikimedia.org/wiki/File:The_LSTM_cell.png</td>
    </tr>
</table>

We will try to manage the complexity
- Following the "classical" derivation of the [original paper that introduced the LSTM](https://www.bioinf.jku.at/publications/older/2604.pdf)
- But influenced by an excellent [blog post](http://blog.echen.me/2017/05/30/exploring-lstms/)

Let's [go inside an LSTM](LSTM_Workings.ipynb) to see how this happens.

#  LSTM as gated residual connections

How does an LSTM circumvent the potential problem of vanishing and exploding gradients ?

Recall that a very deep NN (such as an "unrolled" RNN on a long sequence)
- is exposed to the problem of Vanishing/Exploding gradients
- because the Loss Gradient gets moderated by
$$
\frac{\partial \y_\llp}{\partial \y_{(\ll-1)} }
$$
during Back Propagation

$$
\begin{array}[llll]\\
\loss'_{(\ll-1)} & = &  \loss'_\llp \frac{\partial \y_\llp}{\partial \y_{(\ll-1)}} \\
         & = &  \loss'_\llp a'_\llp f'_\llp
\end{array}
$$

- where $\y_\llp$ computes
$$
a_\llp ( f_\llp(\y_{(\ll-1)}, \W_\llp))
$$
- and $a' \lt 1$ for many activation functions


<table>
    <tr>
        <th><center>RNN Loss: Backward pass</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_loss_gradient.png" width=70%></td>
    </tr>
</table>

But the LSTM has the ability
- through certain settings of the gates
- to have $\y_\llp$ compute
- the identity function
$$
\y_\llp = \y_{(\ll-1)}
$$

When this happens
$$
\frac{\partial \y_\llp}{\partial \y_{(\ll-1)} } = 1
$$
and the Loss Gradient flows backward undiminished.

Examine the update equation for the long term memory of an LSTM:

$$\c_\tp = \remember_\tp \otimes \c_{(t-1)} + \save_\tp \otimes \c'_\tp$$

and consider the case when
$$
\begin{array}[lll]\\
\remember_{\tp,j} & = & 1 & \text{Allow prior value of } j^{th} \text{ element to be updated} \\
\save_{\tp, j} & = & 0 & \text{But don't allow the candidate update value } \c'_{\tp,j} \text{ to participate}
\end{array}
$$



Then
$$
\c_\tp = \c_{(\tt-1)}
$$

and the "circuit" for this element becomes a skip connection
- Passing through prior value of element $j$ unchanged
- No moderation of the Loss Gradient

By 
- Having the ability to turn a layer into the identity transformation
- Implementing a "skip connection" of a residual network

the LSTM can avoid the problem of Vanishing/Exploding gradients.

If $\focus_{\tp,j}$ is also equal to $1$

$$\h_\tp = \focus_t \otimes \tanh(\c_\tp)$$
is also unchanged.

So, if the Loss function would be minimized by an identity transformation
- The LSTM is able to implement this transformation
- By the appropriate choice of weights $\W$


# Initial bias to "not forget"

A very deep multi-layer network may have trouble learning in early epochs of training
- Uninitialized weights in deep layers
- No specialization of shallow layers to uncover features that would be useful to the deeper layers
- Resulting in large Loss function values
- Large Loss Gradients make weight updates unstable

The ability of the LSTM to implement the identity transformation comes to the rescue !

If we could force $\remember_\tp$ to $1$ early in training
- We could get the LSTM to implement the identity transformation
- And allow deep LSTM layers to not influence weight updates

We simply set the bias $\b_f$ of the $\remember$ gate to a large positive value
$$\remember_\tp   =  f_\tp   =  \sigma(\W_{x,f} \x_{(t)} + \W_{h,f}\h_{(t-1)} + \b_f) $$
which forces the sigmoid to output $1$.

# What is *really* going on inside an LSTM

The mechanics of the LSTM feel complicated.

But let's not let that obscure what is going on inside the LSTM.


Let's examine the update equation for the long-term memory $\c_\tp$
$$
\begin{array}[lll]\\
\c'_\tp  & = & \tanh(\W_{x,c} \x_\tp + \W_{h,c}\h_{(t-1)} + \b_c) \\
\c_\tp   & = & \remember_\tp \otimes \c_{(t-1)} + \save_\tp \otimes \c'_\tp
\end{array}
$$

Note that the candidate update value $\c'_\tp$
- Has been squashed by $\tanh$ to the range $[-1, +1]$


When
$$
\begin{array}[lll]\\
\remember_{\tp,j} & = & 1 & \text{Allow prior value of } j^{th} \text{ element to be updated} \\
\save_{\tp, j} & = & 1 & \text{Allow the candidate update value } c'_{\tp,j} \text{ to participate}
\end{array}
$$

then the $j^{th}$ long-term memory element acts like a counter !
- Incremented/Decremented by $\c_\tp \in [-1, +1]$

In our module on [RNN Visualization](RNN_Visualization.ipynb)
- We speculated that the elements of the latent state $\h_\tp$
- Of a vanilla RNN
- Were implementing counters

<table>
    <tr>
        <th><center>State activations after seeing prefix of input</center></th>
    </tr>
    <tr>
        <td><img src="images/Unreasonable_effectiveness_cell2.png"></td>
    </tr>
</table>

You can imagine how a counter might be used in handling text input
- As a switch indicating being inside/outside of delimiters like quotation marks
- As a measure of how deeply nested an expression is (e.g., lists)
- As a count of characters in a sentence (increasing probability of seeing an end-of-sentence delimiter)

The update equation for long-term memory
$$
\begin{array}[lll]\\
\c'_\tp  & = & \tanh(\W_{x,c} \x_\tp + \W_{h,c}\h_{(t-1)} + \b_c) \\
\c_\tp   & = & \remember_\tp \otimes \c_{(t-1)} + \save_\tp \otimes \c'_\tp
\end{array}
$$
is an almost direct operational implementation of the counter concept.


# Conclusion

We introduced an advanced Recurrent Layer type call the LSTM.

Similar concepts are present in the Gated Recurrent Unit (GRU), another advanced Recurrent Layer type.

These advanced concepts were designed with the specific intent of dealing with long sequences.

They are thus quite common in domains such as Natural Language Processing, which has long-range dependencies.

In [2]:
print("Done")

Done
