# Lab 7: Recurrent Neural Networks

## Machine learning on sequences
When performing a task on a sequence, _memory_ is often very important.
To borrow an example from [Chris Olah's excellent blog post on RNNs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) (you should probably read this!), a neural network trained to predict the next word will have trouble filling in the blank in "I grew up in France [ . . . ] I speak fluent  ____" without memory.
If it takes a fixed-size window as input, it may predict that the next word is the name of a language, but if the first sentence comes much earlier in the text, it would need to guess which language.

Another problem is how to deal with variable-length sequences.
1-D convolution is possible on a variable-length sequence, producing another variable-length sequence, but if we want to "sum up" an entire sequence as a single vector or number (for example, classifying a sentence by content), it's unclear how to do that.
Similarly, for a fixed-size input, it's unclear how we could produce a variable-sized output.
For instance, we might want to train a neural network to generate captions for images:

![image captioning RNN](./images/captioning.png)
(Image source: ["Deep Visual-Semantic Alignments for Generating Image Descriptions"](https://cs.stanford.edu/people/karpathy/deepimagesent/))

## Simple recurrent networks
These two problems suggest that we'd like to have a specialized kind of neural network for sequences, one which learns to pick out parts of the input to remember later.
This is the key idea behind a **recurrent neural network** (RNN), which maintains a **hidden state** that evolves as it reads through the input and governs how it produces output.

At every time step $t$, a simple RNN:
 1. Reads an input $\vec{x_t}$, and computes an activation based on the input $U \vec{x_t}$ for a learned matrix $U$.
 2. Reads the previous hidden state $\vec{h_{t-1}}$, and computes an activation based on the previous hidden state $W \vec{h_{t-1}}$ for a learned matrix $W$.
 3. Computes a new hidden state for this time step $\vec{h_t}$ by applying a nonlinearity (usually tanh) to the sum of these two activations, and a learned bias term $\vec{b}$: $\vec{h_t} = \tanh \big( \vec{b} + W \vec{h_{t-1}} + U \vec{x_t} \big )$
 4. Computes an output for this timestep $\vec{y_t}$ using the new hidden state: $\vec{y_t} = \theta(\vec{c} + V \vec{h_t})$ where $\theta$ is an appropriate activation function and $\vec{c}$ is a learned bias parameter
 
(For this formulation, see the [Deep Learning Book, Chapter 10](http://www.deeplearningbook.org/contents/rnn.html))

![RNN example](https://karpathy.github.io/assets/rnn/charseq.jpeg)
(Image source: ["The Unreasonable Effectiveness of Recurrent Neural Networks"](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy, which you should definitely also read)

Any neural network that operates over multiple time steps, where the output or hidden state of one time step is used as input for others, is called "recurrent".
An RNN layer that updates a hidden state in exactly this way is called a "simple RNN" -- we will introduce more complicated RNN variants later. 

tanh is the activation function of choice for producing the hidden state of simple RNNs, since it has nice properties (linearity near zero, no exploding values, not setting values to zero like ReLU) when applied repeatedly in sequence (which we do when processing with multiple time steps).

One view is that this forms a network with "loops", or _layers with self-connections_.
These loops allow the hidden state in one time step $t-1$ to influence the hidden state in the next time step $t$, with additive influence from the input at time $t$ through one regular dense layer, and another regular dense layer producing the output at time $t$.

Equivalently, RNNs can be viewed **unrolled**, where different time steps are seen as different parts of the neural network.
In this view, every connection in an RNN is an ordinarty dense layer, with the exceptions that:
 - the hidden-to-hidden layers share weights
 - the input-to-hidden layers share weights
 - the hidden-to-output layers share weights
 - the new hidden state is based on the sum of the hidden-to-hidden layer and the input-to-hidden layer, passed through an activation function

![unrolling RNNs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)
(Image source: ["Understanding LSTM Networks"](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) on Colah's blog. In this image $h$ is the output and $A$ is the hidden state.)

One neat interpretation of RNNs is that they learn to carry out an algorithm, which the elements of the hidden state acting as local variables.
In this view:
 - The input-to-hidden matrix $U$ determines how new inputs influence the local variables $h$
 - The hidden-to-hidden matrix $W$ determines how the local variables should influence each other in one step of the algorithm
 - The hidden-to-output matrix $V$ determines how the local variables should combine to produce one step of output
 
RNNs can be made deep (multi-layer) too, by adding a second recurrent layer that takes as its (variable-length) input the sequence of outputs of the first recurrent layer.

## Common structures for recurrent networks
RNNs can use their hidden state as memory, learning what to incorporate from the input ($U$), what to remember ($W$), and how to produce output based on what's remembered ($V$).

It turns out RNNs also solve our second problem with operating over sequences: they can handle arbitrary-length sequences as input or output.

![RNN structures](https://karpathy.github.io/assets/rnn/diags.jpeg)
(Image source: ["The Unreasonable Effectiveness of Recurrent Neural Networks"](https://karpathy.github.io/2015/05/21/rnn-effectiveness/))

RNNs "by default" look at one input vector and produce one output vector per timestep.
This results in a "synchronized" architecture, where every input is associated with an output that can also use information from previous timesteps.

However, by only taking the output vector from the last timestep, an RNN can produce a fixed-length output from a variable-sized input.
Sentiment analysis is one example of this.
When the output vector is treated as a representation to be input to other layers in the neural network, it's interpreted as "summarizing" the sequence into a fixed-size vector.

Similarly, by only using the input vector for the first timestep, we can have an RNN produce variable-length output from a fixed-size input.
For instance, if the input is a high-level representation learned by a convolutional network on images, we could feed this (fixed-size) vector into an RNN to produce a description of the image.
To know when to stop making new time steps, a special "end-of-sequence" output is added to the end of every training example; when the network outputs this symbol, generation stops.

Commonly, these two ideas are combined to form a "sequence to sequence" model in which one RNN (the "encoder") summarizes a sequence into a fixed-length vector, and the a second RNN (the "decoder") takes the fixed-length vector and produces a variable-length output sequence.
This flexible architecture enables translating between sequences of two different (arbitrary) lengths.
One obvious application is translation, though this idea's seen use in lots of places.

If you're curious, check out: [the paper that introduced the idea](https://arxiv.org/abs/1409.3215) and [the Keras blog's tutorial](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html).

## Backpropagation through time
TODO

NOTE: mention http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

## Long-term dependencies
TODO

### Exploding gradients
TODO

### Vanishing gradients
TODO

## LSTM recurrences
TODO

## Problems for recurrences
TODO

Link to:
 - https://bair.berkeley.edu/blog/2018/08/06/recurrent/
 - https://arxiv.org/abs/1803.01271

## Recurrent neural networks in Keras
TODO

## Integrating TensorFlow and Keras
TODO