# An Introduction to Recurrent Neural Networks

Recurrent neural networks (RNNs) have become increasingly popular in machine learning for handling sequential data. In this tutorial, we will cover background about the architecture, a toy training example, and some demos for evaluating state-of-the-art pre-trained models.

**Date**: June 26, 2019

**Authors**: 

---

## 0. Background

Note: The following figures were taken from excellent blog posts by [Christopher Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and [Jianqiang Ma](https://medium.com/@jianqiangma/all-about-recurrent-neural-networks-9e5ae2936f6e).

### 0.1 What is an RNN?

In both machine learning and everyday life, many tasks make use of **sequential** data. Take language understanding, for example. As you comprehend each word in this sentence, you draw upon information from the previous words. RNNs have an **inductive bias** for handling this type of data, as parameters are shared across positions in the sequence.

How does this architecture compare to "normal" neural networks? Let $A$ be a neural network, $x_t$ the input, and $h_t$ the output or hidden state vector. Consider the following diagram (credit to [colah's post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)):

![RNN](img/rnn_unrolled.png)

The self-loop at $A$ represents the **recurrence** of the network. We can "unroll" it into multiple copies of the same network $A$, sending information down the temporal chain.

In practice, many applications use a special type of RNN called a **long short-term memory (LSTM)** network. We'll call non-LSTM RNNs "vanilla RNNs".

### 0.2 What is an LSTM?

To motivate the popularity of LSTMs, recall why we were interested in RNNs in the first place: modeling *sequences*. The idea is that items at later parts of a sequence can depend on items at earlier parts of the sequence. The distance between these related items is often called the **dependency length**.

Dependency lengths can be quite short in some cases and quite long in others, even for the same task. Returning to the language example, suppose your task is to predict the last word in a sentence. If your sentence is 

>"A car has four *wheels*."

then the gap between the target word (here, "wheels") and the relevant parts of the sequence (here, "car") is relatively small. In contrast, consider the sentence 

>"A car can have sentimental value for many owners for a variety of reasons, and can come in many models, sizes, and colors; nevertheless, one defining characteristic of such a machine is that it has four *wheels*".

Here, the gap between "wheels" and "car" is relatively large. Words closer to the end of the sentence such as "machine" and "four" can help guide the prediction of "wheels", but we need the word "car" to nail down the correct word.

In theory, vanilla RNNs are able to capture these long-term dependencies, but this doesn't work well in practice (see [this paper](http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf) and [this paper](http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf)). This is where LSTMs come in.

#### 0.2.1 LSTM architecture

![LSTM](img/lstm.png)

**TODO**: explain cell

### 0.3 Further reading

[The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

[The Deep Learning textbook](http://www.deeplearningbook.org/) (Chapter 10 is most relevant)

## 1. Training

In [1]:
import torch