# Recurrent Neural Networks
In this notebook, we are going to discuss recurrent neural networks (RNN), a class of nets that can predict the future (well, up to a point, of course). They can analyze time series data such as stock prices, and tell you when to buy or sell. In autonomous driving systems, they can anticipate car trajectories and help avoid accidents. More generally, they can work on sequences of arbitrary lengths, rather than on fixed-sized inputs like all the nets we have discussed so far. For example, they can take sentences, documents, or audio samples as input, making them extremely useful for natural language processing (NLP) systems such as automatic translation, speech-to-text, or sentiment analysis (e.g., reading movie reviews and extracting the rater’s feeling about the movie).

Moreover, RNNs’ ability to anticipate also makes them capable of surprising creativity. You can ask them to predict which are the most likely next notes in a melody, then randomly pick one of these notes and play it. Then ask the net for the next most likely notes, play it, and repeat the process again and again. Before you know it, your net will compose a melody such as the one produced by Google’s Magenta project. Similarly, RNNs can generate sentences, image captions, and much more. The result is not exactly Shakespeare or Mozart yet, but who knows what they will produce a few years from now?

## Recurrent Neurons

A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward. Let’s look at the simplest possible RNN, composed of just one neuron receiving inputs, producing an output, and sending that output back to itself, as shown in Figure 1 (left). At each time step t (also called a frame), this recurrent neuron receives the inputs $x_{(t)}$ as well as its own output from the previous time step, $y_{(t–1)}$. We can represent this tiny network against the time axis, as shown in Figure 1 (right). This is called unrolling the network through time.

<img src="../images/rnn_1.png"/>
_Figure1: A recurrent neuron (left), unrolled through time (right)_


You can easily create a layer of recurrent neurons. At each time step t, every neuron receives both the input vector $x_{(t)}$ and the output vector from the previous time step $y_{(t–1)}$, as shown in Figure 2. Note that both the inputs and outputs are vectors now (when there was just a single neuron, the output was a scalar).

<img src="../images/rnn_2.png"/>
_Figure1: A layer of recurrent neurons (left), unrolled through time (right)_

Each recurrent neuron has two sets of weights: one for the inputs $x_{(t)}$ and the other for the outputs of the previous time step, $y_{(t-1)}$. Let’s call these weight vectors wx and wy. If we consider the whole recurrent layer instead of just one recurrent neuron, we can place all the weight vectors in two weight matrices, Wx and Wy. The output vector of the whole recurrent layer can then be computed pretty much as you might expect, as shown in Equation 1 (b is the bias vector and $\phi \left ( \cdot  \right )$ is the activation function, e.g., ReLU1).

_Equation 1. Output of a recurrent layer for a single instance_
<img src="../images/eq_1.png"/>

Just like for feedforward neural networks, we can compute a recurrent layer’s output in one shot for a whole mini-batch by placing all the inputs at time step _t_ in an input matrix $x_{(t)}$ (see Equation 2).

_Equation 2. Outputs of a layer of recurrent neurons for all instances in a mini-batch_
<img src="../images/eq_2.png"/>



- $Y_{(t)}$ is an $m × n_{neurons}$ matrix containing the layer’s outputs at time step t for each instance in the mini-batch (m is the number of instances in the mini-batch and nneurons is the number of neurons).

- $X_{(t)}$ is an $m × n_{inputs}$ matrix containing the inputs for all instances (ninputs is the number of input features).

- $W_x$ is an $n_{inputs} × n{neurons}$ matrix containing the connection weights for the inputs of the current time step.

- $W_y$ is an $n_{neurons} × n_{neurons}$ matrix containing the connection weights for the outputs of the previous time step.

- $b$ is a vector of size nneurons containing each neuron’s bias term.

- The weight matrices $W_x$ and $W_y$ are often concatenated vertically into a single weight matrix W of shape $(n_{inputs} + n_{neurons}) × n_{neurons}$ (see the second line of Equation 2).

- The notation $[X_{(t)} Y_{(t–1)}]$ represents the horizontal concatenation of the matrices $X_{(t)}$ and $Y_{(t–1)}$.

Notice that $Y_{(t)}$ is a function of $X_{(t)}$ and $Y_{(t–1)}$, which is a function of $X_{(t–1)}$ and $Y_{(t–2)}$, which is a function of $Y_{(t–2)}$ and $Y_{(t–3)}$, and so on. This makes $Y_{(t)}$ a function of all the inputs since time $t = 0$ (that is, $X_{(0)}$, $X_{(1)}$, …, $X_{(t)}$). At the first time step, $t = 0$, there are no previous outputs, so they are typically assumed to be all zeros.

## Memory Cells

Since the output of a recurrent neuron at time step t is a function of all the inputs from previous time steps, you could say it has a form of _memory_. A part of a neural network that preserves some state across time steps is called a _memory cell_ (or simply a _cell_). A single recurrent neuron, or a layer of recurrent neurons, is a very _basic cell_, but later in this chapter we will look at some more complex and powerful types of cells.

In general a cell’s state at time step $t$, denoted $h_{(t)}$ (the “h” stands for “hidden”), is a function of some inputs at that time step and its state at the previous time step: $h_{(t)} = f(h_{(t–1)}, x_{(t)})$. Its output at time step $t$, denoted $y_{(t)}$, is also a function of the previous state and the current inputs. In the case of the basic cells we have discussed so far, the output is simply equal to the state, but in more complex cells this is not always the case, as shown in Figure 3.

<img src="../images/rnn_3.png"/>
_Figure 3: A cell’s hidden state and its output may be different_

## Input and Output Sequences

An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs (see Figure 14-4, top-left network). For example, this type of network is useful for predicting time series such as stock prices: you feed it the prices over the last $N$ days, and it must output the prices shifted by one day into the future (i.e., from $N – 1$ days ago to tomorrow).

Alternatively, you could feed the network a sequence of inputs, and ignore all outputs except for the last one (see the top-right network). In other words, this is a sequence-to-vector network. For example, you could feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score (e.g., from –1 [hate] to +1 [love]).

Conversely, you could feed the network a single input at the first time step (and zeros for all other time steps), and let it output a sequence (see the bottom-left network). This is a vector-to-sequence network. For example, the input could be an image, and the output could be a caption for that image.

Lastly, you could have a sequence-to-vector network, called an _encoder_, followed by a vector-to-sequence network, called a _decoder_ (see the bottom-right network). For example, this can be used for translating a sentence from one language to another. You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language. This two-step model, called an Encoder–Decoder, works much better than trying to translate on the fly with a single sequence-to-sequence RNN (like the one represented on the top left), since the last words of a sentence can affect the first words of the translation, so you need to wait until you have heard the whole sentence before translating it.

<img src="../images/rnn_4.png"/>
_Figure 4: Seq to seq (top left), seq to vector (top right), vector to seq (bottom left), delayed seq to seq (bottom right)_

Sounds promising, so let’s start coding!

In [1]:
import tensorflow as tf

  from ._conv import register_converters as _register_converters


## Training a Sequence Classifier

Let’s train an RNN to classify MNIST images. A convolutional neural network would be better suited for image classification (see Chapter 13), but this makes for a simple example that you are already familiar with. We will treat each image as a sequence of 28 rows of 28 pixels each (since each MNIST image is 28 × 28 pixels). We will use cells of 150 recurrent neurons, plus a fully connected layer containing 10 neurons (one per class) connected to the output of the last time step, followed by a softmax layer (see Figure 6).

<img src="../images/rnn_6.png"/>
_Figure 6: Sequence classifier_

The construction phase is quite straightforward; it’s pretty much the same as the MNIST classifier we built in previously except that an unrolled RNN replaces the hidden layers. Note that the fully connected layer is connected to the _states_ tensor, which contains only the final state of the RNN (i.e., the $28^{th}$ output). Also note that $y$ is a placeholder for the target classes.

In [2]:
n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

Now let’s load the MNIST data and reshape the test data to [batch_size, n_steps, n_inputs] as is expected by the network. We will take care of reshaping the training data in a moment.

In [3]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/")
X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


Now we are ready to train the RNN. The execution phase is exactly the same as for the MNIST classifier in previous notebook, except that we reshape each training batch before feeding it to the network.

In [5]:
n_epochs = 50
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

0 Train accuracy: 0.9266667 Test accuracy: 0.931
1 Train accuracy: 0.97333336 Test accuracy: 0.9491
2 Train accuracy: 0.9866667 Test accuracy: 0.954
3 Train accuracy: 0.9533333 Test accuracy: 0.9638
4 Train accuracy: 0.96666664 Test accuracy: 0.963
5 Train accuracy: 0.9533333 Test accuracy: 0.9686
6 Train accuracy: 0.97333336 Test accuracy: 0.9715
7 Train accuracy: 0.9866667 Test accuracy: 0.97
8 Train accuracy: 1.0 Test accuracy: 0.9749
9 Train accuracy: 0.99333334 Test accuracy: 0.9714
10 Train accuracy: 0.96666664 Test accuracy: 0.9687
11 Train accuracy: 0.9866667 Test accuracy: 0.9687
12 Train accuracy: 0.98 Test accuracy: 0.9726
13 Train accuracy: 0.99333334 Test accuracy: 0.9689
14 Train accuracy: 1.0 Test accuracy: 0.9774
15 Train accuracy: 0.9866667 Test accuracy: 0.978
16 Train accuracy: 1.0 Test accuracy: 0.9782
17 Train accuracy: 0.99333334 Test accuracy: 0.9734
18 Train accuracy: 0.99333334 Test accuracy: 0.9637
19 Train accuracy: 0.96666664 Test accuracy: 0.9749
20 Train a

We get over 98% accuracy—not bad! Plus you would certainly get a better result by tuning the hyperparameters, initializing the RNN weights using He initialization, training longer, or adding a bit of regularization (e.g., dropout).