# Recurrent Neural Networks with TensorFlow
Based on Chap. 14 of `Hands-On Machine Learning with Scikit-Learn and TensorFlow` by A. Géron. 

### Introduction
Recurrent neural networks (RNNs) are used for time series data, where not only the present but also the past matters. In other words, RNNs differ from feed-forward networks in that they can remember state. At each time step $t$, a recurrent neuron receives the input $\mathbf{x}_t$ at time $t$ as well as its own output from the previous time step, $y_{(t-1)}$. Similarly, a layer of recurrent neurons receives not only the input $\mathbf{x}_t$ but also the output $\mathbf{y}_{(t-1)}$ of the whole layer from the previous time step.

Each recurrent neuron has two sets of weights, one for the inputs and the other for the outputs of the previous time step. The output of a recurrent layer for a single instance $\mathbf{x}_{(t)}$ is 

$$
 \mathbf{y}_{(t)} = \phi ( \mathbf{W}_x^T \cdot \mathbf{x}_{(t)} + \mathbf{W}_y^T \cdot \mathbf{y}_{(t-1)} + \mathbf{b}),
$$

where the $\phi$ is the activation function and the weights and bias are similar as for feed-forward networks.

A part of a neural network that remembers state is called a _memory cell_. Examples include a layer of recurrent neurons, long short-term memory cell (LSTM) and gated recurrent unit (GRU).

In general, a cell's state $\mathbf{h}_{(t)}$ at time step $t$ is a function of the inputs at that timestep and the state at the previous time step: $\mathbf{h}_{(t)} = f(\mathbf{h}_{(t-1)}, \mathbf{x}_{(t)})$. Similarly, the output is a function of the inputs and the previous state.

RNNs take a sequence as input and produce a sequence as output. This is called a sequence-to-sequence network. Ignoring all outputs but the last creates a sequence-to-vector network. Similarly, one can feed a non-zero value to RNN at the first time step and zeroes afterwards, creating a vector-to-sequence (or vector-to-vector by ignoring all outputs except for one) network. One can also have a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder. This could be used for translating a sentence from a language to another. This works better than a single sequence-to-sequence RNN, because the last words of a sentence can affect the first words of the translation. 

As practice, let us create an RNN in TensorFlow from scratch.

In [2]:
import tensorflow as tf
import numpy as np

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [None]:
n_inputs = 3
n_neurons = 5

reset_graph()

X0 = tf.placeholder(tf.float32, [None, n_inputs]) # Input at time 0
X1 = tf.placeholder(tf.float32, [None, n_inputs]) # Input at time 1

Wx = tf.Variable(tf.random_normal(shape=(n_inputs, n_neurons)), dtype=tf.float32)
Wy = tf.Variable(tf.random_normal(shape=(n_neurons, n_neurons)), dtype=tf.float32)

b = tf.Variable(tf.random_normal(shape=(1, n_neurons)), dtype = tf.float32)

Y0 = tf.tanh(tf.matmul(X0, Wx) + b) # Output at time 0
Y1 = tf.tanh(tf.matmul(X1, Wx) + tf.matmul(Y0, Wy) + b)

X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # Values at time step 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [0, 0, 0], [3, 2, 1]]) # Values at time step 1

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={ X0: X0_batch, X1: X1_batch })
    
print(Y1_val)

Now let use built-in functions from TensorFlow:

In [None]:
reset_graph()

X0 = tf.placeholder(tf.float32, [None, n_inputs]) # Input at time 0
X1 = tf.placeholder(tf.float32, [None, n_inputs]) # Input at time 1

# A "factory" that creates copies of the cell to build the unrolled RNN
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

# `static_rnn` calls the cell factory's `__call__()` once per input, creating two copies of the cell, 
# with shared weights and bias terms, and chains them together
# The first output is a list containing the output tensors at each timestep,
# the second is a tensor containing the final states of the network (last output for basic RNN cell).
output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell, [X0, X1], dtype=tf.float32)

Y0, Y1 = output_seqs

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={ X0: X0_batch, X1: X1_batch })

print(Y1_val)

The above approach would not work very well if there were 50 time steps in input sequences. Let's do it differently:

In [None]:
reset_graph()
n_steps = 2

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) # Input sequences, replaces X0 and X1 above

X_rearr = tf.transpose(X, perm=[1, 0, 2]) # To format [n_steps, None, n_inputs]
X_seqs = tf.unstack(X_rearr) # Sequence of tensors of shape [None, n_inputs] (corresponds to [X0, X1] above)

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

# In `output_seqs`, each tensor (corresponding to different timestep) is of size [None, n_outputs]
output_seqs, states = tf.contrib.rnn.static_rnn(basic_cell, X_seqs, dtype=tf.float32)

outputs_stacked = tf.stack(output_seqs) # Merge output tensors into a single tensor, [n_steps, None, n_outputs]
outputs = tf.transpose(outputs_stacked, perm=[1, 0, 2]) # To format [None, n_steps, n_outputs]

X_batch = np.array([
    [[0, 1, 2], [3, 4, 5]], # First instance
    [[3, 4, 5], [5, 6, 7]], # Second instance
    [[5, 6, 7], [7, 8, 9]], # Third
    [[7, 8, 9], [9, 10, 11]] # Fourth
])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    outputs_val = outputs.eval(feed_dict={X: X_batch})
    
print(outputs_val)

Static RNN used above would still get very ugly for large input sequences, as it still essentially creates one cell per timestep (the weights and bias terms are shared, though). To avoid this, one can use _dynamic unrolling_ that uses a `while_loop()` to run over the cell as many times as needed. It also accepts an input of shape `[None, n_steps, n_inputs]` and outputs a tensor of shape `[None, n_steps, n_outputs]`, so there is no need to unstack, transpose, and stack as above. The following code does the same as above:

In [None]:
reset_graph()
n_steps = 2

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) # Input sequences, replaces X0 and X1 above

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

# `outputs` is of size [None, n_steps, n_inputs], states is equal to the `outputs` at last time step
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

X_batch = np.array([
    [[0, 1, 2], [3, 4, 5]], # First instance
    [[3, 4, 5], [5, 6, 7]], # Second instance
    [[5, 6, 7], [7, 8, 9]], # Third
    [[7, 8, 9], [9, 10, 11]] # Fourth
])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    outputs_val, states = sess.run([outputs, states], feed_dict={X: X_batch})
    
print(outputs_val)

Variable-length input sequences can be handled by passing `sequence_length` argument to `dynamic_rnn`, containing a list of sequence length for each instance in the mini-batch. The RNN outputs zero vectors for every time step past the input sequence length. Variable-length output sequences (like in text translation) are generally handled by a special output called an _end-of-sequence token_ (EOS). Any output past EOS should be ignored.

### Training RNNs

RNNs are trained using _backpropagation through time_. Gradients are propagated backward through the unrolled network and finally the parameters are updated using the gradients computed during backpropagation.

As a toy example of training RNNs, let us use an RNN to classify MNIST images. A row in input image corresponds to a single time step. For MNIST images, there are then 28 timesteps with 28 inputs each.

In [12]:
from tensorflow.examples.tutorials.mnist import input_data
reset_graph()

n_steps = 28
n_inputs = 28

n_neurons = 150
n_outputs = 10

learning_rate = 0.001

X = tf.placeholder(tf.float32, shape=(None, n_steps, n_inputs))
y = tf.placeholder(tf.int32, shape=(None))

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)

xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

mnist = input_data.read_data_sets("/tmp/mnist/data")

X_valid = mnist.validation.images.reshape((-1, n_steps, n_inputs))
y_valid = mnist.validation.labels

n_epochs = 10
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(1, n_epochs + 1):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            loss_train, acc_train, _ = sess.run([loss, accuracy, training_op], feed_dict={X: X_batch, y: y_batch})

        loss_val, acc_val = sess.run([loss, accuracy], feed_dict={X: X_valid, y: y_valid})
        print("{}\tTraining loss: {:.6f}\tTraining acc: {:.6f}\tValidation loss: {:.6f}\tAccuracy: {:.2f}%".format(
                        epoch, loss_train, acc_train, loss_val, acc_val * 100))
        


Extracting /tmp/mnist/data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist/data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist/data/t10k-labels-idx1-ubyte.gz
1	Training loss: 0.275220	Training acc: 0.913333	Validation loss: 0.244868	Accuracy: 93.10%
2	Training loss: 0.124366	Training acc: 0.966667	Validation loss: 0.164343	Accuracy: 95.58%
3	Training loss: 0.246158	Training acc: 0.926667	Validation loss: 0.146260	Accuracy: 96.04%
4	Training loss: 0.161353	Training acc: 0.953333	Validation loss: 0.129532	Accuracy: 96.38%
5	Training loss: 0.150708	Training acc: 0.966667	Validation loss: 0.137861	Accuracy: 96.18%
6	Training loss: 0.116905	Training acc: 0.953333	Validation loss: 0.107175	Accuracy: 96.98%
7	Training loss: 0.099635	Training acc: 0.966667	Validation loss: 0.095419	Accuracy: 97.50%
8	Training loss: 0.165225	Training acc: 0.946667	Validation loss: 0.102198	Accuracy: 97.02%
9	Training loss: 0.174324	Training acc: 0.960000	V