# Recurrent neural networks

Like CNNs, which are tailored for processing 2D data, recurrent neural networks (RNNs) are designed for processing sequences or time-series by exploring the power of **recurrent connections**. Depending on the way they are constructed, RNNs can process sequences of variable length, which is not the case with feedforward nets (which have fixed input size) and CNNs (the same kernels can be reused for larger inputs, but usually the "decision" layer has fixed size).

Similarly to CNNs, RNNs also have a lot of parameter sharing, but now parameters are shared across different time-steps.

RNNs are currently part of the state-of-the-art techniques for acoustic modeling in ASR and natural language processing ([Google voice search](https://research.googleblog.com/2015/09/google-voice-search-faster-and-more.html) and many other Google products are powered by RNNs).

## Simple RNN

<img src="RNN-unrolled.png" width=650px>

This type of RNN is available on Keras as `keras.layers.recurrent.SimpleRNN`.

### Issues: vanishing and exploding gradients

The reason simple RNNs are almost never used in practice is that they are very hard to optimize properly when learning long-term dependencies. Very small and very large gradients lead to the so-called *vanishing gradients* and *exploding gradients* issues, which happen when a small/large gradient is propagated across many timesteps, decreasing or increasing in an exponential fashion across the recurrent connections. Gated RNNs such as long short-term memory units (LSTM) and gated recurrent units (GRU) were proposed to solve these issues.

## LSTM

![](lstm.png)
(image from Wikimedia Commons: https://commons.wikimedia.org/wiki/File:Long_Short_Term_Memory.png)

This type of RNN is available on Keras as `keras.layers.recurrent.LSTM`.


## GRU

GRUs are another proposed solution for the problem of vanishing gradients, which use the same gating principle but are simpler than LSTMs as they only have two gates (reset and update) and its internal memory is the same as its hidden state, instead of using a separate cell like LSTMs. There are usually no big performance gaps between LSTM and GRU networks with the same number of parameters.

GRUs are available on Keras as `keras.layers.recurrent.GRU`.


## Gradient value and norm-clipping

Another strategy that can mitigate the *exploding gradients* issue is gradient clipping. Gradients can be clipped either by their maximum absolute value or by the total L2 gradient norm. All optimizers in Keras support both modes of gradient clipping, by using the following keyword parameters:

- `clipnorm=value` (value: float > 0): Gradients will be clipped when their L2 norm exceeds this value
- `clipvalue=value` (value: float > 0): Gradients will be clipped when their L2 norm exceeds this value

## How to connect the output of a recurrent layer to another layer

We can connect the output of an RNN to the next layer in a model in two different ways:

- Use all the hidden states generated by the RNN (a sequence of feature vectors) for a given sequence
- Use only the last hidden state (one feature vector per sequence)

If using the first approach, the next layer has to support processing sequences (or you can flatten the sequence as a single vector, as we have seen in the CNN section of this tutorial). To choose whether you want a recurrent layer to output a sequence or a single vector, use the keyword parameter `return_sequences`. The default for this parameter is `False`.

Besides recurrent layers, Keras also supports using any layer for processing a sequence by using the `TimeDistributed` wrapper. This wrapper is equivalent to making a copy of the wrapped layer for each timestep of the sequence, with all parameters tied. For example:

In [1]:
model.add(LSTM(512, return_sequences=True))
model.add(TimeDistributed(Dropout(0.5)))
model.add(TimeDistributed(Dense(256)))

# or

model.add(LSTM(256)) # Using default value for return_sequences, which is False
model.add(Dense(256))

NameError: name 'model' is not defined

## Bidirectional RNNs

Bidirectional RNNs use two RNNs in parallel to process a sequence: one goes forward in time, the other backwards. The outputs of both RNNs are then concatenated and used as the input for the next layer in the network. In Keras, one can implement a bidirectional RNN by using the keyword argument `go_backwards` and then a `Merge` layer to concatenate both outputs. Note that this requires using the functional API instead of the sequential API, which we will not cover today (please check the [tutorial](http://keras.io/getting-started/functional-api-guide/) for the functional API in the official documentation).

In [None]:
x = Input(shape=(50, 100)) # 50 timesteps with 100 features each

h1_fw = GRU(256)(x)
h1_bw = GRU(256, go_backwards=True)(x)

h1 = Merge([h1_fw, h1_bw], mode='concat', concat_axis=-1)

## Full example

As a speech-related example, let's implement a simple model to predict the clean magnitude spectrum for speechs signals based on the noisy magnitude spectrum.

(Note: this is not the best way to go around solving this task, but it's a simpler model to use as an example.)

You will notice that compilation times for RNNs are usually much longer than for the other networks we have seen in this tutorial, especially when you backpropagate over a large number of timesteps. 

In [None]:
import numpy as np
np.random.seed(42)  # for reproducibility

from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, TimeDistributed
from keras.layers.recurrent import LSTM

# In case your RNN is backpropagating over a large number of timesteps, you might need to do this
# (only when using the Theano backend)
import sys
sys.setrecursionlimit(50000)

# Loading data. Input is already standardized (mean = 0, standard deviation = 1)
X = np.load('stft_data_noisy.npy') # shape is (n_samples, n_frames, n_frequency_bins)
y = np.load('stft_data_clean.npy') # same as above, but scaled to the range [0, 1]

X_train, X_test = X[:10000], X[10000:]
y_train, y_test = X[:10000], X[10000:]

model = Sequential()
model.add(LSTM(256, return_sequences=True))
model.add(LSTM(256, return_sequences=True))
model.add(TimeDistributed(Dropout(0.5)))
model.add(TimeDistributed(Dense(y_train.shape[-1], activation='sigmoid')))

model.compile(loss='mse', optimizer='adam')
history = model.fit(X_train, y_train, batch_size=32, nb_epoch=100, validation_split=0.1)

## More info/references

- [Keras documentation for recurrent layers](http://keras.io/layers/recurrent/)

- [A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,”](www.cs.toronto.edu/~fritz/absps/RNN13.pdf) in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649.

- [Chapter 10 of the Deep Learning book - Sequence Modeling: Recurrent and Recursive Nets](http://www.deeplearningbook.org/contents/rnn.html)

- [A. Karpathy, "The Unreasonable Effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

- [C. Olah, "Understanding LSTMs"](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)