# 15. Processing Sequences Using RNNs and CNNs

In this Chapter we will cover Recurrent Neural Networks, especially useful with time series. 

### Recurrent Neurons and Layers

A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward.

![RNN](images/15.RNN.png)

Let's call the weight vectors for inputs $w_x$ and the ones for outputs $w_y$. We can put all these vectors in two matrices $W_x$ and $W_y$. 

The output vector for the layer would therefore be ($b$ = bias vector; $\phi$ = activation function):

$y_{(t)}= \phi(W_x^T x_{(t)} + W_y^T y_{(t-1)} + b)$

#### Memory Cells

Since the output of a recurrent neuron at time step $t$ is a function of all the inputs from previous time steps, we can say it has some sort of memory. 

This part of the NN is called a **memory cell**. 

#### Input and Output Sequences

There are several types of input-output sequences:

* Sequence-to-sequence (e.g. for stock prices predictions)
* Sequence-to-vector = **encoder** (e.g. sentiment score)
* Vector-to-sequence = **decoder** (e.g. caption for image)

We can also combine them. A typical example is using encoders-decoders back to back for machine translation. 

### Training RNNs

The trick is to _unroll it through time_ and then use backprop. This is called **backprop through time** (BPTT).

Simply put, we have:

1. First pass through unrolled network
2. Output sequence evaluated using a cost function
3. Gradients of that cost function are then propagated backward through the unrolled network
4. Model parameters are updated using the gradients computed during BPTT

### Forecasting a Time Series

There are two classifications of time series based on variables: **univariate** and **multivariate**.  
Two more based on our goal: **forecasting** or **imputation** (missing past values).

In [1]:
import numpy as np

def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10)) # wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5) # + noise
    return series[..., np.newaxis].astype(np.float32)

Usually time series are 3D arrays [batch size, time steps, dimensionality]. 

In [2]:
n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

#### Baseline Metrics

The simplest approach is to predict the last value in each series (**naive forecasting**):

In [3]:
y_pred = X_valid[:, -1]

In [6]:
from tensorflow import keras 

np.mean(keras.losses.mean_squared_error(y_valid, y_pred))

0.020701446

Another simple approach is to use a fully connected flattened network. In our example below, we will use LR:

In [9]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50, 1]),
    keras.layers.Dense(1)
])

In [12]:
model.compile(loss="mse", optimizer="adam")

### Implementing a Simple RNN

Now let's try to beat our naive metrics! Here is the simplest possible RNN:

In [8]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])

In [18]:
model.compile(loss="mse", optimizer="adam")

#### Deep RNNs

Let's add more layers of cells:

In [20]:
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.SimpleRNN(1)
])

Although this works well (better than our LR model), it might be preferable to replace the output layer with a `Dense` layer: it would run slightly faster, the accuracy would be roughly the same, and it would allow us to choose any output activation function we want.

In [21]:
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.SimpleRNN(20),
keras.layers.Dense(1)
])

#### Forecasting Several Time Steps Ahead

But what if we want to go further than one step ahead? As intuition would suggest, the key idea is to use the to-be-predicted value(s) as input(s) for the next one(s).  

In [22]:
series = generate_time_series(1, n_steps + 10)

X_new, Y_new = series[:, :n_steps], series[:, n_steps:]
X = X_new
for step_ahead in range(10):
    y_pred_one = model.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)
    
Y_pred = X[:, n_steps:]

A useful metric here could be comparing it with a naive prediction (i.e. constant values) or a simple linear model. 

The second option is to train an RNN to predict all 10 next values at once (sequence-to-vector, 10 values instead of 1). 

In [23]:
series = generate_time_series(10000, n_steps + 10)
X_train, Y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid, Y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, Y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

Now we just need the output layer to have 10 units instead of 1:

In [24]:
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.SimpleRNN(20),
keras.layers.Dense(10)
])

This works very well (better than the linear model) but we can push things further. We can train our model to forecast he next 10 values at each and every time step (sequence-to-sequence). 

In [25]:
Y = np.empty((10000, n_steps, 10)) # each target is a sequence of 10D vectors
for step_ahead in range(1, 10 + 1):
    Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:7000]
Y_valid = Y[7000:9000]
Y_test = Y[9000:]

**Tip**: instead of a one-shot prediction we could adopt techniques such as Montecarlo dropout to understand how are predictions behave in multiple runs. 

### Handling Long Sequences

An unrolled RNN on long sequences is a deep network and suffers from the same issues. Let's start to tackle the unstable gradient problem.  

#### Fighting the Unstable Gradients Problem

Out of the tricks we already know, nonsaturating activation functions (e.g. ReLU) may not work well. Moreover, Batch Normalization does not work particularly well either, except for small benefits when applied between layers. 

Another form of normalization often works better with RNNs: **Layer Normalization**. Similar to BN, it normalizes across the features dimension instead of the batch dimension. 

In [26]:
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation="tanh", **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units,
        activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

#### Tackling the Short-Term Memory Problem

After a while, RRNs _forget_ initial inputs. To tackle this problem, various types of cells with long-term memory have been introduced.

##### LSTM cells

Long Short-Term Memory (LSTM) cell can be easily implemented as a black box in Keras:

In [27]:
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

But how does it work? Similarly to a normal cell, except that its state is split into two vectors: **$h_{(c)}$** (short-term state) and **$c_{(t)}$** (long-term state). 

The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it. 

![LSTM](images/15.LSTM.png)

A very short explanation. Starting from the four connected layers:

1. f(t) decides how much of $c_{(t-1)}$ is "**[f]**orgotten" (dropped)
2. g(t) is similar to our basic cell, which only has this layer
3. i(t) decides how much of **[i]**nputs after g(t) should be added to long-term state
4. o(t) controls which parts of the long-term state should be read and **[o]**utput at this time step

In short, an LSTM cell can:
* Learn to recognize an important input (that’s the role of the input gate)
* Store it in the long-term state
* Preserve it for as long as it is needed (that’s the role of the forget gate)
* Extract it whenever it is needed