[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nrkfeller/YCBS_258_Fall_2019/blob/master/4.RNN/Forecasting.ipynb)

# Time Series Problem
We create a univariate time series. As opposed to multivariate time series which is like financial information over time like: price, volume, revenues, debt, etc.

In this case we will try to predict future values, but it could be an imputation problem where we attempt to fill in missing values. This is one of the early uses of neural networks. Where scientists used RNNs to fill missing values in databases.

### Generate Time Series
* This functino generates as many times series as requested by the ```batch_size```
* Each time series is of length ```n_steps```
* it creates a series made of two sine waves with random frequencies and phases with a bit of noise. The amplitudes are fixed
* returns a numpy array of ```[batch_size, n_steps, 1]```

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras

In [0]:
def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))  #   wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # + noise
    return series[..., np.newaxis].astype(np.float32)

### Creating Data Splits
1. X_train is 7000 time series of 50 values
2. X_valid is 2000 times series of 50 values
3. All Y values are groups of single values that come imediately after the X times series

In [0]:
n_steps = 50
series = generate_time_series(10000, n_steps + 1)

X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

for i in range(1):
  plt.plot(series[i].reshape(-1))

## Baseline Metrics

In [0]:
from sklearn.metrics import mean_squared_error

y_pred = X_valid[:, -1]

# our baseline error is about 0.02, se we must do better
mean_squared_error(y_pred, y_valid)

In [0]:
model_linear = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[50,1]),
    keras.layers.Dense(1, input_shape=[50,1])
])

model_linear.compile(loss='mse', optimizer='adam')

model_linear.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))

In [0]:
model_linear.evaluate(X_test, y_test)

## Implementing a Simple RNN
* We create a single layer, single neuron RNN ```SimpleRNN```
* The default activation is ```tanh```
* In this simple RNN, the output is the same as the state. Yt = Ht
* The default output for ```SimpleRNN``` is the output at the last timestep. In our example the output is Y at t=49
* If we want the output all all timesteps we must set ```return_sequences=True```

In [0]:
?keras.layers.SimpleRNN

In [0]:
model_rec = keras.models.Sequential([
    # We don't specify the length of the input sequence since RNNs can process any number of timesteps
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])

model_rec.compile(loss='mse', optimizer='adam')

model_rec.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid), verbose=0)

model_rec.evaluate(X_test, y_test)

In [0]:
model_linear.summary()

In [0]:
# our RNN model doesn't perform as well, but it has much less parameters
model_rec.summary()

## Deep RNNs
* We must set ```return_sequences=True``` so that we can propagate backwards the correct size. This means that we propagate a sequence of errors instead of just the last errors
* We are still using a single output at the last layer. This means that the hidden state for this layer is just one unit.
* We can also replace the last layer by a Dense layer, it wouldn't change the performance and it would make the training faster. We just need to remove the ```return_sequence=True``` from the second layer.

In [0]:
# Let's try a more complex network

model_rec_deep = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(1)
])

model_rec_deep.compile(loss='mse', optimizer='adam')

model_rec_deep.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))

In [0]:
model_rec_deep.summary()

## Forecasting Several Time Steps Ahead
Lets try to predict 10 steps ahead and the next 10 values

1. The first option is to use the model we already trained, make it predict the next value, then add that value to the inputs (acting as if this predicted value had actually occurred), and use the model again to predict the following value, and so on.

In [0]:
series = generate_time_series(1, n_steps + 10)
X_new, Y_new = series[:, :n_steps], series[:, n_steps:]
X = X_new

for step_ahead in range(10):
  y_pred_one = model_rec_deep.predict(X[:, step_ahead:])[:, np.newaxis, :]
  X = np.concatenate([X, y_pred_one], axis=1)
  
Y_pred = X[:, n_steps:]

In [0]:
plt.plot(X.reshape(-1))
plt.plot(np.linspace(50,60,10),Y_pred.reshape(-1))

Instead of predicting the next value and adding that value to the sequence and predicting the next one and so on. We can just create a model that predicts 10 values at a time.

For this we need the y's to be chunks of 10 subsequent values

This is a sequence to vector RNN. Because it takes in a sequence and outputs a vector, in this case of 10 values.

So simply put, at the last timestep we predict 10 values

In [0]:
series = generate_time_series(10000, n_steps + 10)
X_train, Y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid, Y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, Y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

In [0]:
model_ten = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10)
])

model_ten.compile(loss='mse', optimizer='adam')

In [0]:
model_ten.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))

In [0]:
Y_pred = model_ten.predict(X_new)

In [0]:
# plot the curve

### Seq to seq
In the above model we predict the next 10 values only at the very last time step. Let's make a model that predicts the next 10 values at each time step.

We will get more error gradient and our model will be able to learn much more complex patterns.

This will actually be faster and more robust!

#### The model
1. We need to set ```return_sequences=True``` for a seq2seq
2. The ```TimeDistributed``` layer wraps any layer and applies it to every timestep. So it reshapes the inputs from ```[batch_size, n_steps, in_dims]``` to ```[batch_size, n_steps, out_dims]```. In this case we have a dense layer of size 10 per input, of which we have 20. The output here will not actually be a simple vector of size 10, but 10 independently generated values by 20 different dense layers that form a sequence.


In [0]:
Y = np.empty((10000, n_steps, 10)) # each target is a sequence of 10-D vectors
for step_ahead in range(1, 10 + 1):
    Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:7000]
Y_valid = Y[7000:9000]
Y_test = Y[9000:]

Y_train.shape

In [0]:
model_seq2seq = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

In [0]:
model_seq2seq.summary()

We create a custom loss function that considers the MSE between all predictions and all true values

In [0]:
def last_time_step_mse(Y_true, Y_pred):
    return keras.metrics.mean_squared_error(Y_true[:, -1], Y_pred[:, -1])

model_seq2seq.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])

In [0]:
model_seq2seq.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))

# Handling Long Sequences
As we know RNNs suffer from vanishing gradient. It is also very hard to remember long term patterns as it is so eloquently stated by Bengio et al. in this paper: http://www.iro.umontreal.ca/~lisa/pointeurs/ieeetrnn94.pdf

### Fighting the Vanishing and Exploding Gradient Problem
Exploding gradient is cause by what?

How do we typically deal with exploding gradient:
1. ReLu
2. Good parameter initialization
3. Faster optimization
4. Dropout
5. Batch Normalization

ReLU doesn't work in this case and is even likely to make the gradients explode. So in RNNs the default is a saturating activation fucntion namely "tanh". If your training becomes unstable (which you can monitor with tensorboard) you can use gradient clipping.

Batch Normalization has been shower not to work very well on RNNs. It seems that applying the same scale and offset to the inputs at each time step does not yeild good results.

Layer Normalization is better than BN when working with RNNs. It works in a similar way, but rather thjan normalizing accross the batch dimension, it normalizes across the feature dimension. So normalization happens on a single examples features on all summed inputs. This works for time series because the values are typically all of similar nature. This can also have the same behavior in training as in testing, because it doesn't require a batch.

### Implementation of Layer Norm Cell
1. Inherit from the ```keras.layers.Layer``` class. We create ```state_size``` and ```output_size``` which are the same.
2. We take a simple RNN cell, but with no activation, as we want the activation to take place after the layer norm
3. We provide layer norm and activation
4. ```call()``` takes two arguments, the ```inputs``` and the current time step and hidden ```states``` from the previous step
5. Returns the outputs and the current state. Both of which are normalized

In [0]:
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation="tanh", **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units,
                                                          activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

In [0]:
model = keras.models.Sequential([
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True, input_shape=[None, 1]),
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

In [0]:
# you can add dropout as a hyperparameter of any RNN layers in keras
?keras.layers.SimpleRNN

In [0]:
# You can fill in this code

### Tackling the Short Term Memory Problem
As you know the LSTM cell is just an RNN cell that performs much better. Meaning that the training will converge much faster and it will detect long-term dependencies in the data.
![image](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

In [0]:
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

In [0]:
# Complete

In [0]:
model = keras.models.Sequential([
    keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid",
                        input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history = model.fit(X_train, Y_train[:, 3::2], epochs=20,
                    validation_data=(X_valid, Y_valid[:, 3::2]))

In [0]:
# Complete

### Using 1D Convolutional Layers to Process Sequences

In [0]:
model = keras.models.Sequential([
    keras.layers.Conv1D(filters=20, kernel_size=4, strides=2, padding="valid",
                        input_shape=[None, 1]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

In [0]:
model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])


In [0]:
history = model.fit(X_train, Y_train[:, 3::2], epochs=20,
                    validation_data=(X_valid, Y_valid[:, 3::2]))

### Wavenet
* Stacked 1D convolutional layer
* Double dialation rate at every layer
* Each Conv1d layer see half the inputs as the one under it
* The original paper used 3 blocks of 10 conv1d layers with dilations of 1,2,4,8,16,...,256,512

https://deepmind.com/blog/wavenet-generative-model-raw-audio/


In [0]:
model = keras.models.Sequential()
model.add(keras.layers.InputLayer(input_shape=[None, 1]))
for rate in (1, 2, 4, 8) * 2:
    model.add(keras.layers.Conv1D(filters=20, kernel_size=2, padding="causal",
                                  activation="relu", dilation_rate=rate))
model.add(keras.layers.Conv1D(filters=10, kernel_size=1))
model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history = model.fit(X_train, Y_train, epochs=20,
                    validation_data=(X_valid, Y_valid))

### Final word
The last 2 models we created should give the best performance. They even allow us to generate voice, translations, even compose music... all we need is data and maybe some GPUs