#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Advanced Sequence Prediction

Sequence prediction involves predicting the next value of a given input sequence. You've seen one application of this type of problem already: predicting stock prices in the RNN colab. Some other common applications include:

* *Weather Forecasting*: Given a sequence of observations about the weather over time, predict the weather tomorrow.

* *Product Recommendation*: Given a sequence of products added to a shopping list by a user, predict the user's next purchase.

* *Language Modeling*: Given a sequence of words, predict the next word in the sentence.

These problems have been around for a long time, and are often approached using [statistical methods](http://www.statsoft.com/Textbook/Time-Series-Analysis). In recent years, RNNs have gained popularity for solving these problems, since they are specially designed to handle sequential input.

In this lesson we will use more powerful types of RNNs: a GRU (Gated Recurrent Unit) and an LSTM (Long Short-Term Memory), to perform sequence prediction on stock prices.

## Data

We will use a generated dataset of stock prices, similar to the dataset we created for the RNN colab. We will create 10 years of stock data for a fictional company that had an initial public offering (IPO) of $42.

In [0]:
# Set random seeds for reproducible results.
import numpy as np
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

In [0]:
days_per_year = 365
years = 10

eod_prices = np.zeros(days_per_year*years)
eod_prices[0] = 42.0

bias = 0.0002

for i in range(1, len(eod_prices)):
  # Find yesterday's price.
  yesterdays_price = eod_prices[i-1]
  
  # Generate a random a percentage change on the normal curve.
  percentage_change = np.random.randn(1)[0]
  
  # The random number is a value on the standard normal distribution
  # with a mean 0 and variance 1. This will give us a nice range of
  # positive and negative values, but we need to divide by 100 to scale
  # them down to reasonable percentages for daily stock price changes.
  percentage_change /= 100

  # Finally we give the change just a little bit of positive bias so
  # that we get a more interesting curve.
  percentage_change += bias
  
  # Calculate the new price.
  todays_price = yesterdays_price + yesterdays_price * percentage_change
  
  # Store the price.
  eod_prices[i] = todays_price

Let's quickly visualize the data to make sure it looks reasonably realistic.

In [0]:
import matplotlib.pyplot as plt

plt.plot(list(range(len(eod_prices))), eod_prices)
plt.show()

## Data Preprocessing

RNNs are sensitive to the scale of the input data, specifically when using the `sigmoid` or `tanh` activation functions. It can be good practice to normalize or rescale the data to a range of 0 to 1. The [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) in scikit-learn is a good tool for normalization.

MinMaxScaler expects an input of shape `(n_samples, n_features)`. Our data only has one feature (price), so we must reshape it to work with the scaler.

In [0]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
# you can use "-1" as a stand-in for one dimension when reshaping NumPy arrays,
# and NumPy will determine the correct value based on the other dimensions.
eod_prices = np.reshape(eod_prices, (-1, 1))
eod_prices = scaler.fit_transform(eod_prices)

eod_prices = np.reshape(eod_prices, -1)
eod_prices[-1]

Notice that the final stock price, around $180, has been normalized to 0.095. To follow the previous RNN colab, let's also use stocks in a 100-day window to predict the next day's stock price.

In [0]:
n_steps = 100
n_samples = len(eod_prices) - n_steps

# Split eod_prices into X (100 day windows) and y (next day prices).
X = np.zeros((n_samples, n_steps))
y = np.zeros(n_samples)

for i in range(n_samples):
  X[i] = eod_prices[i:i + n_steps]
  y[i] = eod_prices[i + n_steps]

y[-1]

We need to create a training and testing set. To model real-life forecasting problems, we'll use the stock performance from the first 80% of days as training and the remaining days as testing (modeling predicting the "future").

In [0]:
split_index = int(n_samples * 0.8)

X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

X_train.shape, X_test.shape

## RNN Model in Keras

Keras makes it easy to experiment with different [RNN models](https://keras.io/layers/recurrent/). We will train a model using a basic RNN to show the basic structure of using an RNN model in Keras.

In [0]:
from tensorflow import keras

RNN models in Keras expect a 3-dimensional input with dimensions `(n_samples, n_timesteps, n_features)`. In the documentation, this is described as `(batch_size, timesteps, input_dim)`.

Each of our samples has 100 timesteps and 1 feature (price). We'll need to make `X_train` and `X_test` three dimensional using `np.reshape`. 

In [0]:
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

X_train.shape, X_test.shape

### Training

Now let's set up our model. We'll use:

1.   A `SimpleRNN` layer with 25 hidden units (as in the RNN colab).
2.   A `Dense` layer with 1 output unit (the predicted price).

This model is complex enough to converge fairly quickly, so we'll train it for 5 epochs and with a batch size of 50. Since this is a regression problem, we'll use [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) as the loss function.

In [0]:
n_units = 25
n_epochs = 5
batch_size = 50

In [0]:
rnn_model = keras.Sequential([
  keras.layers.SimpleRNN(n_units),
  keras.layers.Dense(1)
])

rnn_model.compile(
  loss='mean_squared_error',
  optimizer='adam'
)

In [0]:
rnn_model.fit(
    X_train,
    y_train,
    epochs=n_epochs,
    batch_size=batch_size
)

### Testing

Remember that we normalized the data to a [0, 1] range before training. To properly evaluate the model, we'll need to un-normalize the predictions, then calculate the Mean Squared Error based on the original data values.

In [0]:
from sklearn.metrics import mean_squared_error

def evaluate(model):
  predicted_train = model.predict(X_train)
  predicted_test = model.predict(X_test)

  # Un-normalize predictions.
  predicted_train = scaler.inverse_transform(predicted_train)
  predicted_test = scaler.inverse_transform(predicted_test)
  # Get original data values.
  y_train_raw = scaler.inverse_transform([y_train])[0]
  y_test_raw = scaler.inverse_transform([y_test])[0]
  # Calculate mean squared error.
  trainScore = mean_squared_error(y_train_raw, predicted_train[:,0])
  print('Train Score: %.2f MSE' % (trainScore))
  testScore = mean_squared_error(y_test_raw, predicted_test[:,0])
  print('Test Score: %.2f MSE' % (testScore))

In [0]:
evaluate(rnn_model)

## GRU Model

With longer input sequences, RNNs sometimes struggle to "remember" input that they saw earlier in the sequence. The [GRU (Gated Recurrent Unit)](https://arxiv.org/pdf/1406.1078v3.pdf) aims to improve upon this by using an *update gate* and *reset gate* to control what information is kept from previous steps and what is "forgotten". 

We won't get into the details of the architecture here. (If you're interested, see the Resources section below). What's important to know is that Keras makes it very straightforward to use a GRU instead of an RNN. It's a one-line change!

In [0]:
gru_model = keras.Sequential([
  keras.layers.GRU(n_units), # This line is the only change we have to make.
  keras.layers.Dense(1)
])

Let's train this GRU model and see how it does. You'll notice that training is slower than with the RNN, due to extra computations needed with the two gates, but the final results should be better due, since the GRU captures more information from previous steps.

In [0]:
gru_model.compile(
  loss='mean_squared_error',
  optimizer='adam'
)

gru_model.fit(
    X_train,
    y_train,
    epochs=n_epochs,
    batch_size=batch_size
)

evaluate(gru_model)

## LSTM Model

[LSTM (Long Short Term Memory)](http://www.bioinf.jku.at/publications/older/2604.pdf) networks are even better at learning such long-term dependencies. Similar to GRUs, they also use gates to control information between steps. 

LSTMs use an explicit *cell state* to control this information flow, which interacts with the *hidden state* that we're used to thinking about. An *input gate*, *forget gate*, and *output gate* control how we update the cell state at each step.

Again, don't worry too much about the specifics for now (and see Resources if you're interested)! We only need to change one line to use an LSTM instead.

In [0]:
lstm_model = keras.Sequential([
  keras.layers.LSTM(n_units), # Another magic one-line change!
  keras.layers.Dense(1)
])

In [0]:
lstm_model.compile(
  loss='mean_squared_error',
  optimizer='adam'
)

lstm_model.fit(
    X_train,
    y_train,
    epochs=n_epochs,
    batch_size=batch_size
)

evaluate(lstm_model)

LSTM models usually take the longest to train (they need the most computations). They are also usually the most powerful, though they may need more epochs to converge.

In practice, it's a good idea to start with an RNN and then upgrade to either a GRU or LSTM if you need a more complex model. For very long input sequences, LSTMs typically work better at capturing the longer-term dependencies in the sequence.

In [0]:
# Visualize LSTM predictions.

predicted_train = lstm_model.predict(X_train)
predicted_test = lstm_model.predict(X_test)

# Un-normalize predictions.
predicted_train = scaler.inverse_transform(predicted_train)
predicted_test = scaler.inverse_transform(predicted_test)

plt.plot(range(len(eod_prices)), scaler.inverse_transform([eod_prices])[0])
plt.plot(range(n_steps, len(predicted_train) + n_steps), predicted_train)
plt.plot(range(len(predicted_train) + n_steps, len(eod_prices)), predicted_test)
plt.show()

# Resources

*   https://colah.github.io/posts/2015-08-Understanding-LSTMs/
*   https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be



# Exercises

## Exercise 1

Train a GRU or LSTM model on a dataset of your choice. Visualize its performance and compare its performance to a basic RNN model.

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

In [0]:
# Solutions will vary; make sure students use an appropriate dataset (i.e. sequential data),
# properly construct both an RNN and advanced RNN model,
# and appropriately visualize the data.