# **Weekly Predictor Training**
## December 2020
### Ian Yu

----

## **Table of Content**

1. [Objective](#Objective)
2. [Setup](#Setup)
3. [Preprocessing](#Preprocessing)
4. [Model Building](#Model-Building)
5. [Model Training](#Model-Training)
6. [Model Summary](#Model-Summary)
6. [Next Step](#Next-Step)

----

## **Objective**

The purpose of this notebook is to take the parameters provided in the previous notebook, '3-Hypertuning with 5-Day Engineered Dataframe', and train the model. We once again split the model training into 3 different notebooks as we perform the training process simultaneously.

[Back to Top](#Table-of-Content)

## **Setup**

As discussed in the previous notebook, we will be training a Bidirectional Long Short-Term Memory Recurrent Neural Network. Unlike the previous notebook, however, since we already performed hypertuning, we will just be training the entire train set, which is all dates before 2019-08-18, and find the best model based on lowest loss value. 

In [1]:
# Importing boto3 to call object from AWS S3 personal bucket
import boto3
s3 = boto3.resource('s3')
s3.Object('capstone-ianyu', '2-5dengineered_df.csv').download_file('data/2-5dengineered_df.csv')

In [2]:
# Importing basic packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read the clean dataset
df = pd.read_csv("data/2-5dengineered_df.csv", index_col = 0)

# To ensure the index is set at datetime
df.index = pd.to_datetime(df.index)

[Back to Top](#Table-of-Content)

## **Preprocessing**

We're going to split the train test and test set based on the notion that we will make prediction continouously for 250 days. After splitting, we will define the target, `SPX Today`, and features for both the train set and the test set. We will also be using MinMaxScaler, a more common scaler for time series problems, to scale our features in train set and test set. Note that MinMaxScaler would scale the features so that different features would stay on the same scale. 



In [3]:
# Train-test Split, continously predicting for 250 days
train = df[:-250]
test = df[-250:]

In [4]:
# Setting features and target for train set
X_train = train.drop('SPX Today', axis = 1)
y_train = train['SPX Today']

# Setting features and target for test set
X_test = test.drop('SPX Today', axis = 1)
y_test = test['SPX Today']

In [5]:
# Fitting MinMax Scaler onto the remainder set
from sklearn.preprocessing import MinMaxScaler
mmscaler = MinMaxScaler()

# Converting to Dataframe to keep the Timestamps
X_train_mms = pd.DataFrame(data = mmscaler.fit_transform(X_train), columns = X_train.columns, index = train.index)
X_test_mms = pd.DataFrame(data = mmscaler.transform(X_test), columns = X_test.columns, index = test.index)

In [6]:
# Reshaping features and target in order to fit through our model
Xtrain = np.array(X_train_mms).reshape(np.array(X_train_mms).shape[0],-1,X_train_mms.shape[1])
ytrain = np.array(y_train).reshape(np.array(y_train).shape[0],-1,1)

[Back to Top](#Table-of-Content)

## **Model Building**

The architecture that we will be building is comprised of four elements. It will be a single layer of Bidirectional Long Short-Term Memory Recurrent Neural Network with an output layer of Time Distributed Layer.

**Recurrent Neural Netowrk** is a type of neural network that allows us to learn backwards in sequence. It takes in a 3D shape array, input the data as (batch size, sequence, features). Batch size would be the number of data points that we are passing into the neural network at once. If we have a batch size of 1024, then we are passing 1024 trading days into the network at a time. We will have the hypertuner to determine what is the optimal batch size. Sequence length is about the past context, where if we set a sequence length of 10, the model would learn how does the previous 10 sequence affect the current input. We will also leave it to the hypertuner to determine the best sequence length parameter. Features dimension would simply be the number of features we have in our dataset.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Recurrent_neural_network_unfold.svg/2880px-Recurrent_neural_network_unfold.svg.png" height=800 width=800>

*image from [Wikepedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg)*

**Long Short-Term Memory** is a special type of RNN that learns about the long term default behaviour of the dataset. In effect, this would decompose seasonality, trends, and other potential long-term patterns. 

**Bidirectional** is applied to the LSTM RNN for the purpose of learning the future context as well. Not only does the past affect the stock market today, but also the anticipation of tomorrow's environment would affect today's market. Therefore, we would also need to understand how that anticipation affects today's price. The bidirectional element creates a separate network in the same training session that learns forwards in the sequence instead, so that each time the network is learning both backwards and forward in time. 

**Time Distributed Layer** is a special type of output layer that keeps the training input and output one at a time, keeping the timestamps true. Without the layer, the default behaviour of RNN would learn and output in batches instead. 

*Note 1*: The inclusion of both Bidirectional element and the Time Distributed Layer was inspired by [Solving Sequence Problems with LSTM in Keras: Part 2](https://stackabuse.com/solving-sequence-problems-with-lstm-in-keras-part-2/). 

*Note 2*: This architecture is the final result of many trials and errors through monitoring the loss and train/validation metrics. Experiments include trying much more complicated structure, such as including RepeatVector mentioned in the tutorial from *Note 1*. We did not find more complex layers performed any better than the current architecture. Part of the decision was also influenced by [Optimizing LSTM for time series prediction in Indian stock market](https://pdf.sciencedirectassets.com/280203/1-s2.0-S1877050920X00056/1-s2.0-S1877050920307237/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjECQaCXVzLWVhc3QtMSJIMEYCIQCaPKGIG7DdWGcWijUd%2BmhnqxFo0lkvbaeANVEVr4Gx8AIhAP%2Fxavemri255BxSxSlvzXZIJ1rP2SZ7no2NS4uWTJgDKr0DCK3%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQAxoMMDU5MDAzNTQ2ODY1IgzpgbluX2qDaLCDNLoqkQOaVtZa17kPj7Fik6ElKg8ZfyV8AGB6Gh3QEEvPhW655oNfCFFaVA2hzgs3ytWBA4e7AaFcCQifbETiM8qSekN5YP02N6MiSD9EV5uuhV3RHTMM6I6De3WwITh6Vg2SwhhVFb%2Bdxf7XgXbK9g7%2Fok1ctc6MhHQ%2FJHzD7AwF7NTJm2mzV0H%2F4tenShk153gVsDqWnHLrW2wYjQg7LtBAy2H89G9gI7TK1WIECXkutt6Moxh1GO1nRQxGIryu5OGnHphCIdAz3oEj91X7loih1geTon9J5EB3HSNeb%2BM%2Fhk%2BfNCInWLgPBnOJdQrX9A9OpbVVcF%2BvacbiPmS4OyGd8%2BkHVLE2svGbkDyiKTokP0TPBCh9NB9S3Ovt%2BoJbx6yGzXdu1gvqREgnqYcezySe5BdxygSCLg1iVcJx%2FLTduJo%2BLLt7EYycHSg14SiVLAO%2F%2B2AQxtV3FfYZZqqAdcuAj6epR1pqyJ%2BF%2FWZp6xHlEXhR4tnscsDvDExku%2FEJI0%2B6DAIL1POcsxP%2Bb9FGqCPd5WlNTzDl35%2F%2BBTrqAS8O9L6e1THIH3VjiRI5bLGQ%2FRNXXoc9oNfpqCOVjANdDrSE2tThA9V%2BNKd14Qb57yciOn93apfbjYvYdovXogLJ0cBBx3N%2FI6y2Z6ZsW0aIEKHx%2Bz6T3e6G5aXLafqMP%2B4JBiUYx0KmS8n8A%2BbbswEqYtbHkGFJxbKSjEzFbT8%2FLaFVD9mQMwZLs55UYmrohOysGZk%2BWnJjtsFj9W9TjYCLIpsXIGQBHABPNbPsDbQ9%2BJYk%2F3lY3xRmA060j3%2F0lHqQf%2Fy3svPUe7xhtCNAMo%2FSWMoxgdWJTmW0wE1MLmRbSYSMJrAqZN4pZg%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20201202T210705Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTY3ADR5CWC%2F20201202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=8fe53ed1da02e8720b05dbd4c47599a135d812d489b217c1e155ea7f8eba50c3&hash=f41fa7bc182cb22c436f1de61016bf5ecaa6158ccab1236efb2e593a3f997dd4&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S1877050920307237&tid=spdf-af325d5e-2008-49c1-9511-03d086ba6774&sid=a182baa1112258464a79b52-2c38f3e10bdfgxrqa&type=client), finding more LSTM layers does not increase performance, but does stabalize the model. 

In a nutshell, our model architecture looks like this:

<img src="images/ModelArchitecture.png" height=800 width=800>

In [7]:
## Importing tensorflow and keras pakcages needed
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Dropout, Activation, Bidirectional,TimeDistributed
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
from tensorflow.keras.metrics import RootMeanSquaredError

We will be building the model based on the configuration from the previous notebook, which would be:

|Parameters|Configuration|
|:---|:---|
|**LSTM Units**|1024|
|**Sequence Length**|1|
|**Initial Learning Rate**|0.01|
|**Decay Steps**|1,000|
|**Clipnorm**|0.1|

We will also be using Mean Squared Error as our loss function and Root Mean Squared Error as our human readable metrics.

In [8]:
## Learning rate schedule code is from Tensorflow documentation:
## https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/ExponentialDecay
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01,
    decay_steps=1000,
    decay_rate=0.9, staircase = True)

## Modeling with sequential
model = keras.Sequential()

## Building Bidrectional LSTM RNN
# Configuration from previous notebook
model.add(Bidirectional(LSTM(units = 1024, activation='relu',
                             input_shape = (1,X_train_mms.shape[1]),
                             return_sequences=True)))

# Time Distributed layer to keep the model time series
model.add(TimeDistributed(Dense(1)))

## Compiling model with Adam optimizer
model.compile(optimizer = keras.optimizers.Adam(clipnorm=0.1,learning_rate = lr_schedule),
              loss = 'mse', metrics = 'RootMeanSquaredError')

[Back to Top](#Table-of-Content)

## **Model Training**

Now that we have built the architecture of the model, we can start training. However, we cannot just blindly start training, especially when we do not have a validation dataset to check if we are overfitting. Therefore, we need to define an Early Stopping mechanism to prevent overfitting. We will be monitoring the loss function, and if the loss value is not improving for five times in a row, then the training will automatically stop. 

We will also call for a Model Checkpoint. Usually the purpose of the model check point is to find the check point with the best validation metrics performance, but here we are also using the Model Checkpoint to save the best performing model as '5 Day Lag.h5'.

In [9]:
# Define early stopping, monitoring loss function, stops if not improving for five times in a row
es = EarlyStopping(monitor='loss', patience = 5, verbose=0, mode='auto')

# Define model checkpoint to save the best model weights and to be used immediately
mc = ModelCheckpoint("/Users/ianyu/capstone/4-models/5 Day Lag.h5",
                     monitor="root_mean_squared_error",
                     verbose=2,save_best_only=True,
                     save_weights_only=False,mode="auto",save_freq="epoch")

# Fit the model with 100 epochs
history = model.fit(Xtrain, ytrain, epochs = 100, verbose = 2, callbacks = [es,mc])

Epoch 1/100

Epoch 00001: root_mean_squared_error improved from inf to 256.36862, saving model to /Users/ianyu/capstone/4-models/5 Day Lag.h5
271/271 - 17s - loss: 65724.8672 - root_mean_squared_error: 256.3686
Epoch 2/100

Epoch 00002: root_mean_squared_error improved from 256.36862 to 47.08670, saving model to /Users/ianyu/capstone/4-models/5 Day Lag.h5
271/271 - 17s - loss: 2217.1577 - root_mean_squared_error: 47.0867
Epoch 3/100

Epoch 00003: root_mean_squared_error improved from 47.08670 to 42.50638, saving model to /Users/ianyu/capstone/4-models/5 Day Lag.h5
271/271 - 17s - loss: 1806.7921 - root_mean_squared_error: 42.5064
Epoch 4/100

Epoch 00004: root_mean_squared_error did not improve from 42.50638
271/271 - 17s - loss: 1841.8187 - root_mean_squared_error: 42.9164
Epoch 5/100

Epoch 00005: root_mean_squared_error did not improve from 42.50638
271/271 - 17s - loss: 1825.7399 - root_mean_squared_error: 42.7287
Epoch 6/100

Epoch 00006: root_mean_squared_error improved from 42.5

[Back to Top](#Table-of-Content)

## **Model Summary**

Lastly, we can take a look at the summary of our model.

In [11]:
# load models
model5d = keras.models.load_model('4-models/5 Day Lag.h5')

model5d.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional multiple                  9715712   
_________________________________________________________________
time_distributed (TimeDistri multiple                  2049      
Total params: 9,717,761
Trainable params: 9,717,761
Non-trainable params: 0
_________________________________________________________________


Our final model has close to 10 million parameters. The model also stopped training after 11 epochs, stopped by our early stopping mechanism. In the end, the weights with a Root Mean Squared Error of 37.80 was saved.

[Back to Top](#Table-of-Content)

## **Next Step**

In this notebook, we preprocessed the '5-Day Lag' dataset by scaling and reshaping to fit into our Bidirectional Long Short-Term Memory Recurrent Neural Network. We trained the model as a Weekly Predictor, and to prevent overfitting, we also called for an early stopping. In the end, our model as 9,717,761 parameters and achieved a Root Mean Squared Error of 37.80.

At this point, however, we do not know how well our model will actually perform on the test set, which is 2019-08-18 and beyond. We will actually test the model with the other two models all in one single notebook.

[Back to Top](#Table-of-Content)