# Investigating Stock Prediction with RNN

A recurrent neural network uses its internal state to process sequential inputs, which makes it great for time series data such as stock prices. In this blog post, we will investigate the popular LSTM, long short-term memory, method for stock prediction.

## Data Import and Clean-up

In [None]:
#install yfinance in Colab
pip install yfinance

In [65]:
#import the required libraries
import pandas as pd
import numpy as np
import keras
import tensorflow as tf
import plotly.graph_objects as go
from keras.preprocessing.sequence import TimeseriesGenerator
import math
import pandas_datareader as web
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from pandas_datareader import data as pdr
import yfinance as yfin
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.optimizers import Adam
from plotly.io import write_html

In [31]:
#yfinance api call to import data
yfin.pdr_override()
df = pdr.get_data_yahoo("^GSPC ^VIX", start="2002-01-01", end="2022-05-03")

[*********************100%***********************]  2 of 2 completed


In [32]:
#data cleanup
df["sp500"] = df["Adj Close"]["^GSPC"]
df["volume"] = df["Volume"]["^GSPC"]
df["vix"] = df["Adj Close"]["^VIX"]
df = df.reset_index()
df = df.drop(columns = ["Adj Close", "Volume"])
df = df.drop(columns = ["Close", "High", "Low", "Open"])
df.head()

  obj = obj._drop_axis(labels, axis, level=level, errors=errors)


Unnamed: 0,Date,sp500,volume,vix
,,,,
0.0,2002-01-02,1154.670044,1171000000.0,22.709999
1.0,2002-01-03,1165.27002,1398900000.0,21.34
2.0,2002-01-04,1172.51001,1513000000.0,20.450001
3.0,2002-01-07,1164.890015,1308300000.0,21.940001
4.0,2002-01-08,1160.709961,1258800000.0,21.83


In [61]:
#replace with datetime
df['Date'] = pd.to_datetime(df['Date'])
df.set_axis(df['Date'], inplace=True)

In [66]:
#visualize our data
trace1 = go.Scatter(
    x = df["Date"],
    y = df["sp500"],
    mode = 'lines',
    name = 'SP500'
)
layout = go.Layout(
    title = "S&P500 Index from 2002 to 2022",
    xaxis = {'title' : "Date"},
    yaxis = {'title' : "Close"}
)
fig = go.Figure(data=[trace1], layout=layout)
fig.show()
write_html(fig, "sp500.html")

In [None]:
#creates dataset for training and testing purposes
close_data = df['sp500'].values
close_data = close_data.reshape((-1,1))

split_percent = 0.80
split = int(split_percent*len(close_data))

#training and testing
close_train = close_data[:split]
close_test = close_data[split:]

date_train = df['Date'][:split]
date_test = df['Date'][split:]

In [None]:
#this will be the length for our RNN input
look_back = 20

#generate time series using this function
train_generator = TimeseriesGenerator(close_train, close_train, length=look_back, batch_size=20)     
test_generator = TimeseriesGenerator(close_test, close_test, length=look_back, batch_size=1)

## First Model with Many-to-One LSTM

Let's make our first model with LSTM. We will be using the many-to-one method, i.e. input 20 data points of data and outputs 1 data point. We will be using the LSTM layers and dropout layers to construct our model.

In [None]:
#make our first model
model = Sequential()
#LSTM layer is the backbone of our RNN
model.add(
    LSTM(50,
        return_sequences=True, 
        activation='relu',
        input_shape=(look_back,1))
)
model.add(
    LSTM(50, 
        activation='relu',
        input_shape=(look_back,1))
)
#dropout layer to prevent overfitting
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

#fit our data with our model
model.fit_generator(train_generator, epochs=25, verbose=1)

Epoch 1/25




Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f1335e56510>

Now let's see how our model predicts.

In [None]:
#make prediction
prediction = model.predict_generator(test_generator)

#reshape our prediction
close_train = close_train.reshape((-1))
close_test = close_test.reshape((-1))
prediction = prediction.reshape((-1))

#plots the three segments of data points, the training data, the predicted trend, and the actual price
trace1 = go.Scatter(
    x = date_train,
    y = close_train,
    mode = 'lines',
    name = 'Training Data'
)
trace2 = go.Scatter(
    x = date_test,
    y = prediction,
    mode = 'lines',
    name = 'Prediction'
)
trace3 = go.Scatter(
    x = date_test,
    y = close_test,
    mode='lines',
    name = 'Actual Price'
)
layout = go.Layout(
    title = "SP500 Prediction",
    xaxis = {'title' : "Date"},
    yaxis = {'title' : "Close"}
)
fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
fig.show()
write_html(fig, "prediction1.html")

  This is separate from the ipykernel package so we can avoid doing imports until


Our model is looking extremely promising. Our model has managed to accurately predict every major turning points in the stock market. If this is real, we would all be billionairs. But is there a catch? It almost looks too good to be true. Let's see how it performs in the real world.

## Making Prediction with Our First Model

We will predict 60 days into the future, 30 days of which is known data (to us, not to the model). Let's see how our model performs.

In [None]:
#reshape our data
close_data = close_data.reshape((-1))

def predict(num_prediction, model):
  """
  Takes in a model and a number for days to predict, outputs the predicted values
  """
  prediction_list = close_data[-look_back:]
  for _ in range(num_prediction):
      x = prediction_list[-look_back:]
      x = x.reshape((1, look_back, 1))
      out = model.predict(x)[0][0]
      prediction_list = np.append(prediction_list, out)
  prediction_list = prediction_list[look_back-1:]
  return prediction_list
    
def predict_dates(num_prediction):
  """
  Takes in the number of predictions and outputs the dates for these predictions based on previous data
  """
  last_date = df['Date'].values[-1]
  prediction_dates = pd.date_range(last_date, periods=num_prediction+1).tolist()
  return prediction_dates

num_prediction = 60
forecast = predict(num_prediction, model)
forecast_dates = predict_dates(num_prediction)

In [None]:
#imports the actual price for these dates and cleans up
result = pdr.get_data_yahoo("^GSPC ^VIX", start="2022-05-02", end="2022-06-03")
result["sp500"] = result["Adj Close"]["^GSPC"]
result["voulme"] = result["Volume"]["^GSPC"]
result["vix"] = result["Adj Close"]["^VIX"]
result = result.reset_index()
result = result.drop(columns = ["Adj Close", "Volume"])
result = result.drop(columns = ["Close", "High", "Low", "Open"])
result['Date'] = pd.to_datetime(result['Date'])
result.set_axis(result['Date'], inplace=True)
close_data = result['sp500'].values
close_data = close_data.reshape((-1,1))
actual_date = result['Date']
actual_close = close_data

[*********************100%***********************]  2 of 2 completed



dropping on a non-lexsorted multi-index without a level parameter may impact performance.



In [None]:
#plots the real-world prediction
trace1 = go.Scatter(
    x = forecast_dates,
    y = forecast,
    mode = 'lines',
    name = 'Data'
)
trace2 = go.Scatter(
    x = result['Date'],
    y = result['sp500'].values,
    mode = 'lines',
    name = 'Actual Price'
)
layout = go.Layout(
    title = "SP500 Prediction",
    xaxis = {'title' : "Date"},
    yaxis = {'title' : "Close"}
)
fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.show()

With just two months of prediction, we are seeing a significant deviation from the actual price trend. Our model tells us that SP500 will continue tanking for two months, with no significant pullbacks. This is highly unlikely based on experience. The actual price trend, however, is much more reasonable.

## Second Model with Many-to-Many RNN

We saw some significant drawbacks with our first model in the real world. It is shocking how good it is with test cases until it performs against unseen data. In this section, we will try to explain this discrepancy with our second model using the many-to-many method.

Instead of feeding our model with 20 data points and extracting one prediction, we will be feeding it with multiple data points and extracting a trend line into the future.

In [42]:
#Data import and cleanup
dataset_train = pd.read_csv('^NDX.csv')
cols = list(dataset_train)[1:]
datelist_train = list(dataset_train["Date"])
datelist_train = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in datelist_train]
training_set = dataset_train.values

In [44]:
#feature scaling using StandardScalar
sc = StandardScaler()
training_set_scaled = sc.fit_transform(training_set)
#shapes our data
sc_predict = StandardScaler()
sc_predict.fit_transform(training_set[:, 0:1])

array([[-0.703329  ],
       [-0.69582021],
       [-0.6775205 ],
       ...,
       [ 2.4043269 ],
       [ 2.47115722],
       [ 2.476595  ]])

For consistency's sake, we will look into the future for 60 days just like before. But instead of looking back for 20 days, we will likely need more data points. Let's use 90 days of trading data to predict the next 60 days of prices.

In [None]:
#create our training data set
X_train = []
y_train = []

#predict 60 days with 90 days
n_future = 60   
n_past = 90     

for i in range(n_past, len(training_set_scaled) - n_future +1):
    X_train.append(training_set_scaled[i - n_past:i, 0:dataset_train.shape[1] - 1])
    y_train.append(training_set_scaled[i + n_future - 1:i + n_future, 0])

X_train, y_train = np.array(X_train), np.array(y_train)

In [47]:
#Similar deal, sequential model with 2 LSTM models and one dropout
model = Sequential()
model.add(LSTM(units=64, 
               return_sequences=True, 
               input_shape=(n_past, dataset_train.shape[1]-1)))
model.add(LSTM(units=10, 
               return_sequences=False))
model.add(Dropout(0.20))
model.add(Dense(units=1, activation='linear'))

#compile the model
model.compile(optimizer = Adam(learning_rate=0.01), loss='mean_squared_error')

#fit the model
model.fit(X_train, y_train, epochs=20)

In [63]:
#makes the list for prediction
datelist_future = pd.date_range(datelist_train[-1], periods=n_future, freq='1d').tolist()

#makes the list of dates
datelist_future_ = []
for this_timestamp in datelist_future:
    datelist_future_.append(this_timestamp.date())

#makes the prediction
predictions_future = model.predict(X_train[-n_future:])
predictions_train = model.predict(X_train[n_past:])

In [None]:
#cleans up the data for visualization
y_pred_future = sc_predict.inverse_transform(predictions_future)
y_pred_train = sc_predict.inverse_transform(predictions_train)

PREDICTIONS_FUTURE = pd.DataFrame(y_pred_future, columns=['Open']).set_index(pd.Series(datelist_future))
PREDICTION_TRAIN = pd.DataFrame(y_pred_train, columns=['Open']).set_index(pd.Series(datelist_train[2 * n_past + n_future -1:]))
PREDICTION_TRAIN.index = PREDICTION_TRAIN.index.to_series()
dataset_train = pd.DataFrame(dataset_train, columns=cols)
dataset_train.index = datelist_train
dataset_train.index = pd.to_datetime(dataset_train.index)

In [67]:
START_DATE_FOR_PLOTTING = '2012-05-01'

trace1 = go.Scatter(
    x = PREDICTIONS_FUTURE.index,
    y = PREDICTIONS_FUTURE['Open'],
    mode = 'lines',
    name = 'Predicted Stock Price'
)
trace2 = go.Scatter(
    x = PREDICTION_TRAIN.loc[START_DATE_FOR_PLOTTING:].index,
    y = PREDICTION_TRAIN.loc[START_DATE_FOR_PLOTTING:]['Open'],
    mode = 'lines',
    name = 'Training Predictions'
)
trace3 = go.Scatter(
    x = dataset_train.loc[START_DATE_FOR_PLOTTING:].index,
    y = dataset_train.loc[START_DATE_FOR_PLOTTING:]['Open'],
    mode = 'lines',
    name = 'Actual Stock Prices'
)
layout = go.Layout(
    title = "NQ100 Prediction",
    xaxis = {'title' : "Date"},
    yaxis = {'title' : "Close"}
)
fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
fig.show()
write_html(fig, "prediction3.html")

This looks nothing like before. Our model has completely failed to predict the upward rally beginning march 2020. In fact, it has failed to predict any breakage of establish trend. It appears it is trained to stick to existing trends in an effort to maximize accuracy. This model is useless in the real world. 

But what causes the stark difference between our first and second model? In a many-to-many model tested against seen data, the model is always tasked with predicting the next day. It is easy to guess the next data point as stock prices are relatively continuous, and our model can always guess some number close. Since we are testing with seen data, any miscalculation does not add up. But when facing unseen prediction, our model fails. The second model is a clear demonstration of how our model fails to predict any change in price trends. 

## Conclusion

Our many-to-many model shows the real-world usefulness of modelling stock prices using RNN. We remain skeptical of technical analysis doctrince based on price trend analysis.