<h3>More Dimensional Sequences - Stock Data</h3>

Let's return to the SPY data and try to predict the daily close price. Again, I will drop the adjusted close and volume.

Furthermore, the full dataset of 30 years is pretty big. I will just use data from 2010.

In [3]:
import pandas as pd
import numpy as np

In [4]:
spy = pd.read_csv('SPY.csv')
spy = spy.drop(['Adj Close', 'Volume'], axis=1)
spy['Date'] = pd.to_datetime(spy['Date'])
spy = spy[spy['Date'] >= '2010-01-01']
spy

Unnamed: 0,Date,Open,High,Low,Close
4264,2010-01-04,112.370003,113.389999,111.510002,113.330002
4265,2010-01-05,113.260002,113.680000,112.849998,113.629997
4266,2010-01-06,113.519997,113.989998,113.430000,113.709999
4267,2010-01-07,113.500000,114.330002,113.180000,114.190002
4268,2010-01-08,113.889999,114.620003,113.660004,114.570000
...,...,...,...,...,...
7326,2022-03-03,440.470001,441.109985,433.799988,435.709991
7327,2022-03-04,431.750000,433.369995,427.880005,432.170013
7328,2022-03-07,431.549988,432.299988,419.359985,419.429993
7329,2022-03-08,419.619995,427.209991,415.119995,416.250000


We redefine the two utility function to process the sequential data

In [5]:
def gen_hist_data(data,window,target,timecol=None):
    if timecol is None:
        lag_data = pd.concat([data.shift(t).add_suffix(f" (t-{t})") for t in range(window+1)], axis=1)
    else:
        time_data = data[timecol]
        lag_data = pd.concat([time_data]+[data.drop(timecol,axis=1).shift(t).add_suffix(f" (t-{t})") 
                                          for t in range(window+1)], axis=1)
    lag_data = pd.concat([lag_data, data[[target]].shift(-1).add_suffix(" (next)")], axis=1)
    return lag_data.iloc[window:-1,:]

def split_seq_data(data,split,target,timecol=None):
    if timecol is None:
        trainX = data.drop(target, axis=1).loc[:split,:]
        testX = data.drop(target, axis=1).loc[split:,:]
        trainY = data[target][:split]
        testY = data[target][split:]
    else:
        trainX = data.drop([target,timecol], axis=1).loc[data[timecol] < split, :]
        testX = data.drop([target,timecol], axis=1).loc[data[timecol] >= split, :]
        trainY = data[target][data[timecol] < split]
        testY = data[target][data[timecol]>= split]
    return trainX, testX, trainY, testY

In [18]:
window = 10
spy10 = gen_hist_data(spy, window, 'Close','Date')
spy10

Unnamed: 0,Date,Open (t-0),High (t-0),Low (t-0),Close (t-0),Open (t-1),High (t-1),Low (t-1),Close (t-1),Open (t-2),...,Close (t-8),Open (t-9),High (t-9),Low (t-9),Close (t-9),Open (t-10),High (t-10),Low (t-10),Close (t-10),Close (next)
4274,2010-01-19,113.620003,115.129997,113.589996,115.059998,114.730003,114.839996,113.199997,113.639999,114.489998,...,113.709999,113.260002,113.680000,112.849998,113.629997,112.370003,113.389999,111.510002,113.330002,113.889999
4275,2010-01-20,114.279999,114.449997,112.980003,113.889999,113.620003,115.129997,113.589996,115.059998,114.730003,...,114.190002,113.519997,113.989998,113.430000,113.709999,113.260002,113.680000,112.849998,113.629997,111.699997
4276,2010-01-21,113.919998,114.269997,111.559998,111.699997,114.279999,114.449997,112.980003,113.889999,113.620003,...,114.570000,113.500000,114.330002,113.180000,114.190002,113.519997,113.989998,113.430000,113.709999,109.209999
4277,2010-01-22,111.199997,111.739998,109.089996,109.209999,113.919998,114.269997,111.559998,111.699997,114.279999,...,114.730003,113.889999,114.620003,113.660004,114.570000,113.500000,114.330002,113.180000,114.190002,109.769997
4278,2010-01-25,110.209999,110.410004,109.410004,109.769997,111.199997,111.739998,109.089996,109.209999,113.919998,...,113.660004,115.080002,115.129997,114.239998,114.730003,113.889999,114.620003,113.660004,114.570000,109.309998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7325,2022-03-02,432.369995,439.720001,431.570007,437.890015,435.040009,437.170013,427.109985,429.980011,432.029999,...,437.059998,443.929993,448.059998,441.940002,446.600006,443.730011,446.279999,443.179993,446.100006,435.709991
7326,2022-03-03,440.470001,441.109985,433.799988,435.709991,432.369995,439.720001,431.570007,437.890015,435.040009,...,434.230011,443.220001,446.570007,436.420013,437.059998,443.929993,448.059998,441.940002,446.600006,432.170013
7327,2022-03-04,431.750000,433.369995,427.880005,432.170013,440.470001,441.109985,433.799988,435.709991,432.369995,...,429.570007,437.329987,438.660004,431.820007,434.230011,443.220001,446.570007,436.420013,437.059998,419.429993
7328,2022-03-07,431.549988,432.299988,419.359985,419.429993,431.750000,433.369995,427.880005,432.170013,440.470001,...,421.950012,431.890015,435.500000,425.859985,429.570007,437.329987,438.660004,431.820007,434.230011,416.250000


Now split the data. I will use data from 2021 for testing, and before that for training

In [7]:
trainX, testX, trainY, testY = split_seq_data(spy10, '2021-01-01', 'Close (next)', 'Date')

<h3> Modeling </h3>

With the data defined, we can apply regression models like SVR, trees, forests, etc., just like before

<h4>Support Vector Regressor</h4>

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

svr = SVR()

param_grid = [{
    'C': [0.01, 0.1, 1, 10, 100],
    'kernel' : ['rbf'],
    'gamma' : [0.01, 0.1, 1, 10, 100]
}]

grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX,trainY)

The finetuned model (note that score is now R2 since we are doing regression):

In [9]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
-25.442466841320883


And the testing performance 

In [10]:
best_svr = grid_search.best_estimator_
best_svr.score(testX, testY)

-65.12180981316536

<h4>Decision Tree Regressor</h4>

In [11]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()

param_grid = [{
    'max_depth': [3,4,5,6],
    'max_features' : [4],
    'min_samples_split' : [2, 10, 20, 30, 40],
    'min_samples_leaf' : [1, 10, 20, 30, 40]
}]

grid_search = GridSearchCV(dtr, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX,trainY)

In [12]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 6, 'max_features': 4, 'min_samples_leaf': 10, 'min_samples_split': 2}
0.1031421387296529


In [13]:
best_dt = grid_search.best_estimator_
best_dt.score(testX, testY)

-4.542055249564381

<h4>Random Forest Regressor</h4>

In [14]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()

param_grid = [{
    'n_estimators' : [5, 10, 20, 50],
    'max_depth': [3,4,5],
    'max_features' : [4],
    'min_samples_split' : [2, 10, 20, 30, 40],
    'min_samples_leaf' : [1, 10, 20, 30, 40]
}]

grid_search = GridSearchCV(rfr, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX,trainY)

In [15]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'max_depth': 5, 'max_features': 4, 'min_samples_leaf': 10, 'min_samples_split': 40, 'n_estimators': 10}
0.09663286317349935


In [16]:
best_rf = grid_search.best_estimator_
best_rf.score(testX, testY)

-5.81642439828853

<h4>Neural Network Regressor</h4>

In [19]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

n_features = window

param_grid = [{
    'hidden_layer_sizes' : [[n_features,n_features],                       #two hidden layer with n_features neurons
                            [n_features,n_features,n_features],            #three hidden layer with n_features neurons 
                            [n_features//2,n_features//2],                 #two hidden layer with n_features/2 neurons
                            [n_features//2,n_features//2,n_features//2],   #three hidden layer with n_features/2 neurons
                            [n_features*2,n_features*2],                   #two hidden layer with n_features*2 neurons
                            [n_features*2,n_features*2,n_features*2]],     #three hidden layer with n_features*2 neurons
    'alpha' : [0.001, 0.01, 0.1, 1, 10]                                    #regularization terms
}]

mlp = MLPRegressor(max_iter=2000)

grid_search = GridSearchCV(mlp, param_grid, cv=5, scoring='r2', return_train_score=True)

grid_search.fit(trainX,trainY)



Best training model:

In [20]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'alpha': 0.001, 'hidden_layer_sizes': [10, 10]}
0.9575451918785319


In [21]:
best_nn = grid_search.best_estimator_
best_nn.score(testX, testY)

0.9627193234346988

In this case, R2 is actually not too good to evaluate. We will use root MSE

In [25]:
testY_pred = best_dt.predict(testX)

from sklearn.metrics import mean_squared_error

np.sqrt(mean_squared_error(testY, testY_pred))

5.366956134199154

We can also plot the prediction against the true value

In [32]:
%matplotlib notebook

import matplotlib.pyplot as plt

date = spy.loc[spy['Date']>='2021-01-01','Date']
date = date[:-1]
plt.plot(date, testY)
plt.plot(date, testY_pred)
plt.gcf().autofmt_xdate()
plt.show()

<IPython.core.display.Javascript object>

<h3> Recurrent Neural Network </h3>

Similarly, we retransform the data to carry a longer lag. I will use 60. The training features also need to be reshaped for the RNN models.

In [59]:
window = 15

spy60 = gen_hist_data(spy, window, 'Close', 'Date')

trainX,testX,trainY,testY = split_seq_data(spy60,'2021-01-01','Close (next)','Date')
trainX = trainX.values.reshape(trainX.shape[0],window+1,-1)
testX = testX.values.reshape(testX.shape[0],window+1,-1)

In [60]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [70]:
model = keras.Sequential()

model.add(layers.Input([trainX.shape[1],trainX.shape[2]]))
model.add(layers.GRU(15, return_sequences=True))
model.add(layers.GRU(15, return_sequences=True))
model.add(layers.GRU(15))
model.add(layers.Dense(15, activation='relu'))
model.add(layers.Dense(1, activation='linear'))

model.compile(loss='mse', metrics=['mse'])

model.fit(trainX, trainY, epochs=500, batch_size=64, validation_split=0.2)

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

Epoch 112/500
Epoch 113/500
Epoch 114/500
Epoch 115/500
Epoch 116/500
Epoch 117/500
Epoch 118/500
Epoch 119/500
Epoch 120/500
Epoch 121/500
Epoch 122/500
Epoch 123/500
Epoch 124/500
Epoch 125/500
Epoch 126/500
Epoch 127/500
Epoch 128/500
Epoch 129/500
Epoch 130/500
Epoch 131/500
Epoch 132/500
Epoch 133/500
Epoch 134/500
Epoch 135/500
Epoch 136/500
Epoch 137/500
Epoch 138/500
Epoch 139/500
Epoch 140/500
Epoch 141/500
Epoch 142/500
Epoch 143/500
Epoch 144/500
Epoch 145/500
Epoch 146/500
Epoch 147/500
Epoch 148/500
Epoch 149/500
Epoch 150/500
Epoch 151/500
Epoch 152/500
Epoch 153/500
Epoch 154/500
Epoch 155/500
Epoch 156/500
Epoch 157/500
Epoch 158/500
Epoch 159/500
Epoch 160/500
Epoch 161/500
Epoch 162/500
Epoch 163/500
Epoch 164/500
Epoch 165/500
Epoch 166/500
Epoch 167/500
Epoch 168/500
Epoch 169/500
Epoch 170/500
Epoch 171/500
Epoch 172/500
Epoch 173/500
Epoch 174/500
Epoch 175/500
Epoch 176/500
Epoch 177/500
Epoch 178/500
Epoch 179/500
Epoch 180/500
Epoch 181/500
Epoch 182/500
Epoch 

Epoch 223/500
Epoch 224/500
Epoch 225/500
Epoch 226/500
Epoch 227/500
Epoch 228/500
Epoch 229/500
Epoch 230/500
Epoch 231/500
Epoch 232/500
Epoch 233/500
Epoch 234/500
Epoch 235/500
Epoch 236/500
Epoch 237/500
Epoch 238/500
Epoch 239/500
Epoch 240/500
Epoch 241/500
Epoch 242/500
Epoch 243/500
Epoch 244/500
Epoch 245/500
Epoch 246/500
Epoch 247/500
Epoch 248/500
Epoch 249/500
Epoch 250/500
Epoch 251/500
Epoch 252/500
Epoch 253/500
Epoch 254/500
Epoch 255/500
Epoch 256/500
Epoch 257/500
Epoch 258/500
Epoch 259/500
Epoch 260/500
Epoch 261/500
Epoch 262/500
Epoch 263/500
Epoch 264/500
Epoch 265/500
Epoch 266/500
Epoch 267/500
Epoch 268/500
Epoch 269/500
Epoch 270/500
Epoch 271/500
Epoch 272/500
Epoch 273/500
Epoch 274/500
Epoch 275/500
Epoch 276/500
Epoch 277/500
Epoch 278/500
Epoch 279/500
Epoch 280/500
Epoch 281/500
Epoch 282/500
Epoch 283/500
Epoch 284/500
Epoch 285/500
Epoch 286/500
Epoch 287/500
Epoch 288/500
Epoch 289/500
Epoch 290/500
Epoch 291/500
Epoch 292/500
Epoch 293/500
Epoch 

Epoch 333/500
Epoch 334/500
Epoch 335/500
Epoch 336/500
Epoch 337/500
Epoch 338/500
Epoch 339/500
Epoch 340/500
Epoch 341/500
Epoch 342/500
Epoch 343/500
Epoch 344/500
Epoch 345/500
Epoch 346/500
Epoch 347/500
Epoch 348/500
Epoch 349/500
Epoch 350/500
Epoch 351/500
Epoch 352/500
Epoch 353/500
Epoch 354/500
Epoch 355/500
Epoch 356/500
Epoch 357/500
Epoch 358/500
Epoch 359/500
Epoch 360/500
Epoch 361/500
Epoch 362/500
Epoch 363/500
Epoch 364/500
Epoch 365/500
Epoch 366/500
Epoch 367/500
Epoch 368/500
Epoch 369/500
Epoch 370/500
Epoch 371/500
Epoch 372/500
Epoch 373/500
Epoch 374/500
Epoch 375/500
Epoch 376/500
Epoch 377/500
Epoch 378/500
Epoch 379/500
Epoch 380/500
Epoch 381/500
Epoch 382/500
Epoch 383/500
Epoch 384/500
Epoch 385/500
Epoch 386/500
Epoch 387/500
Epoch 388/500
Epoch 389/500
Epoch 390/500
Epoch 391/500
Epoch 392/500
Epoch 393/500
Epoch 394/500
Epoch 395/500
Epoch 396/500
Epoch 397/500
Epoch 398/500
Epoch 399/500
Epoch 400/500
Epoch 401/500
Epoch 402/500
Epoch 403/500
Epoch 

Epoch 443/500
Epoch 444/500
Epoch 445/500
Epoch 446/500
Epoch 447/500
Epoch 448/500
Epoch 449/500
Epoch 450/500
Epoch 451/500
Epoch 452/500
Epoch 453/500
Epoch 454/500
Epoch 455/500
Epoch 456/500
Epoch 457/500
Epoch 458/500
Epoch 459/500
Epoch 460/500
Epoch 461/500
Epoch 462/500
Epoch 463/500
Epoch 464/500
Epoch 465/500
Epoch 466/500
Epoch 467/500
Epoch 468/500
Epoch 469/500
Epoch 470/500
Epoch 471/500
Epoch 472/500
Epoch 473/500
Epoch 474/500
Epoch 475/500
Epoch 476/500
Epoch 477/500
Epoch 478/500
Epoch 479/500
Epoch 480/500
Epoch 481/500
Epoch 482/500
Epoch 483/500
Epoch 484/500
Epoch 485/500
Epoch 486/500
Epoch 487/500
Epoch 488/500
Epoch 489/500
Epoch 490/500
Epoch 491/500
Epoch 492/500
Epoch 493/500
Epoch 494/500
Epoch 495/500
Epoch 496/500
Epoch 497/500
Epoch 498/500
Epoch 499/500
Epoch 500/500


<keras.callbacks.History at 0x24737071370>

In [71]:
model.evaluate(trainX,trainY)



[4930.2314453125, 4930.2314453125]

In [72]:
model.evaluate(testX,testY)



[59450.64453125, 59450.64453125]

In [73]:
testY_pred = model.predict(testX)

from sklearn.metrics import r2_score

r2_score(testY, testY_pred)



-75.945692476451