<a href="https://colab.research.google.com/github/kappandrew2/DataPreProcessing/blob/main/MarketResearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Model Purpose

Utilize historical value and time attributes to predict the next day's gain or loss value

!Dataset Notes: The dataset for this data solution must come from the following web sit and contain a large historical sample of data. For example:

Begin Date = 12/01/2007 (Be mindful that the last 35 periods (in this case, days) will get chopped off of the bottom of the dataset during data preprocessing)

End Date = Today's current value (to be run an hour before market close)

Ticker = SPY

Train Set = all data except last 60 periods (rows)

Prediction Set = all data from -90 periods (days) to current

https://www.wsj.com/market-data/quotes/index/SPX/historical-prices


In [681]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
from sklearn.model_selection import train_test_split
from datetime import date, datetime, timedelta
from pandas._libs.tslibs.timestamps import Timestamp

#Connect to drive and import data set

Using google drive

Importing historical prices for ticker "SPY"

In [682]:
#Create CSV from data export
#https://www.wsj.com/market-data/quotes/index/SPX/historical-prices

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/HistoricalPricesSPY.csv')

print(dataset)

Mounted at /content/drive
          Date     Open      High     Low   Close     Volume
0     07/29/21  439.815  441.8000  439.81  440.65   46716900
1     07/28/21  439.680  440.3000  437.31  438.83   52472359
2     07/27/21  439.910  439.9400  435.99  439.01   67397133
3     07/26/21  439.310  441.0300  439.26  441.02   43719191
4     07/23/21  437.520  440.3000  436.79  439.94   63766641
...        ...      ...       ...     ...     ...        ...
3433  12/07/07  151.420  151.5000  150.55  150.91  148951391
3434  12/06/07  148.630  151.2100  148.57  150.94  154487203
3435  12/05/07  147.930  149.2000  147.83  148.81  170813406
3436  12/04/07  146.660  147.5409  146.31  146.36  136528609
3437  12/03/07  148.190  148.4500  147.29  147.68  145852797

[3438 rows x 6 columns]


#Modifiy dataset Content and Headers

Remove contents not required for this exercise

Renaming columns to remove leading white space

Narrowing the dataset can be done via drop or select, both options are available (comment out the one not in use)

In [683]:
dataset.rename({' Close': 'Close'}, axis=1, inplace = True)
dataset = dataset[['Close', 'Date']]
#dataset = dataset.drop([' Open', ' High', ' Low', ' Volume'], axis = 1)

#Dataset information validation

Validate date frame, column contents and data types

In [684]:
dataset['Date'] = pd.to_datetime(dataset['Date'])

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3438 entries, 0 to 3437
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Close   3438 non-null   float64       
 1   Date    3438 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1)
memory usage: 53.8 KB


#Change indext to date (for troublshooting)

Moving date to the index assists in visually validating processes are working correctly

!Note: This should be "off" except when troublshooting

In [685]:
#dataset['Date_Index'] = dataset['Date']
#dataset.set_index('Date_Index', inplace=True)

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3438 entries, 0 to 3437
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Close   3438 non-null   float64       
 1   Date    3438 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1)
memory usage: 53.8 KB


#Create time attributes

Time attributes will change date from a continous variable into discrete (a numeric categorical value)

In [686]:
dataset['DOW'] = dataset['Date'].dt.dayofweek
dataset['DOY'] = dataset['Date'].dt.dayofyear
dataset['Week'] = dataset['Date'].dt.week
dataset['Month'] = dataset['Date'].dt.month
dataset['Quarter'] = dataset['Date'].dt.quarter

dataset.dtypes

  This is separate from the ipykernel package so we can avoid doing imports until


Close             float64
Date       datetime64[ns]
DOW                 int64
DOY                 int64
Week                int64
Month               int64
Quarter             int64
dtype: object

#Create skip-day gain loss values (and dependant variable #1)

!Note: gain-loss-0 will ultimately end up being the dependant variable but also an independant variable (we will create a new column later and shift it down a row)

1) Calculate the first day's gain loss by subtracting day -1 from day 0

2) Calculate the second day's gain loss by subtracting day -2 from day 0

3) Calculate the third day's gain loss by subtracting day -n from day 0

!Note: This should be turned into a loop using i=n where n = the rows to be processed (now many previous rows)




In [687]:
dataset['gain_loss-0'] = dataset['Close'].diff(-1)
dataset['gain_loss-1'] = dataset['Close'].diff(-2)
#dataset['gain_loss-1'] = dataset['gain_loss-1'].shift(periods=-1, fill_value=0) #Removed these to experiment 
#with switching around the dependant variable rather than the independant variable
dataset['gain_loss-2'] = dataset['Close'].diff(-3) 
#dataset['gain_loss-2'] = dataset['gain_loss-2'].shift(periods=-1, fill_value=0)
dataset['gain_loss-3'] = dataset['Close'].diff(-4) 
#dataset['gain_loss-3'] = dataset['gain_loss-3'].shift(periods=-1, fill_value=0)
dataset['gain_loss-4'] = dataset['Close'].diff(-5) 
#dataset['gain_loss-4'] = dataset['gain_loss-4'].shift(periods=-1, fill_value=0)

print(dataset)
#dataset.dtypes

       Close       Date  DOW  ...  gain_loss-2  gain_loss-3  gain_loss-4
0     440.65 2021-07-29    3  ...        -0.37         0.71         5.19
1     438.83 2021-07-28    2  ...        -1.11         3.37         4.28
2     439.01 2021-07-27    1  ...         3.55         4.46         7.95
3     441.02 2021-07-26    0  ...         6.47         9.96        16.05
4     439.94 2021-07-23    4  ...         8.88        14.97         8.60
...      ...        ...  ...  ...          ...          ...          ...
3433  150.91 2007-12-07    4  ...         4.55         3.23          NaN
3434  150.94 2007-12-06    3  ...         3.26          NaN          NaN
3435  148.81 2007-12-05    2  ...          NaN          NaN          NaN
3436  146.36 2007-12-04    1  ...          NaN          NaN          NaN
3437  147.68 2007-12-03    0  ...          NaN          NaN          NaN

[3438 rows x 12 columns]


#Create binary version of skip-day gain loss values (and dependant variable #2)

!Note: gain-loss-0b will ultimately end up being the dependant variable but also an independant variable (we will create a new column later and shift it down a row)

This process changes all gain loss continuous variables into a binary-descrete (dichotomous) variables

!Note - This process should be converted into the previous process when that process is converted into a loop

In [688]:
dataset['gain_loss-0b'] = np.where(dataset['gain_loss-0'] > 0, 1, 0)
dataset['gain_loss-1b'] = np.where(dataset['gain_loss-1'] > 0, 1, 0)
dataset['gain_loss-2b'] = np.where(dataset['gain_loss-2'] > 0, 1, 0)
dataset['gain_loss-3b'] = np.where(dataset['gain_loss-3'] > 0, 1, 0)
dataset['gain_loss-4b'] = np.where(dataset['gain_loss-4'] > 0, 1, 0)

dataset.dtypes

Close                  float64
Date            datetime64[ns]
DOW                      int64
DOY                      int64
Week                     int64
Month                    int64
Quarter                  int64
gain_loss-0            float64
gain_loss-1            float64
gain_loss-2            float64
gain_loss-3            float64
gain_loss-4            float64
gain_loss-0b             int64
gain_loss-1b             int64
gain_loss-2b             int64
gain_loss-3b             int64
gain_loss-4b             int64
dtype: object

#Aggregate the binary skip-day gain loss values

This creates a true categorical value from the binary descrete values.

The theory is that, having binary values for each period (sparce matrix) and an aggregate (categorical), the values will work together to increase the value of this data

!Note = This process should be indluded in the loop mentioned in notes from the above process (future modifications to the data pre-processing procedures)

In [689]:
dataset['gain_loss-total_b'] = dataset['gain_loss-0b'] + dataset['gain_loss-1b'] + dataset['gain_loss-2b'] + dataset['gain_loss-3b'] + dataset['gain_loss-4b']

dataset.head(-1)

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b
0,440.65,2021-07-29,3,210,30,7,3,1.82,1.64,-0.37,0.71,5.19,1,1,0,1,1,4
1,438.83,2021-07-28,2,209,30,7,3,-0.18,-2.19,-1.11,3.37,4.28,0,0,0,1,1,2
2,439.01,2021-07-27,1,208,30,7,3,-2.01,-0.93,3.55,4.46,7.95,0,0,1,1,1,3
3,441.02,2021-07-26,0,207,30,7,3,1.08,5.56,6.47,9.96,16.05,1,1,1,1,1,5
4,439.94,2021-07-23,4,204,29,7,3,4.48,5.39,8.88,14.97,8.60,1,1,1,1,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3432,152.08,2007-12-10,0,344,50,12,4,1.17,1.14,3.27,5.72,4.40,1,1,1,1,1,5
3433,150.91,2007-12-07,4,341,49,12,4,-0.03,2.10,4.55,3.23,,0,1,1,1,0,3
3434,150.94,2007-12-06,3,340,49,12,4,2.13,4.58,3.26,,,1,1,1,0,0,3
3435,148.81,2007-12-05,2,339,49,12,4,2.45,1.13,,,,1,1,0,0,0,2


#Create daily gain loss and denormalize values

1) Calculate the first day's gain loss by subtracting day -1 from day 0

2) Calculate the second day's gain loss by subtracting day -2 from day -1

3) Calculate the third day's gain loss by subtracting day -n from day -n+1

This process creates a new column and removes the top rows in accordance with the desired "lookback" period - shift over 1 and lift by 1, shift over 2 and lift by 2, shift over n and lift by n

!Note: This should be turned into a loop using i=n where n = the rows to be processed (now many previous rows)


In [690]:
dataset['prior_day-0'] = dataset['gain_loss-0']
#dataset['prior_day-1'] = dataset['prior_day-1'].shift(periods=-1, fill_value=0)#Removed this to experiment 
#with switching around the dependant variable rather than the independant variable
dataset['prior_day-1'] = dataset['gain_loss-0']
dataset['prior_day-1'] = dataset['prior_day-1'].shift(periods=-1, fill_value=0)
dataset['prior_day-2'] = dataset['gain_loss-0']
dataset['prior_day-2'] = dataset['prior_day-2'].shift(periods=-2, fill_value=0)
dataset['prior_day-3'] = dataset['gain_loss-0']
dataset['prior_day-3'] = dataset['prior_day-3'].shift(periods=-3, fill_value=0)
dataset['prior_day-4'] = dataset['gain_loss-0']
dataset['prior_day-4'] = dataset['prior_day-4'].shift(periods=-4, fill_value=0)
dataset.head()

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b,prior_day-0,prior_day-1,prior_day-2,prior_day-3,prior_day-4
0,440.65,2021-07-29,3,210,30,7,3,1.82,1.64,-0.37,0.71,5.19,1,1,0,1,1,4,1.82,-0.18,-2.01,1.08,4.48
1,438.83,2021-07-28,2,209,30,7,3,-0.18,-2.19,-1.11,3.37,4.28,0,0,0,1,1,2,-0.18,-2.01,1.08,4.48,0.91
2,439.01,2021-07-27,1,208,30,7,3,-2.01,-0.93,3.55,4.46,7.95,0,0,1,1,1,3,-2.01,1.08,4.48,0.91,3.49
3,441.02,2021-07-26,0,207,30,7,3,1.08,5.56,6.47,9.96,16.05,1,1,1,1,1,5,1.08,4.48,0.91,3.49,6.09
4,439.94,2021-07-23,4,204,29,7,3,4.48,5.39,8.88,14.97,8.6,1,1,1,1,1,5,4.48,0.91,3.49,6.09,-6.37


#Create binary version of daily gain loss values

This process changes all gain loss continuous variables into a binary-descrete (dichotomous) variables

!Note - This process should be converted into the previous process when that process is converted into a loop

In [691]:
dataset['prior_day-0b'] = np.where(dataset['prior_day-0'] > 0, 1, 0)
dataset['prior_day-1b'] = np.where(dataset['prior_day-1'] > 0, 1, 0)
dataset['prior_day-2b'] = np.where(dataset['prior_day-2'] > 0, 1, 0)
dataset['prior_day-3b'] = np.where(dataset['prior_day-3'] > 0, 1, 0)
dataset['prior_day-4b'] = np.where(dataset['prior_day-4'] > 0, 1, 0)
dataset.head()

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b,prior_day-0,prior_day-1,prior_day-2,prior_day-3,prior_day-4,prior_day-0b,prior_day-1b,prior_day-2b,prior_day-3b,prior_day-4b
0,440.65,2021-07-29,3,210,30,7,3,1.82,1.64,-0.37,0.71,5.19,1,1,0,1,1,4,1.82,-0.18,-2.01,1.08,4.48,1,0,0,1,1
1,438.83,2021-07-28,2,209,30,7,3,-0.18,-2.19,-1.11,3.37,4.28,0,0,0,1,1,2,-0.18,-2.01,1.08,4.48,0.91,0,0,1,1,1
2,439.01,2021-07-27,1,208,30,7,3,-2.01,-0.93,3.55,4.46,7.95,0,0,1,1,1,3,-2.01,1.08,4.48,0.91,3.49,0,1,1,1,1
3,441.02,2021-07-26,0,207,30,7,3,1.08,5.56,6.47,9.96,16.05,1,1,1,1,1,5,1.08,4.48,0.91,3.49,6.09,1,1,1,1,1
4,439.94,2021-07-23,4,204,29,7,3,4.48,5.39,8.88,14.97,8.6,1,1,1,1,1,5,4.48,0.91,3.49,6.09,-6.37,1,1,1,1,0


#Aggregate the binary daily gain loss values

This creates a true categorical value from the binary descrete values.

The theory is that, having binary values for each period (sparce matrix) and an aggregate (categorical), the values will work together to increase the value of this data

!Note = This process should be indluded in the loop mentioned in notes from the above process (future modifications to the data pre-processing procedures)

In [692]:
dataset['prior_day-total_b'] = dataset['prior_day-0b'] + dataset['prior_day-1b'] + dataset['prior_day-2b'] + dataset['prior_day-3b'] + dataset['prior_day-4b'] 
dataset.head(5)

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b,prior_day-0,prior_day-1,prior_day-2,prior_day-3,prior_day-4,prior_day-0b,prior_day-1b,prior_day-2b,prior_day-3b,prior_day-4b,prior_day-total_b
0,440.65,2021-07-29,3,210,30,7,3,1.82,1.64,-0.37,0.71,5.19,1,1,0,1,1,4,1.82,-0.18,-2.01,1.08,4.48,1,0,0,1,1,3
1,438.83,2021-07-28,2,209,30,7,3,-0.18,-2.19,-1.11,3.37,4.28,0,0,0,1,1,2,-0.18,-2.01,1.08,4.48,0.91,0,0,1,1,1,3
2,439.01,2021-07-27,1,208,30,7,3,-2.01,-0.93,3.55,4.46,7.95,0,0,1,1,1,3,-2.01,1.08,4.48,0.91,3.49,0,1,1,1,1,4
3,441.02,2021-07-26,0,207,30,7,3,1.08,5.56,6.47,9.96,16.05,1,1,1,1,1,5,1.08,4.48,0.91,3.49,6.09,1,1,1,1,1,5
4,439.94,2021-07-23,4,204,29,7,3,4.48,5.39,8.88,14.97,8.6,1,1,1,1,1,5,4.48,0.91,3.49,6.09,-6.37,1,1,1,1,0,4


#Creating Rolling mean attribute values

Rolling mean values are based on daily gain loss and represent the trending direction of the prior n mean values (5, 10, 15, n, row mean values)

the rolling mean works from the top row down - for exampple the mean of row 1 and 2 would appear on row 2. We need the mean of row 1 and 2 to land on row 1. This requires us to reverse the index of each desired mean column. The process to do this creates pandas value lists

!Note: this process can convert into a loop  where n = list of n mean values (as described in the description above)

In [693]:
#Rolling averages based on prior day gain loss
rolling_prior_day = dataset['prior_day-0']

rolling_prior_day_5 = rolling_prior_day[::-1].rolling(5).mean()[::-1]
rolling_prior_day_10 = rolling_prior_day[::-1].rolling(10).mean()[::-1]
rolling_prior_day_15 = rolling_prior_day[::-1].rolling(15).mean()[::-1]
rolling_prior_day_20 = rolling_prior_day[::-1].rolling(20).mean()[::-1]
rolling_prior_day_25 = rolling_prior_day[::-1].rolling(25).mean()[::-1]
rolling_prior_day_30 = rolling_prior_day[::-1].rolling(30).mean()[::-1]

print(rolling_prior_day_10)

0       0.590
1       0.259
2       0.342
3       0.394
4       0.442
        ...  
3433      NaN
3434      NaN
3435      NaN
3436      NaN
3437      NaN
Name: prior_day-0, Length: 3438, dtype: float64


#Remove NaN rows

Need to remove the NaN rows from bottom of dataset. These will cause errors in the analysis if not removed.

Due to this delete, the dataset must contain 35 additional data of history beyond what is desired. This was mentioned in the notation heading of this solution. (due to rolling means and shifts).

In [694]:
#dataset.dropna(inplace = True)

dataset.head(-5)

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b,prior_day-0,prior_day-1,prior_day-2,prior_day-3,prior_day-4,prior_day-0b,prior_day-1b,prior_day-2b,prior_day-3b,prior_day-4b,prior_day-total_b
0,440.65,2021-07-29,3,210,30,7,3,1.82,1.64,-0.37,0.71,5.19,1,1,0,1,1,4,1.82,-0.18,-2.01,1.08,4.48,1,0,0,1,1,3
1,438.83,2021-07-28,2,209,30,7,3,-0.18,-2.19,-1.11,3.37,4.28,0,0,0,1,1,2,-0.18,-2.01,1.08,4.48,0.91,0,0,1,1,1,3
2,439.01,2021-07-27,1,208,30,7,3,-2.01,-0.93,3.55,4.46,7.95,0,0,1,1,1,3,-2.01,1.08,4.48,0.91,3.49,0,1,1,1,1,4
3,441.02,2021-07-26,0,207,30,7,3,1.08,5.56,6.47,9.96,16.05,1,1,1,1,1,5,1.08,4.48,0.91,3.49,6.09,1,1,1,1,1,5
4,439.94,2021-07-23,4,204,29,7,3,4.48,5.39,8.88,14.97,8.60,1,1,1,1,1,5,4.48,0.91,3.49,6.09,-6.37,1,1,1,1,0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3428,147.17,2007-12-14,4,348,50,12,4,-1.89,-2.20,-0.74,-4.91,-3.74,0,0,0,0,0,0,-1.89,-0.31,1.46,-4.17,1.17,0,0,1,0,1,2
3429,149.06,2007-12-13,3,347,50,12,4,-0.31,1.15,-3.02,-1.85,-1.88,0,1,0,0,0,1,-0.31,1.46,-4.17,1.17,-0.03,0,1,0,1,0,2
3430,149.37,2007-12-12,2,346,50,12,4,1.46,-2.71,-1.54,-1.57,0.56,1,0,0,0,1,2,1.46,-4.17,1.17,-0.03,2.13,1,0,1,0,1,3
3431,147.91,2007-12-11,1,345,50,12,4,-4.17,-3.00,-3.03,-0.90,1.55,0,0,0,0,1,1,-4.17,1.17,-0.03,2.13,2.45,0,1,0,1,1,3


#Create dependant variables (2 dependants)

As mentioned earlier, the gain_loss-0 attribute is a dependant variable. It's binary conterpart, gain_loss-0b is also a dependant variable.

the dependant variables need shifted down one row. This will adjust all of the independant variable into a position where they are trying to predict the "day ahead". Because the data is shifted down one day the last day must be removed.

!Note: Due to the organizaiton of this dataset (train and test set being time-based) this adjustment for the dependant variables will create results for next day. 


In [695]:
y = pd.DataFrame(dataset['gain_loss-0']).reset_index(drop = True)
y.loc[-1] = [0]
y.index = y.index + 1
y = y.sort_index()
y.drop(y.tail(1).index, inplace = True)
y.rename(columns={'gain_loss-0': 'y'}, inplace=True)
y_df = pd.DataFrame(y, columns=['y'])

yb = pd.DataFrame(dataset['gain_loss-0b']).reset_index(drop = True)
yb.loc[-1] = [0]
yb.index = yb.index + 1
yb = yb.sort_index()
yb.drop(yb.tail(1).index, inplace = True)
yb.rename(columns={'gain_loss-0b': 'yb'}, inplace=True)
yb_df = pd.DataFrame(yb, columns=['yb'])

print(y)
print("----------------")
print(yb)

         y
0     0.00
1     1.82
2    -0.18
3    -2.01
4     1.08
...    ...
3433  1.17
3434 -0.03
3435  2.13
3436  2.45
3437 -1.32

[3438 rows x 1 columns]
----------------
      yb
0      0
1      1
2      0
3      0
4      1
...   ..
3433   1
3434   0
3435   1
3436   1
3437   0

[3438 rows x 1 columns]


In [696]:
a = len(y.index)
b = len(yb.index)
c = len(dataset.index)
d = len(rolling_prior_day_5.index)
a1 = len(y_df.index)
b1 = len(yb_df.index)

print(a)
print(b)
print(c)
print(d)
print(a1)
print(b1)

3438
3438
3438
3438
3438
3438


#Create final dataset and review

A concat procedure is necessary to create the final dataset.

There should be a total of 37 columns

In [697]:
dataset_final = pd.concat([dataset,
           rolling_prior_day_5, 
           rolling_prior_day_10, 
           rolling_prior_day_15, 
           rolling_prior_day_20, 
           rolling_prior_day_25, 
           rolling_prior_day_30,
           y_df,
           yb_df],
           axis = 1)

dataset_final.head(-5)

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b,prior_day-0,prior_day-1,prior_day-2,prior_day-3,prior_day-4,prior_day-0b,prior_day-1b,prior_day-2b,prior_day-3b,prior_day-4b,prior_day-total_b,prior_day-0.1,prior_day-0.2,prior_day-0.3,prior_day-0.4,prior_day-0.5,prior_day-0.6,y,yb
0,440.65,2021-07-29,3,210,30,7,3,1.82,1.64,-0.37,0.71,5.19,1,1,0,1,1,4,1.82,-0.18,-2.01,1.08,4.48,1,0,0,1,1,3,1.038,0.590,0.648667,0.6295,0.7220,0.618000,0.00,0
1,438.83,2021-07-28,2,209,30,7,3,-0.18,-2.19,-1.11,3.37,4.28,0,0,0,1,1,2,-0.18,-2.01,1.08,4.48,0.91,0,0,1,1,1,3,0.856,0.259,0.291333,0.5565,0.6288,0.478333,1.82,1
2,439.01,2021-07-27,1,208,30,7,3,-2.01,-0.93,3.55,4.46,7.95,0,0,1,1,1,3,-2.01,1.08,4.48,0.91,3.49,0,1,1,1,1,4,1.590,0.342,0.405333,0.5770,0.7260,0.458333,-0.18,0
3,441.02,2021-07-26,0,207,30,7,3,1.08,5.56,6.47,9.96,16.05,1,1,1,1,1,5,1.08,4.48,0.91,3.49,6.09,1,1,1,1,1,5,3.210,0.394,0.486667,0.7205,1.0440,0.557000,-2.01,0
4,439.94,2021-07-23,4,204,29,7,3,4.48,5.39,8.88,14.97,8.60,1,1,1,1,1,5,4.48,0.91,3.49,6.09,-6.37,1,1,1,1,0,4,1.720,0.442,0.634000,0.7420,0.7188,0.544333,1.08,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3428,147.17,2007-12-14,4,348,50,12,4,-1.89,-2.20,-0.74,-4.91,-3.74,0,0,0,0,0,0,-1.89,-0.31,1.46,-4.17,1.17,0,0,1,0,1,2,-0.748,,,,,,-2.10,0
3429,149.06,2007-12-13,3,347,50,12,4,-0.31,1.15,-3.02,-1.85,-1.88,0,1,0,0,0,1,-0.31,1.46,-4.17,1.17,-0.03,0,1,0,1,0,2,-0.376,,,,,,-1.89,0
3430,149.37,2007-12-12,2,346,50,12,4,1.46,-2.71,-1.54,-1.57,0.56,1,0,0,0,1,2,1.46,-4.17,1.17,-0.03,2.13,1,0,1,0,1,3,0.112,,,,,,-0.31,0
3431,147.91,2007-12-11,1,345,50,12,4,-4.17,-3.00,-3.03,-0.90,1.55,0,0,0,0,1,1,-4.17,1.17,-0.03,2.13,2.45,0,1,0,1,1,3,0.310,,,,,,1.46,1


#Evaluate dataset for NaN

Throught the processes above there should have been some NaN values created at the tail

In [698]:
dataset_final.dropna(inplace = True)

a = len(dataset.index)
e = len(dataset_final.index)

print("rows dropped = {}".format(a-e))

rows dropped = 30


#Create dataset splitting variable (Train and Pred)

Date variables based on today date are required to prevent "hardcoding" dates into the model

The date_var variable will represent the most current date in the dataset. This allows the solution to be run for any timeframe.

The following code can replace the current date_var logic in the case the current method causes issues. Note, this method requires adjustment when back testing 

date_var = pd.to_datetime(date.today()) 

In [699]:
date_var = dataset_final['Date'].max()
train_begin_date = dataset_final['Date'].min()
train_end_date = (date_var - pd.to_timedelta(90, unit='d'))
pred_begin_date = (date_var - pd.to_timedelta(120, unit='d'))
pred_end_date = date_var

train_begin_date = train_begin_date.to_pydatetime()
train_end_date = train_end_date.to_pydatetime()
pred_begin_date = pred_begin_date.to_pydatetime()
pred_end_date = pred_end_date.to_pydatetime()

#train_begin_date = pd.DataFrame([train_begin_date], columns=['train_begin_date'])
#train_end_date = pd.DataFrame([train_end_date], columns=['train_end_date'])
#pred_begin_date = pd.DataFrame([pred_begin_date], columns=['pred_begin_date'])
#pred_end_date = pd.DataFrame([pred_end_date], columns=['pred_end_date'])

print(train_begin_date)
print(train_end_date)
print(pred_begin_date)
print(pred_end_date)

2008-01-16 00:00:00
2021-04-30 00:00:00
2021-03-31 00:00:00
2021-07-29 00:00:00


#Split between training and predict data sets

The top last 90 periods (rows) will generate the pred data set.

All but the top 60 periods (rows) will generate the training data set.

The 30 day overlap can provide a measure of the model's degredation over time

!Note - The market is closed on weekends and holidays. The count of days in each set will NOT equal the amount of days between begin and end dates.

In [700]:
#split text and train datasets
predset = dataset_final[(dataset_final['Date'] >= pred_begin_date) & 
                        (dataset_final['Date'] <= pred_end_date)]
trainset = dataset_final[(dataset_final['Date'] >= train_begin_date) & 
                         (dataset_final['Date'] <= train_end_date)]
type(predset)

pandas.core.frame.DataFrame

In [701]:
len(predset)


84

In [702]:
len(trainset)


3346

In [703]:
len(dataset)

3438

In [704]:
trainset

Unnamed: 0,Close,Date,DOW,DOY,Week,Month,Quarter,gain_loss-0,gain_loss-1,gain_loss-2,gain_loss-3,gain_loss-4,gain_loss-0b,gain_loss-1b,gain_loss-2b,gain_loss-3b,gain_loss-4b,gain_loss-total_b,prior_day-0,prior_day-1,prior_day-2,prior_day-3,prior_day-4,prior_day-0b,prior_day-1b,prior_day-2b,prior_day-3b,prior_day-4b,prior_day-total_b,prior_day-0.1,prior_day-0.2,prior_day-0.3,prior_day-0.4,prior_day-0.5,prior_day-0.6,y,yb
62,417.30,2021-04-30,4,120,17,4,2,-2.76,-0.10,-0.22,-0.31,0.56,0,0,0,0,1,1,-2.76,2.66,-0.12,-0.09,0.87,0,1,0,0,1,2,0.112,0.004,0.387333,0.8345,1.1040,0.860667,0.90,1
63,420.06,2021-04-29,3,119,17,4,2,2.66,2.54,2.45,3.32,7.79,1,1,1,1,1,5,2.66,-0.12,-0.09,0.87,4.47,1,0,0,1,1,3,1.558,0.419,0.769333,1.1865,1.3016,0.760000,-2.76,0
64,417.40,2021-04-28,2,118,17,4,2,-0.12,-0.21,0.66,5.13,1.33,0,0,1,1,1,3,-0.12,-0.09,0.87,4.47,-3.80,0,0,1,1,0,2,0.266,0.595,0.720667,1.1335,1.1160,0.716333,2.66,1
65,417.52,2021-04-27,1,117,17,4,2,-0.09,0.78,5.25,1.45,5.35,0,1,1,1,1,4,-0.09,0.87,4.47,-3.80,3.90,0,1,1,0,1,3,1.070,0.466,0.760000,1.0870,0.9972,0.703667,-0.12,0
66,417.61,2021-04-26,0,116,17,4,2,0.87,5.34,1.54,5.44,2.40,1,1,1,1,1,5,0.87,4.47,-3.80,3.90,-3.04,1,1,0,1,0,3,0.480,0.597,0.750000,1.0815,1.1252,0.785000,-0.09,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3403,133.86,2008-01-23,2,23,4,1,1,3.14,1.80,0.43,-3.12,-4.31,1,1,1,0,0,3,3.14,-1.34,-1.37,-3.55,-1.19,1,0,0,0,0,1,-0.862,-0.505,-0.823333,-0.7135,-0.5324,-0.568333,1.13,1
3404,130.72,2008-01-22,1,22,4,1,1,-1.34,-2.71,-6.26,-7.45,-10.56,0,0,0,0,0,0,-1.34,-1.37,-3.55,-1.19,-3.11,0,0,0,0,0,0,-2.112,-1.047,-1.105333,-0.8040,-0.7336,-0.674000,3.14,1
3405,132.06,2008-01-18,4,18,3,1,1,-1.37,-4.92,-6.11,-9.22,-8.09,0,0,0,0,0,0,-1.37,-3.55,-1.19,-3.11,1.13,0,0,0,0,1,1,-1.618,-0.925,-1.040667,-0.6910,-0.6924,-0.558333,-1.34,0
3406,133.43,2008-01-17,3,17,3,1,1,-3.55,-4.74,-7.85,-6.72,-7.86,0,0,0,0,0,0,-3.55,-1.19,-3.11,1.13,-1.14,0,0,0,1,0,1,-1.572,-1.143,-1.074667,-0.6225,-0.5792,-0.431000,-1.37,0


#Convert dataset into X and y and refine column membership

This process separates the dependant and independant variables

X should not contain the y or yb attributes

for X, "Date" should be removed since it is a time-series value; date attributes will represent time

for X, "Close" should be removed due to its relationship to the indepenant variable

Two models will come out of this model, one for continuous variable y and binary value yb

In [705]:
X = trainset
X.drop(X.tail(31).index, inplace = True)
X = X.drop(['y','yb', 'Date', 'Close'], axis = 1).values
y = trainset['y'].values
yb = trainset['yb'].values
print(X)
print("-------------------------")
print(y)
print("-------------------------")
print(yb)

[[ 4.00000000e+00  1.20000000e+02  1.70000000e+01 ...  8.34500000e-01
   1.10400000e+00  8.60666667e-01]
 [ 3.00000000e+00  1.19000000e+02  1.70000000e+01 ...  1.18650000e+00
   1.30160000e+00  7.60000000e-01]
 [ 2.00000000e+00  1.18000000e+02  1.70000000e+01 ...  1.13350000e+00
   1.11600000e+00  7.16333333e-01]
 ...
 [ 2.00000000e+00  6.50000000e+01  1.00000000e+01 ... -1.50000000e-02
  -8.32000000e-02  1.03666667e-01]
 [ 1.00000000e+00  6.40000000e+01  1.00000000e+01 ... -2.41500000e-01
  -9.00000000e-02  3.10000000e-02]
 [ 0.00000000e+00  6.30000000e+01  1.00000000e+01 ... -3.04000000e-01
   1.84000000e-02  2.33333333e-03]]
-------------------------
[ 0.9  -2.76  2.66 ... -2.77  0.84 -0.51]
-------------------------
[1 0 1 ... 0 1 0]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [706]:
#Validation of row counts

f = len(X)
g = len(y)
h = len(yb)

print(f, g, h)

3315 3315 3315


#Train the models

The model can be extended to use any regression or classificaiton model.

Current model inventory:

1) Random Forest

In [707]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)

regressor.fit(X, y)

regressor_b = RandomForestRegressor(n_estimators = 100, random_state = 0)

regressor_b.fit(X, yb)


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

#Create predict dataset

The predict dataset should match the process used to generate the training dataset

In [708]:
#Prepare predict set
Xpred = predset
Xpred = Xpred.drop(['y','yb', 'Date', 'Close'], axis = 1).values
y_actual = predset['y'].values
yb_actual = predset['yb'].values
print(Xpred)
print("-------------------------")
print(y_actual)
print("-------------------------")
print(yb_actual)

[[3.00000000e+00 2.10000000e+02 3.00000000e+01 ... 6.29500000e-01
  7.22000000e-01 6.18000000e-01]
 [2.00000000e+00 2.09000000e+02 3.00000000e+01 ... 5.56500000e-01
  6.28800000e-01 4.78333333e-01]
 [1.00000000e+00 2.08000000e+02 3.00000000e+01 ... 5.77000000e-01
  7.26000000e-01 4.58333333e-01]
 ...
 [0.00000000e+00 9.50000000e+01 1.40000000e+01 ... 1.13650000e+00
  1.04000000e+00 5.44333333e-01]
 [3.00000000e+00 9.10000000e+01 1.30000000e+01 ... 1.19550000e+00
  7.31200000e-01 3.29666667e-01]
 [2.00000000e+00 9.00000000e+01 1.30000000e+01 ... 7.45500000e-01
  1.82400000e-01 1.31333333e-01]]
-------------------------
[ 0.    1.82 -0.18 -2.01  1.08  4.48  0.91  3.49  6.09 -6.37 -3.41 -1.49
  0.65 -1.49  1.56  4.6  -3.54  1.53 -0.79  3.29  2.37  0.36  0.23  0.86
  1.51  2.5  -0.51  2.25  5.94 -7.05 -0.14 -2.37 -0.78  0.95  0.7   1.96
 -0.63  0.09 -0.41  3.83 -1.56  0.66 -0.37  0.75  0.22  0.83 -0.93  4.23
 -0.34  4.42 -1.08 -3.58 -1.06  6.3   4.87 -8.8  -3.73 -4.18  3.05  3.32
  0.13 -2

In [709]:
i = len(Xpred)
j = len(y_actual)
k = len(yb_actual)

print(i, j, k)

84 84 84


#Generate predictions

Predictions are made for both continuous and binary

In [763]:
y_pred = regressor.predict(Xpred)

yb_pred = regressor_b.predict(Xpred)

np_array = np.concatenate((y_pred.reshape(len(y_pred),1), 
                           yb_pred.reshape(len(y_pred),1),
                           y_actual.reshape(len(y_actual),1),
                           yb_actual.reshape(len(yb_actual),1),
                           ), axis = 1)

results = pd.DataFrame(np_array, columns = ['y_pred', 'yb_pred', 'y_actual', 'yb_actual'])

print(results)


      y_pred  yb_pred  y_actual  yb_actual
0  -0.969824     0.59      0.00        0.0
1   0.899154     0.56      1.82        1.0
2   0.464892     0.72     -0.18        0.0
3   0.795992     0.59     -2.01        0.0
4  -0.181100     0.56      1.08        1.0
..       ...      ...       ...        ...
79  1.472700     0.85      1.93        1.0
80  0.454100     0.81      0.47        1.0
81 -0.373200     0.12     -0.24        0.0
82  3.194900     0.85      5.75        1.0
83  2.366782     0.79      4.28        1.0

[84 rows x 4 columns]


#Create buy/sell indicator based on pred

1 = Buy next day

0 = Sell next day

In [764]:
results['y_pred_arg'] = np.where(results['y_pred'] > 0, 1, 0)
results['yb_pred_arg'] = np.where(results['yb_pred'] > 0.5, 1, 0)
results['y_actual_arg'] = np.where(results['y_actual'] > 0, 1, 0)
results['yb_acutal_arg'] = np.where(results['yb_actual'] > 0, 1, 0)

print(results)


      y_pred  yb_pred  y_actual  ...  yb_pred_arg  y_actual_arg  yb_acutal_arg
0  -0.969824     0.59      0.00  ...            1             0              0
1   0.899154     0.56      1.82  ...            1             1              1
2   0.464892     0.72     -0.18  ...            1             0              0
3   0.795992     0.59     -2.01  ...            1             0              0
4  -0.181100     0.56      1.08  ...            1             1              1
..       ...      ...       ...  ...          ...           ...            ...
79  1.472700     0.85      1.93  ...            1             1              1
80  0.454100     0.81      0.47  ...            1             1              1
81 -0.373200     0.12     -0.24  ...            0             0              0
82  3.194900     0.85      5.75  ...            1             1              1
83  2.366782     0.79      4.28  ...            1             1              1

[84 rows x 8 columns]


#Add in 'short' ticker for same period

A short ticker is one that behavies opposite of the selected ticker in this evaluation.

The concept of having a short ticker is to trade "into" it when trading "out" of the primary ticker

Note! - This section is not broke out; use hashtag notes as a reference

In [765]:
type(pred_short)

pandas.core.frame.DataFrame

In [766]:
#import dataset
dataset_short = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/HistoricalPricesHIBS.csv')
#rename values with leading white spaces
dataset_short.rename({' Close': 'Close'}, axis=1, inplace = True)
#create gain loss value for short
dataset_short['gain_loss_short-0'] = dataset_short['Close'].diff(-1)
#select out values needed for model
dataset_short = dataset_short[['Close', 'Date', 'gain_loss_short-0']]
#convert date to datetype
dataset_short['Date'] = pd.to_datetime(dataset_short['Date'])
#select out mating dates to pred dataset
pred_short = dataset_short[(dataset_short['Date'] >= pred_begin_date) & 
                         (dataset_short['Date'] <= pred_end_date)]
#move rows down by one
y_short = pd.DataFrame(pred_short['gain_loss_short-0']).reset_index(drop = True)
y_short.loc[-1] = [0]
y_short.index = y_short.index + 1
y_short = y_short.sort_index()
y_short.drop(y_short.tail(1).index, inplace = True)
y_short.rename(columns={'gain_loss_short-0': 'y_short'}, inplace=True)
y_short_df = pd.DataFrame(y_short, columns=['y_short'])
#add short to results
results = pd.concat([results,
                    y_short_df],
                    axis = 1)
#verfity y_short and y_actual are both 0 at row 0
print(results)

      y_pred  yb_pred  y_actual  ...  y_actual_arg  yb_acutal_arg  y_short
0  -0.969824     0.59      0.00  ...             0              0     0.00
1   0.899154     0.56      1.82  ...             1              1    -0.38
2   0.464892     0.72     -0.18  ...             0              0    -0.32
3   0.795992     0.59     -2.01  ...             0              0     0.43
4  -0.181100     0.56      1.08  ...             1              1    -0.58
..       ...      ...       ...  ...           ...            ...      ...
79  1.472700     0.85      1.93  ...             1              1     0.09
80  0.454100     0.81      0.47  ...             1              1    -0.02
81 -0.373200     0.12     -0.24  ...             0              0    -0.04
82  3.194900     0.85      5.75  ...             1              1    -0.14
83  2.366782     0.79      4.28  ...             1              1    -0.86

[84 rows x 9 columns]


In [741]:
type(results)

pandas.core.frame.DataFrame

#Model Performance

To account accurately the top row of the "results" dataset must be removed

Model performanced is based on the buy-sell relationship between the "..._arg" columns

Each model is measured as well as a various combination of the models.

In [767]:
results['y_buy'] = np.where(results['y_pred_arg'] >= 1, results['y_actual'], results['y_short'])
results['yb_buy'] = np.where(results['yb_pred_arg'] >= 1, results['y_actual'], results['y_short'])
results['yoryb_buy'] = np.where((results['y_pred_arg'] + results['yb_pred_arg']) >= 1, results['y_actual'], results['y_short'])
results['yandyb_buy'] = np.where((results['y_pred_arg'] + results['yb_pred_arg']) >= 2, results['y_actual'], results['y_short'])


The following section should be turned into a loop

In [769]:
performance_all = results
performance_all = performance_all[1:] #Remove top row since we do not have actual value
performance_all = performance_all.sum(axis = 0)
print(performance_all)

y_pred           14.637012
yb_pred          46.980000
y_actual         44.320000
yb_actual        49.000000
y_pred_arg       54.000000
yb_pred_arg      56.000000
y_actual_arg     49.000000
yb_acutal_arg    49.000000
y_short          -3.870000
y_buy            38.020000
yb_buy           64.020000
yoryb_buy        66.180000
yandyb_buy       35.860000
dtype: float64


In [770]:
performance_30 = results.head(31)
performance_30 = performance_30[1:] #Remove top row since we do not have actual value
performance_30 = performance_30.sum(axis = 0)
print(performance_30)

y_pred           -4.370251
yb_pred          16.020000
y_actual         18.540000
yb_actual        19.000000
y_pred_arg       16.000000
yb_pred_arg      20.000000
y_actual_arg     19.000000
yb_acutal_arg    19.000000
y_short           0.630000
y_buy             4.460000
yb_buy           24.590000
yoryb_buy        23.040000
yandyb_buy        6.010000
dtype: float64


In [771]:
performance_60 = results.head(61)
performance_60 = performance_60[1:] #Remove top row since we do not have actual value
performance_60 = performance_60.sum(axis = 0)
print(performance_60)

y_pred           -0.954542
yb_pred          33.200000
y_actual         25.030000
yb_actual        35.000000
y_pred_arg       38.000000
yb_pred_arg      41.000000
y_actual_arg     35.000000
yb_acutal_arg    35.000000
y_short          -1.240000
y_buy             3.760000
yb_buy           30.220000
yoryb_buy        31.800000
yandyb_buy        2.180000
dtype: float64


In [772]:
performance_15 = results.head(16)
performance_15 = performance_15[1:] #Remove top row since we do not have actual value
performance_15 = performance_15.sum(axis = 0)
print(performance_15)

y_pred           -2.912119
yb_pred           8.280000
y_actual          9.730000
yb_actual         9.000000
y_pred_arg        8.000000
yb_pred_arg      11.000000
y_actual_arg      9.000000
yb_acutal_arg     9.000000
y_short          -0.900000
y_buy             7.900000
yb_buy           19.070000
yoryb_buy        19.070000
yandyb_buy        7.900000
dtype: float64


In [773]:
performance_45 = results.head(46)
performance_45 = performance_45[1:] #Remove top row since we do not have actual value
performance_45 = performance_45.sum(axis = 0)
print(performance_45)

y_pred           -0.769382
yb_pred          24.570000
y_actual         22.410000
yb_actual        28.000000
y_pred_arg       30.000000
yb_pred_arg      31.000000
y_actual_arg     28.000000
yb_acutal_arg    28.000000
y_short          -0.190000
y_buy             8.080000
yb_buy           25.330000
yoryb_buy        26.910000
yandyb_buy        6.500000
dtype: float64


In [774]:
performance_5 = results.head(6)
performance_5 = performance_5[1:] #Remove top row since we do not have actual value
performance_5 = performance_5.sum(axis = 0)
print(performance_5)

y_pred           3.292546
yb_pred          2.990000
y_actual         5.190000
yb_actual        3.000000
y_pred_arg       4.000000
yb_pred_arg      5.000000
y_actual_arg     3.000000
yb_acutal_arg    3.000000
y_short         -0.870000
y_buy            3.530000
yb_buy           5.190000
yoryb_buy        5.190000
yandyb_buy       3.530000
dtype: float64


In [775]:
performance_10 = results.head(11)
performance_10 = performance_10[1:] #Remove top row since we do not have actual value
performance_10 = performance_10.sum(axis = 0)
print(performance_10)

y_pred           -3.388018
yb_pred           5.410000
y_actual          5.900000
yb_actual         6.000000
y_pred_arg        5.000000
yb_pred_arg       7.000000
y_actual_arg      6.000000
yb_acutal_arg     6.000000
y_short          -1.100000
y_buy             3.820000
yb_buy           13.170000
yoryb_buy        13.170000
yandyb_buy        3.820000
dtype: float64


In [776]:
performance_20 = results.head(21)
performance_20 = performance_20[1:] #Remove top row since we do not have actual value
performance_20 = performance_20.sum(axis = 0)
print(performance_20)

y_pred           -4.807723
yb_pred          10.380000
y_actual         12.590000
yb_actual        12.000000
y_pred_arg       10.000000
yb_pred_arg      12.000000
y_actual_arg     12.000000
yb_acutal_arg    12.000000
y_short           0.550000
y_buy             8.850000
yb_buy           21.570000
yoryb_buy        20.020000
yandyb_buy       10.400000
dtype: float64


In [777]:
performance_25 = results.head(26)
performance_25 = performance_25[1:] #Remove top row since we do not have actual value
performance_25 = performance_25.sum(axis = 0)
print(performance_25)

y_pred           -2.842042
yb_pred          13.290000
y_actual         18.050000
yb_actual        17.000000
y_pred_arg       13.000000
yb_pred_arg      16.000000
y_actual_arg     17.000000
yb_acutal_arg    17.000000
y_short           0.260000
y_buy            13.380000
yb_buy           26.400000
yoryb_buy        24.850000
yandyb_buy       14.930000
dtype: float64


In [778]:
performance_35 = results.head(36)
performance_35 = performance_35[1:] #Remove top row since we do not have actual value
performance_35 = performance_35.sum(axis = 0)
print(performance_35)

y_pred           -3.437224
yb_pred          18.560000
y_actual         19.000000
yb_actual        22.000000
y_pred_arg       21.000000
yb_pred_arg      23.000000
y_actual_arg     22.000000
yb_acutal_arg    22.000000
y_short           0.960000
y_buy             4.920000
yb_buy           26.610000
yoryb_buy        23.500000
yandyb_buy        8.030000
dtype: float64


In [779]:
performance_40 = results.head(41)
performance_40 = performance_40[1:] #Remove top row since we do not have actual value
performance_40 = performance_40.sum(axis = 0)
print(performance_40)

y_pred           -2.031479
yb_pred          21.580000
y_actual         20.320000
yb_actual        24.000000
y_pred_arg       26.000000
yb_pred_arg      27.000000
y_actual_arg     24.000000
yb_acutal_arg    24.000000
y_short           1.450000
y_buy             6.240000
yb_buy           23.980000
yoryb_buy        24.820000
yandyb_buy        5.400000
dtype: float64


In [780]:
performance_50 = results.head(51)
performance_50 = performance_50[1:] #Remove top row since we do not have actual value
performance_50 = performance_50.sum(axis = 0)
print(performance_50)

y_pred           -2.131543
yb_pred          27.370000
y_actual         28.710000
yb_actual        30.000000
y_pred_arg       31.000000
yb_pred_arg      34.000000
y_actual_arg     30.000000
yb_acutal_arg    30.000000
y_short          -0.130000
y_buy             6.580000
yb_buy           33.040000
yoryb_buy        34.620000
yandyb_buy        5.000000
dtype: float64


In [781]:
performance_55 = results.head(56)
performance_55 = performance_55[1:] #Remove top row since we do not have actual value
performance_55 = performance_55.sum(axis = 0)
print(performance_55)

y_pred           -2.249643
yb_pred          30.360000
y_actual         26.440000
yb_actual        32.000000
y_pred_arg       34.000000
yb_pred_arg      37.000000
y_actual_arg     32.000000
yb_acutal_arg    32.000000
y_short          -0.800000
y_buy             0.850000
yb_buy           27.310000
yoryb_buy        28.890000
yandyb_buy       -0.730000
dtype: float64


In [782]:
performance = pd.concat([performance_5,
                         performance_10,
                         performance_15, 
                         performance_20,
                         performance_25,
                         performance_30,
                         performance_35,
                         performance_40, 
                         performance_45,
                         performance_50,
                         performance_55,
                         performance_60, 
                         performance_all], axis=1)

print(performance)

                     0          1          2   ...         10         11         12
y_pred         3.292546  -3.388018  -2.912119  ...  -2.249643  -0.954542  14.637012
yb_pred        2.990000   5.410000   8.280000  ...  30.360000  33.200000  46.980000
y_actual       5.190000   5.900000   9.730000  ...  26.440000  25.030000  44.320000
yb_actual      3.000000   6.000000   9.000000  ...  32.000000  35.000000  49.000000
y_pred_arg     4.000000   5.000000   8.000000  ...  34.000000  38.000000  54.000000
yb_pred_arg    5.000000   7.000000  11.000000  ...  37.000000  41.000000  56.000000
y_actual_arg   3.000000   6.000000   9.000000  ...  32.000000  35.000000  49.000000
yb_acutal_arg  3.000000   6.000000   9.000000  ...  32.000000  35.000000  49.000000
y_short       -0.870000  -1.100000  -0.900000  ...  -0.800000  -1.240000  -3.870000
y_buy          3.530000   3.820000   7.900000  ...   0.850000   3.760000  38.020000
yb_buy         5.190000  13.170000  19.070000  ...  27.310000  30.220000  64

In [783]:
results.to_csv('/content/drive/MyDrive/Colab Notebooks/results.csv')