<a href="https://colab.research.google.com/github/kmaciver/Ryerson_Capstone/blob/master/Approach/Step1-Benchmark/Benchmark_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

In [0]:
from google.colab import files

uploaded = files.upload()

# Benchmark Model for Day Trade Prediction

Reading the Day Trade Data


In [0]:
DayTrade = pd.read_csv('Day_trade_data.csv', index_col='Time')
DayTrade = DayTrade.drop([DayTrade.columns[0]] ,  axis='columns')
DayTrade.head()

Dropping Volume Currency as discussed in the Feature Selection phase.

In [0]:
DayTrade = DayTrade.drop(columns='Volume_.Currency.')

## Modifying Day Trade data into a supervised problem 

The objective of the model is to predict 10 minutes ahead of the current timestep. 
The Day Trade data contains the minute to minute data for a total of 1735 days. The analysis must be limited within each day.

In [0]:

days_in_data = list(dict.fromkeys(DayTrade["date"].values))
days_in_data = list(days_in_data)
len(days_in_data)

In order to compare results with other models a 90% split is going to be made 

In [0]:
split = 0.9
training_days = days_in_data[:int(split*len(days_in_data))]
testing_days = days_in_data[int(split*len(days_in_data)):]
print(len(training_days),len(testing_days))

For each prediction the Algorithm will use the last 25 timesteps of the current day

In [0]:
# Transforming data into a supervised problem

def transform_data(data, days):
    '''Objective: Transform the data in order for each row to have the current time-step and previous 25 time-step data
       Input: data-> Dataframe to be transformed
              days-> list of days in data to be transformed
       Output X_train, Y_train if days are training_days or X_test, Y_test if days are testing days'''
    
    # Create a dummy Dataframes

    # Step Generate dummy Dataframes columns_names
    training_columns = list(data.drop(columns='date').columns)
    for column in data.drop(columns='date').columns:
        for i in range(25,0,-1):
            training_columns.append(str(column +' t-'+str(i)))

    X_data = pd.DataFrame(columns=training_columns)
    Y_data = pd.DataFrame(columns=['Weighted_Price t+10'])

    # Forecast Period
    shift_step = 10 #minutes
    
    for day in days:
        # Filtering data for each day
        df = data[data['date'].values==day]
        df = df.drop(columns='date')
        # Performing stationarization
        differenced =df.diff(1)
        differenced = differenced.iloc[1:]

        train_day_data = differenced.copy()

        #Shifting previous time step information to same row
        for column in training_columns:
            for i in range(25,0,-1):
                train_day_data[column +' t-'+str(i)] = train_day_data[column].shift(i)

        label_day_data = pd.DataFrame(data = train_day_data['Weighted_Price'].shift(-shift_step).values, columns = ['Weighted_Price t+10'], index=train_day_data.index)


        # Now there are NaN values in the 25 first rows of the train_day_data and on the 10 last rows of the label_day_data

        train_day_data_clean = train_day_data.iloc[25:-10,]
        label_day_data_clean = label_day_data.iloc[25:-10,]

        #test = pd.concat([train_day_data_clean,train_day_data_clean], axis=0)

        X_data = pd.concat([X_data,train_day_data_clean], axis=0, sort=False)
        Y_data = pd.concat([Y_data,label_day_data_clean], axis=0, sort=False)
    
    return(X_data, Y_data)

In [0]:
X_train, Y_train = transform_data(DayTrade, training_days)

X_train.shape, Y_train.shape

In [0]:
X_test, Y_test = transform_data(DayTrade, testing_days)

X_test.shape, Y_test.shape

In [0]:
X_test.to_csv("X_test")
Y_test.to_csv("Y_test")

## Applying Regressor to Train data

Now that the data has been transformed into a supervised algorithm problem. A regressor model can be trained and evaluated.

In [0]:
from sklearn import linear_model

model_linear = linear_model.Lasso(alpha=0.1)
model_linear.fit(X_train.values, Y_train.values)

In [0]:
from sklearn.svm import SVR

model_svr = SVR(C=1.0, epsilon=0.2)
model_svr.fit(X_train.values, Y_train.values)

In [0]:
from sklearn.ensemble import GradientBoostingRegressor

# fit random forest model
model = GradientBoostingRegressor(n_estimators=100, random_state=1)
model.fit(X_train.values, Y_train.values)

In [0]:
import pickle
filename = "GBR_model.sav"
model = pickle.load(open(filename, 'rb'))

In [0]:
from sklearn.metrics import mean_squared_error 

Y_pred =  model_linear.predict(X_test)
mse_benchmark = mean_squared_error(Y_pred,Y_test) 
mse_benchmark

Saving trained model's weights.

In [0]:
import pickle

filename = "Lasso_model.sav"
pickle.dump(model_linear, open(filename, "wb"))

In [0]:
#loaded_model = pickle.load(open(filename, 'rb'))
#Y_pred2 = loaded_model.predict(X_test)
#mse_benchmark2 = mean_squared_error(Y_pred2,Y_test) 
#mse_benchmark2

## Restoring prediction values to original form 

Creating a Dataframe with all the timesteps for the testing days

In [0]:
Validation_data = DayTrade[DayTrade['date'].isin(testing_days)]
Validation_data_label = Validation_data.iloc[:,5:7]
Validation_data_label

In [0]:
Validation_data_label.to_csv("Validation_data_label.csv")

The algorithm trained uses the last 25 time-steps of the stationarized data from each day, to predict 10 minutes in the future. Therefore, from the 1440 minutes of each day the algorithm predicts the last 1404 minutes. 

During transformations the indexes where altered. So an array with the correct indexes is created

In [0]:
from datetime import datetime
from datetime import timedelta 

prediction_indexes = []
for days in testing_days:
  for i in range(1404):
    datetime_str= str(days+" 00:36:00")
    datetime_object = datetime.strptime(datetime_str, '%Y-%m-%d %H:%M:%S')
    prediction_indexes.append(str(datetime_object + timedelta(minutes=i)))

len(prediction_indexes)

Transforming the predictions into a Dataframe with the correct timestep as index

In [0]:
predicted_dataframe = Y_test.copy()
predicted_dataframe['Prediction'] = Y_pred
predicted_dataframe.index = prediction_indexes
predicted_dataframe

In [0]:
predicted_dataframe.to_csv("predicted_dataframe.csv")

Creating a new Dataframe that will restore the prediction values performing the inverse of the diff function applied for stationarizing the data

In [0]:
restored = Validation_data_label.merge(predicted_dataframe, left_index=True, right_index=True, how='left')

In [0]:
restored['Prediction_restored'] = restored['Prediction']
restored['Prediction_restored'].iloc[:] = np.nan
for i in range(0,len(testing_days)):
  idx_start = (1440 * i) + 35
  pos = idx_start + 1
  restored['Prediction_restored'].iloc[pos] = restored['Weighted_Price'].iloc[idx_start] + restored['Prediction'].iloc[pos]
  k_old = 0
  for k in range(1,1404): #since there are 1404 minutes predicted each day but the first one was already restored
    restored['Prediction_restored'].iloc[pos+k] = restored['Prediction_restored'].iloc[pos+k_old] + restored['Prediction'].iloc[pos+k]
    k_old = k

In [0]:
restored.to_csv("restored_prediciton_lasso.csv")

In [0]:
mse_restored = mean_squared_error(restored.dropna()['Prediction_restored'].values,restored.dropna()['Weighted_Price'].values)
mse_restored