# OLS Baseline 
In this notebook an OLS estimation will be performed which serves as a baseline to compare the other methods with. It also serves as the proof-of-concept to figure out how to structure the data correctly and save results before moving on to other architectures.

In [115]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

In [116]:
pd.set_option('display.max_columns', None)

In [117]:
def createRollingWindow(dataset, look_back=1):
    X= pd.DataFrame(np.empty((dataset.shape[0]-look_back, dataset.shape[1]*look_back)))
    for i in tqdm(range(dataset.shape[0]-look_back)):    
        X.iloc[i] = dataset.iloc[i:(i+look_back):].to_numpy().flatten()
    return X

In [118]:
def shift_data(steps, X, y):
    X = X[:X.shape[0]-steps]
    y = y.shift(periods=-steps)[:y.shape[0]-steps].reset_index(drop=True)
    return X,y

## Reading Data
First we start with loading the relevant data from the excel to be used in our analyis

In [119]:
#Read the equity premium series to a dataframe
ep = pd.read_excel('data/Augemented_Formatted_results.xls', sheet_name='Equity premium', skiprows= range(1118,1127,1))[:-1]
ep['Date'] = pd.to_datetime(ep['Date'], format='%Y%m')
ep = ep.set_index('Date')
ep = ep.loc[(ep.index >= '1950-12-01')]

In [120]:
#Read the maacroeconomic variables to a dataframe
mev = pd.read_excel('data/Augemented_Formatted_results.xls', sheet_name='Macroeconomic variables', 
                    skiprows= range(1118,1126,1)).fillna(method='bfill')[:-1] #backward fill missing values. 
mev = mev.loc[:, ~mev.columns.str.match('Unnamed')]  #Remove empty column
mev['Date'] = pd.to_datetime(mev['Date'], format='%Y%m') #convert date pandas format
mev = mev.set_index('Date') #Set date as index. 
mev = mev.loc[(mev.index >= '1950-12-01')]

### Data restructuring
We must create rolling windows of the Macro Economic Variables (MEV) and match them with the 1 month out of sample equity premium in order train a model. 

In [121]:
#Create rolling window version of the MEV dataset.  
X_mev = createRollingWindow(mev, look_back = 12)

HBox(children=(FloatProgress(value=0.0, max=817.0), HTML(value='')))




In [122]:
#Shift equity premiumms such that they correspond to the 1 month out of sample corresponding to each window. 
y = ep.shift(periods=-12)[:ep.shape[0]-12].reset_index(drop=True)

#Convert y to a series with only log equity premium or simple equity premium 
y = y['Log equity premium'].astype('float64')

### Train OLS Model Windowed
Create rolling windows where we try to predit the 1 month out of sample equity premium based on the previous 12 months of Macro economic variables.

In [146]:
#Create Train and test set
X_train, X_test, y_train, y_test = train_test_split(X_mev, y, train_size=168, random_state=0, shuffle=False)

In [151]:
#Train a linear regression model on MEV rolling window data and the corresponding 1 month out of sample equity premium. 
reg = LinearRegression().fit(X_train, y_train)
coefficients = reg.coef_
intercept = reg.intercept_

In [155]:
#Make a prediction
y_pred = reg.predict(X_test)

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:', metrics.r2_score(y_test, y_pred))
print('Explained Variance:', metrics.explained_variance_score(y_test, y_pred))

Mean Absolute Error: 15.805841849587061
Mean Squared Error: 977.3759656738616
Root Mean Squared Error: 31.263012741478732
R2: -525094.7996636973
Explained Variance: -488032.71709039697


### Train OLS Model Vanilla
Train an OLS model without the rolling window variation. Here we just shift the equity premium by 1 such that we alling 1 row of MEV measurements with the 1 month out of sample equity premium. 

In [164]:
X = mev[:mev.shape[0]-1]
y = ep['Log equity premium'].shift(periods=-1)[:ep['Log equity premium'].shape[0]-1].reset_index(drop=True)

In [166]:
#Create Train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=168, random_state=0, shuffle=False)

In [167]:
#Train a linear regression model on MEV rolling window data and the corresponding 1 month out of sample equity premium. 
reg = LinearRegression().fit(X_train, y_train)
coefficients = reg.coef_
intercept = reg.intercept_

In [168]:
#Make a prediction
y_pred = reg.predict(X_test)

from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:', metrics.r2_score(y_test, y_pred))
print('Explained Variance:', metrics.explained_variance_score(y_test, y_pred))

Mean Absolute Error: 0.40171447999422094
Mean Squared Error: 0.2844314592484045
Root Mean Squared Error: 0.533321159573108
R2: -153.54061176469264
Explained Variance: -66.74475794944357


## WIP Notes
* The date is included in the rolling window, that cannot be correct
* The y_mev is useless, need to match it up with the relevant EP. 
* What type of OLS regression should I run? Based on MEV and TA seperately I suppose? Perhaps read rapach 