# Realized Volatility Prediction
## 4 (pt.1) Modeling

### Table of Contents
1. Linear Regression
2. SGDRegressor
3. PLSRegression

In [1]:
# Standard imports and libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
train = pd.read_csv('../data/interim/features1_train2.csv')
train.drop(columns=['Unnamed: 0'], inplace=True)
X_train= train.drop(['stock_id', 'time_id', 'target_value'], axis=1)
y_train = train[['target_value']].values.ravel()

In [3]:
test = pd.read_csv('../data/interim/features1_train2.csv')
test.drop(columns=['Unnamed: 0'], inplace=True)
X_test = test.drop(['stock_id', 'time_id', 'target_value'], axis=1)
y_test = test[['target_value']].values.ravel()

In [4]:
train.shape

(343146, 10)

## 4.1 Linear Regression
The first model we will test is linear regression.

In [5]:
from sklearn import linear_model
linreg = linear_model.LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)

In [6]:
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

In [7]:
rmspe(y_test, y_pred)

0.32666567700634125

Not a significant improvement from our naive prediction using just past volatility (RMSPE = 0.3413544901880096)

## 4.2 SGDRegressor

In [8]:
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [9]:
# Always scale the input. The most convenient way is to use a pipeline.
sgdreg = make_pipeline(StandardScaler(),
                    SGDRegressor(max_iter=1000, tol=1e-3))

sgdreg.fit(X_train, y_train)

In [10]:
y_pred = sgdreg.predict(X_test)

In [11]:
rmspe(y_test, y_pred)

0.3563575776011894

## 4.3 PLSRegression
PLS is both a transformer and a regressor, and it is quite similar to PCR: it also applies a dimensionality reduction to the samples before applying a linear regressor to the transformed data. The main difference with PCR is that the PLS transformation is supervised.

We will start with a trial of 5 components, reduced from 7.

In [12]:
from sklearn.cross_decomposition import PLSRegression
plsreg = PLSRegression(n_components=3, tol=1e-05)
plsreg.fit(X_train, y_train)

In [None]:
#y_pred_plsreg = plsreg.predict(X_test)
#rmspe(y_test, y_pred_plsreg)

Execution failed repeatedly due to RAM limitations.