In [1]:
import os
from os.path import join
from pathlib import Path

import pandas as pd
import numpy as np

processed_dir = Path('../../data/processed')

In [2]:
past_realized_volatility = pd.read_hdf(join(processed_dir, 'past_volatility_indexed.h5'))
targets = pd.read_hdf(join(processed_dir, 'targets_indexed.h5'))

A simple benchmark that we can measure our models against is the accuracy of using past realized volatility as a predictor for future volatility. Since we already have the past volatility computed we can just take the rms error right away.

In [3]:
mspe = np.mean(((past_realized_volatility - targets.loc[96]) / targets.loc[96])**2)

mspe**(1/2)

0.2455915676451374

We can also fit a linear model to the data and use hypothesis testing to see if it is much different than using past realized volatility. (Is the slope different from 1? Is the y-intercept different than 0?)

In [4]:
past_mean = past_realized_volatility.mean()
target_mean = targets.loc[96].mean()

past_devs = past_realized_volatility - past_mean
target_devs = targets.loc[96] - target_mean

slope = (past_devs * target_devs).sum() / (past_devs**2).sum()
intercept = target_mean - slope * past_mean

slope, intercept

(0.7038942983677003, 0.001070400444458277)

Based on these numbers it seems that intercept is likely to actually be 0 but the slope does not equal 1, we should use a more rigorous approach to prove this.

In [5]:
preds = past_realized_volatility * slope + intercept

rss = ((preds - targets.loc[96])**2).sum()

var = rss / (len(preds) - 2)

intercept_se = (var * (1 / len(preds) + past_mean**2 / (past_devs**2).sum()))**(1/2)
slope_se = (var / (past_devs**2).sum())**(1/2)

# intercept_interval = (intercept - 2 * intercept_se, intercept + 2 * intercept_se)
# slope_interval = (slope - 2 * slope_se, slope + slope_se)
intercept_t = intercept / intercept_se
slope_t = (slope - 1) / slope_se

intercept_t, slope_t

(32.45758727340186, -50.052824664085136)

Actually, it seems that while the slope is not equal to 1, the intercept is also not equal to 0.

In [6]:
past_val = past_realized_volatility.sample(frac=0.2, random_state=0)
val_index = past_val.index

past_train = past_realized_volatility.drop(index=val_index)
target_train = targets.loc[96].drop(index=val_index)

target_val = targets.loc[96][val_index]

In [7]:
past_train_mean = past_train.mean()
target_train_mean = target_train.mean()

past_train_devs = past_train - past_train_mean
target_train_devs = target_train - target_train_mean

slope_train = (past_train_devs * target_train_devs).sum() / (past_train_devs**2).sum()
intercept_train = target_train_mean - slope_train * past_train_mean

slope_train, intercept_train

(0.7071240521986295, 0.0010513888903972359)

In [8]:
preds_val = past_val * slope_train + intercept_train

rmspe_val = np.mean(((preds_val - target_val) / target_val)**2)
rmspe_val = rmspe_val**(1/2)

rmspe_val

0.20311094758611253