In [1]:
import os

import pandas as pd
import numpy as np
from tqdm import tqdm

from optiver import Directories
from optiver.utils import realized_volatility, generate_dfs
from optiver.bench import rmspe

dirs = Directories("../..")

# Past Values as a Predictor for Realized Volatility

One way to predict realized volatility is to take the value that it is now. A question we would like to answer is whether realized volatility tends to be the same as it was 10 minutes ago, or if it tends to change. If we were to fit a linear model to the stock data, this would be equivalent to answering whether the slope and intercept of the model equal 1 and 0, respectively. If we draw a conclusion for one stock, is it the same for all stocks? We can perform hypothesis tests to answer this question.

## Using Past Values as Predictions

We can calculate the RMSPE when using past realized volatility as a prediction and compare it to the RMSPE of linear models to check whether it is worth it at all to use a linear model. To calculate the realized volatility from the order book data, we first calculate the weighted average price (WAP)

$$
S = \frac{BidPrice_1 * AskSize_1 + AskPrice_1 * BidSize_1}{BidSize_1 + AskSize_1},
$$

and then the log return at each second based on the WAP:

$$
r_{t_1, t_2} = \log \frac{S_{t_2}}{S_{t_1}}.
$$

From there we can calculate realized volatility as the root of squared log returns for each time bucket:

$$
\sigma = \sqrt{\sum_t{r^2_{t-1, t}}}.
$$

In [None]:
generated_dfs = tqdm(generate_dfs(dirs.processed / "book_train"), total=92)
past_volatility_predictions = pd.concat({stock_id:realized_volatility(df) for stock_id, df in generated_dfs}, names=("stock_id", "time_id"))

target_volatility = pd.read_hdf(dirs.processed / "targets_train.h5")

rmspe(past_volatility_predictions, target_volatility)

We get an RMSPE of ~34% when using past volatility as the prediction, how does the linear model fare? 

In [3]:
past_test = past_volatility_predictions.sample(frac=0.2, random_state=0).sort_index()
test_index = past_test.index

target_test = target_volatility.loc[test_index]

past_train, target_train = past_volatility_predictions.drop(test_index), target_volatility.drop(test_index)

In [4]:
def devs_and_mean(df):
    means = df.groupby(level="stock_id").mean()
    
    devs = df - means
    
    return devs, means


def group_stocks(df):
    return df.groupby(level="stock_id")


past_devs, past_means = devs_and_mean(past_train)
target_devs, target_means = devs_and_mean(target_train)

slopes = group_stocks(past_devs * target_devs).sum() / group_stocks(past_devs**2).sum()
intercepts = target_means - slopes * past_means

linear_models = pd.DataFrame({"slope": slopes, "intercept": intercepts})

predictions_test = past_test * linear_models["slope"] + linear_models["intercept"]

display(linear_models)
print(f"RMSPE: {rmspe(predictions_test, target_test)}")

Unnamed: 0_level_0,slope,intercept
stock_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.610802,0.001269
1,0.643120,0.001328
2,0.769605,0.000399
4,0.632173,0.001274
5,0.726581,0.001112
...,...,...
118,0.667330,0.001293
119,0.772152,0.000508
123,0.733954,0.000510
125,0.786191,0.000321


RMSPE: 0.32760240784708133


Linear models for each stock drop the RMSPE by 1-2%, but we know we can do a lot better. (Skipping hypothesis tests for now to work on more important tasks. Also a better hypothesis test would simply be looking at the mean of $\text{target} - \text{past}$, as it's technically possible for a linear model to have a non 1/0 slope/intercept and just have the outputs approximate a 1/0 slope/intercept for a certain range).