**DEPRECATED**

In [None]:
import os
import random
from os.path import join
from pathlib import Path

import pandas as pd
import numpy as np

data_dir = Path('../../data/')
raw_dir = join(data_dir, 'raw')
splits_dir = join(data_dir, 'splits')
book_train_dir = join(raw_dir, 'book_train')
train_targets_path = join(raw_dir, 'train.csv')

os.makedirs(book_train_dir, exist_ok=True)
os.makedirs(splits_dir, exist_ok=True)

random.seed(0)

It seems like the competition test data contains the same stocks as the training data. This means we should be able to create models for each stock id instead of trying to make a general model for any stock. The following cell reserves 20 stocks as a final test set before submitting, but it may be more appropriate to do the same thing for each individual stock instead. For this cell and the rest of the notebook to work, you need to have the book_train data unzipped into the `data/raw/` directory. Once that's done the directory structure should look like this

```
|-- data
|   |-- raw
|   |   |-- book_train
|   |   |   |-- stock_id=0
|   |   |   |-- stock_id=1
|   |   |   |-- ...
|-- README.md
|-- LICENSE
|-- ...
```

In [None]:
stocks = os.listdir(book_train_dir)
# Sort the directory names so we can always be sure this provides the same split for a given seed.
stocks.sort()

from random import sample

train_stocks = sample(stocks, len(stocks) - 20)

train_stocks_path = join(splits_dir, 'train_stocks.txt')

with open(train_stocks_path, 'w+') as f:
    f.write(''.join([train_stock + '\n' for train_stock in train_stocks]))

For now we can just use these 5 stocks as toy data to do a bit of exploring with.

In [None]:
with open(train_stocks_path, 'r') as f:
    train_stocks_text = f.read()

train_stocks_text.split(sep='\n')[0:5]

Data comes in parquet files, which pandas has support for. It's possible to just read `data/raw/book_train` and get the data for every stock, but I advise against this as it almost crashed my computer with how much memory it took.

In [None]:
df = pd.read_parquet('../../data/raw/book_train/stock_id=96')

In [None]:
df

Since every row can be uniquely identified by `time_id` and `seconds_in_bucket`, reindexing based on these columns makes analysis a bit nicer.

In [None]:
df.index = pd.MultiIndex.from_frame(df[['time_id', 'seconds_in_bucket']])
df = df.drop(columns=['time_id', 'seconds_in_bucket'])

df

There is roughly 10 minutes of data in each time bucket. `seconds_in_bucket` is not continuous, some seconds are missing.

In [None]:
df.loc[5]

In [None]:
np.sqrt((df['bid_price1'] + df['ask_price1']).groupby(level='time_id').sum())

Realized volatility calculation:

In [None]:
def realized_volatility(book_data):
    wap = (book_data['bid_price1'] * book_data['ask_size1'] + book_data['ask_price1'] * book_data['bid_size1']) / (book_data['bid_size1'] + book_data['ask_size1'])
    log_wap = np.log(wap)
    
    log_returns = log_wap.groupby(level='time_id').diff().dropna()
    
    return np.sqrt((log_returns**2).groupby(level='time_id').sum())

In [None]:
past_realized_volatility = realized_volatility(df)

past_realized_volatility

We can compare this with the target realized volatility from `train.csv`.

In [None]:
train_targets_df = pd.read_csv(train_targets_path)
train_targets_df.index = pd.MultiIndex.from_frame(train_targets_df[['stock_id', 'time_id']])

targets = train_targets_df['target']

# Put both past and target in same dataframe for side-by-side comparison and graphing.
comparison_df = pd.DataFrame({'past': past_realized_volatility, 'target': targets.loc[96]})

comparison_df

Plot of past volatility vs target volatility suggests a linear relationship.

In [None]:
_ = comparison_df.plot(x='past', y='target', kind='scatter')

#### Next Steps

For a baseline we should replicate the tutorial method of using past volatility as the predicted target. We should then try a linear model based on past volatility. From there best thing to do should be feature engineering, we still have the trade data to tap into, and there are other features we can take from the book data. For a high dimensionality approach we could try taking summary stats for short time intervals throughout each bucket and feed them into something like an SVM.

A competitive model would probably recognize the data as time series data, but I want to get some practice with the more basic learning methods before getting into time series analysis.

In [None]:
processed_dir = join(data_dir, 'processed')

os.makedirs(processed_dir, exist_ok=True)

past_realized_volatility.to_hdf(join(processed_dir, 'past_volatility_indexed.h5'), key='past_realized_volatility')
targets.to_hdf(join(processed_dir, 'targets_indexed.h5'), 'targets')