In [1]:
import os
from pathlib import Path
from distutils.dir_util import copy_tree
import random
from random import sample

import pandas as pd
import numpy as np
from tqdm import tqdm

from optiver import Directories

dirs = Directories("../..")
random.seed(0)

# Competition Data

The task that we are given in this competition is to predict the realized volatility of a stock within 10 minutes, given 10 minutes of prior data for that stock. To do this we have access to order book and trade data over a 10 minute window for a variety of stocks, as well as the realized volatility for the stock in the next 10 minutes. This makes the task a regression problem where the input is the order book and trades and the output is the predicted volatility in the next 10 minutes. Success in this competition will require mining good features from the input data and choosing a good model to map features to outputs. There are 112 stocks to work with and 3830 data points per stock.

## Explanation of Files

This is an edited version of the explanation given at https://www.kaggle.com/c/optiver-realized-volatility-prediction/data. There are three main files of interest in the competition data.

**book_train.parquet** A parquet file containing book order data for the most competitive buy and sell orders of each stock. 
- `stock_id` - ID code for the stock. Not all stock IDs exist in every time bucket. ~~It is expected that the same stock IDs will be present in the testing data, so we can train models that are particular to each stock, rather than trying to create a model that works for any stock.~~ We probably have to train a model that works for any stock, because our model will be tested on real market data.
- `time_id` - ID code for the time bucket. Time IDs are not necessarily sequential but are consistent across all stocks.
- `seconds_in_bucket` - Number of seconds from the start of the bucket, always starting from 0.
- `bid_price[1/2]` - Normalized prices of the most/second most competitive buy level.
- `ask_price[1/2]` - Normalized prices of the most/second most competitive sell level.
- `bid_size[1/2]` - The number of shares on the most/second most competitive buy level.
- `ask_size[1/2]` - The number of shares on the most/second most competitive sell level.
**trade_train.parquet** A parquet file containing data on trades that are actually executed. Usually, in the market, there are more passive buy/sell intention updates (book updates) than actual trades, therefore one may expect this file to be more sparse than the order book.
- `stock_id` - Same as above.
- `time_id` - Same as above.
- `seconds_in_bucket` - Same as above. Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from 0.
- `price` - The average price of executed transactions happening in one second. Prices have been normalized and the average has been weighted by the number of shares traded in each transaction.
- `size` - The sum number of shares traded.
- `order_count` - The number of unique trade orders taking place.
**train.csv** The ground truth values for realized volatility.
- `stock_id` - Same as above.
- `time_id` - Same as above.
- `target` - The realized volatility computed over the 10 minute window following the feature data under the same stock/time_id. There is no overlap between feature and target data.

## Data Splitting

Portions of the data need to be reserved for testing our most promising models before submission. We will take 20 stocks and make sure that they are not seen at all by the training process, so that we can gauge how well our methods will work on new stocks. Out of the rest of the data, 20% of the `time_id`s for each `stock_id` will be reserved in a test set for stocks that have already been seen. It would also be a good idea to see how the model fares on real market data before the final submission.

In [2]:
stocks = os.listdir(dirs.raw / "book_train.parquet")
# Sorting the directory names ensures that this will always provide the same split for a given seed.
stocks.sort()

# These stocks are allowed to be seen by the training process.
train_stocks = sample(stocks, len(stocks) - 20)

# We save the split in a plain text file in case we want to commit it and ensure we can replicate the split.
with open(dirs.splits / "train_stocks.txt", 'w+') as f:
    f.write(''.join([train_stock + '\n' for train_stock in train_stocks[:-1]]))
    f.write(train_stocks[-1])


def reserve_stocks(src, train_dest, reserved_dest):
    """Split the stock data found in the source directory into a training directory and reserved directory."""
    os.makedirs(train_dest, exist_ok=True)
    os.makedirs(reserved_dest, exist_ok=True)

    for stock in tqdm(stocks):
        dest = train_dest if stock in train_stocks else reserved_dest
        copy_tree(os.fspath(src / stock), os.fspath(dest / stock))


reserve_stocks(dirs.raw / "book_train.parquet", dirs.interim / "book_train.parquet", dirs.interim / "book_reserved.parquet")
reserve_stocks(dirs.raw / "trade_train.parquet", dirs.interim / "trade_train.parquet", dirs.interim / "trade_reserved.parquet")

100%|██████████████████████████████████████████████████████████| 112/112 [00:14<00:00,  7.75it/s]
100%|██████████████████████████████████████████████████████████| 112/112 [00:03<00:00, 36.62it/s]


The training split is for the order book and trade data is located in `data/interim/book_train.parquet` and `data/interim/trade_train.parquet`, respectively. Similar locations are given to the reserved split. This is a different convention from the raw competition data, where `book_train.parquet` for example contains all the training data for order books.

### Preprocessing

Before we reserve the testing set for each stock, we are going to process the data a bit, to make it easier to index. The parquet file for each stock will become an HDF5 file and the dataframes will be indexed by things such as `stock_id`, `time_id`, and `seconds_in_bucket`.

In [3]:
def process_input(src, dest):
    """
    Read all parquet files and reindex them by time_id and seconds_in_bucket before
    converting them to HDF format.
    """
    os.makedirs(dest, exist_ok=True)
    
    for parquet_file in tqdm(os.listdir(src)):
        df = pd.read_parquet(src / parquet_file)
        
        df.index = pd.MultiIndex.from_frame(df[['time_id', 'seconds_in_bucket']])
        # Drop old columns that are now represented as indices.
        df = df.drop(columns=['time_id', 'seconds_in_bucket'])
        
        # The parquet filename format is `stock_id=#`, but we don't want the =
        # character in the filename so we change it to `stock_#`.
        hdf_file = f"stock_{parquet_file[9:]}.h5"
        
        df.to_hdf(dest / hdf_file, key=hdf_file[:-3])


def process_targets(csv_file):
    """Read the CSV file and return a series containing targets and indexed by stock id and time_id."""
    targets = pd.read_csv(csv_file)

    targets.index = pd.MultiIndex.from_frame(targets[['stock_id', 'time_id']])
    # We decide to turn the dataframe into a series since there's only one column.
    targets = targets["target"]
    
    return targets


process_input(dirs.interim / "book_train.parquet", dirs.interim / "processed" / "book_train")
process_input(dirs.interim / "trade_train.parquet", dirs.interim / "processed" / "trade_train")

# Since we won't need to do anything with the reserved set after indexing it, we can immediately save it
# to the processed directory.
process_input(dirs.interim / "book_reserved.parquet", dirs.processed / "book_reserved")
process_input(dirs.interim / "trade_reserved.parquet", dirs.processed / "trade_reserved")

targets = process_targets(dirs.raw / "train.csv")

100%|████████████████████████████████████████████████████████████| 92/92 [01:21<00:00,  1.12it/s]
100%|████████████████████████████████████████████████████████████| 92/92 [00:04<00:00, 21.15it/s]
100%|████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.25it/s]
100%|████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 13.46it/s]


After the data has been reindexed, the final split can be made. There will then be three types of files in the `data/processed` directory:

**book_[train/test/reserved]** A directory of HDF files containing dataframes of book order data for each stock. Dataframes are indexed by `time_id` and `seconds_in_bucket`.

**trade_[train/test/reserved]** Same as above but for the trade data of each stock.

**targets_[train/test/reserved]** An HDF file containing a series of realized volatility targets. The series is indexed by `stock_id` and `time_id`.

In [4]:
def parse_stock_id(stock_id: str):
    # Format is stock_id=#, we need to get rid of 'stock_id='
    return int(stock_id[9:])


def split_index(df, index):
    split_2 = df[index]
    split_1 = df.drop(split_2.index)
    
    return split_1, split_2


reserved_stock_ids = [parse_stock_id(stock) for stock in stocks if stock not in train_stocks]

train_targets, reserved_targets = split_index(targets, reserved_stock_ids)
train_targets, test_targets = split_index(train_targets, train_targets.sample(frac=0.2, random_state=0).sort_index().index)

train_targets.to_hdf(dirs.processed / "targets_train.h5", key="targets_train")
test_targets.to_hdf(dirs.processed / "targets_test.h5", key="targets_test")
reserved_targets.to_hdf(dirs.processed / "targets_reserved.h5", key="targets_reserved")


def make_test_split(directory, train_directory, test_directory, trade=False):
    """Take 20% of every training dataframe and reserve it in a special testing set."""
    for hdf_file in tqdm(os.listdir(directory)):
        # filename format is stock_#.h5
        stock_id = int(hdf_file[6:-3])
        
        train_df = pd.read_hdf(directory / hdf_file)
        
        test_index = test_targets[stock_id].index
        
        # Trade files may not have every time_id, since it's possible that there were no trades for certain time periods.
        if trade:
            available_time_ids = set(train_df.index.get_level_values("time_id"))
            if len(test_index) != len(available_time_ids):
                test_index = [index for index in test_index if index in available_time_ids]
        
        test_df = train_df.loc[test_index]
        train_df = train_df.drop(test_index)
        
        os.makedirs(train_directory, exist_ok=True)
        os.makedirs(test_directory, exist_ok=True)
        
        df_name = f"stock_{stock_id}"
        train_df.to_hdf(train_directory / hdf_file, key=df_name)
        test_df.to_hdf(test_directory / hdf_file, key=df_name)


make_test_split(dirs.interim / "processed" / "book_train", dirs.processed / "book_train", dirs.processed / "book_test")
make_test_split(dirs.interim / "processed" / "trade_train", dirs.processed / "trade_train", dirs.processed / "trade_test", trade=True)

100%|████████████████████████████████████████████████████████████| 92/92 [04:12<00:00,  2.75s/it]
100%|████████████████████████████████████████████████████████████| 92/92 [00:49<00:00,  1.85it/s]


Here is an example of training data for `stock_id=96`.

In [5]:
stock_96_book = pd.read_hdf(dirs.processed / "book_train" / "stock_96.h5")
stock_96_trades = pd.read_hdf(dirs.processed / "trade_train" / "stock_96.h5")
stock_96_targets = pd.read_hdf(dirs.processed / "targets_train.h5")[96]

print("Order book:")
display(stock_96_book)
print("Trades:")
display(stock_96_trades)
print("Target realized volatility:")
display(stock_96_targets)

Order book:


Unnamed: 0_level_0,Unnamed: 1_level_0,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2
time_id,seconds_in_bucket,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5,0,1.000766,1.001283,1.000747,1.001301,14,1,24,1
5,1,1.000766,1.001283,1.000747,1.001301,1,1,24,1
5,2,1.000932,1.001430,1.000914,1.001504,214,21,214,49
5,4,1.001061,1.001578,1.000969,1.001596,10,1,14,38
5,5,1.001061,1.001578,1.000747,1.001596,150,1,2,38
...,...,...,...,...,...,...,...,...,...
32767,593,0.999157,0.999543,0.999137,0.999563,100,100,100,100
32767,594,0.999157,0.999523,0.999137,0.999563,100,201,100,100
32767,595,0.999157,0.999503,0.999137,0.999563,100,102,100,100
32767,596,0.999137,0.999503,0.999117,0.999563,100,102,28,100


Trades:


Unnamed: 0_level_0,Unnamed: 1_level_0,price,size,order_count
time_id,seconds_in_bucket,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,0,1.001216,369,12
5,2,1.001410,162,10
5,4,1.001434,22,4
5,9,1.001109,47,6
5,23,1.000990,57,7
...,...,...,...,...
32767,588,0.999726,1,1
32767,591,0.999604,31,1
32767,592,0.999462,14,2
32767,593,0.999360,100,2


Target realized volatility:


time_id
5        0.005130
11       0.002393
16       0.003085
62       0.002936
72       0.006340
           ...   
32750    0.002326
32751    0.004332
32753    0.002848
32763    0.004951
32767    0.003124
Name: target, Length: 3074, dtype: float64