In this document, we will collect the data processing functions. The functions themselves will be stored within ../proj_mod/data_processing.py where proj_mod can be imported as a python module when needed. Following two python block surves as an example of importing proj_mod. 

In [7]:
import sys
sys.path.append("../")

import pandas as pd
import numpy as np

import time

import glob

# This forces Jupyter Notebook to reload the module instead of using the cached import
import importlib

from proj_mod import data_processing
importlib.reload(data_processing); #Adding ";" to suppress output.

## Book data harvesting function by stock and time id

We will first demonstrate the function book_for_stock(). 

In [2]:
df_book_0_5=data_processing.book_for_stock(str_file_path="../raw_data/kaggle_ORVP/book_train.parquet",stock_id=0,time_id=5, create_para=True)

In [3]:
df_book_0_5

Unnamed: 0,time_id,seconds_in_bucket,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2,stock_id,wap,log_return
0,5,1,1.001422,1.002301,1.001370,1.002353,3,100,2,100,0,1.001448,0.000014
1,5,5,1.001422,1.002301,1.001370,1.002405,3,100,2,100,0,1.001448,0.000000
2,5,6,1.001422,1.002301,1.001370,1.002405,3,126,2,100,0,1.001443,-0.000005
3,5,7,1.001422,1.002301,1.001370,1.002405,3,126,2,100,0,1.001443,0.000000
4,5,11,1.001422,1.002301,1.001370,1.002405,3,100,2,100,0,1.001448,0.000005
...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,5,585,1.003129,1.003749,1.003025,1.003801,100,3,26,3,0,1.003731,0.000245
297,5,586,1.003129,1.003749,1.002612,1.003801,100,3,2,3,0,1.003731,0.000000
298,5,587,1.003129,1.003749,1.003025,1.003801,100,3,26,3,0,1.003731,0.000000
299,5,588,1.003129,1.003749,1.002612,1.003801,100,3,2,3,0,1.003731,0.000000


## Calculating RV by stock and time id. (This appears not very practical due to speed). 

In the following, we show an example of realized_vol function. 

In [4]:
rv, row= data_processing.realized_vol(df_book_0_5)

## Saving the RV values of current 10s to safe calculation time for future use. 

I created an alternative function to calculate values enmass, going through each df separately is extremely time-consuming. 

In [2]:
path_book="../raw_data/kaggle_ORVP/book_train.parquet"

In [3]:
start_time = time.time()
df_rv=data_processing.create_df_RV_by_row_id(path_book)
end_time = time.time()
elapsed_time = end_time - start_time

print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 62.43 seconds


In [4]:
df_rv

Unnamed: 0,time_id,RV,row_id,stock_id
0,5,0.002185,93-5,93
1,11,0.001205,93-11,93
2,16,0.001461,93-16,93
3,31,0.001693,93-31,93
4,62,0.001296,93-62,93
...,...,...,...,...
428927,32751,0.002337,104-32751,104
428928,32753,0.001500,104-32753,104
428929,32758,0.002272,104-32758,104
428930,32763,0.001949,104-32763,104


In [9]:
df_rv.to_csv("../processed_data/RV_by_row_id.csv",index=False)

## The calculation of RV values can be done in parallel now (this function also uses less memory after the initial spike):

In [5]:
start_time = time.time()
path_book="../raw_data/kaggle_ORVP/book_train.parquet"
df_rv_parallel=data_processing.create_df_RV_by_row_id_parallel(path_book)
end_time = time.time()
elapsed_time = end_time - start_time

print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 17.24 seconds


In [6]:
df_rv_parallel

Unnamed: 0,time_id,RV,row_id,stock_id
0,5,0.002185,93-5,93
1,11,0.001205,93-11,93
2,16,0.001461,93-16,93
3,31,0.001693,93-31,93
4,62,0.001296,93-62,93
...,...,...,...,...
428927,32751,0.002337,104-32751,104
428928,32753,0.001500,104-32753,104
428929,32758,0.002272,104-32758,104
428930,32763,0.001949,104-32763,104


## Trade data harvesting function by stock and time id. 

Function trade_for_stock() is similar. 

In [24]:
df_trade_0_5=data_processing.trade_for_stock(str_file_path="../raw_data/kaggle_ORVP/trade_train.parquet",stock_id=0,time_id=5)

In [25]:
df_trade_0_5.head()

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count,stock_id
0,5,21,1.002301,326,12,0
1,5,46,1.002778,128,4,0
2,5,50,1.002818,55,1,0
3,5,57,1.003155,121,5,0
4,5,68,1.003646,4,1,0


## Precalculate the trade data on avg and std of of price, size, order, and sum of size, and order. 

In [8]:
df_trade_vals=data_processing.create_df_trade_vals_by_row_id(str_path="../raw_data/kaggle_ORVP/trade_train.parquet")

In [9]:
df_trade_vals

Unnamed: 0,time_id,price_mean,price_std,size_sum,size_mean,size_std,order_count_sum,order_count_mean,order_count_std,row_id,stock_id
0,5,1.002227,0.001003,35728,140.109804,165.359986,815,3.196078,2.865898,93-5,93
1,11,1.000889,0.000439,23796,226.628571,242.518804,402,3.828571,3.861688,93-11,93
2,16,0.999648,0.000335,20642,231.932584,223.418268,330,3.707865,3.348006,93-16,93
3,31,0.999804,0.000393,12960,196.363636,193.771685,256,3.878788,2.836656,93-31,93
4,62,0.999488,0.000338,5547,84.045455,118.524705,178,2.696970,2.811896,93-62,93
...,...,...,...,...,...,...,...,...,...,...,...
428908,32751,0.999914,0.000487,8089,120.731343,150.944961,216,3.223881,3.024250,104-32751,104
428909,32753,0.999238,0.000271,7782,131.898305,164.677500,221,3.745763,3.826375,104-32753,104
428910,32758,0.999601,0.000453,2804,100.142857,114.605835,74,2.642857,3.188106,104-32758,104
428911,32763,0.999555,0.000570,24618,276.606742,280.412546,429,4.820225,4.187540,104-32763,104


In [57]:
df_trade_vals.to_csv("../processed_data/trade_vals_by_row_id.csv",index=False)

## Time series creation function

For each row id "a-b" (stock id a and time id b), we have the trade data. 
We create the RV of sub-intervals (e.g. seconds_in_bracket in interval [0,10]) for all disjoint sub-intervals within [0,600] (e.g. [0,10], [11, 21], ...). 
This will help us to bypass the fact that there are different total number of seconds_in_bracket in each row_id. 
This sequence of RV can serve as a time series data. 

We create a function to create this time series feature. 

In [13]:
arr_RV_0_5=data_processing.create_RV_timeseries(df_in=df_book_0_5)

In [14]:
arr_RV_0_5

array([1.49832934e-05, 1.03072451e-05, 1.05685058e-03, 0.00000000e+00,
       8.98304774e-04, 8.92053424e-04, 4.54066889e-04, 9.91255076e-04,
       3.65205270e-05, 3.70139248e-05, 2.99336299e-05, 3.62542885e-04,
       9.60843370e-04, 3.78928830e-04, 8.51440932e-04, 1.09884521e-03,
       5.05653081e-04, 8.49980121e-04, 6.40812165e-04, 4.85923409e-04,
       7.97586512e-04, 4.25454588e-04, 7.69266746e-04, 5.20608145e-04,
       2.68669240e-06, 6.78738335e-04, 6.79358412e-04, 1.47598809e-04,
       4.78524245e-04, 1.65369514e-05, 6.23763457e-04, 7.32826186e-04,
       1.08703048e-03, 6.87840067e-04, 3.12315713e-04, 6.45995291e-06,
       2.41529705e-04, 4.29347021e-04, 3.58229596e-06, 7.15967292e-04,
       1.38924412e-03, 9.61347656e-06, 4.77481691e-04, 3.43513251e-04,
       2.23079881e-04, 0.00000000e+00, 6.32441530e-04, 6.41041735e-04,
       1.42769361e-04, 6.56544153e-05, 5.48786596e-06, 3.36101519e-04,
       5.06175206e-04, 5.92941392e-04, 6.63459680e-04, 6.13845227e-04,
      