# Realized Volatility Prediction
## 3 Pre-processing (Part 1)

### Table of Contents
1. Introduction and Imports
2. Book Feature Building
3. Trade Feature Building
4. Saving Features

## 3.1 Introduction and Imports

We have so far built one feature (realized volatility) for an example bucket; in this step of the project, we will build more features and apply them to the rest of the dataset. We will accomplish this by writing a series of functions that generate different statistics on a bucket. The generated statistics will be added to train.csv and finally a train/test split will be performed.

Let's start with the necessary imports and load the same example bucket for reference.

In [1]:
# Standard imports and libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# File paths
target = pd.read_csv('../data/raw/train.csv')
book_path = '../data/raw/book_train.parquet/stock_id='
trade_path = '../data/raw/trade_train.parquet/stock_id='

In [3]:
# Load example bucket for reference
stock_id = 33
time_id = 21834

book_file = pd.read_parquet(book_path + str(stock_id))
book_example = book_file[book_file['time_id'] == time_id].copy()
book_example.head()

Unnamed: 0,time_id,seconds_in_bucket,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2
530254,21834,0,0.999342,1.000554,0.999321,1.000679,5,1,105,3
530255,21834,13,0.999321,1.000554,0.999258,1.000679,105,1,102,3
530256,21834,14,0.999321,1.000554,0.999258,1.000679,105,1,102,3
530257,21834,20,0.999321,1.000554,0.999258,1.000679,106,1,102,3
530258,21834,23,0.999342,1.000554,0.999321,1.000679,148,1,57,3


In [4]:
trade_file = pd.read_parquet(trade_path + str(stock_id))
trade_example = trade_file[trade_file['time_id'] == time_id].copy()
trade_example.head()

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count
63934,21834,58,0.999225,368,14
63935,21834,59,0.999363,5,1
63936,21834,75,0.998485,2,1
63937,21834,98,0.998861,5,1
63938,21834,120,0.999091,1,1


## 3.2 Book Feature Building

Now, we are ready to build a series of functions that take in a slice of a dataframe, and compute statistics for each. We will likely build time series first (of the bid/ask spread, WAP, trading volume, etc.), then extract from these time series some statistics such as mean, std, moving average at different cutoff points, and so on.

In [5]:
def custom_series(book):
    # Build custom series based on price data
    book['bid_ask_1'] = book['ask_price1'] - book['bid_price1']
    book['bid_ask_2'] = book['ask_price2'] - book['bid_price2']
    book['wap_1'] = (book['bid_price1'] * book['ask_size1'] + book['ask_price1'] * book['bid_size1']) / (book['bid_size1']+ book['ask_size1'])
    book['wap_2'] = (book['bid_price2'] * book['ask_size2'] + book['ask_price2'] * book['bid_size2']) / (book['bid_size2']+ book['ask_size2'])

    # Build custom series based on volumn data
    book['volume_overall'] = book['bid_size1'] + book['bid_size2'] + book['ask_size1'] + book['ask_size2']
    book['volume_difference'] = (book['bid_size1'] + book['bid_size2']) - (book['ask_size1'] + book['ask_size2'])

In [6]:
custom_series(book_example)
book_example.head()

Unnamed: 0,time_id,seconds_in_bucket,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2,bid_ask_1,bid_ask_2,wap_1,wap_2,volume_overall,volume_difference
530254,21834,0,0.999342,1.000554,0.999321,1.000679,5,1,105,3,0.001212,0.001358,1.000352,1.000641,114,106
530255,21834,13,0.999321,1.000554,0.999258,1.000679,105,1,102,3,0.001233,0.001421,1.000542,1.000638,211,203
530256,21834,14,0.999321,1.000554,0.999258,1.000679,105,1,102,3,0.001233,0.001421,1.000542,1.000638,211,203
530257,21834,20,0.999321,1.000554,0.999258,1.000679,106,1,102,3,0.001233,0.001421,1.000542,1.000638,212,204
530258,21834,23,0.999342,1.000554,0.999321,1.000679,148,1,57,3,0.001212,0.001358,1.000546,1.000611,209,201


In [7]:
def base_stat(book):
    functions = ['mean','std']
    base_stat = book.iloc[:,2:].agg(functions)
    return base_stat

base_stat = base_stat(book_example)
print(base_stat)

      bid_price1  ask_price1  bid_price2  ask_price2  bid_size1  ask_size1  \
mean    0.998649    0.999989    0.998538    1.000150  43.243243  34.491892   
std     0.000570    0.000617    0.000580    0.000714  57.179639  45.362803   

      bid_size2  ask_size2  bid_ask_1  bid_ask_2     wap_1     wap_2  \
mean  58.572973  15.497297   0.001341   0.001612  0.999349  0.999592   
std   63.675693  27.824396   0.000221   0.000284  0.000623  0.000990   

      volume_overall  volume_difference  
mean      151.805405          51.827027  
std        99.253326          90.212264  


In [8]:
def logdiff_stat(book):
    # Make a log_diff copy of the book
    book_ld = book.iloc[:, 2:].apply(lambda x: np.log(x).diff())
    functions = ['mean','std']
    logdiff_stat = book_ld.agg(functions)

    real_vol_1 = np.sqrt(np.sum(book_ld['wap_1']**2))
    real_vol_2 = np.sqrt(np.sum(book_ld['wap_2']**2))
    return logdiff_stat, real_vol_1, real_vol_2

logdiff_stat, real_vol_1, real_vol_2= logdiff_stat(book_example)
print(logdiff_stat)
print(real_vol_1)
print(real_vol_2)

      bid_price1  ask_price1  bid_price2  ask_price2  bid_size1  ask_size1  \
mean   -0.000005   -0.000004   -0.000005   -0.000003   0.004758   0.008747   
std     0.000157    0.000206    0.000171    0.000220   1.306068   1.269269   

      bid_size2  ask_size2  bid_ask_1  bid_ask_2     wap_1     wap_2  \
mean  -0.019323  -0.005971   0.000619   0.000850 -0.000005 -0.000005   
std    1.410670   1.223297   0.150769   0.094247  0.000392  0.000479   

      volume_overall  volume_difference  
mean       -0.009194           0.025635  
std         0.668444           0.813549  
0.00530235206273329
0.006480464494648944


  result = getattr(ufunc, method)(*inputs, **kwargs)


In [9]:
# Generate column names for book features
list_columns = base_stat.columns.to_list()
print(len(list_columns))

mean_list = ['mean_' + name for name in list_columns]
std_list = ['std_' + name for name in list_columns]
ld_mean_list = ['lg_mean_' + name for name in list_columns]
ld_std_list = ['lg_std_' + name for name in list_columns]

list_columns = ['real_vol_1', 'real_vol_2'] + mean_list + std_list + ld_mean_list + ld_std_list
print(list_columns)
print(len(list_columns))

14
['real_vol_1', 'real_vol_2', 'mean_bid_price1', 'mean_ask_price1', 'mean_bid_price2', 'mean_ask_price2', 'mean_bid_size1', 'mean_ask_size1', 'mean_bid_size2', 'mean_ask_size2', 'mean_bid_ask_1', 'mean_bid_ask_2', 'mean_wap_1', 'mean_wap_2', 'mean_volume_overall', 'mean_volume_difference', 'std_bid_price1', 'std_ask_price1', 'std_bid_price2', 'std_ask_price2', 'std_bid_size1', 'std_ask_size1', 'std_bid_size2', 'std_ask_size2', 'std_bid_ask_1', 'std_bid_ask_2', 'std_wap_1', 'std_wap_2', 'std_volume_overall', 'std_volume_difference', 'lg_mean_bid_price1', 'lg_mean_ask_price1', 'lg_mean_bid_price2', 'lg_mean_ask_price2', 'lg_mean_bid_size1', 'lg_mean_ask_size1', 'lg_mean_bid_size2', 'lg_mean_ask_size2', 'lg_mean_bid_ask_1', 'lg_mean_bid_ask_2', 'lg_mean_wap_1', 'lg_mean_wap_2', 'lg_mean_volume_overall', 'lg_mean_volume_difference', 'lg_std_bid_price1', 'lg_std_ask_price1', 'lg_std_bid_price2', 'lg_std_ask_price2', 'lg_std_bid_size1', 'lg_std_ask_size1', 'lg_std_bid_size2', 'lg_std_ask_s

In [10]:
# Flatten the dataframes so that it is a single list of all the book features we have engineered
def flatten(stat_df):
    mean_list = stat_df.loc['mean'].tolist()
    std_list = stat_df.loc['std'].tolist()
    return mean_list + std_list

flatten(base_stat) + flatten(logdiff_stat)

[0.9986487030982971,
 0.9999893307685852,
 0.9985381364822388,
 1.0001503229141235,
 43.24324324324324,
 34.49189189189189,
 58.57297297297297,
 15.497297297297298,
 0.0013405977515503764,
 0.0016121042426675558,
 0.9993493205846065,
 0.9995916274674801,
 151.8054054054054,
 51.82702702702703,
 0.0005703636561520398,
 0.0006166412495076656,
 0.000579889107029885,
 0.0007139858789741993,
 57.179639412445184,
 45.362803058150995,
 63.67569262118246,
 27.824395833966722,
 0.00022127307602204382,
 0.0002836256171576679,
 0.000623341012679912,
 0.000990296169962561,
 99.25332637604518,
 90.21226372528005,
 -4.54715473097167e-06,
 -3.74685942006181e-06,
 -4.546926447801525e-06,
 -3.2919190289248945e-06,
 0.004757982268227712,
 0.008746945176272291,
 -0.01932254381244246,
 -0.00597071896015277,
 0.0006190564599819481,
 0.0008495190995745361,
 -4.821228805519202e-06,
 -5.244727742962247e-06,
 -0.009193891362342776,
 0.025635476385161768,
 0.00015702302334830165,
 0.00020640854199882597,
 0.000

## 3.3 Trade Feature Building
Here we will just take the trade prices as the basis for calculated log return and then realized volatility.

In [11]:
def trade_stat(trade):
    log_diff = np.log(trade['price']).diff()
    real_vol = np.sqrt(np.sum(log_diff**2))
    return real_vol

real_vol_trade = trade_stat(trade_example)
print(real_vol_trade)

0.002464603


## 3.4 Saving Features
In the final step of this notebook, we will streamline the transformation process so that it could be readily resued. We will then transform the raw dataset and write the features we engineered into a csv file.

In [None]:
def 

In [12]:
import csv
 
 
# field names 
fields = ['Name', 'Branch', 'Year', 'CGPA'] 
   
# data rows of csv file 
rows = [ ['Nikhil', 'COE', '2', '9.0'], 
         ['Sanchit', 'COE', '2', '9.1'], 
         ['Aditya', 'IT', '2', '9.3'], 
         ['Sagar', 'SE', '1', '9.5'], 
         ['Prateek', 'MCE', '3', '7.8'], 
         ['Sahil', 'EP', '2', '9.1']] 
 
with open('GFG', 'w') as f:
     
    # using csv.writer method from CSV package
    write = csv.writer(f)
     
    write.writerow(fields)
    write.writerows(rows)