# Optiver Realized Volatility Prediction

## Apply your data science skills to make financial markets better


<a href= "https://www.optiver.com/insights/guides/options-volatility/">Volatility</a> is one of the most prominent terms you’ll hear on any trading floor – and for good reason. In financial markets, volatility captures the amount of fluctuation in prices. High volatility is associated to periods of market turbulence and to large price swings, while low volatility describes more calm and quiet markets. For trading firms like Optiver, accurately predicting volatility is essential for the trading of options, whose price is <a href= "https://www.optiver.com/insights/guides/options-pricing/">directly related to the volatility</a> of the underlying product.

## Understanding the data

In [2]:
import pandas as pd
import numpy as np
import seaborn as sn

train.csv The ground truth values for the training set.

* stock_id - Same as above, but since this is a csv the column will load as an integer instead of categorical.

* time_id - Same as above.

* target - The realized volatility computed over the 10 minute window following the feature data under the same stock/time_id. There is no overlap between feature and target data. You can find more info in this <a href= "https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data?scriptVersionId=67183666#Competition-data">tutorial notebook</a>.

In [3]:
dftt = pd.read_csv("train.csv")
print(dftt.info())
dftt.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428932 entries, 0 to 428931
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   stock_id  428932 non-null  int64  
 1   time_id   428932 non-null  int64  
 2   target    428932 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 9.8 MB
None


Unnamed: 0,stock_id,time_id,target
0,0,5,0.004136
1,0,11,0.001445
2,0,16,0.002168
3,0,31,0.002195
4,0,62,0.001747


In [4]:
id_count = pd.DataFrame(dftt.stock_id.value_counts()).sort_index()
print(id_count)

     stock_id
0        3830
1        3830
2        3830
3        3830
4        3830
..        ...
122      3830
123      3830
124      3830
125      3830
126      3830

[112 rows x 1 columns]


In [5]:
dftt.tail()

Unnamed: 0,stock_id,time_id,target
428927,126,32751,0.003461
428928,126,32753,0.003113
428929,126,32758,0.00407
428930,126,32763,0.003357
428931,126,32767,0.00209


test.csv Provides the mapping between the other data files and the submission file. As with other test files, most of the data is only available to your notebook upon submission with just the first few rows available for download.


* stock_id - Same as above.

* time_id - Same as above.

* row_id - Unique identifier for the submission row. There is one row for each existing time ID/stock ID pair. Each time window is not necessarily containing every individual stock.

In [6]:
dfts = pd.read_csv("test.csv")
print(dfts.info())
dfts.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   stock_id  3 non-null      int64 
 1   time_id   3 non-null      int64 
 2   row_id    3 non-null      object
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes
None


Unnamed: 0,stock_id,time_id,row_id
0,0,4,0-4
1,0,32,0-32
2,0,34,0-34


sample_submission.csv - A sample submission file in the correct format.

* row_id - Same as in test.csv.

* target - Same definition as in train.csv. The benchmark is using the median target value from train.csv.

In [12]:
dfs = pd.read_csv("sample_submission.csv")
print(dfs.info())
dfs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   row_id  3 non-null      object 
 1   target  3 non-null      float64
dtypes: float64(1), object(1)
memory usage: 176.0+ bytes
None


Unnamed: 0,row_id,target
0,0-4,0.003048
1,0-32,0.003048
2,0-34,0.003048


## Selecting data

## Working on the models