# LOB Dataset for Projects

This jupyter notebook is used to download the FI-2010 [1] dataset for train and test a AI classifier on LOB data. 
The code is obtained from [2].

### Data:
The FI-2010 is publicly avilable and interested readers can check out their paper [1]. The dataset can be downloaded from: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649 

Otherwise, the notebook will download the data automatically or it can be obtained from: 

https://drive.google.com/drive/folders/1Xen3aRid9ZZhFqJRgEMyETNazk02cNmv?usp=sharing

### References:
[1] Ntakaris A, Magris M, Kanniainen J, Gabbouj M, Iosifidis A. Benchmark dataset for mid‐price forecasting of limit order book data with machine learning methods. Journal of Forecasting. 2018 Dec;37(8):852-66. https://arxiv.org/abs/1705.03233

[2] Zhang Z, Zohren S, Roberts S. DeepLOB: Deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing. 2019 Mar 25;67(11):3001-12. https://arxiv.org/abs/1808.03668

### This notebook runs on Pytorch 1.9.0.

In [1]:
# download the data
import os 
if not os.path.isfile('data.zip'):
    !wget https://raw.githubusercontent.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books/master/data/data.zip
    !unzip -n data.zip
    print('data downloaded.')
else:
    print('data already existed.')

data already existed.


In [2]:
!ls -la

total 1962160
drwxr-xr-x  18 andreacoletta  staff        576 Dec 16 14:01 [1m[36m.[m[m
drwxr-xr-x   4 andreacoletta  staff        128 Dec  2 13:34 [1m[36m..[m[m
-rw-r--r--@  1 andreacoletta  staff       6148 Nov  4 15:26 .DS_Store
drwxr-xr-x  15 andreacoletta  staff        480 Dec 16 14:01 [1m[36m.git[m[m
drwxr-xr-x   6 andreacoletta  staff        192 Dec  2 13:37 [1m[36m.ipynb_checkpoints[m[m
-rw-r--r--   1 andreacoletta  staff        342 Oct 28 12:20 README.md
-rw-r--r--   1 andreacoletta  staff  132259850 Jul 14 22:58 Test_Dst_NoAuction_DecPre_CF_7.txt
-rw-r--r--   1 andreacoletta  staff  124378346 Jul 14 22:58 Test_Dst_NoAuction_DecPre_CF_8.txt
-rw-r--r--   1 andreacoletta  staff   76138106 Jul 14 22:58 Test_Dst_NoAuction_DecPre_CF_9.txt
-rw-r--r--   1 andreacoletta  staff  607324298 Jul 14 22:58 Train_Dst_NoAuction_DecPre_CF_7.txt
drwxr-xr-x  17 andreacoletta  staff        544 Nov  3 23:05 [1m[36mdata[m[m
-rw-r--r--   1 andreacoletta  staff   562781

In [4]:
# load packages
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from tqdm import tqdm 
from sklearn.metrics import accuracy_score, classification_report


# Data preparation

We used no auction dataset that is normalised by decimal precision approach in their work. The first seven days are training data and the last three days are testing data. A validation set (20%) from the training set is used to monitor the overfitting behaviours.  

The first 40 columns of the FI-2010 dataset are 10 levels ask and bid information for a limit order book and we only use these 40 features in our network. The last 5 columns of the FI-2010 dataset are the labels with different prediction horizons. 

In [6]:
# please change the data_path to your local path
# data_path = '/nfs/home/zihaoz/limit_order_book/data'
dec_data = np.loadtxt('Train_Dst_NoAuction_DecPre_CF_7.txt') # 80 training - 20 validation
dec_train = dec_data[:, :int(np.floor(dec_data.shape[1] * 0.8))]
dec_val = dec_data[:, int(np.floor(dec_data.shape[1] * 0.8)):]

dec_test1 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_7.txt')
dec_test2 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_8.txt')
dec_test3 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_9.txt')
dec_test = np.hstack((dec_test1, dec_test2, dec_test3))

print(dec_train.shape, dec_val.shape, dec_test.shape)

(149, 203800) (149, 50950) (149, 139587)


In [7]:
# all the data refer to 7 days, and the first 5 days are in the training set and validation 
# and the last 2 days are inside the test set 
x_training_data = dec_train.T[:, :40]
x_validation_data = dec_val.T[:, :40]
x_test_data = dec_test.T[:, :40]

In [8]:
print(x_training_data.shape, x_validation_data.shape, x_test_data.shape)

(203800, 40) (50950, 40) (139587, 40)


In [9]:
x_training_data[0] # 40 --> 10 levels and (ask-price, ask-volume, bid-price, bid-volume)

array([0.2615 , 0.00353, 0.2606 , 0.00326, 0.2618 , 0.002  , 0.2604 ,
       0.00682, 0.2619 , 0.00164, 0.2602 , 0.00786, 0.262  , 0.00532,
       0.26   , 0.00893, 0.2621 , 0.00151, 0.2599 , 0.00159, 0.2623 ,
       0.00837, 0.2595 , 0.001  , 0.2625 , 0.0015 , 0.2593 , 0.00143,
       0.2626 , 0.00787, 0.2591 , 0.00134, 0.2629 , 0.00146, 0.2588 ,
       0.00123, 0.2633 , 0.00311, 0.2579 , 0.00128])

In [10]:
x_training_data[1]  # second time instant

array([0.2615 , 0.00211, 0.2606 , 0.00326, 0.2619 , 0.00164, 0.2604 ,
       0.00682, 0.262  , 0.00138, 0.2602 , 0.00786, 0.2621 , 0.00545,
       0.2601 , 0.00393, 0.2625 , 0.0015 , 0.26   , 0.005  , 0.2626 ,
       0.00787, 0.2599 , 0.00159, 0.2629 , 0.00146, 0.2595 , 0.001  ,
       0.2633 , 0.00311, 0.2593 , 0.00143, 0.2637 , 0.00165, 0.2591 ,
       0.00134, 0.2646 , 0.00138, 0.2588 , 0.00123])

#### Dataset info:

The 'x' is an 2d-array that contains, for each row a snapshot of the orderbook in the following structure:
'best-ask price', 'best-ask volume', 'best-bid price', 'best-bid volume', '2-lev ask price', '2-levl ask volume', '2-lev bid price', '2-lev bid volume', ....
