# DeepLOB: Deep Convolutional Neural Networks for Limit Order Books

### Authors: Zihao Zhang, Stefan Zohren and Stephen Roberts
Oxford-Man Institute of Quantitative Finance, Department of Engineering Science, University of Oxford

This jupyter notebook is used to demonstrate our recent paper [2] published in IEEE Transactions on Singal Processing. We use FI-2010 [1] dataset and present how model architecture is constructed here. 

### Data:
The FI-2010 is publicly avilable and interested readers can check out their paper [1]. The dataset can be downloaded from: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649 

Otherwise, the notebook will download the data automatically or it can be obtained from: 

https://drive.google.com/drive/folders/1Xen3aRid9ZZhFqJRgEMyETNazk02cNmv?usp=sharing.

### References:
[1] Ntakaris A, Magris M, Kanniainen J, Gabbouj M, Iosifidis A. Benchmark dataset for mid‐price forecasting of limit order book data with machine learning methods. Journal of Forecasting. 2018 Dec;37(8):852-66. https://arxiv.org/abs/1705.03233

[2] Zhang Z, Zohren S, Roberts S. DeepLOB: Deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing. 2019 Mar 25;67(11):3001-12. https://arxiv.org/abs/1808.03668

### This notebook runs on tensorflow 2.

In [1]:
# # obtain data
# import os 
# if not os.path.isfile('data.zip'):
#     !wget https://raw.githubusercontent.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books/master/data/data.zip
#     !unzip -n data.zip
#     print('data downloaded.')
# else:
#     print('data already existed.')

In [1]:
# limit gpu memory
import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
    # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
        print(e)

1 Physical GPUs, 1 Logical GPUs


In [2]:
# load packages
import pandas as pd
import pickle
import numpy as np
##############
## my imports
##############
from glob import glob
from json import loads
from sklearn.model_selection import train_test_split
from time import time
##############
import random
import keras
from keras import backend as K
from keras.models import load_model, Model
from keras.layers import Flatten, Dense, Dropout, Activation, Input, LSTM, Reshape, Conv2D, MaxPooling2D
from tensorflow.keras.optimizers import Adam
from keras.layers.advanced_activations import LeakyReLU
from keras.utils import np_utils

from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt

# set random seeds
# np.random.seed(int(time()))
# tf.random.set_seed(int(time()))


# Data preparation

We used no auction dataset that is normalised by decimal precision approach in their work. The first 40 columns of the FI-2010 dataset are 10 levels ask and bid information for a limit order book and we only use these 40 features in our network. The last 5 columns of the FI-2010 dataset are the labels with different prediction horizons. 

In [3]:
def prepare_x(data):
    df1 = data[:40, :].T
    return np.array(df1)

def get_label(data):
    lob = data[-5:, :].T
    return lob

def data_classification(X, Y, T):
    [N, D] = X.shape
    df = np.array(X)
    dY = np.array(Y)
    dataY = dY[T - 1:N]
    dataX = np.zeros((N - T + 1, T, D))
    for i in range(T, N + 1):
        dataX[i - T] = df[i - T:i, :]
    return dataX.reshape(dataX.shape + (1,)), dataY

def prepare_x_y(data, k, T):
    x = prepare_x(data)
    y = get_label(data)
    x, y = data_classification(x, y, T=T)
    y = y[:,k] - 1
    y = np_utils.to_categorical(y, 3)
    return x, y

In [4]:
files = glob('./data/LOBdata*.csv')
dfs = [pd.read_csv(i) for i in files]
coins = np.ndarray.tolist(pd.concat(dfs).coin.unique())
coinseperate = {}
coinslabel = {}
for coin in coins:
    maskdfs = []
    for df in dfs:
        if coin in df['coin'].to_numpy():
            maskdfs.append(df)
    labels = [df[df['coin'] == coin].iloc[0].label for df in maskdfs]
    coinslabel[coin] = labels
    coinseperate[coin] = [df[df['coin'] == coin].iloc[0].matrix for df in maskdfs]

for coin in coins:
    x = np.stack([loads(i) for i in coinseperate[coin]])
    x = x.reshape(list(x.shape)+ [1])
    coinseperate[coin] = x
    labels_index = list(map(lambda x:x+1,coinslabel[coin]))
    labels = []
    for i in labels_index:
        l = [0]*3
        l[int(i)] = 1
        labels.append(l)
    coinslabel[coin] = np.array(labels)

keys = coins.copy()
allowed = ['ADA', 'XRP', 'SOL', 'DOGE', 'DOT', 'TRX', 'SHIB', 'AVAX', 'LTC', 'FTT', 'MATIC',
 'LINK', 'UNI', 'XLM', 'NEAR', 'BCH', 'ALGO', 'XMR', 'ETC', 'ATOM', 'VET', 'MANA',
 'HBAR', 'FLOW', 'HNT', 'ICP', 'TUSD', 'XTZ', 'THETA', 'FIL', 'EGLD', 'SAND', 'APE',
 'USDP', 'BTTC', 'EOS', 'ZEC', 'AXS', 'AAVE', 'IOTA', 'MKR', 'XEC', 'GRT']

# for coin in keys:
#     if coin not in allowed:
#         coins.remove(coin)

# coinseperate['CITY'].shape
trainX_CNN = np.concatenate([coinseperate[coin] for coin in coins])
trainY_CNN = np.concatenate([coinslabel[coin] for coin in coins])
# trainX_CNN = trainX_CNN.reshape([trainX_CNN.shape[0]*trainX_CNN.shape[1]]+list(trainX_CNN.shape[2:]))
# trainY_CNN = trainY_CNN.reshape([trainY_CNN.shape[0]*trainY_CNN.shape[1]]+list(trainY_CNN.shape[2:]))

musk2 = [i[2]==1 for i in trainY_CNN]
ex2 = trainX_CNN[musk2]
lab2 = trainY_CNN[musk2]

musk1 = [i[1]==1 for i in trainY_CNN]
ex1 = trainX_CNN[musk1]
lab1 = trainY_CNN[musk1]

musk0 = [i[0]==1 for i in trainY_CNN]
ex0 = trainX_CNN[musk0]
lab0 = trainY_CNN[musk0]

ex12 = []
lab12 = []
ex1 = np.ndarray.tolist(ex1)
lab1 = np.ndarray.tolist(lab1)



for i in range(len(ex2)):
    index = random.randint(0,len(ex1)-1)
    ex12.append(ex1.pop(index))
    lab12.append(lab1.pop(index))

ex1 = np.array(ex12)
lab1 = np.array(lab12)
    
trainX_CNN = np.concatenate([ex0,ex1,ex2])
trainY_CNN = np.concatenate([lab0,lab1,lab2])


n_hiddens = 64
checkpoint_filepath = './model_tensorflow2/weights'
trainX_CNN.shape,trainY_CNN.shape

X_train, X_test, y_train, y_test =  train_test_split( trainX_CNN, trainY_CNN,stratify=trainY_CNN, test_size=0.2,shuffle=True)

trainX_CNN = X_train
trainY_CNN = y_train

testX_CNN = X_test
testY_CNN = y_test
batch_size = 32
decay_epoch = 100


In [5]:
trainX_CNN.shape,testX_CNN.shape

((2350, 100, 40, 1), (588, 100, 40, 1))

In [6]:
# # please change the data_path to your local path
# # data_path = '/nfs/home/zihaoz/limit_order_book/data'

# dec_data = np.loadtxt('Train_Dst_NoAuction_DecPre_CF_7.txt')
# dec_train = dec_data[:, :int(np.floor(dec_data.shape[1] * 0.8))]
# dec_val = dec_data[:, int(np.floor(dec_data.shape[1] * 0.8)):]

# dec_test1 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_7.txt')
# dec_test2 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_8.txt')
# dec_test3 = np.loadtxt('Test_Dst_NoAuction_DecPre_CF_9.txt')
# dec_test = np.hstack((dec_test1, dec_test2, dec_test3))

# k = 4 # which prediction horizon
# T = 100 # the length of a single input
# n_hiddens = 64
# checkpoint_filepath = './model_tensorflow2/weights'

# trainX_CNN, trainY_CNN = prepare_x_y(dec_train, k, T)
# valX_CNN, valY_CNN = prepare_x_y(dec_val, k, T)
# testX_CNN, testY_CNN = prepare_x_y(dec_test, k, T)

# print(trainX_CNN.shape, trainY_CNN.shape)
# print(valX_CNN.shape, valY_CNN.shape)
# print(testX_CNN.shape, testY_CNN.shape)

# Model Architecture

Please find the detailed discussion of our model architecture in our paper.

In [12]:
def create_deeplob(T, NF, number_of_lstm):
    input_lmd = Input(shape=(T, NF, 1))
    
    # build the convolutional block
    conv_first1 = tf.keras.layers.Rescaling(1e-3)(input_lmd)
    conv_first1 = Conv2D(32, (1, 2), activation='gelu',strides=(1, 2))(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    conv_first1 = Conv2D(32, (4, 1),activation='gelu', padding='same')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    conv_first1 = Conv2D(32, (4, 1),activation='gelu', padding='same')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)

    conv_first1 = Conv2D(32, (1, 2),activation='gelu', strides=(1, 2))(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    conv_first1 = Conv2D(32, (4, 1),activation='gelu', padding='same')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    conv_first1 = Conv2D(32, (4, 1),activation='gelu', padding='same')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)

    conv_first1 = Conv2D(32, (1, 10),activation='gelu')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    conv_first1 = Conv2D(32, (4, 1),activation='gelu', padding='same')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    conv_first1 = Conv2D(32, (4, 1),activation='gelu', padding='same')(conv_first1)
    conv_first1 = keras.layers.LeakyReLU(alpha=0.01)(conv_first1)
    
    # build the inception module
    convsecond_1 = Conv2D(64, (1, 1),activation='gelu', padding='same')(conv_first1)
    convsecond_1 = keras.layers.LeakyReLU(alpha=0.01)(convsecond_1)
    convsecond_1 = Conv2D(64, (3, 1),activation='gelu', padding='same')(convsecond_1)
    convsecond_1 = keras.layers.LeakyReLU(alpha=0.01)(convsecond_1)

    convsecond_2 = Conv2D(64, (1, 1),activation='gelu', padding='same')(conv_first1)
    convsecond_2 = keras.layers.LeakyReLU(alpha=0.01)(convsecond_2)
    convsecond_2 = Conv2D(64, (5, 1),activation='gelu', padding='same')(convsecond_2)
    convsecond_2 = keras.layers.LeakyReLU(alpha=0.01)(convsecond_2)

    convsecond_3 = MaxPooling2D((3, 1), strides=(1, 1), padding='same')(conv_first1)
    convsecond_3 = Conv2D(64, (1, 1),activation='gelu', padding='same')(convsecond_3)
    convsecond_3 = keras.layers.LeakyReLU(alpha=0.01)(convsecond_3)
    
    convsecond_output = keras.layers.concatenate([convsecond_1, convsecond_2, convsecond_3], axis=3)
    conv_reshape = Reshape((int(convsecond_output.shape[1]), int(convsecond_output.shape[3])))(convsecond_output)
    conv_reshape = keras.layers.Dropout(0.2, noise_shape=(None, 1, int(conv_reshape.shape[2])))(conv_reshape, training=True)

    # build the last LSTM layer
    conv_lstm = LSTM(number_of_lstm)(conv_reshape)

    # build the output layer
    out = Dense(3, activation='softmax')(conv_lstm)
    model = Model(inputs=input_lmd, outputs=out)
    
    lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.001,
    decay_steps=(trainX_CNN.shape[0]//batch_size)*100,decay_rate=0.9)
    adam =  Adam(learning_rate=1e-6)
    model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy',tf.keras.metrics.Precision()])
    return model,adam

deeplob,adam = create_deeplob(trainX_CNN.shape[1], trainX_CNN.shape[2], n_hiddens)
deeplob.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_4 (InputLayer)           [(None, 100, 40, 1)  0           []                               
                                ]                                                                 
                                                                                                  
 rescaling_3 (Rescaling)        (None, 100, 40, 1)   0           ['input_4[0][0]']                
                                                                                                  
 conv2d_42 (Conv2D)             (None, 100, 20, 32)  96          ['rescaling_3[0][0]']            
                                                                                                  
 leaky_re_lu_42 (LeakyReLU)     (None, 100, 20, 32)  0           ['conv2d_42[0][0]']        

# Model Training

In [13]:
%%time

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)
with tf.device('/GPU:0'):
    deeplob.fit(trainX_CNN, trainY_CNN, #validation_data=(valX_CNN, valY_CNN), 
                epochs=150, batch_size=32, verbose=1,shuffle=True, callbacks=[model_checkpoint_callback])


Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150

KeyboardInterrupt: 

In [9]:
x = np.argmax(deeplob(testX_CNN).numpy(),axis=1)
y = testY_CNN.argmax(axis=1)
c1,c2 = 0,0
for i in range(len(x)):
    if x[i] == 2 or y[i] == 2:
        c1+=1
        if x[i]==y[i]:
            c2+=1
        print(x[i],y[i])
c2/c1

2 1
2 2
2 2
0 2
0 2
2 1
1 2
2 2
0 2
1 2
2 1
2 1
2 1
2 1
0 2
2 0
2 2
1 2
2 0
2 1
2 0
0 2
2 1
2 0
2 0
2 2
2 1
2 2
2 0
2 1
0 2
1 2
2 2
2 0
2 0
2 2
2 1
2 2
2 1
2 1
2 0
0 2
0 2
2 2
2 2
2 2
2 0
2 2
2 1
1 2
2 2
2 0
2 1
1 2
1 2
2 1
1 2
1 2
1 2
1 2
2 0
2 2
0 2
1 2
1 2
2 1
1 2
1 2
0 2
2 1
2 0
2 0
2 2
2 0
2 0
2 2
2 2
2 0
2 2
2 2
2 0
2 2
2 2
2 2
2 1
2 1
2 0
2 2
1 2
2 1
2 1
2 2
2 2
1 2
2 2
2 0
2 0
2 0
0 2
0 2
0 2
2 1
2 1
2 0
1 2
0 2
2 2
2 2
2 0
2 1
2 0
0 2
2 2
2 2
2 0
2 2
2 2
1 2
2 1
2 2
2 1
2 2
2 1
2 1
2 2
2 0
0 2
1 2
2 1
2 1
2 0
2 2
2 0
2 1
2 2
2 2
2 2
2 0
2 2
0 2
1 2
2 2
2 0
2 1
2 0
2 0
2 2
2 0
2 1
2 2
0 2
2 1
2 0
0 2
2 0
2 1
2 0
1 2
1 2
2 2
2 2
2 0
2 0
2 2
2 2
2 0
2 1
2 0
0 2
2 2
2 1
2 1
2 2
1 2
2 0
2 2
2 1
2 2
2 2
2 1
2 1
1 2
2 0
0 2
2 2
2 1
2 0
2 2
2 1
2 0
1 2
1 2
0 2
2 2
2 2
2 2
0 2
2 2
1 2
2 2
2 1
0 2
1 2
2 0
2 2
2 0
2 1
2 0
1 2
2 0
0 2
2 1
1 2
2 2
2 1
1 2
1 2
2 0
2 1
2 0
2 1
0 2
2 1
0 2
2 2
1 2
2 0
2 1
1 2
0 2
2 0
0 2
2 0
2 1
1 2
0 2
2 2
0 2
2 2
2 1
1 2
2 0
2 1
2 0
2 0
2 1
2 2
2 1
2 1
0 2


0.22105263157894736

# Model Testing

In [10]:
# deeplob.load_weights(checkpoint_filepath)
pred = deeplob.predict(testX_CNN)

In [11]:
print('accuracy_score:', accuracy_score(np.argmax(testY_CNN, axis=1), np.argmax(pred, axis=1)))
print(classification_report(np.argmax(testY_CNN, axis=1), np.argmax(pred, axis=1), digits=5))

accuracy_score: 0.36054421768707484
              precision    recall  f1-score   support

           0    0.36257   0.31313   0.33604       198
           1    0.36434   0.24103   0.29012       195
           2    0.35764   0.52821   0.42650       195

    accuracy                        0.36054       588
   macro avg    0.36152   0.36079   0.35089       588
weighted avg    0.36152   0.36054   0.35081       588

