# Data Preparation

To fully utilize the power of HftBacktest, it requires to input Tick-by-Tick full order book and trade feed data. Unfortunately, free Tick-by-Tick full order book and trade feed data for HFT is not available unlike daily bar data provided by platforms like Yahoo Finance. However, in the case of cryptocurrency, you can collect the full raw feed yourself.

## Getting started from Binance Futures' raw feed data

You can collect Binance Futures feed yourself using https://github.com/nkaz001/collect-binancefutures  

In [1]:
import gzip

with gzip.open('usdm/btcusdt_20230404.dat.gz', 'r') as f:
    for i in range(20):
        line = f.readline()
        print(line)

b'1680652700423575 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":2710246762461,"s":"BTCUSDT","b":"28145.10","B":"3.868","a":"28145.20","A":"6.887","T":1680652700430,"E":1680652700435}}\n'
b'1680652700441533 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186032,"p":"28145.10","q":"0.002","X":"MARKET","m":true}}\n'
b'1680652700441685 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186033,"p":"28145.10","q":"0.020","X":"MARKET","m":true}}\n'
b'1680652700441725 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186034,"p":"28145.10","q":"0.020","X":"MARKET","m":true}}\n'
b'1680652700442528 {"stream":"btcusdt@trade","data":{"e":"trade","E":1680652700455,"T":1680652700452,"s":"BTCUSDT","t":3535186035,"p":"28145.10","q":"0.008","X":"MARKET","m":true}}\n'
b'1680652700442569 {"stream":"btcusdt@bookTicker","data":{"e":

The first token of the line is timestamp received by local.

<div class="alert alert-info">
    
**Note:** There are currently two different implementations of the feed data collector: one in Python and another in Rust. The Python implementation records timestamps in microseconds, while the Rust implementation records timestamps in nanoseconds. Therefore, Python HftBacktest examples primarily use microseconds, whereas Rust HftBacktest examples use nanoseconds. Be mindful of the timestamp units.
    
</div>

The data needs to be converted to normalized data that can be fed into HftBacktest.  
`convert` method also attempts to correct timestamps by reordering the rows.

### For HftBacktest in Python, use the Python version of the Data Collector

In [2]:
import numpy as np

from hftbacktest.data.utils import binancefutures

data = binancefutures.convert('usdm/btcusdt_20230404.dat.gz')
np.savez_compressed('btcusdt_20230404', data=data)

Correcting the latency
local_timestamp is ahead of exch_timestamp by 18836.0
Correcting the event order


You can save the data directly to a file by providing `output_filename`.

In [3]:
binancefutures.convert('usdm/btcusdt_20230405.dat.gz', output_filename='btcusdt_20230405', compress=True)

Correcting the latency
local_timestamp is ahead of exch_timestamp by 26932.0
Correcting the event order
Saving to btcusdt_20230405


array([[ 1.00000000e+00,  1.68065280e+15,  1.68065280e+15,
         1.00000000e+00,  2.23000000e+04,  2.78800000e+00],
       [ 1.00000000e+00,  1.68065280e+15,  1.68065280e+15,
         1.00000000e+00,  2.75774000e+04,  0.00000000e+00],
       [ 1.00000000e+00,  1.68065280e+15,  1.68065280e+15,
         1.00000000e+00,  2.80238000e+04,  1.63800000e+00],
       ...,
       [ 1.00000000e+00,  1.68065321e+15,  1.68065321e+15,
        -1.00000000e+00,  2.81499000e+04,  1.53200000e+00],
       [ 1.00000000e+00,  1.68065321e+15,  1.68065321e+15,
        -1.00000000e+00,  2.85725000e+04,  1.83000000e-01],
       [ 1.00000000e+00,  1.68065321e+15,  1.68065321e+15,
        -1.00000000e+00,  2.89844000e+04,  1.00000000e-03]])

### For HftBacktest in Python, use the Rust version of the Data Collector

<div class="alert alert-info">
    
**Note:** The timestamp is in nanoseconds.
    
</div>

In [4]:
import numpy as np

from hftbacktest.data.utils import binancefutures

binancefutures.convert(
    "SOLUSDT_20240420.gz",
    output_filename="SOLUSDT_20240420.npz",
    compress=True,
    timestamp_unit="ns",
    combined_stream=False
)

Correcting the latency
Correcting the event order
Saving to SOLUSDT_20240420.npz


array([[ 1.0000000e+00,  1.7135712e+18,  1.7135712e+18,  1.0000000e+00,
         1.3380900e+02,  1.0000000e+00],
       [ 1.0000000e+00,  1.7135712e+18,  1.7135712e+18,  1.0000000e+00,
         1.3702000e+02,  2.0000000e+00],
       [ 1.0000000e+00,  1.7135712e+18,  1.7135712e+18,  1.0000000e+00,
         1.3739200e+02,  0.0000000e+00],
       ...,
       [ 1.0000000e+00,  1.7136576e+18,  1.7136576e+18, -1.0000000e+00,
         1.5133300e+02,  1.5000000e+01],
       [ 1.0000000e+00,  1.7136576e+18,  1.7136576e+18, -1.0000000e+00,
         1.5133800e+02,  2.0000000e+00],
       [ 1.0000000e+00,  1.7136576e+18,  1.7136576e+18, -1.0000000e+00,
         1.5134900e+02,  2.3000000e+01]])

Normalized data as follows. You can find more details on [Data](https://github.com/nkaz001/hftbacktest/wiki/Data).

In [5]:
import pandas as pd

df = pd.DataFrame(data, columns=['event', 'exch_timestamp', 'local_timestamp', 'side', 'price', 'qty'])
df['event'] = df['event'].astype(int)
df['exch_timestamp'] = df['exch_timestamp'].astype(int)
df['local_timestamp'] = df['local_timestamp'].astype(int)
df['side'] = df['side'].astype(int)
df

Unnamed: 0,event,exch_timestamp,local_timestamp,side,price,qty
0,2,1680652700452000,1680652700460369,-1,28145.1,0.002
1,2,1680652700452000,1680652700460521,-1,28145.1,0.020
2,2,1680652700452000,1680652700460561,-1,28145.1,0.020
3,2,1680652700452000,1680652700461364,-1,28145.1,0.008
4,2,1680652700462000,1680652700473746,1,28145.2,0.002
...,...,...,...,...,...,...
71014,1,1680652799975000,1680652799977784,-1,28182.7,0.441
71015,1,1680652799975000,1680652799977784,-1,28186.9,0.054
71016,1,1680652799975000,1680652799977784,-1,28225.5,3.213
71017,1,1680652799975000,1680652799977784,-1,28231.7,0.356


### For HftBacktest in Rust, use the Rust version of the Data Collector

In [6]:
from hftbacktest.data.utils import binancefutures

binancefutures.convert(
    "SOLUSDT_20240420.gz",
    output_filename="SOLUSDT_20240420.npz",
    compress=True,
    timestamp_unit="ns",
    combined_stream=False,
    structured_array=True
)

Correcting the latency
Correcting the event order
Saving to SOLUSDT_20240420.npz


array([(3758096385, 1713571200043000064, 1713571200045828864, 133.809,  1.),
       (3758096385, 1713571200043000064, 1713571200045828864, 137.02 ,  2.),
       (3758096385, 1713571200043000064, 1713571200045828864, 137.392,  0.),
       ...,
       (3489660929, 1713657599968000000, 1713657599976203008, 151.333, 15.),
       (3489660929, 1713657599968000000, 1713657599976203008, 151.338,  2.),
       (3489660929, 1713657599968000000, 1713657599976203008, 151.349, 23.)],
      dtype=[('ev', '<i8'), ('exch_ts', '<i8'), ('local_ts', '<i8'), ('px', '<f4'), ('qty', '<f4')])

## Creating a market depth snapshot

As Binance Futures exchange runs 24/7, you need the initial snapshot to get the complete(almost) market depth.  
`collect-binancefutures` fetches the snapshot only when it makes the connection, so you need build the initial snapshot from the start of the collected feed data.

### For HftBacktest in Python

In [7]:
from hftbacktest.data.utils import create_last_snapshot

# Builds 20230404 End of Day snapshot. It will be used for the initial snapshot for 20230405.
data = create_last_snapshot('btcusdt_20230404.npz', tick_size=0.01, lot_size=0.001)
np.savez('btcusdt_20230404_eod.npz', data=data)

# Builds 20230405 End of Day snapshot.
# Due to the file size limitation, btcusdt_20230405.npz does not contain data for the entire day.
create_last_snapshot(
    'btcusdt_20230405.npz',
    tick_size=0.01,
    lot_size=0.001,
    initial_snapshot='btcusdt_20230404_eod.npz',
    output_snapshot_filename='btcusdt_20230405_eod',
    compress=True
)

Load btcusdt_20230404.npz
Load btcusdt_20230405.npz


array([[ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
         1.00000000e+00,  2.81401000e+04,  8.25100000e+00],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
         1.00000000e+00,  2.81400000e+04,  1.62000000e-01],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
         1.00000000e+00,  2.81399000e+04,  4.00000000e-03],
       ...,
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
        -1.00000000e+00,  3.09404800e+05,  2.00000000e-03],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
        -1.00000000e+00,  3.09425600e+05,  7.00000000e-03],
       [ 4.00000000e+00,  1.68065321e+15, -1.00000000e+00,
        -1.00000000e+00,  3.09443200e+05,  5.00000000e-03]])

In [8]:
df = pd.DataFrame(data, columns=['event', 'exch_timestamp', 'local_timestamp', 'side', 'price', 'qty'])
df['event'] = df['event'].astype(int)
df['exch_timestamp'] = df['exch_timestamp'].astype(int)
df['local_timestamp'] = df['local_timestamp'].astype(int)
df['side'] = df['side'].astype(int)
df

Unnamed: 0,event,exch_timestamp,local_timestamp,side,price,qty
0,4,1680652799977784,-1,1,28155.1,0.060
1,4,1680652799977784,-1,1,28155.0,0.004
2,4,1680652799977784,-1,1,28154.9,0.001
3,4,1680652799977784,-1,1,28154.8,0.001
4,4,1680652799977784,-1,1,28154.7,0.002
...,...,...,...,...,...,...
4092,4,1680652799977784,-1,-1,30827.5,1.620
4093,4,1680652799977784,-1,-1,31500.0,33.077
4094,4,1680652799977784,-1,-1,33500.0,11.648
4095,4,1680652799977784,-1,-1,33752.3,0.001


### For HftBacktest in Rust

In [9]:
from hftbacktest.data import convert_from_struct_arr

# Builds 20240419 End of Day snapshot without the initial snapshot.
create_last_snapshot(
    convert_from_struct_arr(np.load('SOLUSDT_20240419.npz')['data']),
    tick_size=0.001,
    lot_size=1,
    output_snapshot_filename='SOLUSDT_20240419_EOD.npz',
    compress=True,
    structured_array=True
)

array([(2684354564, 1713571199988656384, -1,  142.444, 35.),
       (2684354564, 1713571199988656384, -1,  142.443,  6.),
       (2684354564, 1713571199988656384, -1,  142.442, 10.), ...,
       (2415919108, 1713571199988656384, -1,  500.   , 83.),
       (2415919108, 1713571199988656384, -1,  750.   ,  4.),
       (2415919108, 1713571199988656384, -1, 1150.   ,  1.)],
      dtype=[('ev', '<i8'), ('exch_ts', '<i8'), ('local_ts', '<i8'), ('px', '<f4'), ('qty', '<f4')])

In [10]:
# Builds 20240420 End of Day snapshot.
create_last_snapshot(
    convert_from_struct_arr(np.load('SOLUSDT_20240420.npz')['data']),
    tick_size=0.001,
    lot_size=1,
    initial_snapshot=convert_from_struct_arr(np.load('SOLUSDT_20240419_EOD.npz')['data']),
    output_snapshot_filename='SOLUSDT_20240420_EOD.npz',
    compress=True,
    structured_array=True
)

array([(2684354564, 1713657599976203008, -1,  151.153,  10.),
       (2684354564, 1713657599976203008, -1,  151.15 , 135.),
       (2684354564, 1713657599976203008, -1,  151.148,  14.), ...,
       (2415919108, 1713657599976203008, -1,  749.999,   4.),
       (2415919108, 1713657599976203008, -1, 1000.   ,   2.),
       (2415919108, 1713657599976203008, -1, 5000.   ,   3.)],
      dtype=[('ev', '<i8'), ('exch_ts', '<i8'), ('local_ts', '<i8'), ('px', '<f4'), ('qty', '<f4')])

## Getting started from Tardis.dev data

Few vendors offer tick-by-tick full market depth data along with snapshot and trade data, and Tardis.dev is among them.

In [11]:
# https://docs.tardis.dev/historical-data-details/binance-futures

# Downloads sample Binance futures BTCUSDT trades
!wget https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_trades.csv.gz
    
# Downloads sample Binance futures BTCUSDT book
!wget https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_book.csv.gz

--2024-05-19 09:39:04--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 172.64.147.51, 104.18.40.205, 2606:4700:4400::ac40:9333, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|172.64.147.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3090479 (2.9M) [text/csv]
Saving to: ‘BTCUSDT_trades.csv.gz’


2024-05-19 09:39:06 (6.02 MB/s) - ‘BTCUSDT_trades.csv.gz’ saved [3090479/3090479]

--2024-05-19 09:39:07--  https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 172.64.147.51, 104.18.40.205, 2606:4700:4400::6812:28cd, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|172.64.147.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250016849 (238M) [text/csv]
Saving to: ‘BTCUSDT_book.csv.gz’


2024-05-19 09:39:20 (19.3 MB/s) - ‘BTCUSDT_book.

### For HftBacktest in Python

In [12]:
from hftbacktest.data.utils import tardis

data = tardis.convert(['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'])
np.savez_compressed('btcusdt_20200201.npz', data=data)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Merging
Correcting the latency
Correcting the event order


You can save the data directly to a file by providing `output_filename`. If there are too many rows, you need to increase `buffer_size`.  

In [13]:
tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'],
    output_filename='btcusdt_20200201.npz',
    buffer_size=200_000_000,
    compress=True
)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Merging
Correcting the latency
Correcting the event order
Saving to btcusdt_20200201.npz


array([[ 2.0000000e+00,  1.5805152e+15,  1.5805152e+15,  1.0000000e+00,
         9.3645100e+03,  1.1970000e+00],
       [ 2.0000000e+00,  1.5805152e+15,  1.5805152e+15,  1.0000000e+00,
         9.3656700e+03,  2.0000000e-02],
       [ 2.0000000e+00,  1.5805152e+15,  1.5805152e+15,  1.0000000e+00,
         9.3658600e+03,  1.0000000e-02],
       ...,
       [ 1.0000000e+00,  1.5806016e+15,  1.5806016e+15,  1.0000000e+00,
         9.3514700e+03,  3.9140000e+00],
       [ 1.0000000e+00,  1.5806016e+15,  1.5806016e+15, -1.0000000e+00,
         9.3977800e+03,  1.0000000e-01],
       [ 1.0000000e+00,  1.5806016e+15,  1.5806016e+15,  1.0000000e+00,
         9.3481400e+03,  3.9800000e+00]])

You can also build the snapshot in the same way as described above.

### For HftBacktest in Rust

In [14]:
tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'],
    output_filename='btcusdt_20200201.npz',
    buffer_size=200_000_000,
    compress=True,
    structured_array=True,
    timestamp_unit='ns'
)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Merging
Correcting the latency
Correcting the event order
Saving to btcusdt_20200201.npz


array([(3758096386, 1580515202342000128, 1580515202497051904, 9364.51, 1.197),
       (3758096386, 1580515202342000128, 1580515202497346048, 9365.67, 0.02 ),
       (3758096386, 1580515202342000128, 1580515202497351936, 9365.86, 0.01 ),
       ...,
       (3758096385, 1580601599836000000, 1580601599962960896, 9351.47, 3.914),
       (3489660929, 1580601599836000000, 1580601599963461120, 9397.78, 0.1  ),
       (3758096385, 1580601599848000000, 1580601599973647104, 9348.14, 3.98 )],
      dtype=[('ev', '<i8'), ('exch_ts', '<i8'), ('local_ts', '<i8'), ('px', '<f4'), ('qty', '<f4')])