# Data Preparation

To fully utilize the power of HftBacktest, it requires to input Tick-by-Tick full order book and trade feed data. Unfortunately, free Tick-by-Tick full order book and trade feed data for HFT is not available unlike daily bar data provided by platforms like Yahoo Finance. However, in the case of cryptocurrency, you can collect the full raw feed yourself.

## Getting started from Binance Futures' raw feed data

You can collect Binance Futures feed yourself using https://github.com/nkaz001/hftbacktest/tree/master/collector

In [1]:
import gzip

with gzip.open('usdm/btcusdt_20240718.gz', 'r') as f:
    for i in range(20):
        line = f.readline()
        print(line)

b'1721347020984480942 {"stream":"btcusdt@depth@0ms","data":{"e":"depthUpdate","E":1721347022504,"T":1721347022504,"s":"BTCUSDT","U":4984028367545,"u":4984028370730,"pu":4984028367494,"b":[["1000.00","569.567"],["5000.00","2.024"],["33000.00","3.772"],["62109.60","0.006"],["62531.00","0.022"],["62598.50","0.802"],["63324.10","2.766"],["63726.90","0.006"],["63888.40","0.219"],["63889.30","0.015"],["63889.60","36.365"],["63890.10","0.227"],["63891.00","0.198"],["63931.10","0.028"],["63932.90","0.054"],["63933.30","0.103"],["63939.80","0.100"]],"a":[["63947.40","0.109"],["63951.10","0.000"],["63952.00","0.293"],["63975.60","0.032"],["63989.20","0.024"],["64589.70","0.000"],["64600.00","11.513"],["65229.20","0.002"],["65239.60","0.010"],["69000.00","32.274"]]}}\n'
b'1721347021080832061 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":4984028370881,"s":"BTCUSDT","b":"63943.90","B":"46.371","a":"63944.00","A":"0.095","T":1721347022507,"E":1721347022507}}\n'
b'1721347021080842761 {"

The first token of the line is timestamp received by local.

<div class="alert alert-info">
    
**Note:** The timestamp is in nanoseconds.
    
</div>

The data needs to be converted to normalized data that can be fed into HftBacktest.  
`convert` method also attempts to correct timestamps by reordering the rows.

In [2]:
import numpy as np

from hftbacktest.data.utils import binancefutures

data = binancefutures.convert(
    'usdm/btcusdt_20240718.gz',
    combined_stream=True
)

Correcting the latency
local_timestamp is ahead of exch_timestamp by 1522101448
Correcting the event order


Normalized data as follows. You can find more details on [Data](https://github.com/nkaz001/hftbacktest/wiki/Data).

In [3]:
import polars as pl

pl.DataFrame(data)

ev,exch_ts,local_ts,px,qty,order_id,ival,fval
u64,i64,i64,f64,f64,u64,i64,f64
3758096385,1721347022504000000,1721347022506582390,1000.0,569.567,0,0,0.0
3758096385,1721347022504000000,1721347022506582390,5000.0,2.024,0,0,0.0
3758096385,1721347022504000000,1721347022506582390,33000.0,3.772,0,0,0.0
3758096385,1721347022504000000,1721347022506582390,62109.6,0.006,0,0,0.0
3758096385,1721347022504000000,1721347022506582390,62531.0,0.022,0,0,0.0
…,…,…,…,…,…,…,…
3489660929,1721347200027000000,1721347200041335362,64049.2,0.0,0,0,0.0
3489660929,1721347200027000000,1721347200041335362,64058.9,1.845,0,0,0.0
3489660929,1721347200027000000,1721347200041335362,64059.0,4.074,0,0,0.0
3758096385,1721347200456000000,1721347200508101604,63959.9,20.327,0,0,0.0


You can save the data directly to a file by providing `output_filename`.

In [4]:
_ = binancefutures.convert(
    'usdm/btcusdt_20240718.gz',
    output_filename='usdm/btcusdt_20240718.npz',
    combined_stream=True
)

Correcting the latency
local_timestamp is ahead of exch_timestamp by 1522101448
Correcting the event order
Saving to usdm/btcusdt_20240718.npz


## Creating a market depth snapshot

As Binance Futures exchange runs 24/7, you need the initial snapshot to get the complete(almost) market depth.  
`collect-binancefutures` fetches the snapshot only when it makes the connection, so you need build the initial snapshot from the start of the collected feed data.

In [5]:
from hftbacktest.data.utils.snapshot import create_last_snapshot

# Builds 20240718 End of Day snapshot. It will be used for the initial snapshot for 20230719.
data = create_last_snapshot(
    ['usdm/btcusdt_20240718.npz'],
    tick_size=0.1,
    lot_size=0.001
)

Bid levels are shown before ask levels in the snapshot, and levels are sorted from the best price to the farthest price.

In [6]:
pl.DataFrame(data)

ev,exch_ts,local_ts,px,qty,order_id,ival,fval
u64,i64,i64,f64,f64,u64,i64,f64
3758096388,0,0,63959.9,20.393,0,0,0.0
3758096388,0,0,63959.8,2.021,0,0,0.0
3758096388,0,0,63959.7,0.008,0,0,0.0
3758096388,0,0,63959.4,0.075,0,0,0.0
3758096388,0,0,63959.0,0.289,0,0,0.0
…,…,…,…,…,…,…,…
3489660932,0,0,90271.9,0.005,0,0,0.0
3489660932,0,0,91296.9,0.16,0,0,0.0
3489660932,0,0,93131.0,0.175,0,0,0.0
3489660932,0,0,95922.0,0.03,0,0,0.0


In [7]:
from hftbacktest.data.utils.snapshot import create_last_snapshot

# Builds 20240718 End of Day snapshot. It will be used for the initial snapshot for 20230719.
_ = create_last_snapshot(
    ['usdm/btcusdt_20240718.npz'],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240718_eod.npz'
)

In [8]:
# Converts 20240719 data.
_ = binancefutures.convert(
    'usdm/btcusdt_20240719.gz',
    output_filename='usdm/btcusdt_20240719.npz',
    combined_stream=True
)

# Builds 20230719's last snapshot.
# Due to the file size limitation, btcusdt_20230719.npz does not contain data for the entire day.
_ = create_last_snapshot(
    ['usdm/btcusdt_20240719.npz'],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240719_last.npz',
    initial_snapshot='usdm/btcusdt_20240718_eod.npz',
)

Correcting the latency
local_timestamp is ahead of exch_timestamp by 1523469656
Correcting the event order
Saving to usdm/btcusdt_20240719.npz


In [9]:
# Builds 20230719's last snapshot without the initial snapshot.
_ = create_last_snapshot(
    ['usdm/btcusdt_20240719.npz'],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240719_last.npz'
)

# Builds the 20230719's last snapshot from 20230718 without the initial snapshot.
_ = create_last_snapshot(
    [
        'usdm/btcusdt_20240718.npz',
        'usdm/btcusdt_20240719.npz'
    ],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240719_last.npz'
)

## Getting started from Tardis.dev data

Few vendors offer tick-by-tick full market depth data along with snapshot and trade data, and Tardis.dev is among them.

<div class="alert alert-info">
    
**Note:** Some data may have an issue with the exchange timestamp. Ideally, the exchange timestamp should reflect the moment the event occurs at the matching engine. However, some data uses the server's data sent timestamp instead of the matching engine timestamp.

</div>

In [10]:
# https://docs.tardis.dev/historical-data-details/binance-futures

# Downloads sample Binance futures BTCUSDT trades
!wget https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_trades.csv.gz
    
# Downloads sample Binance futures BTCUSDT book
!wget https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_book.csv.gz

--2024-08-03 11:42:05--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 104.18.7.96, 104.18.6.96, 2606:4700::6812:660, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|104.18.7.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3090479 (2.9M) [text/csv]
Saving to: ‘BTCUSDT_trades.csv.gz’


2024-08-03 11:42:08 (7.20 MB/s) - ‘BTCUSDT_trades.csv.gz’ saved [3090479/3090479]

--2024-08-03 11:42:08--  https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 104.18.7.96, 104.18.6.96, 2606:4700::6812:660, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|104.18.7.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250016849 (238M) [text/csv]
Saving to: ‘BTCUSDT_book.csv.gz’


2024-08-03 11:42:33 (10.3 MB/s) - ‘BTCUSDT_book.csv.gz’ saved [250016849

It is recommended to input trade files before depth files. This is because if a depth event occurs due to a trade event, having the trade event before the depth event could provide a more realistic fill during backtesting. However, the sorting process will prioritize events from the first input file when both events have the same timestamp.

In [11]:
from hftbacktest.data.utils import tardis

data = tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz']
)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Correcting the latency
Correcting the event order


In [12]:
pl.DataFrame(data)

ev,exch_ts,local_ts,px,qty,order_id,ival,fval
u64,i64,i64,f64,f64,u64,i64,f64
3758096386,1580515202342000000,1580515202497052000,9364.51,1.197,0,0,0.0
3758096386,1580515202342000000,1580515202497346000,9365.67,0.02,0,0,0.0
3758096386,1580515202342000000,1580515202497352000,9365.86,0.01,0,0,0.0
3758096386,1580515202342000000,1580515202497357000,9366.36,0.002,0,0,0.0
3758096386,1580515202342000000,1580515202497363000,9366.36,0.003,0,0,0.0
…,…,…,…,…,…,…,…
3489660929,1580601599812000000,1580601599944404000,9397.79,0.0,0,0,0.0
3758096385,1580601599826000000,1580601599952176000,9354.8,4.07,0,0,0.0
3758096385,1580601599836000000,1580601599962961000,9351.47,3.914,0,0,0.0
3489660929,1580601599836000000,1580601599963461000,9397.78,0.1,0,0,0.0


You can save the data directly to a file by providing `output_filename`. If there are too many rows, you need to increase `buffer_size`.  

In [13]:
_ = tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'],
    output_filename='btcusdt_20200201.npz',
    buffer_size=200_000_000
)

Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Correcting the latency
Correcting the event order
Saving to btcusdt_20200201.npz


Tardis.dev artificially inserts the SOD snapshot to the start of the daily file. If you continuously backtest multiple days, you don't need the snapshot every start of days and it may incur more time to backtest. You can choose to include the Tardis.dev's SOD snapshot in the converted file using the option.