# Compare crypto.com historical data and our realtime archived data

This notebook builds upon previous work. For details on the historical data source refer to `im_v2/common/notebooks/CmampTask8547_Short_EDA_on_crypto.com_bidask_historical_data.ipynb`
- the dataset doesn't have a signature yet, we only have snippets of data, the epic to on-board the data is https://github.com/cryptokaizen/cmamp/issues/8520

Realtime archived data comes from our downloaders, dataset signature:
`periodic_daily.airflow.archived_200ms.parquet.bid_ask.futures.v7_4.ccxt.cryptocom.v1_0_0`

In [109]:
%load_ext autoreload
%autoreload 2

import logging
import numpy as np
import glob
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import helpers.hdbg as hdbg
import helpers.henv as henv
import helpers.hprint as hprint
import helpers.hpandas as hpandas
import helpers.hprint as hprint
import im_v2.common.data.client.im_raw_data_client as imvcdcimrdc
import core.finance.resampling as cfinresa

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
hdbg.init_logger(verbosity=logging.INFO)
log_level = logging.INFO

_LOG = logging.getLogger(__name__)

_LOG.info("%s", henv.get_system_signature()[0])

hprint.config_notebook()

INFO  > cmd='/venv/lib/python3.9/site-packages/ipykernel_launcher.py -f /home/.local/share/jupyter/runtime/kernel-1370d960-d200-4542-a8bb-8e9724776ca2.json'
INFO  # Git
  branch_name='CmampTask8608_Maintain_local_order_book_correctly_for_crypto.com'
  hash='07d85a13b'
  # Last commits:
    * 07d85a13b Vedanshu Joshi CmTask8720 Create pre-prod DAGs for shadow trading (#8727)        (   4 hours ago) Mon Jun 24 12:19:40 2024  (HEAD -> CmampTask8608_Maintain_local_order_book_correctly_for_crypto.com, origin/master, origin/HEAD, origin/CmampTask8733_Deploy_dash_app_behind_VPN, origin/CmampTask8608_Maintain_local_order_book_correctly_for_crypto.com, master)
    * aca12ce15 Shayan   Updated the docs with the new infra SMTP (#8738)                  (   5 hours ago) Mon Jun 24 11:17:40 2024           
    * fb325a7dc Shayan   Updated EFS throughput mode (#8736)                               (   6 hours ago) Mon Jun 24 10:07:37 2024           
# Machine info
  system=Linux
  node name=3bfe3a23d5

## Load Historical Data

Snippet of ~10 minutes of data ata for June 23th ~3:00PM

In [60]:
! ls /shared_data/CmTask8608

1719154946422  1719155126448  1719155306502
1719155058911  1719155251740  1719155414737


In [61]:
glob.glob("/shared_data/CmTask8608/*")

['/shared_data/CmTask8608/1719155251740',
 '/shared_data/CmTask8608/1719155126448',
 '/shared_data/CmTask8608/1719155058911',
 '/shared_data/CmTask8608/1719155306502',
 '/shared_data/CmTask8608/1719155414737',
 '/shared_data/CmTask8608/1719154946422']

In [62]:
dfs = []
for file in glob.glob("/shared_data/CmTask8608/*"):
    df_ = pd.read_json(file, lines=True)
    dfs.append(df_)

df = pd.concat(dfs, axis=0)

In [13]:
df = df.drop_duplicates(subset=["p"])

- We have confirmation from CCXT discord that the timestamp used by CCXT here https://github.com/ccxt/ccxt/blob/1cca6b0883a0e471fede443ebf8501601e40836a/python/ccxt/pro/cryptocom.py#L208 is the time of message publish, AKA
't' field from https://exchange-docs.crypto.com/exchange/v1/rest-ws/index.html#book-instrument_name-depth

- We have confirmation from telegram that "p" field in the historical data also corresponds to the publish time

In [87]:
df["p"] = pd.to_datetime(df["p"], unit="ms", utc=True)

In [88]:
historical_df = df.set_index("p", drop=True)

In [89]:
historical_df.index.min()

Timestamp('2024-06-23 15:02:26.424000+0000', tz='UTC')

Get top of the book data

In [90]:
historical_df[["bid_price", "bid_size"]] = historical_df["b"].map(lambda x: x[0]).apply(pd.Series)
historical_df[["ask_price", "ask_size"]] = historical_df["a"].map(lambda x: x[0]).apply(pd.Series)

In [91]:
historical_df.index.name = "timestamp"

## Load our data

In [98]:
signature = "periodic_daily.airflow.archived_200ms.parquet.bid_ask.futures.v7_4.ccxt.cryptocom.v1_0_0"
reader = imvcdcimrdc.RawDataReader(signature, stage="preprod")
start_timestamp = historical_df.index.min() - pd.Timedelta(minutes=1)
end_timestamp = historical_df.index.max() + - pd.Timedelta(minutes=1)
archived_data = reader.read_data(start_timestamp, end_timestamp, currency_pairs=["BTC_USD"])
_LOG.log(log_level, hpandas.df_to_str(archived_data, log_level=log_level))

INFO  Loading dataset schema file: /app/amp/data_schema/dataset_schema_versions/dataset_schema_v3.json
INFO  Loaded dataset schema version v3
INFO  Loading dataset schema file: /app/amp/data_schema/dataset_schema_versions/dataset_schema_v3.json
INFO  Loaded dataset schema version v3
INFO  Loading dataset schema file: /app/amp/data_schema/dataset_schema_versions/dataset_schema_v3.json
INFO  Loaded dataset schema version v3


Unnamed: 0,timestamp,bid_size,bid_price,ask_size,ask_price,exchange_id,level,end_download_timestamp,knowledge_timestamp,currency_pair,year,month,day
2024-06-23 15:01:26.659000+00:00,1719154886659,0.203,64121.6,0.253,64121.7,cryptocom,1,2024-06-23 15:01:26.800326+00:00,2024-06-23 15:01:32.743258+00:00,BTC_USD,2024,6,23
2024-06-23 15:01:26.659000+00:00,1719154886659,0.125,64121.0,0.05,64122.2,cryptocom,2,2024-06-23 15:01:26.800326+00:00,2024-06-23 15:01:32.743258+00:00,BTC_USD,2024,6,23
2024-06-23 15:01:26.659000+00:00,1719154886659,0.01,64119.7,0.125,64124.7,cryptocom,3,2024-06-23 15:01:26.800326+00:00,2024-06-23 15:01:32.743258+00:00,BTC_USD,2024,6,23
,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-06-23 15:10:26.353000+00:00,1719155426353,0.0693,64114.4,0.0079,64128.1,cryptocom,8,2024-06-23 15:10:26.520302+00:00,2024-06-23 15:10:30.460438+00:00,BTC_USD,2024,6,23
2024-06-23 15:10:26.353000+00:00,1719155426353,0.204,64114.0,0.202,64128.2,cryptocom,9,2024-06-23 15:10:26.520302+00:00,2024-06-23 15:10:30.460438+00:00,BTC_USD,2024,6,23
2024-06-23 15:10:26.353000+00:00,1719155426353,0.105,64113.8,0.204,64128.8,cryptocom,10,2024-06-23 15:10:26.520302+00:00,2024-06-23 15:10:30.460438+00:00,BTC_USD,2024,6,23


INFO  None


In [99]:
archived_data = archived_data[archived_data.level == 1].drop("timestamp", axis=1)

In [100]:
merged_df = pd.merge(historical_df, archived_data, on='timestamp', suffixes=('_historical', '_rt_archived'))

# Calculate the deviation percentage for each column
for column in ['bid_size', 'bid_price', 'ask_size', 'ask_price']:
    merged_df[f'{column}_deviation'] = abs(merged_df[f'{column}_historical'] - merged_df[f'{column}_rt_archived'])

In [96]:
merged_df.shape

(5, 25)

In [102]:
merged_df[
    ["bid_size_deviation", "bid_price_deviation", "ask_size_deviation", "ask_price_deviation"]
].describe()

Unnamed: 0,bid_size_deviation,bid_price_deviation,ask_size_deviation,ask_price_deviation
count,5.0,5.0,5.0,5.0
mean,1.1102230000000002e-17,0.0,0.0632,0.4
std,1.1610990000000001e-17,0.0,0.1413195,0.894427
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,1.387779e-17,0.0,1.387779e-17,0.0
75%,1.387779e-17,0.0,2.775558e-17,0.0
max,2.775558e-17,0.0,0.316,2.0


Conclusion, we have very small overlap of the timestamps between the datasets, which is surprising given the fact
that both datasets should be snapshots using the same timestamp semantics - the message publish time

### Align both datasets to 100ms grid

- Choosing very conservative forward filling

In [110]:
historical_df_100ms_grid = cfinresa.resample(historical_df, rule="100ms").last().ffill(limit=10)

In [111]:
archived_data_100ms_grid = cfinresa.resample(archived_data, rule="100ms").last().ffill(limit=10)

In [113]:
merged_df = pd.merge(historical_df_100ms_grid, archived_data_100ms_grid, on='timestamp', suffixes=('_historical', '_rt_archived'))

# Calculate the deviation percentage for each column
for column in ['bid_size', 'bid_price', 'ask_size', 'ask_price']:
    merged_df[f'{column}_deviation'] = abs(merged_df[f'{column}_historical'] - merged_df[f'{column}_rt_archived'])

In [114]:
merged_df.shape

(4800, 25)

In [115]:
merged_df[
    ["bid_size_deviation", "bid_price_deviation", "ask_size_deviation", "ask_price_deviation"]
].describe()

Unnamed: 0,bid_size_deviation,bid_price_deviation,ask_size_deviation,ask_price_deviation
count,4765.0,4765.0,4765.0,4765.0
mean,0.03577083,0.395551,0.04355448,0.405876
std,0.07258316,1.119618,0.09250891,1.162211
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,1.387779e-17,0.0,1.387779e-17,0.0
75%,0.0406,0.2,0.05,0.1
max,0.687,13.4,0.9185,13.8


After aligning on a grid the results are very encouraging, we see very close match at the top of the book
for all levels