This is spin off from `CMTask2703_Perform_manual_reconciliation_of_OB_data` notebook
We would like to reconcile data collected ~200ms via CCXT with historical data from CryptoChassis

- CCXT data = CCXT real-time DB bid-ask data collection for futures
- CC data = CryptoChassis historical Parquet bid-ask futures data

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

import logging

import pandas as pd

import helpers.hdatetime as hdateti
import helpers.hdbg as hdbg
import helpers.henv as henv
import helpers.hpandas as hpandas
import helpers.hparquet as hparque
import helpers.hprint as hprint
import helpers.hsql as hsql
import im_v2.im_lib_tasks as imvimlita

  from tqdm.autonotebook import tqdm


In [2]:
hdbg.init_logger(verbosity=logging.INFO)

_LOG = logging.getLogger(__name__)

_LOG.info("%s", henv.get_system_signature()[0])

hprint.config_notebook()

[0m[36mINFO[0m: > cmd='/venv/lib/python3.8/site-packages/ipykernel_launcher.py -f /home/.local/share/jupyter/runtime/kernel-f2e97df8-a422-4554-a88a-2cb834609870.json'
INFO  # Git
  branch_name='CmTask2912_Implement_Websocket_Extractor'
  hash='eaddf9e92'
  # Last commits:
    * eaddf9e92 jsmerix  Add websocket data reconciliation notebook                        (  18 hours ago) Tue Oct 11 17:43:16 2022  (HEAD -> CmTask2912_Implement_Websocket_Extractor, origin/CmTask2912_Implement_Websocket_Extractor)
    * 2ad6766cb jsmerix  Rename attribute                                                  (    4 days ago) Sat Oct 8 16:28:37 2022           
    * 9e48b5313 jsmerix  Update Talos extractor to avoid missing abstract methods error    (    4 days ago) Sat Oct 8 16:27:59 2022           
# Machine info
  system=Linux
  node name=e5090f74aa2a
  release=5.15.0-1019-aws
  version=#23~20.04.1-Ubuntu SMP Thu Aug 18 03:20:14 UTC 2022
  machine=x86_64
  processor=x86_64
  cpu count=8
  cpu freq=

# Load the data

For CCXT data we have multiple data points within a single, we resample to second by taking the latest entry within
a second

## Specify universe

In [3]:
universe = [
    "binance::SOL_USDT",
    "binance::DOGE_USDT",
    "binance::BNB_USDT",
    "binance::ETH_USDT",
    "binance::BTC_USDT",
]

## Load data

In [4]:
start_ts = pd.Timestamp("2022-10-11 17:00:00+00:00")
end_ts = pd.Timestamp("2022-10-11 18:00:00+00:00")
start_ts_unix = hdateti.convert_timestamp_to_unix_epoch(start_ts)
end_ts_unix = hdateti.convert_timestamp_to_unix_epoch(end_ts)

### CC data

In [5]:
filters = [("year", "=", 2022), ("month", "=", 10)]
file_name = "s3://cryptokaizen-data.preprod/reorg/daily_staged.airflow.pq/bid_ask-futures/crypto_chassis/binance/"
df = hparque.from_parquet(file_name, filters=filters, aws_profile="ck")

In [6]:
df.head()

Unnamed: 0_level_0,timestamp,bid_price,bid_size,ask_price,ask_size,exchange_id,knowledge_timestamp,currency_pair,year,month
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-10-07 00:00:00+00:00,1665100800,0.4285,65997.0,0.4286,186646.0,binance,2022-10-09 00:26:22.997214+00:00,ADA_USDT,2022,10
2022-10-07 00:00:01+00:00,1665100801,0.4284,50367.0,0.4285,84562.0,binance,2022-10-09 00:26:22.997214+00:00,ADA_USDT,2022,10
2022-10-07 00:00:03+00:00,1665100803,0.4284,45702.0,0.4285,84562.0,binance,2022-10-09 00:26:22.997214+00:00,ADA_USDT,2022,10
2022-10-07 00:00:05+00:00,1665100805,0.4284,43969.0,0.4285,84562.0,binance,2022-10-09 00:26:22.997214+00:00,ADA_USDT,2022,10
2022-10-07 00:00:07+00:00,1665100807,0.4284,68826.0,0.4285,107099.0,binance,2022-10-09 00:26:22.997214+00:00,ADA_USDT,2022,10


In [7]:
df.index.max()

Timestamp('2022-10-11 23:59:59+0000', tz='UTC')

In [8]:
df_chassis = df.loc[(df.index >= start_ts) & (df.index <= end_ts)]
df_chassis = df_chassis.drop_duplicates()
df_chassis["full_symbol"] = "binance::" + df_chassis["currency_pair"]
df_chassis = df_chassis[df_chassis["full_symbol"].isin(universe)]
df_chassis = df_chassis[
    ["bid_size", "bid_price", "ask_size", "ask_price", "full_symbol"]
]
df_chassis = df_chassis.reset_index().set_index(["timestamp", "full_symbol"])
# We drop the first row because CC labels right side of the intrval during resampling, meaning for CCXT we will have
# one less row
df_chassis = df_chassis.drop(start_ts)

In [9]:
df_chassis.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,bid_size,bid_price,ask_size,ask_price
timestamp,full_symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-10-11 17:59:56+00:00,binance::SOL_USDT,587.0,31.63,998.0,31.64
2022-10-11 17:59:57+00:00,binance::SOL_USDT,1495.0,31.64,951.0,31.65
2022-10-11 17:59:58+00:00,binance::SOL_USDT,1512.0,31.64,951.0,31.65
2022-10-11 17:59:59+00:00,binance::SOL_USDT,1469.0,31.64,1363.0,31.65
2022-10-11 18:00:00+00:00,binance::SOL_USDT,1487.0,31.64,949.0,31.65


In [10]:
df_chassis.shape

(35632, 4)

In [11]:
df_chassis[df_chassis.index.isin(["binance::BTC_USDT"], level=1)].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,bid_size,bid_price,ask_size,ask_price
timestamp,full_symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-10-11 17:00:01+00:00,binance::BTC_USDT,12.477,19145.8,1.741,19145.9
2022-10-11 17:00:02+00:00,binance::BTC_USDT,4.687,19145.8,19.584,19145.9
2022-10-11 17:00:03+00:00,binance::BTC_USDT,45.384,19145.9,1.496,19146.0
2022-10-11 17:00:04+00:00,binance::BTC_USDT,26.743,19146.4,10.655,19146.5
2022-10-11 17:00:05+00:00,binance::BTC_USDT,39.214,19147.6,0.512,19147.7


### CCXT data

In [12]:
env_file = imvimlita.get_db_env_path("dev")
connection_params = hsql.get_connection_info_from_env_file(env_file)
db_connection = hsql.get_connection(*connection_params)

In [13]:
query = f"SELECT * FROM public.ccxt_bid_ask_futures_test \
WHERE level = 1 AND timestamp >= {start_ts_unix} AND timestamp <= {end_ts_unix}"
query

'SELECT * FROM public.ccxt_bid_ask_futures_test WHERE level = 1 AND timestamp >= 1665507600000 AND timestamp <= 1665511200000'

In [14]:
df_ccxt = hsql.execute_query_to_df(db_connection, query)
df_ccxt["timestamp"] = df_ccxt["timestamp"].map(
    hdateti.convert_unix_epoch_to_timestamp
)
df_ccxt = df_ccxt.reset_index(drop=True).set_index(["timestamp"])

  df = pd.read_sql_query(query, connection)


In [15]:
# Use label right to match crypto chassis data
df_ccxt["full_symbol"] = "binance::" + df_ccxt["currency_pair"]
dfs_ccxt = []
for fs in universe:
    df_fs = df_ccxt[df_ccxt["full_symbol"] == fs]
    df_fs = (
        df_fs[["bid_size", "bid_price", "ask_size", "ask_price"]]
        .resample("S", label="right")
        .mean()
    )
    df_fs["full_symbol"] = fs
    df_fs = df_fs.reset_index().set_index(["timestamp", "full_symbol"])
    dfs_ccxt.append(df_fs)
df_ccxt_sec_last = pd.concat(dfs_ccxt)

# Analysis

In [16]:
data_ccxt = df_ccxt_sec_last
data_cc = df_chassis

In [17]:
bid_ask_cols = ["bid_size", "bid_price", "ask_size", "ask_price", "full_symbol"]

## Merge CC and DB data into one DataFrame

In [18]:
data = data_ccxt.merge(
    data_cc,
    how="outer",
    left_index=True,
    right_index=True,
    suffixes=("_ccxt", "_cc"),
)
_LOG.info("Start date = %s", data.reset_index()["timestamp"].min())
_LOG.info("End date = %s", data.reset_index()["timestamp"].max())
_LOG.info(
    "Avg observations per coin = %s",
    len(data) / len(data.reset_index()["full_symbol"].unique()),
)
# Move the same metrics from two vendors together.
data = data.reindex(sorted(data.columns), axis=1)
# NaNs observation.
_LOG.info(
    "Number of observations with NaNs in CryptoChassis = %s",
    len(data[data["bid_price_cc"].isna()]),
)
_LOG.info(
    "Number of observations with NaNs in CCXT = %s",
    len(data[data["bid_price_ccxt"].isna()]),
)
# Remove NaNs.
data = hpandas.dropna(data, report_stats=True)
#
display(data.tail())

INFO  Start date = 2022-10-11 17:00:01+00:00
INFO  End date = 2022-10-11 18:00:00+00:00
INFO  Avg observations per coin = 7163.2
INFO  Number of observations with NaNs in CryptoChassis = 184
INFO  Number of observations with NaNs in CCXT = 0
INFO  removed rows with nans: 184 / 35816 = 0.51%


Unnamed: 0_level_0,Unnamed: 1_level_0,ask_price_cc,ask_price_ccxt,ask_size_cc,ask_size_ccxt,bid_price_cc,bid_price_ccxt,bid_size_cc,bid_size_ccxt
timestamp,full_symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-10-11 18:00:00+00:00,binance::DOGE_USDT,0.06025,0.06025,281744.0,280241.2,0.06024,0.06024,130636.0,130546.8
2022-10-11 18:00:00+00:00,binance::ETH_USDT,1290.48,1290.48,160.683,156.342,1290.47,1290.47,0.004,0.7096
2022-10-11 18:00:00+00:00,binance::ETH_USDT,1290.48,1290.48,160.683,156.342,1290.47,1290.47,0.004,0.7096
2022-10-11 18:00:00+00:00,binance::SOL_USDT,31.65,31.65,949.0,1087.0,31.64,31.64,1487.0,1480.833333
2022-10-11 18:00:00+00:00,binance::SOL_USDT,31.65,31.65,949.0,1087.0,31.64,31.64,1487.0,1480.833333


## Calculate differences

In [19]:
# Full symbol will not be relevant in calculation loops below.
bid_ask_cols.remove("full_symbol")
# Each bid ask value will have a notional and a relative difference between two sources.
for col in bid_ask_cols:
    # Notional difference: CC value - DB value.
    data[f"{col}_diff"] = data[f"{col}_cc"] - data[f"{col}_ccxt"]
    # Relative value: (CC value - DB value)/DB value.
    data[f"{col}_relative_diff_pct"] = (
        100 * (data[f"{col}_cc"] - data[f"{col}_ccxt"]) / data[f"{col}_ccxt"]
    )
#
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ask_price_cc,ask_price_ccxt,ask_size_cc,ask_size_ccxt,bid_price_cc,bid_price_ccxt,bid_size_cc,bid_size_ccxt,bid_size_diff,bid_size_relative_diff_pct,bid_price_diff,bid_price_relative_diff_pct,ask_size_diff,ask_size_relative_diff_pct,ask_price_diff,ask_price_relative_diff_pct
timestamp,full_symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2022-10-11 17:00:01+00:00,binance::BNB_USDT,273.37,273.37,9.83,17.988333,273.36,273.36,105.32,93.118333,12.201667,13.103399,0.0,0.0,-8.158333,-45.35347,0.0,0.0
2022-10-11 17:00:01+00:00,binance::BNB_USDT,273.37,273.37,9.83,17.988333,273.36,273.36,105.32,93.118333,12.201667,13.103399,0.0,0.0,-8.158333,-45.35347,0.0,0.0
2022-10-11 17:00:01+00:00,binance::BTC_USDT,19145.9,19145.9,1.741,5.6146,19145.8,19145.8,12.477,9.6936,2.7834,28.713791,0.0,0.0,-3.8736,-68.991558,0.0,0.0
2022-10-11 17:00:01+00:00,binance::BTC_USDT,19145.9,19145.9,1.741,5.6146,19145.8,19145.8,12.477,9.6936,2.7834,28.713791,0.0,0.0,-3.8736,-68.991558,0.0,0.0
2022-10-11 17:00:01+00:00,binance::DOGE_USDT,0.06049,0.06049,108377.0,85385.0,0.06048,0.06048,353180.0,351197.5,1982.5,0.564497,0.0,0.0,22992.0,26.927446,0.0,0.0


In [20]:
# Calculate the mean value of differences for each coin.
diff_stats = []
grouper = data.groupby(["full_symbol"])
for col in bid_ask_cols:
    diff_stats.append(grouper[f"{col}_diff"].mean())
    diff_stats.append(grouper[f"{col}_relative_diff_pct"].mean())
#
diff_stats = pd.concat(diff_stats, axis=1)

## Show stats for differences (in %)

### Prices

In [21]:
diff_stats[["bid_price_relative_diff_pct", "ask_price_relative_diff_pct"]]

Unnamed: 0_level_0,bid_price_relative_diff_pct,ask_price_relative_diff_pct
full_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
binance::BNB_USDT,2.6e-05,2.7e-05
binance::BTC_USDT,4e-05,4e-05
binance::DOGE_USDT,-7.8e-05,-8.1e-05
binance::ETH_USDT,1e-05,1e-05
binance::SOL_USDT,-2.1e-05,-3.2e-05


As one can see, the difference between bid and ask prices in DB and CC are less than 1%.

### Sizes

In [22]:
diff_stats[["bid_size_relative_diff_pct", "ask_size_relative_diff_pct"]]

Unnamed: 0_level_0,bid_size_relative_diff_pct,ask_size_relative_diff_pct
full_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
binance::BNB_USDT,0.697237,4.00301
binance::BTC_USDT,2.271014,7.265841
binance::DOGE_USDT,1.019151,0.122283
binance::ETH_USDT,2.540136,7.109853
binance::SOL_USDT,-0.186687,0.012566


## Correlations

### Bid price

In [23]:
bid_price_corr_matrix = (
    data[["bid_price_cc", "bid_price_ccxt"]].groupby(level=1).corr()
)
bid_price_corr_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,bid_price_cc,bid_price_ccxt
full_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
binance::BNB_USDT,bid_price_cc,1.0,0.999741
binance::BNB_USDT,bid_price_ccxt,0.999741,1.0
binance::BTC_USDT,bid_price_cc,1.0,0.999416
binance::BTC_USDT,bid_price_ccxt,0.999416,1.0
binance::DOGE_USDT,bid_price_cc,1.0,0.999524
binance::DOGE_USDT,bid_price_ccxt,0.999524,1.0
binance::ETH_USDT,bid_price_cc,1.0,0.999776
binance::ETH_USDT,bid_price_ccxt,0.999776,1.0
binance::SOL_USDT,bid_price_cc,1.0,0.998906
binance::SOL_USDT,bid_price_ccxt,0.998906,1.0


Correlation stats confirms the stats above: bid prices in DB and CC are highly correlated.

### Ask price

In [24]:
ask_price_corr_matrix = (
    data[["ask_price_cc", "ask_price_ccxt"]].groupby(level=1).corr()
)
ask_price_corr_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,ask_price_cc,ask_price_ccxt
full_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
binance::BNB_USDT,ask_price_cc,1.0,0.999729
binance::BNB_USDT,ask_price_ccxt,0.999729,1.0
binance::BTC_USDT,ask_price_cc,1.0,0.999418
binance::BTC_USDT,ask_price_ccxt,0.999418,1.0
binance::DOGE_USDT,ask_price_cc,1.0,0.999514
binance::DOGE_USDT,ask_price_ccxt,0.999514,1.0
binance::ETH_USDT,ask_price_cc,1.0,0.999777
binance::ETH_USDT,ask_price_ccxt,0.999777,1.0
binance::SOL_USDT,ask_price_cc,1.0,0.998897
binance::SOL_USDT,ask_price_ccxt,0.998897,1.0


Correlation stats confirms the stats above: ask prices in DB and CC are highly correlated.

### Bid size

In [25]:
bid_size_corr_matrix = (
    data[["bid_size_cc", "bid_size_ccxt"]].groupby(level=1).corr()
)
bid_size_corr_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,bid_size_cc,bid_size_ccxt
full_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
binance::BNB_USDT,bid_size_cc,1.0,0.95451
binance::BNB_USDT,bid_size_ccxt,0.95451,1.0
binance::BTC_USDT,bid_size_cc,1.0,0.940405
binance::BTC_USDT,bid_size_ccxt,0.940405,1.0
binance::DOGE_USDT,bid_size_cc,1.0,0.958969
binance::DOGE_USDT,bid_size_ccxt,0.958969,1.0
binance::ETH_USDT,bid_size_cc,1.0,0.950931
binance::ETH_USDT,bid_size_ccxt,0.950931,1.0
binance::SOL_USDT,bid_size_cc,1.0,0.971197
binance::SOL_USDT,bid_size_ccxt,0.971197,1.0


Correlation stats confirms the stats above: bid sizes in DB and CC are highly correlated.

### Ask size

In [26]:
ask_size_corr_matrix = (
    data[["ask_size_cc", "ask_size_ccxt"]].groupby(level=1).corr()
)
ask_size_corr_matrix

Unnamed: 0_level_0,Unnamed: 1_level_0,ask_size_cc,ask_size_ccxt
full_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
binance::BNB_USDT,ask_size_cc,1.0,0.900568
binance::BNB_USDT,ask_size_ccxt,0.900568,1.0
binance::BTC_USDT,ask_size_cc,1.0,0.941578
binance::BTC_USDT,ask_size_ccxt,0.941578,1.0
binance::DOGE_USDT,ask_size_cc,1.0,0.96628
binance::DOGE_USDT,ask_size_ccxt,0.96628,1.0
binance::ETH_USDT,ask_size_cc,1.0,0.939599
binance::ETH_USDT,ask_size_ccxt,0.939599,1.0
binance::SOL_USDT,ask_size_cc,1.0,0.973529
binance::SOL_USDT,ask_size_ccxt,0.973529,1.0


Correlation stats confirms the stats above: ask sizes in DB and CC are highly correlated.