## Load CEX data and clean up
I have access to Binance price data from 2024-01-29 00:00:00.066000 UTC to 2024-02-09 23:59:59.532000 UTC. Binance has four PENDLE markets: PENDLEBTC, PENDLEUSDT, PENDLEFDUSD, and PENDLETUSD. PENDLEUSDT is the highest volume market by far, so I'll use price data from that market.

I need to do the following work:
1. **Pull PENDLEUSDT and USDTUSDC binance data.** I need PENDLE<>USDC Binance prices because the Uniswap pools are PENDLE<>USDC. 
2. **Use Polars to join data on second to produce PENDLEUSDC pricing with second-level granularity.** We use Google BQ, and I don't think I can join on second-level data in GoogleBQ. The plan is to join the data in Polars with second-level granularity.

Please note that I'm developing in a haphazard and suboptimal way as of 2/11/24 then i'll go back and clean everything up

In [2]:
# dependencies

!pip3 install polars
!pip3 install seaborn
import polars as pl
import seaborn as sns


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.12 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.12 -m pip install --upgrade pip[0m


In [3]:
# see CEX_data_pull.sql. Looks like 4.2m rows and 180 MB.
# I've never worked with a dataset this large before, so pretty exciting.
# Still, not too big as I understand so should be approachable.

df = pl.read_csv('data/2024.2.10_CEX_data.csv')

In [4]:
# convert string to timestamp
df = df.with_columns(
    pl.coalesce(
        pl.col('timestamp')
            .str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S%.f UTC', strict=False), # strict = False writes to null
        # need to coalesce two `strptime` to handle errors where a whole second didn't have a decimal at the end.
        pl.col('timestamp')
            .str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S UTC', strict=False)
    )
)
# truncate timestamp to seconds
df = df.with_columns(
    pl.col("timestamp")
    .dt.truncate("1s")
    .alias("timestamp")
)

In [5]:
# calculate mean price per second for each token pair

df_mean = (
    df.group_by("symbol", "timestamp")
    .agg(pl.col("price").mean())
)


In [6]:
df_mean.head()

symbol,timestamp,price
str,datetime[μs],f64
"""PENDLEUSDT""",2024-02-09 23:59:59,3.170942
"""USDCUSDT""",2024-02-09 23:59:58,0.9995
"""USDCUSDT""",2024-02-09 23:59:56,0.99955
"""USDCUSDT""",2024-02-09 23:59:52,0.9995
"""PENDLEUSDT""",2024-02-09 23:59:51,3.1687


In [7]:
# identify the minimum and maximum times in my dataset
min_time = df_mean.select(pl.min('timestamp')).item()
max_time = df_mean.select(pl.max('timestamp')).item()

# create a polars datetime range from min to max
dates = pl.datetime_range(
            min_time,
            max_time,
            interval = '1s',
            closed = 'both',
            # eager evaluate it into a series
            eager = True
        )   

# create a new df with a continuous timeseries
df_full=dates
df_full.head()

literal
datetime[μs]
2024-01-29 00:00:00
2024-01-29 00:00:01
2024-01-29 00:00:02
2024-01-29 00:00:03
2024-01-29 00:00:04
2024-01-29 00:00:05
2024-01-29 00:00:06
2024-01-29 00:00:07
2024-01-29 00:00:08
2024-01-29 00:00:09


In [11]:
# @dev TODO as of 10 AM Feb 12, 2024
# last cell above was generating a second-level timeseries from min-time to max-time for the Binance price data that we have.
# next step I was imagining was to join the PENDLEUSDT and USDCUSDT prices into the df_full timeseries.
# from there, can backfill prices where there are null values - will then have full price-time data for both pairs and can compute a PENDLE-USDC price for the period. 
# that will give us full PENDLE-USDC CEX data to which we can compare the Uni v3 data.