## Load CEX data and clean up
I have access to Binance price data from 2024-01-29 00:00:00.066000 UTC to 2024-02-09 23:59:59.532000 UTC. Binance has four PENDLE markets: PENDLEBTC, PENDLEUSDT, PENDLEFDUSD, and PENDLETUSD. PENDLEUSDT is the highest volume market by far, so I'll use price data from that market.

I need to do the following work:
1. **Pull PENDLEUSDT and USDTUSDC binance data.** I need PENDLE<>USDC Binance prices because the Uniswap pools are PENDLE<>USDC. 
2. **Use Polars to join data on second to produce PENDLEUSDC pricing with second-level granularity.** We use Google BQ, and I don't think I can join on second-level data in GoogleBQ. The plan is to join the data in Polars with second-level granularity.

In [3]:
# dependencies

!pip3 install polars
!pip3 install seaborn
import polars as pl
import seaborn as sns


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.12 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.12 -m pip install --upgrade pip[0m


In [18]:
# see CEX_data_pull.sql. Looks like 4.2m rows and 180 MB.
# I've never worked with a dataset this large before, so pretty exciting.
# Still, not too big as I understand so should be approachable.

df = pl.read_csv('data/2024.2.10_CEX_data.csv')

In [19]:
# convert string to timestamp
df = df.with_columns(
    pl.coalesce(
        pl.col('timestamp')
            .str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S%.f UTC', strict=False), # strict = False writes to null
        # need to coalesce two `strptime` to handle errors where a whole second didn't have a decimal at the end.
        pl.col('timestamp')
            .str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S UTC', strict=False)
    )
)
# truncate timestamp to seconds
df = df.with_columns
    (
        pl.col("timestamp")
        .dt.truncate("s")
        .alias("timestamp")
    )


# df = (
#         df.group_by("timestamp", "symbol")
#         .agg(pl.col("price").mean())
# )

In [21]:
df.head(20)

timestamp,symbol,price
datetime[μs],str,f64
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1709
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1699
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1724
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1718
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1697
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.171
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1724
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1709
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1722
2024-02-09 23:59:59.532,"""PENDLEUSDT""",3.1713


In [None]:
# # production polars
# # building my string for lazy api

# q = (
#     pl.scan_csv('data/2024.2.10_CEX_data.csv')
#     .filter(pl.col("symbol")="PENDLEUSDT"))
# )

# df_pendleusdc = q.collect()
# df_pendleusdc.head()