### Load CEX data and clean up
I have access to Binance price data from 2024-01-29 00:00:00.066000 UTC. Will use PENDLEUSDT and WETHUSDT pairs to derive PENDLE-ETH price. Think it's fine to use Binance ETH as essentially the same as WETH because it's more about the relative price change that is important for this analysis.

I need to do the following work:
1. **Pull PENDLEUSDT and ETHUSDT binance data.**
2. **Use Polars to join data on second to produce PENDLEWETH pricing with second-level granularity.** We use Google BQ, and I don't think I can join on second-level data in GoogleBQ. The plan is to join the data in Polars with second-level granularity.

Please note that I'm developing in a haphazard and suboptimal way right now then i'll go back and clean everything up (maybe)

In [41]:
# dependencies

# !pip3 install polars
# !pip3 install seaborn
import polars as pl
import seaborn as sns

In [42]:
# see CEX_data_pull.sql. Looks like 8m rows and 400 MB.
# I've never worked with a dataset this large before, so pretty exciting.
# Still, not too big as I understand so should be approachable.

nance = pl.read_csv('data/2024.2.19 PENDLEUSDT WETHUSDT.csv')

In [43]:
# convert string to timestamp
nance = nance.with_columns(
    pl.coalesce(
        pl.col('timestamp')
            # note all times UTC
            .str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S%.f UTC', strict=False), # strict = False writes to null
        # need to coalesce two `strptime` to handle errors where a whole second didn't have a decimal at the end.
        pl.col('timestamp')
            .str.strptime(pl.Datetime, '%Y-%m-%d %H:%M:%S UTC', strict=False)
    )
)
# truncate timestamp to seconds
nance = nance.with_columns(
    pl.col("timestamp")
    .dt.truncate("1s")
    # .alias("timestamp")
)

# pivot nance table into columns per token pair
nance = nance.pivot(
    values='price',
    index='timestamp',
    columns='symbol',
    aggregate_function='mean' 
)

nance.head()

timestamp,ETHUSDT,PENDLEUSDT
datetime[μs],f64,f64
2024-02-04 21:36:59,2308.10375,3.1367
2024-02-04 21:36:58,2308.1,
2024-02-04 21:36:57,2308.1,
2024-02-04 21:36:56,2308.11,
2024-02-04 21:36:55,2308.11,3.136738


In [44]:
# @Dev TODO is there a way to do this without creating a new dataframe?

# create a polars datetime range from min to max
dates = pl.datetime_range(
            nance.select(pl.min('timestamp')).item(),
            nance.select(pl.max('timestamp')).item(),
            interval = '1s',
            closed = 'both',
            # eager evaluate it into a series
            eager = True
        )   

# create a new df with a continuous timeseries
# this will be my master df
df=pl.DataFrame({'timestamp':dates})
df.shape

(1814401, 1)

In [45]:
# join nance data into df on timestamp
df = df.join(
    nance,
    left_on='timestamp',
    right_on='timestamp',
    how = 'left'
)

(1814401, 3)

In [48]:
# generate new column for PENDLE-WETH price
df = df.with_columns((df["PENDLEUSDT"] / df["ETHUSDT"]).alias("nance-PENDLE-ETH"))
df = df.with_columns(pl.col('nance-PENDLE-ETH').fill_null(strategy="forward"))
df.head(10)

timestamp,ETHUSDT,PENDLEUSDT,nance-PENDLE-ETH
datetime[μs],f64,f64,f64
2024-01-28 23:59:59,2256.9,,
2024-01-29 00:00:00,2256.903922,2.2398,0.000992
2024-01-29 00:00:01,2256.905,,0.000992
2024-01-29 00:00:02,2256.900667,,0.000992
2024-01-29 00:00:03,2257.187832,2.240142,0.000992
2024-01-29 00:00:04,2257.78,2.2412,0.000993
2024-01-29 00:00:05,2257.782,2.24085,0.000993
2024-01-29 00:00:06,2257.316185,2.2409,0.000993
2024-01-29 00:00:07,2256.457917,,0.000993
2024-01-29 00:00:08,2256.323333,,0.000993
