# Modeling Problem

### "Given the market state *right now*, which coin is *relatively* most likely to outperform the others over the next 15 minutes?"
- *"At time $t$, choose the coin that will outperform all the others over $[t, t+15]$"*

At each time $t$:
- Observe $x$ assets (e.g. 10) at the same timestamp
- Each asset has a feature vector $x_t^i$
- We want to choose

$$\arg \max_i \mathbb{E}[r^i_{t+1} | x_t^i]$$

This is not **time-series forecasting**. It is **cross-sectional ranking**>

### Cross Sectional Ranking

Instead of viewing one asset at a time, we are comparing many assets at once and predicting which will perform the best.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyarrow

df = pd.read_csv("../data/kraken_15min_6mo_ohlcv.csv")

mapping = {
    "KRAKEN_SPOT_BTC_USD": "BTC",
    "KRAKEN_SPOT_ETH_USD": "ETH",
    "KRAKEN_SPOT_SOL_USD": "SOL",
    "KRAKEN_SPOT_XRP_USD": "XRP",
    "KRAKEN_SPOT_ADA_USD": "ADA",
    "KRAKEN_SPOT_DOGE_USD": "DOGE",
    "KRAKEN_SPOT_LTC_USD": "LTC",
    "KRAKEN_SPOT_AVAX_USD": "AVAX",
    "KRAKEN_SPOT_LINK_USD": "LINK",
    "KRAKEN_SPOT_DOT_USD": "DOT",
}

df["symbol_id"] = df["symbol_id"].replace(mapping)   
df.drop(columns=['time_open', 'time_close'], inplace=True)

time_cols = ['time_period_start', 'time_period_end']

for col in time_cols:
    df[col] = pd.to_datetime(df[col]).dt.strftime('%Y-%m-%d %H:%M')

df

Unnamed: 0,symbol_id,time_period_start,time_period_end,price_open,price_high,price_low,price_close,volume_traded,trades_count
0,BTC,2025-06-17 22:30,2025-06-17 22:45,104575.30000,104604.80000,104494.70000,104604.70000,2.644289,162
1,BTC,2025-06-17 22:45,2025-06-17 23:00,104604.80000,104656.00000,104232.30000,104248.20000,147.599820,456
2,BTC,2025-06-17 23:00,2025-06-17 23:15,104248.20000,104434.20000,104234.60000,104434.10000,4.789123,234
3,BTC,2025-06-17 23:15,2025-06-17 23:30,104434.10000,104518.10000,104434.00000,104515.50000,5.567111,154
4,BTC,2025-06-17 23:30,2025-06-17 23:45,104515.60000,104700.70000,104515.60000,104700.70000,5.864958,176
...,...,...,...,...,...,...,...,...,...
172207,DOT,2025-12-14 21:30,2025-12-14 21:45,1.96290,1.96375,1.95600,1.95755,1682.287502,8
172208,DOT,2025-12-14 21:45,2025-12-14 22:00,1.95770,1.95775,1.94925,1.95370,2242.631221,11
172209,DOT,2025-12-14 22:00,2025-12-14 22:15,1.95400,1.96370,1.95400,1.96015,165.572558,14
172210,DOT,2025-12-14 22:15,2025-12-14 22:30,1.96015,1.96095,1.95265,1.95565,2547.636277,7


# Cleaning/Alignment

At each time step, you have a **cross-sectional snapshot** of the market. Thus:
1. Sort from earliest timestamp to latest, then by symbol
2. Ensure timestamps align
    1. `df.groupby("time_close")`: Group all rows by timestamp
    2. `["symbol_id"].nunique()`: For each time group, select only the symbols, and the number of unique symbols at each timestamp

In [2]:
df = df.sort_values(["time_period_end", "symbol_id"]).reset_index(drop=True)
df.head(10)

Unnamed: 0,symbol_id,time_period_start,time_period_end,price_open,price_high,price_low,price_close,volume_traded,trades_count
0,ADA,2025-06-17 22:30,2025-06-17 22:45,0.607916,0.608179,0.606332,0.606994,269705.3,128
1,AVAX,2025-06-17 22:30,2025-06-17 22:45,18.46,18.46,18.42,18.42,102.8341,19
2,BTC,2025-06-17 22:30,2025-06-17 22:45,104575.3,104604.8,104494.7,104604.7,2.644289,162
3,DOGE,2025-06-17 22:30,2025-06-17 22:45,0.169588,0.169588,0.16916,0.16916,2071852.0,72
4,DOT,2025-06-17 22:30,2025-06-17 22:45,3.7079,3.7089,3.7024,3.7089,801.3484,14
5,ETH,2025-06-17 22:30,2025-06-17 22:45,2510.22,2511.45,2506.96,2510.36,66.88134,57
6,LINK,2025-06-17 22:30,2025-06-17 22:45,12.91888,12.91907,12.89135,12.91045,2360.67,31
7,LTC,2025-06-17 22:30,2025-06-17 22:45,84.08,84.13,83.96,83.99,86.07405,39
8,SOL,2025-06-17 22:30,2025-06-17 22:45,147.23,147.27,146.97,147.03,307.2505,50
9,XRP,2025-06-17 22:30,2025-06-17 22:45,2.15341,2.15341,2.14695,2.15131,228398.3,293


In [3]:
df.groupby("time_period_end")["symbol_id"].nunique().describe()

df.to_parquet(
    '../data/data1.parquet',
    engine="pyarrow",
    compression="snappy"
)