### "Given the market state *right now*, which coin is *relatively* most likely to outperform the others over the next 15 minutes?"

At each time $t$:
- Observe $x$ assets (e.g. 10) at the same timestamp
- Each asset has a feature vector $x_t^i$
- We want to choose

$$\arg \max_i \mathbb{E}[r^i_{t+1} | x_t^i]$$

This is not **time-series forecasting**. It is **cross-sectional ranking**>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyarrow

df = pd.read_csv("../data/kraken_15min_6mo_ohlcv.csv")

mapping = {
    "KRAKEN_SPOT_BTC_USD": "BTC",
    "KRAKEN_SPOT_ETH_USD": "ETH",
    "KRAKEN_SPOT_SOL_USD": "SOL",
    "KRAKEN_SPOT_XRP_USD": "XRP",
    "KRAKEN_SPOT_ADA_USD": "ADA",
    "KRAKEN_SPOT_DOGE_USD": "DOGE",
    "KRAKEN_SPOT_LTC_USD": "LTC",
    "KRAKEN_SPOT_AVAX_USD": "AVAX",
    "KRAKEN_SPOT_LINK_USD": "LINK",
    "KRAKEN_SPOT_DOT_USD": "DOT",
}

df["symbol_id"] = df["symbol_id"].replace(mapping)   
 
df.head()

Unnamed: 0,symbol_id,time_period_start,time_period_end,time_open,time_close,price_open,price_high,price_low,price_close,volume_traded,trades_count
0,BTC,2025-06-17T22:30:00.0000000Z,2025-06-17T22:45:00.0000000Z,2025-06-17T22:34:01.7527859Z,2025-06-17T22:44:54.0998051Z,104575.3,104604.8,104494.7,104604.7,2.644289,162
1,BTC,2025-06-17T22:45:00.0000000Z,2025-06-17T23:00:00.0000000Z,2025-06-17T22:45:02.1713569Z,2025-06-17T22:59:56.0955209Z,104604.8,104656.0,104232.3,104248.2,147.59982,456
2,BTC,2025-06-17T23:00:00.0000000Z,2025-06-17T23:15:00.0000000Z,2025-06-17T23:00:00.0414009Z,2025-06-17T23:14:09.8272209Z,104248.2,104434.2,104234.6,104434.1,4.789123,234
3,BTC,2025-06-17T23:15:00.0000000Z,2025-06-17T23:30:00.0000000Z,2025-06-17T23:15:03.9804720Z,2025-06-17T23:29:54.1752009Z,104434.1,104518.1,104434.0,104515.5,5.567111,154
4,BTC,2025-06-17T23:30:00.0000000Z,2025-06-17T23:45:00.0000000Z,2025-06-17T23:30:06.4436750Z,2025-06-17T23:44:47.1663300Z,104515.6,104700.7,104515.6,104700.7,5.864958,176


## Data

At each time step, you have a **cross-sectional snapshot** of the market. Thus:
1. Sort from earliest timestamp to latest, then by symbol
2. Ensure timestamps align
    1. `df.groupby("time_close")`: Group all rows by timestamp
    2. `["symbol_id"].nunique()`: For each time group, select only the symbols, and the number of unique symbols at each timestamp

In [2]:
df = df.sort_values(["time_close", "symbol_id"]).reset_index(drop=True)
df.head(3)

Unnamed: 0,symbol_id,time_period_start,time_period_end,time_open,time_close,price_open,price_high,price_low,price_close,volume_traded,trades_count
0,DOT,2025-06-17T22:30:00.0000000Z,2025-06-17T22:45:00.0000000Z,2025-06-17T22:34:13.8209159Z,2025-06-17T22:42:07.2504830Z,3.7079,3.7089,3.7024,3.7089,801.348419,14
1,LINK,2025-06-17T22:30:00.0000000Z,2025-06-17T22:45:00.0000000Z,2025-06-17T22:34:06.6370680Z,2025-06-17T22:42:25.6043870Z,12.91888,12.91907,12.89135,12.91045,2360.670279,31
2,ADA,2025-06-17T22:30:00.0000000Z,2025-06-17T22:45:00.0000000Z,2025-06-17T22:34:02.1722350Z,2025-06-17T22:44:30.1537210Z,0.607916,0.608179,0.606332,0.606994,269705.314157,128


In [3]:
df.groupby("time_close")["symbol_id"].nunique().describe()

count    172208.000000
mean          1.000023
std           0.004819
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           2.000000
Name: symbol_id, dtype: float64

## Target Definition

#### Pooled Model With Scale-Invariant Features

One pooled model, only scale free features,

# Analysis

In [None]:
df_raw.describe()

# Sanity Checks

* `price_high > max(price_open, price_close)`
* `price_low < min(price_open, price_close)`


In [None]:
(df_raw["price_high"] < df_raw[["price_open", "price_close"]].max(axis=1)).any()

# Target Definition

### Next Return

$$y = log(\frac{C_{t+1}}{C_t})$$

Other options:
- Future k-step return
- Future volatility
- Future price delta

If the target is:

Heavy-tailed → robust losses

Near-zero mean → very low signal-to-noise

Non-stationary → differencing required

# Price Dynamics

Look for:
- Volatility clustering
- Fat tails
- Regime Changes

In [None]:
close = df_raw["price_close"] 
returns = np.log(df_raw["price_close"]).diff()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(close)
ax1.set_title('Price Close')
ax1.set_xlabel('Time')
ax1.set_ylabel('Price')

ax2.hist(returns.dropna(), bins=200)
ax2.set_title('Distribution of Returns')
ax2.set_xlabel('Returns')
ax2.set_ylabel('Frequency')

plt.tight_layout()  
plt.show()

In [None]:
plt.plot(df_raw["price_close"])

In [None]:
returns = np.log(df_raw["price_close"]).diff()
plt.hist(returns.dropna(), bins=200);

In [None]:
import pandas_ta as ta

df_raw["EMA10"]   = ta.ema(df_raw["price_close"], length=10)
df_raw["EMA30"]   = ta.ema(df_raw["price_close"], length=30)
df_raw["EMA200"]  = ta.ema(df_raw["price_close"], length=200)

df_raw["RSI14"]   = ta.rsi(df_raw["price_close"], length=14)
df_raw["RSI30"]   = ta.rsi(df_raw["price_close"], length=30)
df_raw["RSI200"]  = ta.rsi(df_raw["price_close"], length=200)

df_raw["MOM10"]   = ta.mom(df_raw["price_close"], length=10)
df_raw["MOM30"]   = ta.mom(df_raw["price_close"], length=30)

df_raw["PROC9"]   = ta.roc(df_raw["price_close"], length=9)

df_raw["MACD"]    = ta.macd(df_raw["price_close"])["MACD_12_26_9"]

stoch10           = ta.stoch(df_raw["price_high"], df_raw["price_low"], df_raw["price_close"], k=10, d=3)
df_raw["K10"]     = stoch10["STOCHk_10_3_3"]

stoch30           = ta.stoch(df_raw["price_high"], df_raw["price_low"], df_raw["price_close"], k=30, d=3)
df_raw["K30"]     = stoch30["STOCHk_30_3_3"]

stoch200          = ta.stoch(df_raw["price_high"], df_raw["price_low"], df_raw["price_close"], k=200, d=3)
df_raw["K200"]    = stoch200["STOCHk_200_3_3"]

df_raw = df_raw.dropna().reset_index(drop=True)


In [None]:
class MinMaxScaler:

    def __init__(self):
        self.min = None
        self.max = None

    def fit(self, X):
        self.min = np.min(X, axis=0)
        self.max = np.max(X, axis=0)
        
        return self

    def transform(self, X):
        return (X - self.min) / (self.max - self.min)

    def fit_transform(self, X):
        """
        Fit using X and then transform it. Useful when we need to scale just once.
        """
        self.fit(X)
        return self.transform(X)

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df_raw),
    columns=df_raw.columns,
    index=df_raw.index
)

df_scaled.head()

In [None]:
df_scaled.to_parquet(
    '../data/data1.parquet',
    engine="pyarrow",
    compression="snappy"
)

## SCALING

In [None]:
scaler = MinMaxScaler()

df_scaled = scaler.fit_transform(df)

df_scaled.head()

## PCA



In [None]:
close_fwd_1 = 

$$\mathbf{x}^{(i)}=\begin{bmatrix}\mathcal{C}_{p}^{(i)}\\
\mathcal{V}^{(i)}\\
\text{QAV}^{(i)}\\
\text{NOT}^{(i)}\\
\text{TBBV}^{(i)}\\
\text{RSI}_{14}^{(i)}\\
\text{RSI}_{30}^{(i)}\\
\text{RSI}_{200}^{(i)}\\
\text{MOM}_{10}^{(i)}\\
\text{MOM}_{30}^{(i)}\\
\text{MACD}^{(i)}\\
\text{PROC}_{9}^{(i)}\\
\text{EMA}_{10}^{(i)}\\
\text{EMA}_{30}^{(i)}\\
\text{EMA}_{200}^{(i)}\\
\%K_{10}^{(i)}\\
\%K_{30}^{(i)}\\
\%K_{200}^{(i)}\\
\end{bmatrix},\quad\mathbf{x}^{(i)}\in\mathbb{R}^{n}$$

In [None]:
import pandas_ta as ta


RSI 3

In [None]:
# Drop initial rows with NaNs from long windows (e.g., EMA200, RSI200, K200)
df_raw = df_raw.dropna().reset_index(drop=True)

# Select the 18 features in the paper
feature_cols = [
    "price_open",
    "price_high",
    "price_low",
    "volume_traded",
    "trades_count",
    "EMA10",
    "EMA30",
    "EMA200",
    "RSI14",
    "RSI30",
    "RSI200",
    "MOM10",
    "MOM30",
    "PROC9",
    "MACD",
    "K10",
    "K30",
    "K200"
]

target = "price_close"

X = df_raw[feature_cols]