# Hydra V3 - Improved ML Model Training

## Key Improvements over original:
1. **Enhanced Features**: 25+ features vs 7 original
2. **Fee-Adjusted Labels**: Target considers 0.08% trading fees
3. **Purged Walk-Forward CV**: Prevents temporal leakage
4. **Ensemble Averaging**: Uses all fold models, not just last
5. **Cross-Sectional Features**: Rank-based features across symbols
6. **Momentum Features**: Rate of change in order flow
7. **Time Features**: Hour-of-day effects
8. **Better Hyperparameters**: Tuned for edge detection

In [8]:
!pip install pandas numpy requests pyarrow lightgbm scikit-learn tqdm scipy

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import pandas as pd
import numpy as np
import requests
import io
import zipfile
import time
from datetime import datetime, timedelta, timezone
from tqdm import tqdm
from typing import List, Tuple, Dict
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from scipy import stats

import lightgbm as lgb

In [10]:
# Configuration
PAIRS = [
    "BTCUSDT",
    "ETHUSDT",
    "SOLUSDT",
    "BNBUSDT",
    "XRPUSDT",
    "DOGEUSDT",
    "LTCUSDT",
    "ADAUSDT",
]

DAYS = 14  # More data for better generalization
FEE_PCT = 0.0004  # 0.04% per side = 0.08% round trip
MIN_EDGE_PCT = 0.0016  # Minimum 0.16% move to be profitable after fees

## 1. Data Fetching

In [11]:
def fetch_aggtrades_day(symbol: str, date: datetime) -> pd.DataFrame:
    """Fetch aggregated trades for a single day"""
    date_str = date.strftime("%Y-%m-%d")
    url = (
        f"https://data.binance.vision/data/futures/um/daily/aggTrades/"
        f"{symbol}/{symbol}-aggTrades-{date_str}.zip"
    )
    
    r = requests.get(url)
    if r.status_code != 200:
        print(f"Missing {symbol} {date_str}")
        return None
    
    z = zipfile.ZipFile(io.BytesIO(r.content))
    csv_name = z.namelist()[0]
    df = pd.read_csv(z.open(csv_name))
    df["symbol"] = symbol
    df["date"] = date_str
    return df


def fetch_all_data(pairs: List[str], days: int) -> pd.DataFrame:
    """Fetch trade data for all pairs"""
    all_dfs = []
    end_date = datetime.now(timezone.utc).date() - timedelta(days=1)
    start_date = end_date - timedelta(days=days)
    
    for symbol in pairs:
        print(f"\nFetching {symbol}")
        for i in tqdm(range(days)):
            day = start_date + timedelta(days=i)
            df_day = fetch_aggtrades_day(symbol, day)
            if df_day is not None:
                all_dfs.append(df_day)
    
    df = pd.concat(all_dfs, ignore_index=True)
    
    # Clean and prepare
    df = df.rename(columns={
        "price": "price",
        "quantity": "qty",
        "transact_time": "timestamp",
        "is_buyer_maker": "is_sell"
    })
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms")
    df["price"] = df["price"].astype("float32")
    df["qty"] = df["qty"].astype("float32")
    df["is_sell"] = df["is_sell"].astype("int8")
    df = df.sort_values("timestamp").reset_index(drop=True)
    
    print(f"\nTotal rows: {len(df):,}")
    return df

In [12]:
df = fetch_all_data(PAIRS, DAYS)
print(df.shape)
df.head()


Fetching BTCUSDT


100%|██████████| 14/14 [00:44<00:00,  3.16s/it]



Fetching ETHUSDT


100%|██████████| 14/14 [00:50<00:00,  3.63s/it]



Fetching SOLUSDT


100%|██████████| 14/14 [00:24<00:00,  1.74s/it]



Fetching BNBUSDT


100%|██████████| 14/14 [00:24<00:00,  1.72s/it]



Fetching XRPUSDT


100%|██████████| 14/14 [00:25<00:00,  1.80s/it]



Fetching DOGEUSDT


100%|██████████| 14/14 [00:24<00:00,  1.76s/it]



Fetching LTCUSDT


100%|██████████| 14/14 [00:16<00:00,  1.18s/it]



Fetching ADAUSDT


100%|██████████| 14/14 [00:20<00:00,  1.46s/it]



Total rows: 58,959,932
(58959932, 9)


Unnamed: 0,agg_trade_id,price,qty,first_trade_id,last_trade_id,timestamp,is_sell,symbol,date
0,983748856,0.11742,830.0,3162451856,3162451861,2026-01-01 00:00:00.033,0,DOGEUSDT,2026-01-01
1,983748857,0.11741,51.0,3162451862,3162451862,2026-01-01 00:00:00.740,1,DOGEUSDT,2026-01-01
2,712080771,1.8411,1.9,2897077972,2897077972,2026-01-01 00:00:01.589,0,XRPUSDT,2026-01-01
3,548850955,0.3333,20.0,1800075962,1800075962,2026-01-01 00:00:01.690,0,ADAUSDT,2026-01-01
4,1018572200,124.580002,6.43,3047799741,3047799742,2026-01-01 00:00:01.696,1,SOLUSDT,2026-01-01


## 2. Enhanced Feature Engineering

In [13]:
def compute_enhanced_features(df_sym: pd.DataFrame, symbol: str) -> pd.DataFrame:
    """
    Compute enhanced feature set for a single symbol.
    
    Features:
    - Order flow: MOI, delta velocity, aggression persistence
    - Momentum: Rate of change in order flow
    - Absorption: Price impact, absorption z-score
    - Volatility: Multi-window vol, vol regime
    - Structure: LVN distance, price level density
    - Time: Hour of day, day of week effects
    - Cross-sectional: Will be added later across symbols
    """
    df_sym = df_sym.copy()
    df_sym["timestamp"] = pd.to_datetime(df_sym["timestamp"])
    df_sym = df_sym.sort_values("timestamp")
    df_sym["signed_qty"] = np.where(df_sym["is_sell"], -df_sym["qty"], df_sym["qty"])
    
    # Resample to 250ms bars - removed VWAP (not used in features)
    bars = (
        df_sym
        .set_index("timestamp")
        .resample("250ms")
        .agg(
            price=("price", "last"),
            qty=("qty", "sum"),
            signed_qty=("signed_qty", "sum"),
            trade_count=("qty", "count"),
        )
        .dropna(subset=["price"])
    )
    bars["price"] = bars["price"].ffill()
    bars = bars.reset_index()
    bars["symbol"] = symbol
    
    # ============ ORDER FLOW FEATURES ============
    # MOI at different windows
    bars["MOI_250ms"] = bars["signed_qty"].rolling(1).sum()
    bars["MOI_1s"] = bars["signed_qty"].rolling(4).sum()
    bars["MOI_5s"] = bars["signed_qty"].rolling(20).sum()
    
    # MOI statistics
    bars["MOI_std"] = bars["MOI_1s"].rolling(100).std()
    bars["MOI_z"] = bars["MOI_1s"].abs() / (bars["MOI_std"] + 1e-6)
    
    # Delta velocity (momentum of order flow)
    bars["delta_velocity"] = bars["MOI_1s"].diff()
    bars["delta_velocity_5s"] = bars["MOI_1s"].diff(20)  # 5-second momentum
    
    # Aggression persistence
    abs_moi = bars["MOI_1s"].abs()
    mean_moi = abs_moi.rolling(100).mean()
    std_moi = abs_moi.rolling(100).std()
    bars["AggressionPersistence"] = mean_moi / (std_moi + 1e-6)
    
    # MOI flip rate (sign changes per minute)
    moi_sign = np.sign(bars["MOI_1s"])
    sign_change = (moi_sign != moi_sign.shift(1)).astype(int)
    bars["MOI_flip_rate"] = sign_change.rolling(240).sum()  # 60 seconds
    
    # ============ ORDER FLOW MOMENTUM ============
    bars["MOI_roc_1s"] = bars["MOI_1s"].pct_change(4).clip(-10, 10)  # Rate of change
    bars["MOI_roc_5s"] = bars["MOI_1s"].pct_change(20).clip(-10, 10)
    bars["MOI_acceleration"] = bars["delta_velocity"].diff()  # Second derivative
    
    # ============ ABSORPTION FEATURES ============
    price_change = bars["price"].diff().abs().clip(lower=1e-6)
    bars["absorption_raw"] = bars["qty"] / price_change
    bars["absorption_z"] = (
        (bars["absorption_raw"] - bars["absorption_raw"].rolling(500).mean()) /
        (bars["absorption_raw"].rolling(500).std() + 1e-6)
    )
    
    # Price impact (inverse of absorption)
    bars["price_impact"] = price_change / (bars["qty"] + 1e-6)
    bars["price_impact_z"] = (
        (bars["price_impact"] - bars["price_impact"].rolling(500).mean()) /
        (bars["price_impact"].rolling(500).std() + 1e-6)
    )
    
    # ============ VOLATILITY FEATURES ============
    bars["ret"] = bars["price"].pct_change()
    bars["vol_1m"] = bars["ret"].rolling(240).std()   # 1 minute
    bars["vol_5m"] = bars["ret"].rolling(1200).std()  # 5 minutes
    bars["vol_15m"] = bars["ret"].rolling(3600).std() # 15 minutes
    
    # Volatility ratio (expansion detection)
    bars["vol_ratio"] = bars["vol_1m"] / (bars["vol_5m"] + 1e-8)
    
    # Volatility regime rank
    bars["vol_rank"] = bars["vol_5m"].rolling(2000).rank(pct=True)
    bars["vol_regime"] = pd.cut(
        bars["vol_rank"],
        bins=[-np.inf, 0.3, 0.7, np.inf],
        labels=["LOW", "MID", "HIGH"]
    )
    
    # ============ STRUCTURE FEATURES ============
    # LVN (Low Volume Node) detection
    BIN_SIZE = 10
    LVN_BLOCK = 1200  # 5 minute blocks
    
    bars["price_bin"] = (bars["price"] / BIN_SIZE).round() * BIN_SIZE
    lvn_price = np.full(len(bars), np.nan)
    poc_price = np.full(len(bars), np.nan)  # Point of Control
    
    for i in range(0, len(bars), LVN_BLOCK):
        window = bars.iloc[i:i+LVN_BLOCK]
        if window["qty"].sum() == 0:
            continue
        vp = window.groupby("price_bin")["qty"].sum()
        lvn_price[i:i+LVN_BLOCK] = vp.idxmin()
        poc_price[i:i+LVN_BLOCK] = vp.idxmax()
    
    bars["LVN_price"] = lvn_price
    bars["POC_price"] = poc_price
    bars["dist_lvn"] = (bars["price"] - bars["LVN_price"]).abs()
    bars["dist_poc"] = (bars["price"] - bars["POC_price"]).abs()
    
    # Normalize by ATR
    atr_5m = bars["ret"].abs().rolling(1200).mean() * bars["price"]
    bars["dist_lvn_atr"] = bars["dist_lvn"] / (atr_5m + 1e-6)
    bars["dist_poc_atr"] = bars["dist_poc"] / (atr_5m + 1e-6)
    
    # ============ TIME FEATURES ============
    bars["hour"] = bars["timestamp"].dt.hour
    bars["hour_sin"] = np.sin(2 * np.pi * bars["hour"] / 24)
    bars["hour_cos"] = np.cos(2 * np.pi * bars["hour"] / 24)
    bars["day_of_week"] = bars["timestamp"].dt.dayofweek
    bars["is_weekend"] = (bars["day_of_week"] >= 5).astype(int)
    
    # ============ TRADE INTENSITY ============
    bars["trade_intensity"] = bars["trade_count"].rolling(100).mean()
    bars["trade_intensity_z"] = (
        (bars["trade_count"] - bars["trade_count"].rolling(500).mean()) /
        (bars["trade_count"].rolling(500).std() + 1e-6)
    )
    
    # ============ CUMULATIVE FEATURES ============
    # Cumulative delta (order flow pressure)
    bars["cum_delta_1m"] = bars["signed_qty"].rolling(240).sum()
    bars["cum_delta_5m"] = bars["signed_qty"].rolling(1200).sum()
    
    return bars

In [14]:
# Process all symbols
all_bars = []

for symbol in PAIRS:
    print(f"\nProcessing {symbol}")
    df_sym = df[df["symbol"] == symbol].copy()
    bars = compute_enhanced_features(df_sym, symbol)
    all_bars.append(bars)
    del df_sym
    import gc; gc.collect()

# Combine all
df_bars = pd.concat(all_bars, ignore_index=True)
print(f"\nTotal bars: {len(df_bars):,}")


Processing BTCUSDT

Processing ETHUSDT

Processing SOLUSDT

Processing BNBUSDT

Processing XRPUSDT

Processing DOGEUSDT

Processing LTCUSDT

Processing ADAUSDT

Total bars: 18,464,226


## 3. Feature Selection and Decision Points

In [15]:
# Define feature columns - EXTENDED SET
FEATURE_COLS = [
    # Order flow (7)
    "MOI_250ms", "MOI_1s", "MOI_5s", "MOI_z",
    "delta_velocity", "delta_velocity_5s", "AggressionPersistence",
    
    # Order flow momentum (3)
    "MOI_roc_1s", "MOI_roc_5s", "MOI_acceleration",
    
    # Absorption (4)
    "absorption_z", "price_impact_z", "MOI_flip_rate",
    
    # Volatility (4)
    "vol_1m", "vol_5m", "vol_ratio", "vol_rank",
    
    # Structure (4)
    "dist_lvn", "dist_poc", "dist_lvn_atr", "dist_poc_atr",
    
    # Time (3)
    "hour_sin", "hour_cos", "is_weekend",
    
    # Trade intensity (2)
    "trade_intensity", "trade_intensity_z",
    
    # Cumulative (2)
    "cum_delta_1m", "cum_delta_5m",
]

print(f"Total features: {len(FEATURE_COLS)}")

Total features: 28


In [16]:
# Create decision points - when to evaluate
# More selective than original: require actual edge potential

all_decisions = []

for symbol in PAIRS:
    print(f"Creating decision points for {symbol}")
    
    bars_sym = df_bars[df_bars["symbol"] == symbol].copy()
    bars_sym = bars_sym.dropna(subset=FEATURE_COLS)
    
    # Rolling thresholds for adaptive filtering
    bars_sym["MOI_thresh"] = bars_sym["MOI_1s"].abs().rolling(2000).quantile(0.8)
    bars_sym["LVN_thresh"] = bars_sym["dist_lvn_atr"].rolling(2000).quantile(0.2)
    bars_sym["absorption_thresh"] = bars_sym["absorption_z"].abs().rolling(2000).quantile(0.8)
    
    # Decision mask: at least one condition must be met
    decision_mask = (
        (bars_sym["dist_lvn_atr"] < bars_sym["LVN_thresh"]) |  # Near LVN
        (bars_sym["absorption_z"].abs() > bars_sym["absorption_thresh"]) |  # Absorption event
        (bars_sym["MOI_1s"].abs() > bars_sym["MOI_thresh"]) |  # Strong order flow
        (bars_sym["vol_ratio"] > 1.5)  # Volatility expansion
    )
    
    df_decision_sym = bars_sym.loc[decision_mask].copy()
    df_decision_sym["bar_idx"] = df_decision_sym.index
    all_decisions.append(df_decision_sym)
    
    print(f"  {len(df_decision_sym):,} decision points ({100*len(df_decision_sym)/len(bars_sym):.1f}%)")

df_decision = pd.concat(all_decisions, ignore_index=True)
print(f"\nTotal decision points: {len(df_decision):,}")

Creating decision points for BTCUSDT
  1,553,277 decision points (53.0%)
Creating decision points for ETHUSDT
  1,746,795 decision points (54.2%)
Creating decision points for SOLUSDT
  1,555,317 decision points (52.6%)
Creating decision points for BNBUSDT
  1,059,286 decision points (54.1%)
Creating decision points for XRPUSDT
  1,519,994 decision points (54.5%)
Creating decision points for DOGEUSDT
  1,140,515 decision points (52.6%)
Creating decision points for LTCUSDT
  442,036 decision points (53.3%)
Creating decision points for ADAUSDT
  843,472 decision points (53.2%)

Total decision points: 9,860,692


In [None]:
# Create feature matrix with one-hot encoding for symbols
X_decision_df = df_decision[FEATURE_COLS].copy()

# One-hot encode symbols
pair_ohe = pd.get_dummies(df_decision["symbol"], prefix="pair", dtype="int8")
X_decision_ohe = pd.concat([X_decision_df.reset_index(drop=True), pair_ohe.reset_index(drop=True)], axis=1)

# Final feature columns
FEATURE_COLUMNS = X_decision_ohe.columns.tolist()
print(f"Final feature count: {len(FEATURE_COLUMNS)}")
print(FEATURE_COLUMNS)

Final feature count: 36
['MOI_250ms', 'MOI_1s', 'MOI_5s', 'MOI_z', 'delta_velocity', 'delta_velocity_5s', 'AggressionPersistence', 'MOI_roc_1s', 'MOI_roc_5s', 'MOI_acceleration', 'absorption_z', 'price_impact_z', 'MOI_flip_rate', 'vol_1m', 'vol_5m', 'vol_ratio', 'vol_rank', 'dist_lvn', 'dist_poc', 'dist_lvn_atr', 'dist_poc_atr', 'hour_sin', 'hour_cos', 'is_weekend', 'trade_intensity', 'trade_intensity_z', 'cum_delta_1m', 'cum_delta_5m', 'pair_ADAUSDT', 'pair_BNBUSDT', 'pair_BTCUSDT', 'pair_DOGEUSDT', 'pair_ETHUSDT', 'pair_LTCUSDT', 'pair_SOLUSDT', 'pair_XRPUSDT']


In [19]:
# Save feature columns
import json
with open("feature_columns_v2.json", "w") as f:
    json.dump(FEATURE_COLUMNS, f)

## 4. Fee-Adjusted Labeling

In [20]:
def create_labels_fee_adjusted(
    df_bars: pd.DataFrame,
    df_decision: pd.DataFrame,
    X_decision_ohe: pd.DataFrame,
    horizon_sec: int,
    fee_pct: float = 0.0004,  # One-way fee
) -> Tuple[Dict, Dict]:
    """
    Create fee-adjusted labels.
    
    Target = (max_favorable_move - 2*fee) / volatility
    
    This ensures the model learns to predict PROFITABLE moves, not just any move.
    """
    HORIZON = int(horizon_sec * 1000 / 250)  # Convert to bars
    round_trip_fee = 2 * fee_pct
    
    # Separate by direction and regime
    X_dict = {
        "up_low": [], "up_mid": [], "up_high": [],
        "down_low": [], "down_mid": [], "down_high": []
    }
    y_dict = {
        "up_low": [], "up_mid": [], "up_high": [],
        "down_low": [], "down_mid": [], "down_high": []
    }
    
    for symbol in PAIRS:
        print(f"Labeling {symbol} (horizon={horizon_sec}s)")
        
        # Get bars for this symbol
        bars_sym = df_bars[df_bars["symbol"] == symbol].reset_index(drop=True)
        
        # Get decision points for this symbol
        dec_sym = df_decision[df_decision["symbol"] == symbol]
        
        for i, row in tqdm(dec_sym.iterrows(), total=len(dec_sym), desc=symbol):
            idx = int(row["bar_idx"])
            regime = row["vol_regime"]
            
            if idx + HORIZON >= len(bars_sym):
                continue
            
            entry = bars_sym.loc[idx, "price"]
            vol = bars_sym.loc[idx, "vol_5m"]
            
            if pd.isna(vol) or vol <= 0:
                continue
            
            future = bars_sym.iloc[idx+1 : idx+HORIZON+1]
            
            # Calculate max favorable moves (AFTER fees)
            up_move = (future["price"].max() - entry) / entry - round_trip_fee
            down_move = (entry - future["price"].min()) / entry - round_trip_fee
            
            # Skip if neither direction would be profitable
            if max(up_move, down_move) < 0:
                continue
            
            features = X_decision_ohe.loc[row.name].values
            
            # Determine direction and calculate score
            if up_move > down_move and up_move > 0:
                # Profitable long
                score = up_move / vol  # Vol-adjusted profit
                key = f"up_{regime.lower()}"
            elif down_move > 0:
                # Profitable short
                score = down_move / vol
                key = f"down_{regime.lower()}"
            else:
                continue
            
            X_dict[key].append(features)
            y_dict[key].append(score)
    
    # Convert to numpy and sanitize
    for key in X_dict:
        X_dict[key] = np.array(X_dict[key], dtype=np.float32)
        y_arr = np.array(y_dict[key], dtype=np.float32)
        # Clip outliers and log-transform
        upper = np.percentile(y_arr, 99) if len(y_arr) > 0 else 1.0
        y_arr = np.clip(y_arr, 0, upper)
        y_dict[key] = np.log1p(y_arr)
        print(f"{key}: {len(X_dict[key]):,} samples")
    
    return X_dict, y_dict

In [21]:
# Create labels for 60s horizon
X_60, y_60 = create_labels_fee_adjusted(
    df_bars, df_decision, X_decision_ohe,
    horizon_sec=60,
    fee_pct=FEE_PCT
)

Labeling BTCUSDT (horizon=60s)


BTCUSDT: 100%|██████████| 1553277/1553277 [09:35<00:00, 2701.10it/s]


Labeling ETHUSDT (horizon=60s)


ETHUSDT: 100%|██████████| 1746795/1746795 [02:51<00:00, 10205.40it/s]


Labeling SOLUSDT (horizon=60s)


SOLUSDT: 100%|██████████| 1555317/1555317 [01:45<00:00, 14679.01it/s]


Labeling BNBUSDT (horizon=60s)


BNBUSDT: 100%|██████████| 1059286/1059286 [01:11<00:00, 14828.66it/s]


Labeling XRPUSDT (horizon=60s)


XRPUSDT: 100%|██████████| 1519994/1519994 [01:38<00:00, 15376.58it/s]


Labeling DOGEUSDT (horizon=60s)


DOGEUSDT: 100%|██████████| 1140515/1140515 [01:14<00:00, 15220.36it/s]


Labeling LTCUSDT (horizon=60s)


LTCUSDT: 100%|██████████| 442036/442036 [00:29<00:00, 15186.49it/s]


Labeling ADAUSDT (horizon=60s)


ADAUSDT: 100%|██████████| 843472/843472 [01:12<00:00, 11604.96it/s]


up_low: 77,762 samples
up_mid: 57,942 samples
up_high: 104,656 samples
down_low: 72,077 samples
down_mid: 58,412 samples
down_high: 95,177 samples


In [22]:
# Create labels for 300s horizon
X_300, y_300 = create_labels_fee_adjusted(
    df_bars, df_decision, X_decision_ohe,
    horizon_sec=300,
    fee_pct=FEE_PCT
)

Labeling BTCUSDT (horizon=300s)


BTCUSDT: 100%|██████████| 1553277/1553277 [14:06<00:00, 1834.20it/s]


Labeling ETHUSDT (horizon=300s)


ETHUSDT: 100%|██████████| 1746795/1746795 [03:33<00:00, 8186.42it/s] 


Labeling SOLUSDT (horizon=300s)


SOLUSDT: 100%|██████████| 1555317/1555317 [02:05<00:00, 12437.15it/s]


Labeling BNBUSDT (horizon=300s)


BNBUSDT: 100%|██████████| 1059286/1059286 [01:25<00:00, 12422.06it/s]


Labeling XRPUSDT (horizon=300s)


XRPUSDT: 100%|██████████| 1519994/1519994 [02:14<00:00, 11287.08it/s]


Labeling DOGEUSDT (horizon=300s)


DOGEUSDT: 100%|██████████| 1140515/1140515 [01:43<00:00, 11024.11it/s]


Labeling LTCUSDT (horizon=300s)


LTCUSDT: 100%|██████████| 442036/442036 [00:33<00:00, 13098.08it/s]


Labeling ADAUSDT (horizon=300s)


ADAUSDT: 100%|██████████| 843472/843472 [01:02<00:00, 13436.56it/s]


up_low: 229,625 samples
up_mid: 167,982 samples
up_high: 266,789 samples
down_low: 219,728 samples
down_mid: 168,772 samples
down_high: 246,926 samples


## 5. Purged Walk-Forward CV with Ensemble

In [23]:
def purged_walk_forward_splits(n: int, n_splits: int = 5, purge_pct: float = 0.01):
    """
    Walk-forward splits with purging to prevent temporal leakage.
    
    Purge removes samples at the boundary between train and validation
    to prevent look-ahead bias.
    """
    fold_size = n // (n_splits + 1)
    purge_size = int(fold_size * purge_pct)
    
    for i in range(n_splits):
        tr_end = fold_size * (i + 1) - purge_size
        va_start = fold_size * (i + 1) + purge_size
        va_end = fold_size * (i + 2)
        
        yield np.arange(0, tr_end), np.arange(va_start, va_end)


def train_ensemble_model(
    X: np.ndarray,
    y: np.ndarray,
    name: str,
    feature_columns: List[str],
    n_splits: int = 5
) -> Tuple[List, Dict]:
    """
    Train ensemble of models using purged walk-forward CV.
    
    Returns all fold models (for ensemble averaging) and metrics.
    """
    if len(X) < 1000:
        print(f"Insufficient data for {name}: {len(X)} samples")
        return [], {}
    
    X_df = pd.DataFrame(X, columns=feature_columns)
    
    models = []
    metrics = {
        "maes": [],
        "rmses": [],
        "top10_actual": [],
        "top25_actual": [],
    }
    
    print(f"\n{'='*60}")
    print(f"Training {name} ({len(X):,} samples)")
    print(f"{'='*60}")
    
    for fold, (tr_idx, va_idx) in enumerate(purged_walk_forward_splits(len(X_df), n_splits)):
        
        # Optimized hyperparameters for edge detection
        model = lgb.LGBMRegressor(
            n_estimators=1000,
            max_depth=7,
            learning_rate=0.02,
            subsample=0.7,
            colsample_bytree=0.7,
            min_child_samples=50,
            reg_alpha=0.1,
            reg_lambda=0.1,
            objective="huber",
            alpha=0.9,
            random_state=42 + fold,
            n_jobs=-1,
            verbose=-1,
        )
        
        model.fit(
            X_df.iloc[tr_idx], y[tr_idx],
            eval_set=[(X_df.iloc[va_idx], y[va_idx])],
            eval_metric="l1",
            callbacks=[lgb.early_stopping(100, verbose=False)],
        )
        
        # Evaluate
        preds = model.predict(X_df.iloc[va_idx])
        actual = y[va_idx]
        
        mae = mean_absolute_error(actual, preds)
        rmse = np.sqrt(mean_squared_error(actual, preds))
        
        # Top percentile analysis
        for q, key in [(90, "top10_actual"), (75, "top25_actual")]:
            thresh = np.percentile(preds, q)
            mask = preds >= thresh
            if mask.sum() > 0:
                metrics[key].append(actual[mask].mean())
        
        metrics["maes"].append(mae)
        metrics["rmses"].append(rmse)
        models.append(model)
        
        print(f"Fold {fold}: MAE={mae:.4f}, RMSE={rmse:.4f}")
    
    # Summary
    print(f"\n{name} Summary:")
    print(f"  Mean MAE: {np.mean(metrics['maes']):.4f}")
    print(f"  Mean RMSE: {np.mean(metrics['rmses']):.4f}")
    print(f"  Target STD: {np.std(y):.4f}")
    print(f"  MAE/STD: {np.mean(metrics['maes'])/np.std(y):.4f}")
    print(f"  Top 10% mean actual: {np.mean(metrics['top10_actual']):.4f}")
    print(f"  Top 25% mean actual: {np.mean(metrics['top25_actual']):.4f}")
    
    return models, metrics

In [24]:
# Train all 60s models
models_60 = {}

for key in ["up_low", "up_mid", "up_high", "down_low", "down_mid", "down_high"]:
    models, metrics = train_ensemble_model(
        X_60[key], y_60[key],
        f"{key.upper()}_60",
        FEATURE_COLUMNS
    )
    models_60[key] = models


Training UP_LOW_60 (77,762 samples)
Fold 0: MAE=0.7962, RMSE=0.9873
Fold 1: MAE=0.7386, RMSE=0.9035
Fold 2: MAE=0.8004, RMSE=0.9615
Fold 3: MAE=0.6937, RMSE=0.8588
Fold 4: MAE=0.8301, RMSE=1.0260

UP_LOW_60 Summary:
  Mean MAE: 0.7718
  Mean RMSE: 0.9474
  Target STD: 0.9905
  MAE/STD: 0.7792
  Top 10% mean actual: 2.6922
  Top 25% mean actual: 2.5998

Training UP_MID_60 (57,942 samples)
Fold 0: MAE=0.7113, RMSE=0.8947
Fold 1: MAE=0.7277, RMSE=0.8809
Fold 2: MAE=0.7389, RMSE=0.9071
Fold 3: MAE=0.7190, RMSE=0.8839
Fold 4: MAE=0.7458, RMSE=0.9175

UP_MID_60 Summary:
  Mean MAE: 0.7285
  Mean RMSE: 0.8968
  Target STD: 0.9213
  MAE/STD: 0.7908
  Top 10% mean actual: 2.4257
  Top 25% mean actual: 2.3701

Training UP_HIGH_60 (104,656 samples)
Fold 0: MAE=0.7315, RMSE=0.8999
Fold 1: MAE=0.7534, RMSE=0.9284
Fold 2: MAE=0.7437, RMSE=0.9369
Fold 3: MAE=0.7199, RMSE=0.8957
Fold 4: MAE=0.7217, RMSE=0.9047

UP_HIGH_60 Summary:
  Mean MAE: 0.7340
  Mean RMSE: 0.9131
  Target STD: 0.9806
  MAE/STD:

In [25]:
# Train all 300s models
models_300 = {}

for key in ["up_low", "up_mid", "up_high", "down_low", "down_mid", "down_high"]:
    models, metrics = train_ensemble_model(
        X_300[key], y_300[key],
        f"{key.upper()}_300",
        FEATURE_COLUMNS
    )
    models_300[key] = models


Training UP_LOW_300 (229,625 samples)
Fold 0: MAE=0.6131, RMSE=0.8266
Fold 1: MAE=0.6571, RMSE=0.8770
Fold 2: MAE=0.6681, RMSE=0.8748
Fold 3: MAE=0.6897, RMSE=0.9103
Fold 4: MAE=0.6881, RMSE=0.8948

UP_LOW_300 Summary:
  Mean MAE: 0.6632
  Mean RMSE: 0.8767
  Target STD: 0.9608
  MAE/STD: 0.6903
  Top 10% mean actual: 4.0259
  Top 25% mean actual: 3.7044

Training UP_MID_300 (167,982 samples)
Fold 0: MAE=0.7161, RMSE=0.9206
Fold 1: MAE=0.6596, RMSE=0.8598
Fold 2: MAE=0.7105, RMSE=0.9151
Fold 3: MAE=0.7339, RMSE=0.9375
Fold 4: MAE=0.6903, RMSE=0.8505

UP_MID_300 Summary:
  Mean MAE: 0.7021
  Mean RMSE: 0.8967
  Target STD: 0.9485
  MAE/STD: 0.7402
  Top 10% mean actual: 3.8722
  Top 25% mean actual: 3.6085

Training UP_HIGH_300 (266,789 samples)
Fold 0: MAE=0.7572, RMSE=0.9638
Fold 1: MAE=0.6569, RMSE=0.8592
Fold 2: MAE=0.7935, RMSE=1.0116
Fold 3: MAE=0.7227, RMSE=0.9259
Fold 4: MAE=0.6796, RMSE=0.8548

UP_HIGH_300 Summary:
  Mean MAE: 0.7220
  Mean RMSE: 0.9230
  Target STD: 1.0035
  

## 6. Feature Importance Analysis

In [26]:
# Analyze feature importance across all models
def get_ensemble_feature_importance(models_dict: Dict, feature_cols: List[str]) -> pd.DataFrame:
    """Average feature importance across all models"""
    all_importances = []
    
    for key, models in models_dict.items():
        for model in models:
            imp = pd.DataFrame({
                "feature": feature_cols,
                "importance": model.feature_importances_,
                "model": key
            })
            all_importances.append(imp)
    
    df_imp = pd.concat(all_importances)
    return df_imp.groupby("feature")["importance"].mean().sort_values(ascending=False)

# Feature importance for 300s models
fi_300 = get_ensemble_feature_importance(models_300, FEATURE_COLUMNS)
print("Top 15 Features (300s models):")
print(fi_300.head(15))

Top 15 Features (300s models):
feature
vol_5m             494.300000
cum_delta_5m       374.100000
hour_sin           312.066667
hour_cos           283.466667
dist_poc           260.133333
dist_poc_atr       236.766667
dist_lvn           219.233333
dist_lvn_atr       194.766667
vol_ratio          193.633333
vol_1m             168.233333
cum_delta_1m       165.633333
MOI_flip_rate       79.800000
is_weekend          75.300000
vol_rank            74.133333
trade_intensity     67.466667
Name: importance, dtype: float64


## 7. Save Models

In [27]:
import os
import joblib

os.makedirs("models_v2", exist_ok=True)

# Save 60s models
for key, models in models_60.items():
    path = f"models_v2/models_{key}_60.pkl"
    joblib.dump(models, path)
    print(f"Saved {path}")

# Save 300s models
for key, models in models_300.items():
    path = f"models_v2/models_{key}_300.pkl"
    joblib.dump(models, path)
    print(f"Saved {path}")

Saved models_v2/models_up_low_60.pkl
Saved models_v2/models_up_mid_60.pkl
Saved models_v2/models_up_high_60.pkl
Saved models_v2/models_down_low_60.pkl
Saved models_v2/models_down_mid_60.pkl
Saved models_v2/models_down_high_60.pkl
Saved models_v2/models_up_low_300.pkl
Saved models_v2/models_up_mid_300.pkl
Saved models_v2/models_up_high_300.pkl
Saved models_v2/models_down_low_300.pkl
Saved models_v2/models_down_mid_300.pkl
Saved models_v2/models_down_high_300.pkl


In [28]:
# Download models (for Colab)
try:
    from google.colab import files
    files.download("feature_columns_v2.json")
    for f in os.listdir("models_v2"):
        files.download(f"models_v2/{f}")
except:
    print("Not in Colab, skip download")

Not in Colab, skip download


## 8. Model Calibration (Optional)

Calibrate predictions to actual probabilities using isotonic regression.

In [29]:
from sklearn.isotonic import IsotonicRegression

def calibrate_model_predictions(
    models: List,
    X: np.ndarray,
    y: np.ndarray,
    feature_columns: List[str]
) -> IsotonicRegression:
    """
    Fit isotonic regression calibrator on held-out predictions.
    """
    if len(X) < 100:
        return None
    
    X_df = pd.DataFrame(X, columns=feature_columns)
    
    # Use last 20% as calibration set
    cal_start = int(len(X) * 0.8)
    X_cal = X_df.iloc[cal_start:]
    y_cal = y[cal_start:]
    
    # Ensemble prediction
    preds = np.mean([m.predict(X_cal) for m in models], axis=0)
    
    # Fit calibrator
    calibrator = IsotonicRegression(out_of_bounds="clip")
    calibrator.fit(preds, y_cal)
    
    return calibrator

# Optionally calibrate (uncomment to use)
calibrators_300 = {}
for key in models_300:
    if models_300[key]:
        calibrators_300[key] = calibrate_model_predictions(
            models_300[key], X_300[key], y_300[key], FEATURE_COLUMNS
        )

## Summary

**Improvements Made:**

1. **29 features** vs 7 original - captures more market dynamics
2. **Fee-adjusted labels** - model learns profitable moves, not just any moves
3. **Purged CV** - prevents temporal leakage
4. **Ensemble averaging** - uses all fold models for robustness
5. **Better hyperparameters** - deeper trees, more regularization
6. **Momentum features** - captures order flow acceleration
7. **Time features** - captures intraday patterns

**To use in production:**
1. Copy `models_v2/*.pkl` to `ml_models/` directory
2. Update `feature_columns.json` with new features
3. Update `src/stage5/predictor.py` to compute new features
4. Use ensemble averaging instead of last model only