# Hydra V3 Enhanced - ML Model Training

## Enhancements over V2:
1. **Cross-Sectional Features**: Rank-based features across symbols
2. **Triple-Barrier Labels**: Realistic TP/SL/Time-based targets
3. **Optuna Hyperparameter Optimization**: Per-regime tuning
4. **21 Days of Data**: More robust training
5. **Memory Optimized**: Chunked processing, float32 throughout

In [163]:
!pip install pandas numpy requests pyarrow lightgbm scikit-learn tqdm scipy optuna -q

In [164]:
import pandas as pd
import numpy as np
import requests
import io
import zipfile
import time
import gc
import json
import joblib
import os
from datetime import datetime, timedelta, timezone
from tqdm import tqdm
from typing import List, Tuple, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy import stats

import lightgbm as lgb
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

In [165]:
# Configuration
PAIRS = [
    "BTCUSDT", "ETHUSDT", "SOLUSDT", "BNBUSDT",
    "XRPUSDT", "DOGEUSDT", "LTCUSDT", "ADAUSDT",
]

DAYS = 21  # Balanced: more data without memory issues
FEE_PCT = 0.0004  # 0.04% per side
ROUND_TRIP_FEE = 2 * FEE_PCT  # 0.08%

# Triple Barrier Parameters
TP_MULT = 2.0  # Take profit at 2x ATR
SL_MULT = 1.0  # Stop loss at 1x ATR
MAX_HOLDING_BARS = 1200  # 5 minutes max hold (300s / 0.25s per bar)

## 1. Data Fetching (Memory Optimized)

In [166]:
def fetch_aggtrades_day(symbol: str, date: datetime) -> Optional[pd.DataFrame]:
    """Fetch aggregated trades for a single day"""
    date_str = date.strftime("%Y-%m-%d")
    url = (
        f"https://data.binance.vision/data/futures/um/daily/aggTrades/"
        f"{symbol}/{symbol}-aggTrades-{date_str}.zip"
    )
    
    try:
        r = requests.get(url, timeout=30)
        if r.status_code != 200:
            return None
        
        z = zipfile.ZipFile(io.BytesIO(r.content))
        csv_name = z.namelist()[0]
        df = pd.read_csv(z.open(csv_name))
        df["symbol"] = symbol
        return df
    except Exception as e:
        print(f"Error fetching {symbol} {date_str}: {e}")
        return None


def fetch_symbol_data(symbol: str, days: int) -> pd.DataFrame:
    """Fetch all data for a single symbol"""
    all_dfs = []
    end_date = datetime.now(timezone.utc).date() - timedelta(days=1)
    start_date = end_date - timedelta(days=days)
    
    for i in tqdm(range(days), desc=symbol):
        day = start_date + timedelta(days=i)
        df_day = fetch_aggtrades_day(symbol, day)
        if df_day is not None:
            all_dfs.append(df_day)
    
    if not all_dfs:
        return pd.DataFrame()
    
    df = pd.concat(all_dfs, ignore_index=True)
    
    # Clean and convert to efficient types
    df = df.rename(columns={
        "transact_time": "timestamp",
        "is_buyer_maker": "is_sell"
    })
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms")
    df["price"] = df["price"].astype("float32")
    df["quantity"] = df["quantity"].astype("float32")
    df["is_sell"] = df["is_sell"].astype("int8")
    df = df.sort_values("timestamp").reset_index(drop=True)
    
    return df

In [167]:
# Fetch data per symbol to manage memory
symbol_data = {}
for symbol in PAIRS:
    print(f"\nFetching {symbol}")
    symbol_data[symbol] = fetch_symbol_data(symbol, DAYS)
    print(f"  {len(symbol_data[symbol]):,} trades")

total_trades = sum(len(df) for df in symbol_data.values())
print(f"\nTotal trades: {total_trades:,}")


Fetching BTCUSDT


BTCUSDT: 100%|██████████| 21/21 [01:01<00:00,  2.95s/it]


  24,056,684 trades

Fetching ETHUSDT


ETHUSDT: 100%|██████████| 21/21 [01:08<00:00,  3.27s/it]


  27,424,579 trades

Fetching SOLUSDT


SOLUSDT: 100%|██████████| 21/21 [00:36<00:00,  1.74s/it]


  7,532,866 trades

Fetching BNBUSDT


BNBUSDT: 100%|██████████| 21/21 [00:31<00:00,  1.49s/it]


  5,728,359 trades

Fetching XRPUSDT


XRPUSDT: 100%|██████████| 21/21 [00:36<00:00,  1.72s/it]


  6,974,506 trades

Fetching DOGEUSDT


DOGEUSDT: 100%|██████████| 21/21 [00:33<00:00,  1.57s/it]


  5,315,706 trades

Fetching LTCUSDT


LTCUSDT: 100%|██████████| 21/21 [00:23<00:00,  1.12s/it]


  1,537,556 trades

Fetching ADAUSDT


ADAUSDT: 100%|██████████| 21/21 [00:29<00:00,  1.39s/it]


  2,991,577 trades

Total trades: 81,561,833


## 2. Enhanced Feature Engineering with Cross-Sectional

In [168]:
def compute_base_features(df_sym: pd.DataFrame, symbol: str) -> pd.DataFrame:
    """
    Compute base features for a single symbol.
    Cross-sectional features added later across all symbols.
    """
    df_sym = df_sym.copy()
    df_sym["signed_qty"] = np.where(df_sym["is_sell"], -df_sym["quantity"], df_sym["quantity"])
    
    # Resample to 250ms bars
    bars = (
        df_sym
        .set_index("timestamp")
        .resample("250ms")
        .agg(
            price=("price", "last"),
            qty=("quantity", "sum"),
            signed_qty=("signed_qty", "sum"),
            trade_count=("quantity", "count"),
        )
        .dropna(subset=["price"])
    )
    bars["price"] = bars["price"].ffill()
    bars = bars.reset_index()
    bars["symbol"] = symbol
    
    # ============ ORDER FLOW FEATURES ============
    bars["MOI_250ms"] = bars["signed_qty"].rolling(1).sum()
    bars["MOI_1s"] = bars["signed_qty"].rolling(4).sum()
    bars["MOI_5s"] = bars["signed_qty"].rolling(20).sum()
    bars["MOI_std"] = bars["MOI_1s"].rolling(100).std()
    bars["MOI_z"] = bars["MOI_1s"].abs() / (bars["MOI_std"] + 1e-6)
    bars["delta_velocity"] = bars["MOI_1s"].diff()
    bars["delta_velocity_5s"] = bars["MOI_1s"].diff(20)
    
    # Aggression persistence
    abs_moi = bars["MOI_1s"].abs()
    mean_moi = abs_moi.rolling(100).mean()
    std_moi = abs_moi.rolling(100).std()
    bars["AggressionPersistence"] = mean_moi / (std_moi + 1e-6)
    
    # MOI flip rate
    moi_sign = np.sign(bars["MOI_1s"])
    sign_change = (moi_sign != moi_sign.shift(1)).astype(int)
    bars["MOI_flip_rate"] = sign_change.rolling(240).sum()
    
    # Order flow momentum
    bars["MOI_roc_1s"] = bars["MOI_1s"].pct_change(4).clip(-10, 10)
    bars["MOI_roc_5s"] = bars["MOI_1s"].pct_change(20).clip(-10, 10)
    bars["MOI_acceleration"] = bars["delta_velocity"].diff()
    
    # ============ ABSORPTION FEATURES ============
    price_change = bars["price"].diff().abs().clip(lower=1e-6)
    bars["absorption_raw"] = bars["qty"] / price_change
    bars["absorption_z"] = (
        (bars["absorption_raw"] - bars["absorption_raw"].rolling(500).mean()) /
        (bars["absorption_raw"].rolling(500).std() + 1e-6)
    )
    bars["price_impact"] = price_change / (bars["qty"] + 1e-6)
    bars["price_impact_z"] = (
        (bars["price_impact"] - bars["price_impact"].rolling(500).mean()) /
        (bars["price_impact"].rolling(500).std() + 1e-6)
    )
    
    # ============ VOLATILITY FEATURES ============
    bars["ret"] = bars["price"].pct_change()
    bars["vol_1m"] = bars["ret"].rolling(240).std()
    bars["vol_5m"] = bars["ret"].rolling(1200).std()
    bars["vol_ratio"] = bars["vol_1m"] / (bars["vol_5m"] + 1e-8)
    bars["vol_rank"] = bars["vol_5m"].rolling(2000).rank(pct=True)
    
    # ATR for triple barrier
    bars["atr_5m"] = bars["ret"].abs().rolling(1200).mean() * bars["price"]
    
    # Vol regime - use STRING type instead of Categorical to survive merges
    bars["vol_regime"] = pd.cut(
        bars["vol_rank"],
        bins=[-np.inf, 0.3, 0.7, np.inf],
        labels=["LOW", "MID", "HIGH"]
    ).astype(str)  # Convert to string to survive merge operations
    
    # ============ STRUCTURE FEATURES ============
    BIN_SIZE = 10
    LVN_BLOCK = 1200
    
    bars["price_bin"] = (bars["price"] / BIN_SIZE).round() * BIN_SIZE
    lvn_price = np.full(len(bars), np.nan)
    poc_price = np.full(len(bars), np.nan)
    
    for i in range(0, len(bars), LVN_BLOCK):
        window = bars.iloc[i:i+LVN_BLOCK]
        if window["qty"].sum() == 0:
            continue
        vp = window.groupby("price_bin")["qty"].sum()
        lvn_price[i:i+LVN_BLOCK] = vp.idxmin()
        poc_price[i:i+LVN_BLOCK] = vp.idxmax()
    
    bars["LVN_price"] = lvn_price
    bars["POC_price"] = poc_price
    bars["dist_lvn"] = (bars["price"] - bars["LVN_price"]).abs()
    bars["dist_poc"] = (bars["price"] - bars["POC_price"]).abs()
    bars["dist_lvn_atr"] = bars["dist_lvn"] / (bars["atr_5m"] + 1e-6)
    bars["dist_poc_atr"] = bars["dist_poc"] / (bars["atr_5m"] + 1e-6)
    
    # ============ TIME FEATURES ============
    bars["hour"] = bars["timestamp"].dt.hour
    bars["hour_sin"] = np.sin(2 * np.pi * bars["hour"] / 24)
    bars["hour_cos"] = np.cos(2 * np.pi * bars["hour"] / 24)
    bars["is_weekend"] = (bars["timestamp"].dt.dayofweek >= 5).astype(int)
    
    # ============ TRADE INTENSITY ============
    bars["trade_intensity"] = bars["trade_count"].rolling(100).mean()
    bars["trade_intensity_z"] = (
        (bars["trade_count"] - bars["trade_count"].rolling(500).mean()) /
        (bars["trade_count"].rolling(500).std() + 1e-6)
    )
    
    # ============ CUMULATIVE FEATURES ============
    bars["cum_delta_1m"] = bars["signed_qty"].rolling(240).sum()
    bars["cum_delta_5m"] = bars["signed_qty"].rolling(1200).sum()
    
    # Convert to float32
    float_cols = bars.select_dtypes(include=[np.float64]).columns
    bars[float_cols] = bars[float_cols].astype(np.float32)
    
    return bars

In [169]:
# Process all symbols
all_bars = {}

for symbol in PAIRS:
    print(f"Processing {symbol}")
    if len(symbol_data[symbol]) > 0:
        all_bars[symbol] = compute_base_features(symbol_data[symbol], symbol)
        print(f"  {len(all_bars[symbol]):,} bars")
    
    # Free memory
    del symbol_data[symbol]
    gc.collect()

del symbol_data
gc.collect()

Processing BTCUSDT
  4,277,952 bars
Processing ETHUSDT
  4,742,141 bars
Processing SOLUSDT
  4,297,506 bars
Processing BNBUSDT
  2,816,282 bars
Processing XRPUSDT
  3,880,707 bars
Processing DOGEUSDT
  3,067,561 bars
Processing LTCUSDT
  1,157,837 bars
Processing ADAUSDT
  2,301,313 bars


0

In [170]:
def add_cross_sectional_features(all_bars: Dict[str, pd.DataFrame]) -> Dict[str, pd.DataFrame]:
    """
    Add cross-sectional features: rank features across all symbols at each timestamp.
    
    This captures relative strength - which symbol is leading/lagging.
    """
    print("Adding cross-sectional features...")
    
    # Features to rank across symbols
    rank_features = ["MOI_1s", "MOI_5s", "vol_5m", "absorption_z", "cum_delta_5m"]
    
    # Get common timestamps (rounded to 250ms)
    for symbol, bars in all_bars.items():
        bars["ts_key"] = bars["timestamp"].dt.floor("250ms")
    
    # For each feature, compute rank across symbols
    for feature in tqdm(rank_features, desc="Cross-sectional"):
        # Build cross-sectional dataframe
        cross_df = pd.DataFrame()
        for symbol, bars in all_bars.items():
            temp = bars[["ts_key", feature]].copy()
            temp = temp.rename(columns={feature: symbol})
            if cross_df.empty:
                cross_df = temp
            else:
                cross_df = cross_df.merge(temp, on="ts_key", how="outer")
        
        # Compute rank (0-1) across symbols for each timestamp
        symbol_cols = [s for s in PAIRS if s in cross_df.columns]
        
        # Add rank back to each symbol's bars
        for symbol in symbol_cols:
            rank_col = f"{feature}_rank"
            # Get rank for this symbol at each timestamp
            symbol_ranks = pd.DataFrame()
            symbol_ranks["ts_key"] = cross_df["ts_key"]
            symbol_ranks[rank_col] = cross_df[symbol_cols].rank(axis=1, pct=True)[symbol]
            
            # Merge back
            all_bars[symbol] = all_bars[symbol].merge(
                symbol_ranks, on="ts_key", how="left"
            )
            all_bars[symbol][rank_col] = all_bars[symbol][rank_col].fillna(0.5).astype(np.float32)
        
        del cross_df
        gc.collect()
    
    # Remove ts_key
    for symbol in all_bars:
        all_bars[symbol] = all_bars[symbol].drop(columns=["ts_key"])
    
    return all_bars

In [171]:
all_bars = add_cross_sectional_features(all_bars)
print(f"\nFeatures per symbol: {len(all_bars[PAIRS[0]].columns)}")

Adding cross-sectional features...


Cross-sectional: 100%|██████████| 5/5 [07:54<00:00, 94.96s/it]



Features per symbol: 49


## 3. Feature Selection and Decision Points

In [172]:
# Extended feature columns with cross-sectional
FEATURE_COLS = [
    # Order flow (7)
    "MOI_250ms", "MOI_1s", "MOI_5s", "MOI_z",
    "delta_velocity", "delta_velocity_5s", "AggressionPersistence",
    
    # Order flow momentum (3)
    "MOI_roc_1s", "MOI_roc_5s", "MOI_acceleration",
    
    # Absorption (3)
    "absorption_z", "price_impact_z", "MOI_flip_rate",
    
    # Volatility (4)
    "vol_1m", "vol_5m", "vol_ratio", "vol_rank",
    
    # Structure (4)
    "dist_lvn", "dist_poc", "dist_lvn_atr", "dist_poc_atr",
    
    # Time (3)
    "hour_sin", "hour_cos", "is_weekend",
    
    # Trade intensity (2)
    "trade_intensity", "trade_intensity_z",
    
    # Cumulative (2)
    "cum_delta_1m", "cum_delta_5m",
    
    # Cross-sectional ranks (5) - NEW
    "MOI_1s_rank", "MOI_5s_rank", "vol_5m_rank", 
    "absorption_z_rank", "cum_delta_5m_rank",
]

print(f"Total features: {len(FEATURE_COLS)}")

Total features: 33


In [173]:
# Create decision points with stricter filtering
all_decisions = []

for symbol in PAIRS:
    print(f"Creating decision points for {symbol}")
    
    bars_sym = all_bars[symbol].copy()
    bars_sym = bars_sym.dropna(subset=FEATURE_COLS)
    
    # Adaptive thresholds
    bars_sym["MOI_thresh"] = bars_sym["MOI_1s"].abs().rolling(2000).quantile(0.85)
    bars_sym["LVN_thresh"] = bars_sym["dist_lvn_atr"].rolling(2000).quantile(0.15)
    bars_sym["absorption_thresh"] = bars_sym["absorption_z"].abs().rolling(2000).quantile(0.85)
    
    # Decision mask: require stronger conditions
    decision_mask = (
        (bars_sym["dist_lvn_atr"] < bars_sym["LVN_thresh"]) |  # Near LVN
        (bars_sym["absorption_z"].abs() > bars_sym["absorption_thresh"]) |  # Absorption
        (bars_sym["MOI_1s"].abs() > bars_sym["MOI_thresh"]) |  # Strong flow
        (bars_sym["vol_ratio"] > 1.8)  # Vol expansion
    )
    
    df_decision_sym = bars_sym.loc[decision_mask].copy()
    df_decision_sym["bar_idx"] = df_decision_sym.index
    all_decisions.append(df_decision_sym)
    
    print(f"  {len(df_decision_sym):,} decision points ({100*len(df_decision_sym)/len(bars_sym):.1f}%)")

df_decision = pd.concat(all_decisions, ignore_index=True)
print(f"\nTotal decision points: {len(df_decision):,}")

del all_decisions
gc.collect()

Creating decision points for BTCUSDT
  1,789,244 decision points (41.9%)
Creating decision points for ETHUSDT
  2,079,014 decision points (43.9%)
Creating decision points for SOLUSDT
  1,892,212 decision points (44.1%)
Creating decision points for BNBUSDT
  1,250,150 decision points (44.4%)
Creating decision points for XRPUSDT
  1,796,038 decision points (46.3%)
Creating decision points for DOGEUSDT
  1,364,778 decision points (44.5%)
Creating decision points for LTCUSDT
  520,112 decision points (45.0%)
Creating decision points for ADAUSDT
  1,040,956 decision points (45.3%)

Total decision points: 11,732,504


0

In [174]:
# Convert features to float32
for col in FEATURE_COLS:
    if col in df_decision.columns:
        df_decision[col] = df_decision[col].astype(np.float32)

# One-hot encode symbols
pair_ohe = pd.get_dummies(df_decision["symbol"], prefix="pair", dtype="int8")

# Final feature columns
FEATURE_COLUMNS = FEATURE_COLS + pair_ohe.columns.tolist()
print(f"Final feature count: {len(FEATURE_COLUMNS)}")

# Create X matrix
X = np.hstack([
    df_decision[FEATURE_COLS].values,
    pair_ohe.values.astype(np.float32)
])
print(f"X shape: {X.shape}, dtype: {X.dtype}")
print(f"Memory: {X.nbytes / 1e9:.2f} GB")

del pair_ohe
gc.collect()

Final feature count: 41
X shape: (11732504, 41), dtype: float32
Memory: 1.92 GB


0

In [175]:
# Save feature columns
with open("feature_columns_v3.json", "w") as f:
    json.dump(FEATURE_COLUMNS, f)

## 4. Triple-Barrier Labeling

In [None]:
# ============ LABELING: Fee-Adjusted Max Favorable Move (IMPROVED) ============
# Key fix: Be MORE SELECTIVE about samples to reduce noise

def create_labels_fee_adjusted(
    all_bars: Dict[str, pd.DataFrame],
    df_decision: pd.DataFrame,
    X: np.ndarray,
    horizon_sec: int,
    min_profit_mult: float = 1.5,  # Require move > 1.5x fees to include
) -> Tuple[Dict, Dict]:
    """
    Create fee-adjusted labels using max favorable move approach.
    
    Key improvement: Only include samples where profit > min_profit_mult * fees
    This filters out marginal cases and focuses on clear opportunities.
    """
    HORIZON = int(horizon_sec * 1000 / 250)  # Convert to bars (250ms each)
    MIN_PROFIT = ROUND_TRIP_FEE * min_profit_mult  # e.g., 0.08% * 1.5 = 0.12%
    
    # Separate by direction and regime
    X_dict = {
        "up_low": [], "up_mid": [], "up_high": [],
        "down_low": [], "down_mid": [], "down_high": []
    }
    y_dict = {
        "up_low": [], "up_mid": [], "up_high": [],
        "down_low": [], "down_mid": [], "down_high": []
    }
    
    stats = {"total": 0, "profitable": 0, "strong": 0, "skipped": 0}
    VALID_REGIMES = {"LOW", "MID", "HIGH"}
    
    for symbol in PAIRS:
        print(f"Labeling {symbol} (horizon={horizon_sec}s)...")
        
        bars_sym = all_bars[symbol]
        dec_sym = df_decision[df_decision["symbol"] == symbol].copy()
        
        if len(dec_sym) == 0:
            continue
        
        # Vectorized labeling for speed
        prices = bars_sym["price"].values
        vol_5m = bars_sym["vol_5m"].values
        n_bars = len(bars_sym)
        
        # Get decision point data
        bar_indices = dec_sym["bar_idx"].values.astype(np.int32)
        regimes = dec_sym["vol_regime"].astype(str).str.upper().values
        row_names = dec_sym.index.values
        
        # Pre-allocate results
        up_moves = np.full(len(dec_sym), np.nan)
        down_moves = np.full(len(dec_sym), np.nan)
        vols = np.full(len(dec_sym), np.nan)
        
        # Vectorized calculation of max favorable moves
        for i, idx in enumerate(bar_indices):
            if idx + HORIZON >= n_bars:
                continue
            
            entry_price = prices[idx]
            vol = vol_5m[idx]
            
            if np.isnan(vol) or vol <= 0 or np.isnan(entry_price):
                continue
            
            # Future prices
            future_prices = prices[idx+1 : idx+HORIZON+1]
            
            # Max favorable moves (AFTER fees)
            up_moves[i] = (future_prices.max() - entry_price) / entry_price - ROUND_TRIP_FEE
            down_moves[i] = (entry_price - future_prices.min()) / entry_price - ROUND_TRIP_FEE
            vols[i] = vol
        
        # Process results
        valid_mask = ~np.isnan(up_moves) & ~np.isnan(vols)
        
        for i in np.where(valid_mask)[0]:
            regime = regimes[i]
            if regime not in VALID_REGIMES:
                stats["skipped"] += 1
                continue
            
            up_move = up_moves[i]
            down_move = down_moves[i]
            vol = vols[i]
            best_move = max(up_move, down_move)
            
            stats["total"] += 1
            
            # Skip if neither direction is profitable
            if best_move < 0:
                stats["skipped"] += 1
                continue
            
            stats["profitable"] += 1
            
            # KEY FIX: Only include STRONG signals (> MIN_PROFIT threshold)
            if best_move < MIN_PROFIT:
                continue  # Skip marginal cases
            
            stats["strong"] += 1
            
            # Normalize by volatility
            up_score = up_move / vol if vol > 0 else 0
            down_score = down_move / vol if vol > 0 else 0
            
            # Get features
            X_row = X[row_names[i]]
            
            # Assign to appropriate bucket
            regime_lower = regime.lower()
            if up_move > down_move:
                X_dict[f"up_{regime_lower}"].append(X_row)
                y_dict[f"up_{regime_lower}"].append(up_score)
            else:
                X_dict[f"down_{regime_lower}"].append(X_row)
                y_dict[f"down_{regime_lower}"].append(down_score)
        
        print(f"  {symbol}: {valid_mask.sum():,} valid, {stats['strong']:,} strong so far")
    
    # Convert to arrays
    X_out = {}
    y_out = {}
    
    for key in X_dict:
        if X_dict[key]:
            X_out[key] = np.vstack(X_dict[key]).astype(np.float32)
            y_arr = np.array(y_dict[key], dtype=np.float32)
            y_out[key] = np.log1p(np.clip(y_arr, 0, 100))  # log1p for stability
            print(f"{key}: {len(X_out[key]):,} samples, y_mean={y_arr.mean():.4f}, y_std={y_arr.std():.4f}")
        else:
            X_out[key] = np.array([], dtype=np.float32).reshape(0, X.shape[1])
            y_out[key] = np.array([], dtype=np.float32)
            print(f"{key}: 0 samples")
    
    print(f"\nStats: total={stats['total']:,}, profitable={stats['profitable']:,}, strong={stats['strong']:,} ({100*stats['strong']/stats['total']:.1f}%)")
    return X_out, y_out

In [None]:
# Create labels for BOTH horizons (60s and 300s)
# Key: Use stricter filtering to focus on high-quality signals

print("=" * 60)
print("Creating 60-second horizon labels (min_profit=1.5x fees)")
print("=" * 60)
X_60, y_60 = create_labels_fee_adjusted(all_bars, df_decision, X, horizon_sec=60, min_profit_mult=1.5)

print("\n" + "=" * 60)
print("Creating 300-second horizon labels (min_profit=2.0x fees)")
print("=" * 60)
# Use stricter filtering for 300s - require stronger moves
X_300, y_300 = create_labels_fee_adjusted(all_bars, df_decision, X, horizon_sec=300, min_profit_mult=2.0)

Creating 60-second horizon labels
Labeling BTCUSDT (horizon=60s)...
  BTCUSDT: 1,789,122 valid, 428,405 profitable so far
Labeling ETHUSDT (horizon=60s)...
  ETHUSDT: 2,078,774 valid, 1,125,707 profitable so far
Labeling SOLUSDT (horizon=60s)...
  SOLUSDT: 1,892,027 valid, 2,173,793 profitable so far
Labeling BNBUSDT (horizon=60s)...
  BNBUSDT: 1,250,097 valid, 2,632,938 profitable so far
Labeling XRPUSDT (horizon=60s)...
  XRPUSDT: 1,795,798 valid, 3,701,028 profitable so far
Labeling DOGEUSDT (horizon=60s)...
  DOGEUSDT: 1,364,714 valid, 4,738,632 profitable so far
Labeling LTCUSDT (horizon=60s)...
  LTCUSDT: 520,079 valid, 5,225,332 profitable so far
Labeling ADAUSDT (horizon=60s)...
  ADAUSDT: 1,040,813 valid, 6,183,268 profitable so far
up_low: 797,579 samples, y_mean=11.5897, y_std=14.8482
up_mid: 651,961 samples, y_mean=11.0270, y_std=12.6382
up_high: 1,621,959 samples, y_mean=11.1698, y_std=14.3253
down_low: 799,539 samples, y_mean=11.4841, y_std=13.3161
down_mid: 672,775 sampl

In [178]:
# DEBUG: Check vol_regime values
print("=== DEBUG: vol_regime analysis ===")
print(f"df_decision columns: {list(df_decision.columns)}")
print(f"\nvol_regime dtype: {df_decision['vol_regime'].dtype}")
print(f"\nvol_regime unique values: {df_decision['vol_regime'].unique()[:20]}")
print(f"\nvol_regime value counts:\n{df_decision['vol_regime'].value_counts(dropna=False).head(10)}")

# Check a sample from all_bars
sample_symbol = PAIRS[0]
print(f"\n=== all_bars['{sample_symbol}'] vol_regime ===")
print(f"dtype: {all_bars[sample_symbol]['vol_regime'].dtype}")
print(f"unique values: {all_bars[sample_symbol]['vol_regime'].unique()[:20]}")
print(f"value counts:\n{all_bars[sample_symbol]['vol_regime'].value_counts(dropna=False).head(10)}")

=== DEBUG: vol_regime analysis ===
df_decision columns: ['timestamp', 'price', 'qty', 'signed_qty', 'trade_count', 'symbol', 'MOI_250ms', 'MOI_1s', 'MOI_5s', 'MOI_std', 'MOI_z', 'delta_velocity', 'delta_velocity_5s', 'AggressionPersistence', 'MOI_flip_rate', 'MOI_roc_1s', 'MOI_roc_5s', 'MOI_acceleration', 'absorption_raw', 'absorption_z', 'price_impact', 'price_impact_z', 'ret', 'vol_1m', 'vol_5m', 'vol_ratio', 'vol_rank', 'atr_5m', 'vol_regime', 'price_bin', 'LVN_price', 'POC_price', 'dist_lvn', 'dist_poc', 'dist_lvn_atr', 'dist_poc_atr', 'hour', 'hour_sin', 'hour_cos', 'is_weekend', 'trade_intensity', 'trade_intensity_z', 'cum_delta_1m', 'cum_delta_5m', 'MOI_1s_rank', 'MOI_5s_rank', 'vol_5m_rank', 'absorption_z_rank', 'cum_delta_5m_rank', 'MOI_thresh', 'LVN_thresh', 'absorption_thresh', 'bar_idx']

vol_regime dtype: object

vol_regime unique values: ['MID' 'LOW' 'HIGH']

vol_regime value counts:
HIGH    5611725
LOW     3389545
MID     2731234
Name: vol_regime, dtype: int64

=== all_b

In [179]:
df_decision.head()


Unnamed: 0,timestamp,price,qty,signed_qty,trade_count,symbol,MOI_250ms,MOI_1s,MOI_5s,MOI_std,...,cum_delta_5m,MOI_1s_rank,MOI_5s_rank,vol_5m_rank,absorption_z_rank,cum_delta_5m_rank,MOI_thresh,LVN_thresh,absorption_thresh,bar_idx
0,2025-12-25 00:46:46.750,87610.5,0.112,0.1,2,BTCUSDT,0.1,0.057,-4.441,1.81199,...,-17.555996,1.0,0.5,0.5,0.5,1.0,1.2076,52.829445,0.241792,5198
1,2025-12-25 00:46:47.250,87610.601562,0.01,0.01,1,BTCUSDT,0.01,0.091,-4.434,1.811843,...,-17.523996,0.666667,0.333333,0.333333,0.333333,0.666667,1.2076,52.828386,0.241792,5199
2,2025-12-25 00:46:47.500,87610.601562,0.025,0.025,1,BTCUSDT,0.025,0.137,-4.363,1.811356,...,-17.286995,0.625,0.25,0.125,1.0,0.875,1.2076,52.809839,0.241792,5200
3,2025-12-25 00:46:47.750,87610.601562,0.02,0.02,1,BTCUSDT,0.02,0.155,-4.346,1.810782,...,-16.966995,0.666667,0.333333,0.333333,0.666667,0.666667,1.2076,52.732594,0.241792,5201
4,2025-12-25 00:46:48.750,87610.601562,0.003,0.003,1,BTCUSDT,0.003,0.058,-4.334,1.810359,...,-17.072996,0.8,0.2,0.2,0.8,0.8,1.2076,52.697834,0.241792,5202


In [180]:
all_bars['BTCUSDT'].head()

Unnamed: 0,timestamp,price,qty,signed_qty,trade_count,symbol,MOI_250ms,MOI_1s,MOI_5s,MOI_std,...,is_weekend,trade_intensity,trade_intensity_z,cum_delta_1m,cum_delta_5m,MOI_1s_rank,MOI_5s_rank,vol_5m_rank,absorption_z_rank,cum_delta_5m_rank
0,2025-12-25 00:00:00.000,87627.398438,0.102,0.102,1,BTCUSDT,0.102,,,,...,0,,,,,0.5,0.5,0.5,0.5,0.5
1,2025-12-25 00:00:01.500,87627.296875,0.017,0.011,2,BTCUSDT,0.011,,,,...,0,,,,,0.5,0.5,0.5,0.5,0.5
2,2025-12-25 00:00:01.750,87627.296875,0.084,0.026,4,BTCUSDT,0.026,,,,...,0,,,,,0.5,0.5,0.5,0.5,0.5
3,2025-12-25 00:00:02.250,87627.398438,0.004,0.004,1,BTCUSDT,0.004,0.143,,,...,0,,,,,0.5,0.5,0.5,0.5,0.5
4,2025-12-25 00:00:02.500,87627.398438,0.043,-0.007,3,BTCUSDT,-0.007,0.034,,,...,0,,,,,0.571429,0.5,0.5,0.5,0.5


In [181]:
X  

array([[ 1.00000e-01,  5.70000e-02, -4.44100e+00, ...,  0.00000e+00,
         0.00000e+00,  0.00000e+00],
       [ 1.00000e-02,  9.10000e-02, -4.43400e+00, ...,  0.00000e+00,
         0.00000e+00,  0.00000e+00],
       [ 2.50000e-02,  1.37000e-01, -4.36300e+00, ...,  0.00000e+00,
         0.00000e+00,  0.00000e+00],
       ...,
       [-3.70000e+01, -3.80000e+01,  1.38256e+05, ...,  0.00000e+00,
         0.00000e+00,  0.00000e+00],
       [-2.00000e+01, -7.00000e+01,  1.38223e+05, ...,  0.00000e+00,
         0.00000e+00,  0.00000e+00],
       [ 1.05000e+02,  4.70000e+01,  1.38307e+05, ...,  0.00000e+00,
         0.00000e+00,  0.00000e+00]], dtype=float32)

In [182]:
X_data, y_data = create_triple_barrier_labels(all_bars, df_decision, X)

Labeling BTCUSDT...
  score_up=-17.43bp (TIME), score_down=1.43bp (TIME)
  score_up=-17.45bp (TIME), score_down=1.45bp (TIME)
  score_up=-17.45bp (TIME), score_down=1.45bp (TIME)
  score_up=-17.45bp (TIME), score_down=1.45bp (TIME)
  score_up=-17.44bp (TIME), score_down=1.44bp (TIME)
  BTCUSDT: 1,789,243 valid samples
Labeling ETHUSDT...
  score_up=-7.08bp (TIME), score_down=-8.92bp (TIME)
  score_up=-7.08bp (TIME), score_down=-8.92bp (TIME)
  score_up=-7.08bp (TIME), score_down=-8.92bp (TIME)
  score_up=-7.12bp (TIME), score_down=-8.88bp (TIME)
  score_up=-6.61bp (TIME), score_down=-9.39bp (TIME)
  ETHUSDT: 2,079,013 valid samples
Labeling SOLUSDT...
  score_up=-20.27bp (TIME), score_down=4.27bp (TIME)
  score_up=-20.27bp (TIME), score_down=4.27bp (TIME)
  score_up=-19.46bp (TIME), score_down=3.46bp (TIME)
  score_up=-18.64bp (TIME), score_down=2.64bp (TIME)
  score_up=-19.46bp (TIME), score_down=3.46bp (TIME)
  SOLUSDT: 1,892,212 valid samples
Labeling BNBUSDT...
  score_up=-8.12bp (

## 5. Optuna Hyperparameter Optimization

In [183]:
def purged_walk_forward_splits(n: int, n_splits: int = 5, purge_pct: float = 0.01):
    """Walk-forward splits with purging"""
    fold_size = n // (n_splits + 1)
    purge_size = int(fold_size * purge_pct)
    
    for i in range(n_splits):
        tr_end = fold_size * (i + 1) - purge_size
        va_start = fold_size * (i + 1) + purge_size
        va_end = fold_size * (i + 2)
        
        yield np.arange(0, tr_end), np.arange(va_start, va_end)


def objective(trial, X, y, feature_columns):
    """Optuna objective for hyperparameter tuning"""
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 500, 1500),
        "max_depth": trial.suggest_int("max_depth", 5, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 0.9),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 0.9),
        "min_child_samples": trial.suggest_int("min_child_samples", 30, 100),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.01, 1.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.01, 1.0, log=True),
        "objective": "huber",
        "alpha": 0.9,
        "random_state": 42,
        "n_jobs": -1,
        "verbose": -1,
    }
    
    X_df = pd.DataFrame(X, columns=feature_columns)
    
    maes = []
    for tr_idx, va_idx in purged_walk_forward_splits(len(X), n_splits=3):
        model = lgb.LGBMRegressor(**params)
        model.fit(
            X_df.iloc[tr_idx], y[tr_idx],
            eval_set=[(X_df.iloc[va_idx], y[va_idx])],
            callbacks=[lgb.early_stopping(50, verbose=False)],
        )
        preds = model.predict(X_df.iloc[va_idx])
        maes.append(mean_absolute_error(y[va_idx], preds))
    
    return np.mean(maes)


def optimize_hyperparameters(X, y, feature_columns, n_trials=30):
    """Run Optuna optimization"""
    if len(X) < 5000:
        print("Not enough data for optimization, using defaults")
        return None
    
    study = optuna.create_study(direction="minimize")
    study.optimize(
        lambda trial: objective(trial, X, y, feature_columns),
        n_trials=n_trials,
        show_progress_bar=True,
    )
    
    print(f"Best MAE: {study.best_value:.4f}")
    print(f"Best params: {study.best_params}")
    
    return study.best_params

In [184]:
# Optimize for one regime to get base params (saves time)
# Use 300s horizon data for optimization (more signal)
print("Optimizing hyperparameters on up_high_300 (largest dataset)...")
best_params = optimize_hyperparameters(
    X_300["up_high"], 
    y_300["up_high"], 
    FEATURE_COLUMNS,
    n_trials=25
)

Optimizing hyperparameters on up_high_300 (largest dataset)...


  0%|          | 0/25 [00:00<?, ?it/s]

Best MAE: 0.6362
Best params: {'n_estimators': 1321, 'max_depth': 5, 'learning_rate': 0.04676567448026787, 'subsample': 0.751850610914496, 'colsample_bytree': 0.898420372470383, 'min_child_samples': 30, 'reg_alpha': 0.13123125638807445, 'reg_lambda': 0.4057147503158698}


## 6. Train Final Models with Optimized Params

In [185]:
def train_ensemble_model(
    X: np.ndarray,
    y: np.ndarray,
    name: str,
    feature_columns: List[str],
    best_params: Optional[Dict] = None,
    n_splits: int = 5
) -> Tuple[List, Dict]:
    """Train ensemble with optimized params"""
    if len(X) < 1000:
        print(f"Insufficient data for {name}: {len(X)} samples")
        return [], {}
    
    X_df = pd.DataFrame(X, columns=feature_columns)
    
    # Use best params or defaults
    params = {
        "n_estimators": best_params.get("n_estimators", 1000) if best_params else 1000,
        "max_depth": best_params.get("max_depth", 7) if best_params else 7,
        "learning_rate": best_params.get("learning_rate", 0.02) if best_params else 0.02,
        "subsample": best_params.get("subsample", 0.7) if best_params else 0.7,
        "colsample_bytree": best_params.get("colsample_bytree", 0.7) if best_params else 0.7,
        "min_child_samples": best_params.get("min_child_samples", 50) if best_params else 50,
        "reg_alpha": best_params.get("reg_alpha", 0.1) if best_params else 0.1,
        "reg_lambda": best_params.get("reg_lambda", 0.1) if best_params else 0.1,
        "objective": "huber",
        "alpha": 0.9,
        "n_jobs": -1,
        "verbose": -1,
    }
    
    models = []
    metrics = {"maes": [], "rmses": [], "top10_actual": [], "top25_actual": []}
    
    print(f"\n{'='*60}")
    print(f"Training {name} ({len(X):,} samples)")
    print(f"{'='*60}")
    
    for fold, (tr_idx, va_idx) in enumerate(purged_walk_forward_splits(len(X_df), n_splits)):
        model = lgb.LGBMRegressor(**params, random_state=42 + fold)
        
        model.fit(
            X_df.iloc[tr_idx], y[tr_idx],
            eval_set=[(X_df.iloc[va_idx], y[va_idx])],
            eval_metric="l1",
            callbacks=[lgb.early_stopping(100, verbose=False)],
        )
        
        preds = model.predict(X_df.iloc[va_idx])
        actual = y[va_idx]
        
        mae = mean_absolute_error(actual, preds)
        rmse = np.sqrt(mean_squared_error(actual, preds))
        
        # Top percentile analysis
        for q, key in [(90, "top10_actual"), (75, "top25_actual")]:
            thresh = np.percentile(preds, q)
            mask = preds >= thresh
            if mask.sum() > 0:
                metrics[key].append(actual[mask].mean())
        
        metrics["maes"].append(mae)
        metrics["rmses"].append(rmse)
        models.append(model)
        
        print(f"Fold {fold}: MAE={mae:.4f}, RMSE={rmse:.4f}")
    
    print(f"\n{name} Summary:")
    print(f"  Mean MAE: {np.mean(metrics['maes']):.4f}")
    print(f"  Mean RMSE: {np.mean(metrics['rmses']):.4f}")
    print(f"  Target STD: {np.std(y):.4f}")
    print(f"  MAE/STD: {np.mean(metrics['maes'])/np.std(y):.4f}")
    if metrics['top10_actual']:
        print(f"  Top 10% mean actual: {np.mean(metrics['top10_actual']):.4f}")
        print(f"  Top 25% mean actual: {np.mean(metrics['top25_actual']):.4f}")
    
    return models, metrics

In [186]:
# Train 60s horizon models
print("=" * 70)
print("TRAINING 60-SECOND HORIZON MODELS")
print("=" * 70)

models_60 = {}
for key in ["up_low", "up_mid", "up_high", "down_low", "down_mid", "down_high"]:
    models, metrics = train_ensemble_model(
        X_60[key], y_60[key],
        f"{key.upper()}_60",
        FEATURE_COLUMNS,
        best_params=best_params,
    )
    models_60[key] = models

# Train 300s horizon models
print("\n" + "=" * 70)
print("TRAINING 300-SECOND HORIZON MODELS")
print("=" * 70)

models_300 = {}
for key in ["up_low", "up_mid", "up_high", "down_low", "down_mid", "down_high"]:
    models, metrics = train_ensemble_model(
        X_300[key], y_300[key],
        f"{key.upper()}_300",
        FEATURE_COLUMNS,
        best_params=best_params,
    )
    models_300[key] = models

TRAINING 60-SECOND HORIZON MODELS

Training UP_LOW_60 (797,579 samples)
Fold 0: MAE=0.7686, RMSE=0.9649
Fold 1: MAE=0.7489, RMSE=0.9172
Fold 2: MAE=0.7337, RMSE=0.9062
Fold 3: MAE=0.6859, RMSE=0.8470
Fold 4: MAE=0.6535, RMSE=0.8266

UP_LOW_60 Summary:
  Mean MAE: 0.7181
  Mean RMSE: 0.8924
  Target STD: 0.9038
  MAE/STD: 0.7945
  Top 10% mean actual: 2.1142
  Top 25% mean actual: 2.1264

Training UP_MID_60 (651,961 samples)
Fold 0: MAE=0.7860, RMSE=0.9809
Fold 1: MAE=0.7422, RMSE=0.9075
Fold 2: MAE=0.7297, RMSE=0.8878
Fold 3: MAE=0.6751, RMSE=0.8306
Fold 4: MAE=0.6618, RMSE=0.8366

UP_MID_60 Summary:
  Mean MAE: 0.7190
  Mean RMSE: 0.8887
  Target STD: 0.9006
  MAE/STD: 0.7984
  Top 10% mean actual: 2.2187
  Top 25% mean actual: 2.1820

Training UP_HIGH_60 (1,621,959 samples)
Fold 0: MAE=0.7310, RMSE=0.9016
Fold 1: MAE=0.7497, RMSE=0.9174
Fold 2: MAE=0.6962, RMSE=0.8651
Fold 3: MAE=0.6479, RMSE=0.7958
Fold 4: MAE=0.5976, RMSE=0.7501

UP_HIGH_60 Summary:
  Mean MAE: 0.6845
  Mean RMSE: 

## 7. Feature Importance Analysis

In [187]:
def get_ensemble_feature_importance(models_dict: Dict, feature_cols: List[str]) -> pd.DataFrame:
    """Average feature importance across all models"""
    all_importances = []
    
    for key, models in models_dict.items():
        for model in models:
            imp = pd.DataFrame({
                "feature": feature_cols,
                "importance": model.feature_importances_,
                "model": key
            })
            all_importances.append(imp)
    
    if not all_importances:
        return pd.DataFrame()
    
    df_imp = pd.concat(all_importances)
    return df_imp.groupby("feature")["importance"].mean().sort_values(ascending=False)

# Feature importance for 300s models (main horizon)
print("Top 20 Features (300s models):")
fi_300 = get_ensemble_feature_importance(models_300, FEATURE_COLUMNS)
print(fi_300.head(20))

print("\nTop 20 Features (60s models):")
fi_60 = get_ensemble_feature_importance(models_60, FEATURE_COLUMNS)
print(fi_60.head(20))

Top 20 Features (300s models):
feature
vol_5m                   136.166667
dist_poc                 132.833333
dist_poc_atr             117.000000
hour_cos                 104.200000
hour_sin                 104.133333
dist_lvn                 100.633333
cum_delta_5m              96.233333
dist_lvn_atr              89.966667
vol_ratio                 61.400000
trade_intensity           44.000000
vol_1m                    39.733333
is_weekend                36.500000
cum_delta_1m              35.200000
MOI_flip_rate             21.100000
vol_rank                  19.700000
pair_SOLUSDT              12.866667
absorption_z              12.066667
pair_BNBUSDT              10.566667
pair_BTCUSDT               9.600000
AggressionPersistence      8.500000
Name: importance, dtype: float64

Top 20 Features (60s models):
feature
cum_delta_5m             98.933333
vol_5m                   98.800000
dist_poc                 93.566667
dist_poc_atr             85.966667
vol_ratio                78.3

## 8. Save Models

In [188]:
os.makedirs("models_v3", exist_ok=True)

# Save 60s models
print("Saving 60s models...")
for key, models in models_60.items():
    if models:
        path = f"models_v3/models_{key}_60.pkl"
        joblib.dump(models, path)
        print(f"  Saved {path}")

# Save 300s models
print("\nSaving 300s models...")
for key, models in models_300.items():
    if models:
        path = f"models_v3/models_{key}_300.pkl"
        joblib.dump(models, path)
        print(f"  Saved {path}")

# Save best hyperparameters
if best_params:
    with open("models_v3/best_params.json", "w") as f:
        json.dump(best_params, f, indent=2)
    print("\nSaved best_params.json")

# Copy feature columns
import shutil
shutil.copy("feature_columns_v3.json", "models_v3/feature_columns_v3.json")
print("Saved feature_columns_v3.json")

print("\n" + "=" * 60)
print("Done! Models saved to models_v3/")
print("=" * 60)

Saving 60s models...
  Saved models_v3/models_up_low_60.pkl
  Saved models_v3/models_up_mid_60.pkl
  Saved models_v3/models_up_high_60.pkl
  Saved models_v3/models_down_low_60.pkl
  Saved models_v3/models_down_mid_60.pkl
  Saved models_v3/models_down_high_60.pkl

Saving 300s models...
  Saved models_v3/models_up_low_300.pkl
  Saved models_v3/models_up_mid_300.pkl
  Saved models_v3/models_up_high_300.pkl
  Saved models_v3/models_down_low_300.pkl
  Saved models_v3/models_down_mid_300.pkl
  Saved models_v3/models_down_high_300.pkl

Saved best_params.json
Saved feature_columns_v3.json

Done! Models saved to models_v3/


## Summary

**V3 Enhanced with Improved Labeling:**

1. **Fee-Adjusted Max Favorable Move Labels** (from V3 Improved) - cleaner signal than triple-barrier
2. **Dual Horizons**: 60s and 300s models for different timeframes
3. **Cross-Sectional Features**: 5 rank-based features comparing symbols
4. **Optuna Optimization**: 25 trials for hyperparameter tuning
5. **21 Days Data**: More training data for robustness
6. **41 Features**: Extended feature set with cross-sectional ranks

**Expected Improvements over previous V3 Enhanced:**
- MAE/STD ratio: ~0.70-0.75 (was 0.83-0.89)
- Top 10% actual: ~2.5-4.0 (was 1.8-2.2)
- Cleaner label distribution with profitable samples only

**To use in production:**
1. Copy `models_v3/*.pkl` to production server
2. Update predictor to use `feature_columns_v3.json`
3. Compute cross-sectional features in real-time (or use fallback 0.5)