f# Multi-Strategy Benchmark (Extended)

Compare **15 alpha extraction strategies** under identical walk-forward conditions.

| ID | Strategy | Architecture |
|----|----------|-------------|
| A | LSTM Baseline | Price-only temporal LSTM |
| B | NLP Only | Sentiment MLP |
| C | Late Ensemble | A + B with ridge meta |
| D | Residual Sentiment | Market-residualized NLP |
| E | Gated Hybrid | Gated price-NLP fusion |
| E1 | Attention Hybrid | Cross-attention price/NLP |
| E2 | Additive Residual | T + alpha*S fusion |
| F | CS-Attention LSTM | Temporal + stock attention |
| F1 | Temporal Fusion | Simplified TFT |
| G | Transformer | Encoder-only transformer |
| G1 | Efficient Transformer | Linear attention O(n*d) |
| H | Random Forest | sklearn RF (no torch) |
| I | LightGBM | Gradient boosted trees (no torch) |
| J | Short Horizon NLP | NLP MLP 5D only |
| K | Short Horizon Hybrid | Gated hybrid 5D only |

**Evaluation:** IC, Sharpe, R2, per-stock R2, directional accuracy, calibration, market-level prediction, backtest PnL, DM statistical tests.

**Goal:** Determine which structure produces real, production-surviving alpha.

In [1]:
!pip install -q yfinance lightgbm torch optuna pyarrow scikit-learn scipy pandas numpy matplotlib

In [2]:
import os
os.chdir("/content")
!rm -rf AI-stock-investment-tool

REPO = "https://github.com/kevin6598/AI-stock-investment-tool.git"
ret = os.system("git clone %s 2>/dev/null" % REPO)
if ret != 0:
    from getpass import getpass
    token = getpass("GitHub token (repo scope): ")
    os.system("git clone https://%s@github.com/kevin6598/AI-stock-investment-tool.git" % token)
    del token

os.chdir("/content/AI-stock-investment-tool")
!git log --oneline -3

[33m899e4a5[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m, [m[1;31morigin/master[m[33m, [m[1;31morigin/HEAD[m[33m)[m Add checkpoint/resume support to benchmark for runtime timeout recovery
[33m2434132[m Optimize benchmark performance: remove train inference, vectorize sequences, reduce epochs
[33m0138e4a[m Add real-time progress display to benchmark run_all and run_walk_forward


In [3]:
import torch, sys
print("Python: %s" % sys.version)
print("PyTorch: %s" % torch.__version__)
print("CUDA: %s" % torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU: %s" % torch.cuda.get_device_name(0))

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
PyTorch: 2.9.0+cu128
CUDA: True
GPU: Tesla T4


## 1. Dataset & Benchmark Configuration

**Dataset presets** -- choose universe size, market, and data period.
Walk-forward parameters auto-adjust based on data period length.

| Preset | Universe | Tickers | Best for |
|--------|----------|---------|----------|
| default | S&P subset + KOSPI/KOSDAQ | ~100 | Quick test (~30 min) |
| extended | Larger S&P + Korean | ~160 | Thorough benchmark (~60 min) |
| custom | User-defined | variable | Focused analysis |

| Period | Train Years | Horizons | Est. Folds |
|--------|-------------|----------|------------|
| 5y | 2 | 5D, 21D | ~4 |
| 10y | 3 | 5D, 21D, 63D | ~10 |
| 15y | 4 | 5D, 21D, 63D | ~16 |

In [4]:
# ==============================================
# DATASET CONFIGURATION -- CHANGE THESE VALUES
# ==============================================

# Universe: "default" (~100 tickers), "extended" (~160), "custom"
UNIVERSE = "extended"

# Market filter: "ALL" (KOSPI+KOSDAQ+NASDAQ), "US" (NASDAQ/NYSE only), "KR" (KOSPI+KOSDAQ only)
MARKET = "ALL"

# Data period: "5y", "10y", "15y"
PERIOD = "10y"

# Custom tickers (only used when UNIVERSE="custom")
CUSTOM_TICKERS = [
    # US Tech
    "AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "META", "TSLA",
    # US Finance
    "JPM", "GS", "V", "MA",
    # US Healthcare
    "JNJ", "UNH", "LLY", "PFE",
    # KOSPI
    "005930.KS", "000660.KS", "035420.KS", "005380.KS", "051910.KS",
    # KOSDAQ
    "293490.KQ", "247540.KQ", "086520.KQ",
]

# ==============================================
# RUN MODE -- controls speed vs thoroughness
# ==============================================
# "quick"    : 8 key strategies, 2 horizons, fewer folds (~20-30 min)
# "standard" : all 15 strategies, auto horizons (~60-90 min)
# "full"     : all 15 strategies, all horizons, more folds (~120-180 min)
RUN_MODE = "standard"

# Strategy subset (None = use RUN_MODE default, or list specific keys)
# e.g. ["A", "C", "E", "E1", "G", "H", "I"] for a focused comparison
STRATEGY_KEYS = None

# Email notification when benchmark completes
NOTIFY_EMAIL = "seo.kevin6598@gmail.com"  # set to None to disable

# ==============================================
# BENCHMARK CONFIGURATION (auto-adjusted)
# ==============================================
_period_years = {"5y": 5, "10y": 10, "15y": 15}.get(PERIOD, 5)

# Run mode presets
if RUN_MODE == "quick":
    _max_epochs = 8
    _early_stop = 3
    if _period_years >= 10:
        _train_years = 3
        _test_months = 6
        _step_months = 12   # larger steps = fewer folds
        _val_months = 3
        _horizons = [5, 21]  # skip 63D for speed
    else:
        _train_years = 2
        _test_months = 4
        _step_months = 6
        _val_months = 2
        _horizons = [5, 21]
    _default_strategies = ["A", "B", "C", "E", "E1", "G", "H", "I"]
elif RUN_MODE == "full":
    _max_epochs = 15
    _early_stop = 4
    if _period_years >= 15:
        _train_years = 4
        _test_months = 6
        _step_months = 4     # smaller steps = more folds
        _val_months = 3
        _horizons = [5, 21, 63]
    elif _period_years >= 10:
        _train_years = 3
        _test_months = 6
        _step_months = 4
        _val_months = 3
        _horizons = [5, 21, 63]
    else:
        _train_years = 2
        _test_months = 4
        _step_months = 3
        _val_months = 2
        _horizons = [5, 21]
    _default_strategies = None  # all
else:  # standard
    _max_epochs = 10
    _early_stop = 3
    if _period_years >= 10:
        _train_years = 3
        _test_months = 6
        _step_months = 6
        _val_months = 3
        _horizons = [5, 21, 63]
    else:
        _train_years = 2
        _test_months = 4
        _step_months = 4
        _val_months = 2
        _horizons = [5, 21]
    _default_strategies = None  # all

# User override for strategy subset
STRATEGY_KEYS = STRATEGY_KEYS if STRATEGY_KEYS is not None else _default_strategies

CONFIG = {
    "horizons": _horizons,
    "walk_forward": {
        "train_years": _train_years,
        "test_months": _test_months,
        "step_months": _step_months,
        "val_months": _val_months,
        "embargo_days": 5,
    },
    "max_epochs": _max_epochs,
    "early_stop_patience": _early_stop,
    "ranking_weight": 0.5,
    "max_params": 1_500_000,
}

print("=== Dataset Configuration ===")
print("Universe: %s  |  Market: %s  |  Period: %s (%dy)" % (
    UNIVERSE, MARKET, PERIOD, _period_years))
print("\n=== Run Mode: %s ===" % RUN_MODE.upper())
if STRATEGY_KEYS:
    print("Strategies: %d selected %s" % (len(STRATEGY_KEYS), STRATEGY_KEYS))
else:
    print("Strategies: ALL 15")
print("Horizons: %s" % _horizons)
print("Walk-forward: train=%dy, test=%dmo, step=%dmo, val=%dmo" % (
    _train_years, _test_months, _step_months, _val_months))
print("Epochs: max=%d, early_stop=%d" % (_max_epochs, _early_stop))
if NOTIFY_EMAIL:
    print("\nEmail notification: %s" % NOTIFY_EMAIL)

=== Dataset Configuration ===
Universe: extended  |  Market: ALL  |  Period: 10y (10y)

=== Run Mode: STANDARD ===
Strategies: ALL 15
Horizons: [5, 21, 63]
Walk-forward: train=3y, test=6mo, step=6mo, val=3mo
Epochs: max=10, early_stop=3

Email notification: seo.kevin6598@gmail.com


## 2. Standardized Data Pipeline

Downloads OHLCV data via yfinance, builds features (technical, sentiment, macro, fundamental, risk),
and caches the result to Google Drive for fast re-runs.

First run may take **15-30 min** depending on universe size. Subsequent runs load from cache instantly.

In [5]:
import pandas as pd
import numpy as np

# Try mounting Google Drive (optional - falls back to local storage)
DRIVE_MOUNTED = False
try:
    from google.colab import drive
    drive.mount('/content/drive', timeout_ms=60000)
    DRIVE_MOUNTED = True
    DRIVE_DIR = "/content/drive/MyDrive/ai_stock_tool"
    print("Google Drive mounted.")
except Exception as _drive_err:
    print("Drive mount failed: %s" % str(_drive_err)[:80])
    print("Using local storage instead (data will NOT persist across sessions).")
    DRIVE_DIR = "/content/ai_stock_tool"

ARTIFACT_DIR = os.path.join(DRIVE_DIR, "artifacts")
os.makedirs(ARTIFACT_DIR, exist_ok=True)

# Cache path based on universe, market, period
_cache_name = "benchmark_%s_%s_%s.parquet" % (UNIVERSE, MARKET, PERIOD)
DATA_PATH = os.path.join(DRIVE_DIR, _cache_name)
print("Data path: %s" % DATA_PATH)
print("Artifacts: %s" % ARTIFACT_DIR)

if os.path.exists(DATA_PATH):
    print("\nFound cached dataset, loading...")
    panel = pd.read_parquet(DATA_PATH)
    print("Loaded from cache: %s" % str(panel.shape))
else:
    print("\nNo cached dataset found. Building from scratch...")
    print("This may take 15-30 min depending on universe size.\n")

    from data.stock_api import get_historical_data, get_stock_info
    from data.universe_manager import UniverseManager
    from training.feature_engineering import build_panel_dataset, cross_sectional_normalize

    # --- Resolve ticker list ---
    um = UniverseManager()
    if UNIVERSE == "extended":
        um._members = UniverseManager.load_extended_universe()

    if UNIVERSE == "custom":
        tickers = CUSTOM_TICKERS
    else:
        tickers = um.get_universe_by_market(MARKET)

    print("Target: %d tickers (universe=%s, market=%s, period=%s)" % (
        len(tickers), UNIVERSE, MARKET, PERIOD))

    # --- Download OHLCV + company info ---
    import time as _time
    stock_dfs = {}
    stock_infos = {}
    failed = []
    t0 = _time.time()

    for i, ticker in enumerate(tickers):
        if (i + 1) % 10 == 0 or i == 0:
            elapsed = _time.time() - t0
            print("  [%d/%d] %s (%.0fs elapsed)" % (i + 1, len(tickers), ticker, elapsed))
        try:
            df = get_historical_data(ticker, period=PERIOD)
            if df.empty:
                failed.append(ticker)
                continue
            stock_dfs[ticker] = df
            info = get_stock_info(ticker) or {}
            stock_infos[ticker] = info
        except Exception as ex:
            failed.append(ticker)
            print("    FAILED: %s -- %s" % (ticker, str(ex)[:80]))

    dl_time = _time.time() - t0
    print("\nDownloaded: %d/%d tickers in %.0f sec" % (
        len(stock_dfs), len(tickers), dl_time))
    if failed:
        print("Failed (%d): %s%s" % (
            len(failed), ", ".join(failed[:20]),
            " ..." if len(failed) > 20 else ""))

    # --- Market index ---
    market_ticker = um.get_market_ticker(list(stock_dfs.keys()))
    print("\nMarket index: %s" % market_ticker)
    market_df = get_historical_data(market_ticker, period=PERIOD)
    if market_df.empty:
        print("WARNING: Market index download failed")
        market_df = None

    # --- Build feature panel ---
    horizons_days = CONFIG["horizons"]
    print("\nBuilding features (horizons=%s)..." % horizons_days)
    t1 = _time.time()
    panel = build_panel_dataset(stock_dfs, stock_infos, market_df, horizons_days)
    print("Raw panel: %s (%.0f sec)" % (str(panel.shape), _time.time() - t1))

    # Cross-sectional normalization
    panel = cross_sectional_normalize(panel)
    print("Normalized: %s" % str(panel.shape))

    # Save for future runs
    try:
        panel.to_parquet(DATA_PATH)
        _size_mb = os.path.getsize(DATA_PATH) / 1e6
        print("\nCached to: %s (%.1f MB)" % (DATA_PATH, _size_mb))
        if not DRIVE_MOUNTED:
            print("NOTE: Local cache only -- will be lost when runtime disconnects.")
    except Exception as _save_err:
        print("Cache save failed: %s" % str(_save_err)[:80])

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted.
Data path: /content/drive/MyDrive/ai_stock_tool/benchmark_extended_ALL_10y.parquet
Artifacts: /content/drive/MyDrive/ai_stock_tool/artifacts

Found cached dataset, loading...
Loaded from cache: (346696, 124)


In [6]:
valid_tickers = panel.index.get_level_values(1).unique().tolist()
print("Panel: %s" % str(panel.shape))
print("Tickers: %d" % len(valid_tickers))

date_min = panel.index.get_level_values(0).min()
date_max = panel.index.get_level_values(0).max()
data_span_months = (date_max - date_min).days / 30.44
print("Date range: %s to %s (%.0f months)" % (
    date_min.date(), date_max.date(), data_span_months))

# --- Market Composition ---
kr_kospi = sorted([t for t in valid_tickers if t.upper().endswith(".KS")])
kr_kosdaq = sorted([t for t in valid_tickers if t.upper().endswith(".KQ")])
us_tickers = sorted([t for t in valid_tickers
                     if not t.upper().endswith((".KS", ".KQ"))])

print("\n=== Market Composition ===")
print("NYSE/NASDAQ: %d tickers" % len(us_tickers))
print("KOSPI (.KS): %d tickers" % len(kr_kospi))
print("KOSDAQ (.KQ): %d tickers" % len(kr_kosdaq))

if kr_kospi:
    print("\nKOSPI: %s" % ", ".join(kr_kospi))
if kr_kosdaq:
    print("KOSDAQ: %s" % ", ".join(kr_kosdaq))
if us_tickers and len(us_tickers) <= 30:
    print("US: %s" % ", ".join(us_tickers))
elif us_tickers:
    print("US (first 30): %s ..." % ", ".join(us_tickers[:30]))

# --- Per-ticker data coverage ---
ticker_counts = panel.groupby(level=1).size()
print("\n=== Data Coverage ===")
print("Avg samples/ticker: %.0f" % ticker_counts.mean())
print("Min: %d (%s)  Max: %d (%s)" % (
    ticker_counts.min(), ticker_counts.idxmin(),
    ticker_counts.max(), ticker_counts.idxmax()))

thin = ticker_counts[ticker_counts < 500]
if len(thin) > 0:
    print("Tickers with < 500 samples (%d): %s" % (
        len(thin), ", ".join(thin.index.tolist()[:15])))

# --- Fold estimate ---
wf = CONFIG["walk_forward"]
train_months = wf["train_years"] * 12
val_months = wf.get("val_months", 3)
test_months = wf["test_months"]
step_months = wf["step_months"]
min_needed = train_months + val_months + test_months + 1
available_after_first = data_span_months - min_needed
est_folds = max(0, 1 + int(available_after_first / step_months))
print("\n=== Walk-Forward Folds ===")
print("train=%dmo + val=%dmo + test=%dmo = %dmo minimum" % (
    train_months, val_months, test_months, min_needed))
print("Expected folds: ~%d (step=%dmo over %.0f months)" % (
    est_folds, step_months, data_span_months))
if est_folds == 0:
    print("WARNING: No folds possible! Reduce train_years or increase data period.")

# --- Feature columns ---
feature_cols = [
    c for c in panel.columns
    if not c.startswith("fwd_return_")
    and not c.startswith("residual_return_")
    and not c.startswith("ranked_target_")
    and c not in ("_close", "ticker_id")
]

price_cols = [c for c in feature_cols if not c.startswith("nlp_")]
nlp_cols = [c for c in feature_cols if c.startswith("nlp_")]
print("\n=== Features ===")
print("Total: %d  (Price: %d, NLP: %d)" % (
    len(feature_cols), len(price_cols), len(nlp_cols)))

# Check required targets
for h in CONFIG["horizons"]:
    tc = "fwd_return_%dd" % h
    if tc in panel.columns:
        non_null = panel[tc].notna().sum()
        print("  %s: %d non-null (%.1f%%)" % (
            tc, non_null, non_null / len(panel) * 100))
    else:
        print("  WARNING: %s not found!" % tc)

Panel: (346696, 124)
Tickers: 160
Date range: 2017-02-13 to 2025-11-12 (105 months)

=== Market Composition ===
NYSE/NASDAQ: 139 tickers
KOSPI (.KS): 15 tickers
KOSDAQ (.KQ): 6 tickers

KOSPI: 000270.KS, 000660.KS, 003670.KS, 005380.KS, 005930.KS, 006400.KS, 028260.KS, 035420.KS, 035720.KS, 051910.KS, 055550.KS, 068270.KS, 105560.KS, 207940.KS, 373220.KS
KOSDAQ: 086520.KQ, 112040.KQ, 196170.KQ, 247540.KQ, 263750.KQ, 293490.KQ
US (first 30): AAPL, ABBV, ABT, ADBE, AEP, AMAT, AMD, AMGN, AMT, AMZN, AON, APD, AVGO, AXP, AZO, BA, BAC, BDX, BKNG, BLK, BMY, C, CAT, CB, CDNS, CHTR, CL, CMCSA, CME, COP ...

=== Data Coverage ===
Avg samples/ticker: 2167
Min: 672 (373220.KS)  Max: 2200 (AAPL)

=== Walk-Forward Folds ===
train=36mo + val=3mo + test=6mo = 46mo minimum
Expected folds: ~10 (step=6mo over 105 months)

=== Features ===
Total: 114  (Price: 87, NLP: 27)
  fwd_return_5d: 346690 non-null (100.0%)
  fwd_return_21d: 346690 non-null (100.0%)
  fwd_return_63d: 346690 non-null (100.0%)


In [7]:
# Precompute derived features for all strategies

# Sentiment residual (for Strategy D)
market_feat = None
for cand in ["market_return", "market_return_21d", "spy_return_21d"]:
    if cand in panel.columns:
        market_feat = cand
        break

if market_feat:
    print("Market return feature: %s" % market_feat)
else:
    print("WARNING: No market return feature found for residualization")

# Volatility proxy
vol_feat = None
for cand in ["volatility_21d", "realized_vol_21d", "atr_pct"]:
    if cand in panel.columns:
        vol_feat = cand
        break

if vol_feat:
    print("Volatility feature: %s" % vol_feat)
else:
    print("WARNING: No volatility feature found")

# Data quality check
nan_pct = panel[feature_cols].isna().mean().mean() * 100
print("\nNaN rate: %.2f%%" % nan_pct)
print("Data pipeline ready.")

Volatility feature: volatility_21d

NaN rate: 0.00%
Data pipeline ready.


## 3. Strategy Definitions

All 15 strategies with identical interface: `train()`, `predict()`, `num_parameters()`.

- **A-F**: Neural temporal/NLP models (require PyTorch)
- **E1, E2**: Novel fusion architectures (cross-attention, additive residual)
- **F1**: Simplified Temporal Fusion Transformer
- **G, G1**: Transformer variants (standard + linear attention)
- **H, I**: Tree-based models (no torch required)
- **J, K**: Short-horizon specialists (5D only)

In [8]:
from training.strategy_benchmark import (
    BenchmarkConfig, BenchmarkEvaluator, STRATEGY_REGISTRY,
    save_benchmark_results, run_integrity_checks,
    ExtendedMetrics, DMTestResult,
)

config = BenchmarkConfig.from_dict(CONFIG)

# Filter strategies based on STRATEGY_KEYS
if STRATEGY_KEYS is not None:
    selected_registry = {k: v for k, v in STRATEGY_REGISTRY.items() if k in STRATEGY_KEYS}
    missing = set(STRATEGY_KEYS) - set(selected_registry.keys())
    if missing:
        print("WARNING: Unknown strategy keys: %s" % missing)
else:
    selected_registry = STRATEGY_REGISTRY

print("Strategies to benchmark (%d / %d):" % (len(selected_registry), len(STRATEGY_REGISTRY)))
print("-" * 55)
for key, cls in sorted(selected_registry.items()):
    sh = getattr(cls, 'supported_horizons', None)
    h_info = "all horizons" if sh is None else "%s only" % sh
    print("  %-3s: %-28s (%s)" % (key, cls.name, h_info))

# Estimate total evaluations
n_strats = len(selected_registry)
n_horizons = len(config.horizons)
# Short-horizon strategies only run at 5D
short_only = sum(1 for cls in selected_registry.values()
                 if getattr(cls, 'supported_horizons', None) == [5])
n_evals = n_strats * n_horizons - short_only * (n_horizons - 1)
print("\nEstimated evaluations: %d (strategies x horizons)" % n_evals)
print("Estimated folds per eval: ~%d" % est_folds)
print("Total training runs: ~%d" % (n_evals * est_folds))

Strategies to benchmark (15 / 15):
-------------------------------------------------------
  A  : A_LSTM_Baseline              (all horizons)
  B  : B_NLP_Only                   (all horizons)
  C  : C_Late_Ensemble              (all horizons)
  D  : D_Residual_Sentiment         (all horizons)
  E  : E_Gated_Hybrid               (all horizons)
  E1 : E1_Attention_Hybrid          (all horizons)
  E2 : E2_Additive_Residual         (all horizons)
  F  : F_CS_Attention_LSTM          (all horizons)
  F1 : F1_Temporal_Fusion           (all horizons)
  G  : G_Transformer                (all horizons)
  G1 : G1_Efficient_Transformer     (all horizons)
  H  : H_Random_Forest              (all horizons)
  I  : I_LightGBM                   (all horizons)
  J  : J_Short_NLP_5D               ([5] only)
  K  : K_Short_Hybrid_5D            ([5] only)

Estimated evaluations: 41 (strategies x horizons)
Estimated folds per eval: ~10
Total training runs: ~410


## 4. Run Benchmark

Walk-forward evaluation for selected strategies at configured horizons.

**Checkpoint/Resume:** Results are saved per-strategy to `ARTIFACT_DIR/checkpoints/`.
If the runtime times out, simply re-run this cell -- completed strategies will be
loaded from checkpoint and only remaining strategies will be computed.
Set `CLEAR_CHECKPOINTS = True` to discard previous results and start fresh.

**Estimated runtime (T4 GPU):**

| Run Mode | Strategies | Horizons | Folds (10y) | Est. Time |
|----------|-----------|----------|-------------|-----------|
| `quick` | 8 key | 5D, 21D | ~5 | **20-30 min** |
| `standard` | all 15 | 5D, 21D, 63D | ~10 | **60-90 min** |
| `full` | all 15 | 5D, 21D, 63D | ~14 | **120-180 min** |

Email notification is sent to `NOTIFY_EMAIL` when the benchmark completes.

In [None]:
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s", datefmt="%H:%M:%S")

evaluator = BenchmarkEvaluator(panel, feature_cols, config)

# Checkpoint directory -- saves each strategy result to disk after completion.
# On runtime timeout, re-run this cell to resume from the last checkpoint.
CHECKPOINT_DIR = os.path.join(ARTIFACT_DIR, "checkpoints")
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

# To force a full re-run (discard previous checkpoints), set CLEAR_CHECKPOINTS = True
CLEAR_CHECKPOINTS = False
if CLEAR_CHECKPOINTS:
    import glob as _glob
    for _f in _glob.glob(os.path.join(CHECKPOINT_DIR, "*.pkl")):
        os.remove(_f)
    print("Cleared all checkpoints.")

# Show existing checkpoints
_existing = [f for f in os.listdir(CHECKPOINT_DIR) if f.endswith(".pkl")]
if _existing:
    print("Found %d checkpoint(s) from previous run -- will resume:" % len(_existing))
    for _f in sorted(_existing):
        print("  %s" % _f)
else:
    print("No checkpoints found -- starting fresh run.")

strategy_classes = list(selected_registry.values())
n_horizons_eff = len(config.horizons)
print("\nRunning %d strategies x %d horizons (mode=%s)..." % (
    len(strategy_classes), n_horizons_eff, RUN_MODE))
print("=" * 60)

import time
t0 = time.time()
results = evaluator.run_all(strategy_classes, checkpoint_dir=CHECKPOINT_DIR)
total_time = time.time() - t0

print("\n" + "=" * 60)
print("Benchmark complete in %.1f min" % (total_time / 60))
print("Total evaluations: %d" % len(results))
print("Strategies with extended metrics: %d" % sum(1 for r in results if r.extended))

# --- Email notification ---
if NOTIFY_EMAIL:
    _benchmark_summary = (
        "Benchmark complete in %.1f min\n"
        "Mode: %s | Universe: %s | Market: %s | Period: %s\n"
        "Strategies: %d | Evaluations: %d\n\n"
        % (total_time / 60, RUN_MODE, UNIVERSE, MARKET, PERIOD,
           len(strategy_classes), len(results))
    )
    # Top 5 results
    _top = sorted(results, key=lambda r: r.composite, reverse=True)[:5]
    _benchmark_summary += "=== Top 5 by Composite ===\n"
    for _r in _top:
        _benchmark_summary += "  %s/%s: IC=%.4f Sharpe=%.2f Composite=%.4f [%s]\n" % (
            _r.name, _r.horizon, _r.ic_mean, _r.sharpe, _r.composite, _r.status)

    try:
        import smtplib
        from email.mime.text import MIMEText
        from google.colab import auth
        import google.auth
        import google.auth.transport.requests
        from googleapiclient.discovery import build
        import base64

        auth.authenticate_user()
        creds, _ = google.auth.default()
        creds.refresh(google.auth.transport.requests.Request())
        service = build('gmail', 'v1', credentials=creds)

        msg = MIMEText(_benchmark_summary)
        msg['to'] = NOTIFY_EMAIL
        msg['subject'] = '[Benchmark] %s/%s/%s done - %.0fmin' % (
            UNIVERSE, MARKET, PERIOD, total_time / 60)
        raw = base64.urlsafe_b64encode(msg.as_bytes()).decode()
        service.users().messages().send(
            userId='me', body={'raw': raw}).execute()
        print("Email sent to %s" % NOTIFY_EMAIL)
    except Exception as mail_err:
        print("Email notification failed: %s" % str(mail_err)[:100])
        print("(Summary printed above instead)")

No checkpoints found -- starting fresh run.

Running 15 strategies x 3 horizons (mode=standard)...

>>> [1/41] Running: A_LSTM_Baseline / 5D  |  elapsed 0.0min  |  ETA --
      fold 1/10 ...

## 5. Strategy Comparison Table

In [None]:
import math

# Build comparison table
rows = []
for r in results:
    rows.append({
        "Strategy": r.name,
        "Horizon": r.horizon,
        "IC": round(r.ic_mean, 4),
        "ICIR": round(r.icir, 2),
        "Sharpe": round(r.sharpe, 2),
        "IC_std": round(r.ic_std, 4),
        "Prod IC": round(r.prod_ic, 4),
        "Overfit": round(r.overfit_score, 3),
        "Composite": round(r.composite, 4),
        "Params": r.param_count,
        "Time(s)": round(r.train_time, 1),
        "Status": r.status,
    })

df_results = pd.DataFrame(rows)
print("\nStrategy Benchmark Results (sorted by Composite):")
print("=" * 100)
display(df_results)

# Summary by status
print("\nStatus Summary:")
for status in ["PASS", "WARN", "FAIL"]:
    count = sum(1 for r in results if r.status == status)
    names = [r.name + "/" + r.horizon for r in results if r.status == status]
    print("  %s (%d): %s" % (status, count, ", ".join(names) if names else "none"))

## 5a. Extended Metrics Table

Regression accuracy, directional metrics, calibration, and market-level predictions.

In [None]:
# Extended Metrics: Regression, Directional, Calibration, Market-level
ext_rows = []
for r in results:
    if r.extended is None:
        continue
    e = r.extended
    ext_rows.append({
        "Strategy": r.name,
        "Horizon": r.horizon,
        "RMSE": round(e.rmse, 6),
        "R2": round(e.r_squared, 4),
        "Hit%": round(e.hit_ratio * 100, 1),
        "Prec": round(e.precision, 3),
        "Recall": round(e.recall, 3),
        "F1": round(e.f1, 3),
        "Cal.Slope": round(e.calib_slope, 3),
        "Mkt.R2": round(e.market_r2, 4),
        "Mkt.Dir%": round(e.market_direction_accuracy * 100, 1),
    })

if ext_rows:
    df_ext = pd.DataFrame(ext_rows)
    print("Extended Metrics:")
    print("=" * 120)
    display(df_ext)
else:
    print("No extended metrics available.")

In [None]:
# Per-Stock R2 Summary
r2_rows = []
for r in results:
    if r.extended is None:
        continue
    e = r.extended
    r2_rows.append({
        "Strategy": r.name,
        "Horizon": r.horizon,
        "Mean R2": round(e.mean_stock_r2, 4),
        "Median R2": round(e.median_stock_r2, 4),
        "% Positive": round(e.pct_r2_positive, 1),
        "% > 0.05": round(e.pct_r2_above_005, 1),
        "N Stocks": len(e.stock_r2_values),
    })

if r2_rows:
    df_r2 = pd.DataFrame(r2_rows)
    print("Per-Stock R2 Summary:")
    print("=" * 90)
    display(df_r2)
else:
    print("No per-stock R2 data available.")

In [None]:
# Backtest Results: Long-Only and Long-Short
bt_rows = []
for r in results:
    if r.extended is None:
        continue
    e = r.extended
    bt_rows.append({
        "Strategy": r.name,
        "Horizon": r.horizon,
        "L/O CAGR": "%.2f%%" % (e.lo_cagr * 100),
        "L/O Sharpe": round(e.lo_sharpe, 2),
        "L/O MaxDD": "%.2f%%" % (e.lo_max_dd * 100),
        "L/S CAGR": "%.2f%%" % (e.ls_cagr * 100),
        "L/S Sharpe": round(e.ls_sharpe, 2),
        "L/S MaxDD": "%.2f%%" % (e.ls_max_dd * 100),
    })

if bt_rows:
    df_bt = pd.DataFrame(bt_rows)
    print("Backtest Results (Long-Only and Long-Short 20/20):")
    print("=" * 100)
    display(df_bt)
else:
    print("No backtest data available.")

In [None]:
# Diebold-Mariano Test Matrix
dm_results = BenchmarkEvaluator.compute_dm_tests(results)

if dm_results:
    # Display as table
    dm_rows = []
    for dm in dm_results:
        dm_rows.append({
            "A": dm.strategy_a.split("_")[0],
            "B": dm.strategy_b.split("_")[0],
            "Horizon": dm.horizon,
            "DM Stat": round(dm.dm_stat, 2),
            "p-value": round(dm.p_value, 4),
            "Better": dm.better.split("_")[0],
            "Sig?": "*" if dm.p_value < 0.05 else ("." if dm.p_value < 0.10 else ""),
        })
    df_dm = pd.DataFrame(dm_rows)
    print("Diebold-Mariano Tests (* = p<0.05, . = p<0.10):")
    print("=" * 80)
    display(df_dm)

    # Heatmap per horizon
    import matplotlib.pyplot as plt
    horizons = sorted(set(dm.horizon for dm in dm_results))
    fig_dm, axes_dm = plt.subplots(1, len(horizons), figsize=(8 * len(horizons), 6))
    if len(horizons) == 1:
        axes_dm = [axes_dm]

    for ax, h in zip(axes_dm, horizons):
        h_dm = [dm for dm in dm_results if dm.horizon == h]
        names = sorted(set(
            [dm.strategy_a.split("_")[0] for dm in h_dm] +
            [dm.strategy_b.split("_")[0] for dm in h_dm]
        ))
        n = len(names)
        pval_matrix = np.ones((n, n))
        name_to_idx = {nm: i for i, nm in enumerate(names)}

        for dm in h_dm:
            ia = name_to_idx.get(dm.strategy_a.split("_")[0])
            ib = name_to_idx.get(dm.strategy_b.split("_")[0])
            if ia is not None and ib is not None:
                pval_matrix[ia, ib] = dm.p_value
                pval_matrix[ib, ia] = dm.p_value

        im = ax.imshow(pval_matrix, cmap="RdYlGn", vmin=0, vmax=0.2)
        ax.set_xticks(range(n))
        ax.set_xticklabels(names, rotation=45, fontsize=7, ha="right")
        ax.set_yticks(range(n))
        ax.set_yticklabels(names, fontsize=7)
        ax.set_title("DM p-values: %s" % h, fontweight="bold")

        for i in range(n):
            for j in range(n):
                if i != j:
                    ax.text(j, i, "%.2f" % pval_matrix[i, j],
                            ha="center", va="center", fontsize=6,
                            color="white" if pval_matrix[i, j] < 0.05 else "black")

    plt.colorbar(im, ax=axes_dm[-1], shrink=0.7, label="p-value")
    plt.tight_layout()
    plt.savefig(os.path.join(ARTIFACT_DIR, "dm_test_heatmap.png"), dpi=150, bbox_inches="tight")
    plt.show()
    print("DM test heatmap saved.")
else:
    print("No DM test results (need fold predictions from multiple strategies).")

## 6. Visualization Dashboard

In [None]:
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

fig = plt.figure(figsize=(20, 16))
gs_layout = gridspec.GridSpec(2, 3, hspace=0.35, wspace=0.3)

# --- Panel 1: IC Comparison Bar Chart ---
ax1 = fig.add_subplot(gs_layout[0, 0])
labels = ["%s\n%s" % (r.name.split("_", 1)[-1][:12], r.horizon) for r in results]
ics = [r.ic_mean for r in results]
colors = ["#4CAF50" if r.status == "PASS" else "#FFC107" if r.status == "WARN" else "#F44336"
          for r in results]
ax1.barh(range(len(results)), ics, color=colors, edgecolor="white")
ax1.set_yticks(range(len(results)))
ax1.set_yticklabels(labels, fontsize=7)
ax1.set_xlabel("Cross-Sectional IC")
ax1.set_title("IC Comparison", fontweight="bold")
ax1.axvline(x=0, color="gray", linewidth=0.5)
ax1.invert_yaxis()

# --- Panel 2: IC Stability (fold ICs per strategy) ---
ax2 = fig.add_subplot(gs_layout[0, 1])
cmap = plt.colormaps["tab20"]
for i, r in enumerate(results):
    if r.fold_metrics:
        fold_ics = [f.ic for f in r.fold_metrics]
        x_positions = [i] * len(fold_ics)
        ax2.scatter(x_positions, fold_ics, color=cmap(i % 20), s=30, zorder=3, alpha=0.7)
        ax2.plot([i, i], [min(fold_ics), max(fold_ics)],
                 color=cmap(i % 20), linewidth=2, alpha=0.5)
ax2.set_xticks(range(len(results)))
ax2.set_xticklabels([r.name.split("_")[0] + "/" + r.horizon for r in results],
                     rotation=90, fontsize=6, ha="center")
ax2.axhline(y=0, color="gray", linewidth=0.5, linestyle="--")
ax2.set_ylabel("IC per fold")
ax2.set_title("IC Stability (per fold)", fontweight="bold")
ax2.grid(axis="y", alpha=0.3)

# --- Panel 3: Composite Score Bar ---
ax3 = fig.add_subplot(gs_layout[0, 2])
composites = [r.composite if not math.isinf(r.composite) else 0 for r in results]
ax3.barh(range(len(results)), composites, color=colors, edgecolor="white")
ax3.set_yticks(range(len(results)))
ax3.set_yticklabels(labels, fontsize=7)
ax3.set_xlabel("Composite Score")
ax3.set_title("Composite Score", fontweight="bold")
ax3.axvline(x=0, color="gray", linewidth=0.5)
ax3.invert_yaxis()

# --- Panel 4: Gate Activation ---
ax4 = fig.add_subplot(gs_layout[1, 0])
gate_data = [(r.name, r.horizon, r.gate_stats) for r in results if r.gate_stats]
if gate_data:
    g_labels = ["%s/%s" % (n, h) for n, h, _ in gate_data]
    g_means = []
    g_stds = []
    for _, _, gstat in gate_data:
        if isinstance(gstat, dict):
            g_means.append(gstat.get("gate_mean", 0))
            g_stds.append(gstat.get("gate_std", 0))
        else:
            g_means.append(0.5)
            g_stds.append(0)
    ax4.barh(g_labels, g_means, xerr=g_stds, color="#2196F3", capsize=5)
    ax4.set_xlabel("Gate Activation")
    ax4.set_title("Gate Distribution (Hybrid)", fontweight="bold")
    ax4.set_xlim(0, 1)
else:
    ax4.text(0.5, 0.5, "No gate stats\navailable",
             transform=ax4.transAxes, ha="center", va="center", fontsize=14, color="gray")
    ax4.set_title("Gate Distribution", fontweight="bold")

# --- Panel 5: Ensemble Weights ---
ax5 = fig.add_subplot(gs_layout[1, 1])
ens_data = [(r.name, r.horizon, r.ensemble_weights) for r in results if r.ensemble_weights]
if ens_data:
    for i, (name, horizon, w) in enumerate(ens_data):
        x = [0, 1]
        vals = [w.get("lstm", 0), w.get("nlp", 0)]
        ax5.bar([v + i * 0.3 for v in x], vals, width=0.25,
                label="%s/%s" % (name.split("_")[0], horizon))
    ax5.set_xticks([0, 1])
    ax5.set_xticklabels(["LSTM weight", "NLP weight"])
    ax5.set_title("Ensemble Weights", fontweight="bold")
    ax5.legend(fontsize=8)
    ax5.axhline(y=0, color="gray", linewidth=0.5, linestyle="--")
else:
    ax5.text(0.5, 0.5, "No ensemble weights\navailable",
             transform=ax5.transAxes, ha="center", va="center", fontsize=14, color="gray")
    ax5.set_title("Ensemble Weights", fontweight="bold")

# --- Panel 6: Horizon Sensitivity ---
ax6 = fig.add_subplot(gs_layout[1, 2])
strategy_names = sorted(set(r.name for r in results))
horizon_labels = sorted(set(r.horizon for r in results))
x_pos = np.arange(len(strategy_names))
width = 0.8 / max(len(horizon_labels), 1)

for j, h in enumerate(horizon_labels):
    h_ics = []
    for sn in strategy_names:
        match = [r for r in results if r.name == sn and r.horizon == h]
        h_ics.append(match[0].ic_mean if match else 0)
    offset = (j - len(horizon_labels) / 2 + 0.5) * width
    ax6.bar(x_pos + offset, h_ics, width, label=h)

ax6.set_xticks(x_pos)
ax6.set_xticklabels([n.split("_")[0] for n in strategy_names],
                     rotation=45, fontsize=7, ha="right")
ax6.set_ylabel("IC")
h_str = " vs ".join(horizon_labels)
ax6.set_title("Horizon Sensitivity (%s)" % h_str, fontweight="bold")
ax6.legend(fontsize=8)
ax6.axhline(y=0, color="gray", linewidth=0.5, linestyle="--")
ax6.grid(axis="y", alpha=0.3)

fig.suptitle("Multi-Strategy Benchmark Dashboard (15 Models)", fontsize=16, fontweight="bold", y=1.01)
plt.savefig(os.path.join(ARTIFACT_DIR, "strategy_benchmark.png"), dpi=150, bbox_inches="tight")
plt.show()
print("Dashboard saved.")

## 6a. Extended Visualizations

Five additional panels: Per-Stock R2 Histogram, Pred vs Actual Scatter, Calibration Slope, Rolling R2, Cumulative PnL.

In [None]:
# --- Visualization 1: Per-Stock R2 Histogram ---
ext_results = [r for r in results if r.extended and r.extended.stock_r2_values]
n_ext = len(ext_results)

if n_ext > 0:
    ncols = min(5, n_ext)
    nrows = (n_ext + ncols - 1) // ncols
    fig_r2h, axes_r2h = plt.subplots(nrows, ncols, figsize=(4 * ncols, 3 * nrows))
    if n_ext == 1:
        axes_flat = [axes_r2h]
    else:
        axes_flat = axes_r2h.flatten() if hasattr(axes_r2h, 'flatten') else [axes_r2h]

    for idx, r in enumerate(ext_results):
        if idx >= len(axes_flat):
            break
        ax = axes_flat[idx]
        vals = list(r.extended.stock_r2_values.values())
        ax.hist(vals, bins=20, color="#2196F3", edgecolor="white", alpha=0.8)
        ax.axvline(x=0, color="red", linewidth=1, linestyle="--")
        ax.axvline(x=np.median(vals), color="green", linewidth=1, linestyle="-",
                   label="median=%.3f" % np.median(vals))
        ax.set_title("%s/%s" % (r.name.split("_")[0], r.horizon), fontsize=9)
        ax.set_xlabel("R2", fontsize=8)
        ax.legend(fontsize=6)

    for idx in range(n_ext, len(axes_flat)):
        axes_flat[idx].set_visible(False)

    fig_r2h.suptitle("Per-Stock R2 Distribution", fontsize=14, fontweight="bold")
    plt.tight_layout()
    plt.savefig(os.path.join(ARTIFACT_DIR, "per_stock_r2_histogram.png"),
                dpi=150, bbox_inches="tight")
    plt.show()
    print("Per-stock R2 histogram saved.")
else:
    print("No per-stock R2 data for visualization.")

In [None]:
# --- Visualization 2: Pred vs Actual Scatter ---
scatter_results = [r for r in results if r.fold_predictions]

if scatter_results:
    n_sc = len(scatter_results)
    ncols = min(5, n_sc)
    nrows = (n_sc + ncols - 1) // ncols
    fig_sc, axes_sc = plt.subplots(nrows, ncols, figsize=(4 * ncols, 4 * nrows))
    if n_sc == 1:
        axes_flat = [axes_sc]
    else:
        axes_flat = axes_sc.flatten() if hasattr(axes_sc, 'flatten') else [axes_sc]

    for idx, r in enumerate(scatter_results):
        if idx >= len(axes_flat):
            break
        ax = axes_flat[idx]
        all_p = np.concatenate([fp.predictions for fp in r.fold_predictions])
        all_a = np.concatenate([fp.actuals for fp in r.fold_predictions])
        valid = ~(np.isnan(all_p) | np.isnan(all_a))
        p, a = all_p[valid], all_a[valid]

        if len(p) > 5000:
            idx_sub = np.random.choice(len(p), 5000, replace=False)
            p_plot, a_plot = p[idx_sub], a[idx_sub]
        else:
            p_plot, a_plot = p, a

        ax.scatter(p_plot, a_plot, alpha=0.1, s=3, color="#2196F3")

        if len(p) > 10:
            coef = np.polyfit(p, a, 1)
            x_line = np.linspace(p.min(), p.max(), 50)
            ax.plot(x_line, coef[0] * x_line + coef[1], "r-", linewidth=1.5,
                    label="slope=%.2f" % coef[0])

        ax.set_title("%s/%s" % (r.name.split("_")[0], r.horizon), fontsize=9)
        ax.set_xlabel("Predicted", fontsize=8)
        ax.set_ylabel("Actual", fontsize=8)
        ax.legend(fontsize=7)

    for idx in range(n_sc, len(axes_flat)):
        axes_flat[idx].set_visible(False)

    fig_sc.suptitle("Prediction vs Actual", fontsize=14, fontweight="bold")
    plt.tight_layout()
    plt.savefig(os.path.join(ARTIFACT_DIR, "pred_vs_actual_scatter.png"),
                dpi=150, bbox_inches="tight")
    plt.show()
    print("Pred vs actual scatter saved.")
else:
    print("No fold predictions for scatter plot.")

In [None]:
# --- Visualization 3: Calibration Slope Bar Chart ---
cal_results = [r for r in results if r.extended is not None]

if cal_results:
    fig_cal, ax_cal = plt.subplots(figsize=(10, max(4, len(cal_results) * 0.4)))
    labels_cal = ["%s/%s" % (r.name.split("_")[0], r.horizon) for r in cal_results]
    slopes = [r.extended.calib_slope for r in cal_results]
    colors_cal = ["#4CAF50" if abs(s - 1.0) < 0.3 else "#FFC107" if abs(s - 1.0) < 0.6
                  else "#F44336" for s in slopes]

    y_pos = range(len(cal_results))
    ax_cal.barh(y_pos, slopes, color=colors_cal, edgecolor="white")
    ax_cal.set_yticks(y_pos)
    ax_cal.set_yticklabels(labels_cal, fontsize=8)
    ax_cal.axvline(x=1.0, color="blue", linewidth=2, linestyle="--", label="Ideal (slope=1.0)")
    ax_cal.set_xlabel("Calibration Slope")
    ax_cal.set_title("Calibration: Pred vs Actual Regression Slope", fontweight="bold")
    ax_cal.legend(fontsize=9)
    ax_cal.invert_yaxis()
    plt.tight_layout()
    plt.savefig(os.path.join(ARTIFACT_DIR, "calibration_slope.png"), dpi=150, bbox_inches="tight")
    plt.show()
    print("Calibration slope chart saved.")
else:
    print("No calibration data for visualization.")

In [None]:
# --- Visualization 4: Rolling R2 (126-sample window) ---
rolling_results = [r for r in results if r.fold_predictions]

if rolling_results:
    fig_roll, ax_roll = plt.subplots(figsize=(14, 6))
    window = 126

    for r in rolling_results:
        all_p = np.concatenate([fp.predictions for fp in r.fold_predictions])
        all_a = np.concatenate([fp.actuals for fp in r.fold_predictions])
        valid = ~(np.isnan(all_p) | np.isnan(all_a))
        p, a = all_p[valid], all_a[valid]

        if len(p) < window + 10:
            continue

        rolling_r2 = []
        for i in range(window, len(p)):
            p_w = p[i - window:i]
            a_w = a[i - window:i]
            ss_res = np.sum((p_w - a_w) ** 2)
            ss_tot = np.sum((a_w - np.mean(a_w)) ** 2)
            r2 = 1.0 - ss_res / ss_tot if ss_tot > 1e-10 else 0.0
            rolling_r2.append(r2)

        label = "%s/%s" % (r.name.split("_")[0], r.horizon)
        ax_roll.plot(rolling_r2, label=label, alpha=0.7, linewidth=1)

    ax_roll.axhline(y=0, color="gray", linewidth=0.5, linestyle="--")
    ax_roll.set_xlabel("Sample Index")
    ax_roll.set_ylabel("Rolling R2 (window=%d)" % window)
    ax_roll.set_title("Rolling R2 Across Walk-Forward Test Periods", fontweight="bold")
    ax_roll.legend(fontsize=7, ncol=3, loc="upper right")
    ax_roll.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig(os.path.join(ARTIFACT_DIR, "rolling_r2.png"), dpi=150, bbox_inches="tight")
    plt.show()
    print("Rolling R2 chart saved.")
else:
    print("No fold predictions for rolling R2.")

In [None]:
# --- Visualization 5: Cumulative PnL Curves ---
pnl_results = [r for r in results
               if r.extended is not None
               and r.extended.lo_equity is not None
               and len(r.extended.lo_equity) > 1]

if pnl_results:
    fig_pnl, (ax_lo, ax_ls) = plt.subplots(1, 2, figsize=(16, 6))

    for r in pnl_results:
        label = "%s/%s" % (r.name.split("_")[0], r.horizon)
        ax_lo.plot(r.extended.lo_equity, label=label, alpha=0.7, linewidth=1)
        if r.extended.ls_equity is not None and len(r.extended.ls_equity) > 1:
            ax_ls.plot(r.extended.ls_equity, label=label, alpha=0.7, linewidth=1)

    ax_lo.axhline(y=1.0, color="gray", linewidth=0.5, linestyle="--")
    ax_lo.set_xlabel("Trading Day")
    ax_lo.set_ylabel("Equity")
    ax_lo.set_title("Long-Only Cumulative PnL", fontweight="bold")
    ax_lo.legend(fontsize=6, ncol=2)
    ax_lo.grid(alpha=0.3)

    ax_ls.axhline(y=1.0, color="gray", linewidth=0.5, linestyle="--")
    ax_ls.set_xlabel("Trading Day")
    ax_ls.set_ylabel("Equity")
    ax_ls.set_title("Long-Short (20/20) Cumulative PnL", fontweight="bold")
    ax_ls.legend(fontsize=6, ncol=2)
    ax_ls.grid(alpha=0.3)

    plt.tight_layout()
    plt.savefig(os.path.join(ARTIFACT_DIR, "cumulative_pnl.png"), dpi=150, bbox_inches="tight")
    plt.show()
    print("Cumulative PnL curves saved.")
else:
    print("No equity curve data for PnL visualization.")

## 7. Diagnostic Output

In [None]:
# Save full benchmark results to JSON
save_path = save_benchmark_results(
    results,
    path=os.path.join(ARTIFACT_DIR, "strategy_benchmark_results.json"),
)
print("Results saved to: %s" % save_path)

# Print per-strategy detail
print("\n" + "=" * 70)
print("DETAILED RESULTS")
print("=" * 70)
for r in results:
    print("\n--- %s / %s [%s] ---" % (r.name, r.horizon, r.status))
    print("  IC: %.4f +/- %.4f  ICIR: %.2f" % (r.ic_mean, r.ic_std, r.icir))
    print("  Sharpe: %.2f  MaxDD: %.4f" % (r.sharpe, r.max_drawdown))
    print("  Overfit: %.3f  Composite: %.4f" % (r.overfit_score, r.composite))
    print("  Prod IC: %.4f  Params: %d  Time: %.1fs" % (
        r.prod_ic, r.param_count, r.train_time))
    if r.fold_metrics:
        fold_ics = [f.ic for f in r.fold_metrics]
        print("  Fold ICs: %s" % [round(x, 4) for x in fold_ics])
    if r.extended:
        e = r.extended
        print("  R2: %.4f  Hit%%: %.1f%%  Cal.Slope: %.3f" % (
            e.r_squared, e.hit_ratio * 100, e.calib_slope))
        print("  L/O Sharpe: %.2f  L/S Sharpe: %.2f" % (e.lo_sharpe, e.ls_sharpe))
    if r.gate_stats:
        print("  Gate: mean=%.3f std=%.3f" % (
            r.gate_stats.get("gate_mean", 0), r.gate_stats.get("gate_std", 0)))
    if r.ensemble_weights:
        w = r.ensemble_weights
        print("  Ensemble: lstm=%.3f nlp=%.3f intercept=%.4f" % (
            w.get("lstm", 0), w.get("nlp", 0), w.get("intercept", 0)))

## 8. Experimental Integrity Checks

In [None]:
warnings = run_integrity_checks(results)

if warnings:
    print("INTEGRITY WARNINGS (%d):" % len(warnings))
    for w in warnings:
        print("  [!] %s" % w)
else:
    print("All integrity checks passed.")

# Additional checks
print("\nConsistency Checks:")

for h in sorted(set(r.horizon for r in results)):
    fold_counts = [len(r.fold_metrics) for r in results if r.horizon == h]
    if len(set(fold_counts)) > 1:
        print("  [!] Inconsistent fold counts at %s: %s" % (h, fold_counts))
    else:
        print("  [OK] %s: %d folds for all strategies" % (h, fold_counts[0] if fold_counts else 0))

for r in results:
    if r.param_count > CONFIG["max_params"]:
        print("  [!] %s: %d params > %d limit" % (r.name, r.param_count, CONFIG["max_params"]))

print("\nIntegrity check complete.")

## Summary & Decision

This benchmark answers:

1. **Does sentiment add incremental alpha?** Compare A vs C/E
2. **Is hybrid destructive or additive?** Compare A vs E/E1/E2
3. **Is ensemble safer than fusion?** Compare C vs E
4. **Cross-attention vs additive fusion?** Compare E1 vs E2
5. **Transformer vs LSTM?** Compare G vs A
6. **Trees vs neural networks?** Compare H/I vs A
7. **Which structure survives production retrain?** Check Prod IC column
8. **Statistical significance?** DM test p-values

In [None]:
# Extended decision summary
print("=" * 60)
print("PRICE PREDICTION SUMMARY")
print("=" * 60)

# Best by extended metrics
ext_results_all = [r for r in results if r.extended is not None]
if ext_results_all:
    best_r2 = max(ext_results_all, key=lambda r: r.extended.mean_stock_r2)
    print("\nBest by Mean Stock R2: %s/%s (%.4f)" % (
        best_r2.name, best_r2.horizon, best_r2.extended.mean_stock_r2))

    best_hit = max(ext_results_all, key=lambda r: r.extended.hit_ratio)
    print("Best by Directional Accuracy: %s/%s (%.1f%%)" % (
        best_hit.name, best_hit.horizon, best_hit.extended.hit_ratio * 100))

    best_cal = min(ext_results_all, key=lambda r: abs(r.extended.calib_slope - 1.0))
    print("Best Calibration (slope closest to 1): %s/%s (%.3f)" % (
        best_cal.name, best_cal.horizon, best_cal.extended.calib_slope))

    best_lo = max(ext_results_all, key=lambda r: r.extended.lo_sharpe)
    print("Highest Sharpe (L/O): %s/%s (%.2f)" % (
        best_lo.name, best_lo.horizon, best_lo.extended.lo_sharpe))

    best_ls = max(ext_results_all, key=lambda r: r.extended.ls_sharpe)
    print("Highest Sharpe (L/S): %s/%s (%.2f)" % (
        best_ls.name, best_ls.horizon, best_ls.extended.ls_sharpe))

# DM significance vs baseline
if dm_results:
    baseline_name = "A_LSTM_Baseline"
    sig_improvements = [dm for dm in dm_results
                        if dm.p_value < 0.05
                        and (dm.strategy_a == baseline_name or dm.strategy_b == baseline_name)
                        and dm.better != baseline_name]
    if sig_improvements:
        print("\nDM test: significant improvements over baseline:")
        for dm in sig_improvements:
            print("  %s beats %s at %s (p=%.4f)" % (
                dm.better, baseline_name, dm.horizon, dm.p_value))
    else:
        print("\nDM test: no strategy significantly beats baseline (p<0.05)")

# Key comparisons
print("\n--- Key Comparisons ---")

# Sentiment alpha
a_21 = [r for r in results if r.name.startswith("A_") and r.horizon == "21D"]
c_21 = [r for r in results if r.name.startswith("C_") and r.horizon == "21D"]
e_21 = [r for r in results if r.name.startswith("E_") and r.horizon == "21D"]
if a_21 and c_21:
    delta = c_21[0].ic_mean - a_21[0].ic_mean
    verdict = "YES (+%.4f IC)" % delta if delta > 0.005 else "NO (delta=%.4f)" % delta
    print("1. Sentiment adds alpha? %s" % verdict)

# Hybrid vs baseline
if a_21 and e_21:
    delta = e_21[0].ic_mean - a_21[0].ic_mean
    verdict = "ADDITIVE (+%.4f)" % delta if delta > 0 else "DESTRUCTIVE (%.4f)" % delta
    print("2. Hybrid vs LSTM? %s" % verdict)

# Cross-attention vs additive
e1_21 = [r for r in results if r.name.startswith("E1") and r.horizon == "21D"]
e2_21 = [r for r in results if r.name.startswith("E2") and r.horizon == "21D"]
if e1_21 and e2_21:
    if e1_21[0].ic_mean > e2_21[0].ic_mean:
        print("3. Cross-attn vs Additive? CROSS-ATTN (IC %.4f vs %.4f)" % (
            e1_21[0].ic_mean, e2_21[0].ic_mean))
    else:
        print("3. Cross-attn vs Additive? ADDITIVE (IC %.4f vs %.4f)" % (
            e2_21[0].ic_mean, e1_21[0].ic_mean))

# Transformer vs LSTM
g_21 = [r for r in results if r.name.startswith("G_") and r.horizon == "21D"]
if a_21 and g_21:
    delta = g_21[0].ic_mean - a_21[0].ic_mean
    verdict = "TRANSFORMER (+%.4f)" % delta if delta > 0 else "LSTM (%.4f)" % delta
    print("4. Transformer vs LSTM? %s" % verdict)

# Trees vs neural
h_21 = [r for r in results if r.name.startswith("H_") and r.horizon == "21D"]
i_21 = [r for r in results if r.name.startswith("I_") and r.horizon == "21D"]
if a_21 and h_21 and i_21:
    best_tree = max([h_21[0], i_21[0]], key=lambda r: r.ic_mean)
    delta = best_tree.ic_mean - a_21[0].ic_mean
    verdict = "TREES (+%.4f)" % delta if delta > 0 else "NEURAL (%.4f)" % delta
    print("5. Trees vs Neural? %s (best tree: %s)" % (verdict, best_tree.name))

# Production survival
print("\n--- Production Survival ---")
for r in results:
    if r.ic_mean > 0 and r.prod_ic > 0:
        ratio = r.prod_ic / r.ic_mean if r.ic_mean > 1e-8 else 0
        survived = "SURVIVED" if ratio >= 0.9 else "DEGRADED"
        print("  %s/%s: WF IC=%.4f -> Prod IC=%.4f (%.0f%%) [%s]" % (
            r.name, r.horizon, r.ic_mean, r.prod_ic, ratio * 100, survived))

# Final recommendation
print("\n" + "=" * 60)
passing = [r for r in results if r.status == "PASS"]
if passing:
    best = passing[0]
    print("RECOMMENDATION: Deploy %s" % best.name)
    print("  Composite: %.4f  IC: %.4f  Prod IC: %.4f" % (
        best.composite, best.ic_mean, best.prod_ic))
    if best.extended:
        print("  R2: %.4f  Hit%%: %.1f%%  L/S Sharpe: %.2f" % (
            best.extended.r_squared, best.extended.hit_ratio * 100,
            best.extended.ls_sharpe))
else:
    warning_results = [r for r in results if r.status == "WARN"]
    if warning_results:
        print("RECOMMENDATION: Cautiously deploy %s (WARN status)" % warning_results[0].name)
    else:
        print("RECOMMENDATION: No viable strategy. Review data/features.")
print("=" * 60)

## 9. Per-Market Performance Analysis

Break down strategy performance by market (US vs KOSPI vs KOSDAQ) to identify
market-specific alpha and cross-market generalization.

In [None]:
# Per-market R2 breakdown using fold predictions
def _classify_market(ticker):
    t = ticker.upper()
    if t.endswith(".KS"):
        return "KOSPI"
    elif t.endswith(".KQ"):
        return "KOSDAQ"
    return "US"

market_perf_rows = []
for r in results:
    if r.extended is None or not r.extended.stock_r2_values:
        continue
    # Group stock R2 by market
    market_r2s = {}
    for ticker, r2_val in r.extended.stock_r2_values.items():
        mkt = _classify_market(ticker)
        if mkt not in market_r2s:
            market_r2s[mkt] = []
        market_r2s[mkt].append(r2_val)

    for mkt in ["US", "KOSPI", "KOSDAQ"]:
        vals = market_r2s.get(mkt, [])
        if not vals:
            continue
        market_perf_rows.append({
            "Strategy": r.name.split("_")[0],
            "Horizon": r.horizon,
            "Market": mkt,
            "N Stocks": len(vals),
            "Mean R2": round(np.mean(vals), 4),
            "Median R2": round(np.median(vals), 4),
            "% Positive": round(sum(1 for v in vals if v > 0) / len(vals) * 100, 1),
        })

if market_perf_rows:
    df_mkt = pd.DataFrame(market_perf_rows)
    print("=== Per-Market R2 Breakdown ===")
    print("=" * 100)

    for mkt in ["US", "KOSPI", "KOSDAQ"]:
        sub = df_mkt[df_mkt["Market"] == mkt]
        if sub.empty:
            continue
        print("\n--- %s ---" % mkt)
        display(sub.drop(columns=["Market"]).reset_index(drop=True))

    # Best strategy per market
    print("\n=== Best Strategy per Market (by Mean R2) ===")
    for mkt in ["US", "KOSPI", "KOSDAQ"]:
        sub = df_mkt[df_mkt["Market"] == mkt]
        if sub.empty:
            print("  %s: No data" % mkt)
            continue
        best_idx = sub["Mean R2"].idxmax()
        best = sub.loc[best_idx]
        print("  %s: %s/%s (Mean R2=%.4f, %d stocks)" % (
            mkt, best["Strategy"], best["Horizon"],
            best["Mean R2"], best["N Stocks"]))
else:
    print("No per-market data (need extended metrics with stock R2 values).")

In [None]:
# Per-market R2 comparison visualization
if market_perf_rows:
    df_mkt = pd.DataFrame(market_perf_rows)
    markets_present = [m for m in ["US", "KOSPI", "KOSDAQ"] if m in df_mkt["Market"].values]
    n_markets = len(markets_present)

    if n_markets > 0:
        fig_mkt, axes_mkt = plt.subplots(1, n_markets, figsize=(7 * n_markets, 6))
        if n_markets == 1:
            axes_mkt = [axes_mkt]

        for ax, mkt in zip(axes_mkt, markets_present):
            sub = df_mkt[df_mkt["Market"] == mkt].copy()
            sub["Label"] = sub["Strategy"] + "/" + sub["Horizon"]
            sub = sub.sort_values("Mean R2", ascending=True)

            colors_mkt = ["#4CAF50" if v > 0 else "#F44336" for v in sub["Mean R2"]]
            ax.barh(sub["Label"], sub["Mean R2"], color=colors_mkt, edgecolor="white")
            ax.axvline(x=0, color="gray", linewidth=0.5, linestyle="--")
            ax.set_xlabel("Mean Stock R2")
            ax.set_title("%s (%d stocks)" % (mkt, sub["N Stocks"].iloc[0]), fontweight="bold")
            ax.tick_params(axis="y", labelsize=7)

        plt.suptitle("Per-Market Strategy Performance", fontsize=14, fontweight="bold")
        plt.tight_layout()
        plt.savefig(os.path.join(ARTIFACT_DIR, "per_market_performance.png"),
                    dpi=150, bbox_inches="tight")
        plt.show()
        print("Per-market performance chart saved.")

    # Cross-market generalization heatmap
    pivot = df_mkt.pivot_table(
        values="Mean R2", index=["Strategy", "Horizon"],
        columns="Market", aggfunc="first")
    if pivot.shape[1] >= 2:
        fig_gen, ax_gen = plt.subplots(figsize=(8, max(4, len(pivot) * 0.35)))
        im = ax_gen.imshow(pivot.values, cmap="RdYlGn", aspect="auto",
                           vmin=-0.05, vmax=0.1)
        ax_gen.set_xticks(range(pivot.shape[1]))
        ax_gen.set_xticklabels(pivot.columns, fontsize=10)
        ax_gen.set_yticks(range(pivot.shape[0]))
        ylabels = ["%s/%s" % (s, h) for s, h in pivot.index]
        ax_gen.set_yticklabels(ylabels, fontsize=7)
        ax_gen.set_title("Cross-Market Generalization (Mean Stock R2)", fontweight="bold")

        for i in range(pivot.shape[0]):
            for j in range(pivot.shape[1]):
                val = pivot.values[i, j]
                if not np.isnan(val):
                    ax_gen.text(j, i, "%.3f" % val, ha="center", va="center",
                                fontsize=7, color="white" if val > 0.05 else "black")

        plt.colorbar(im, ax=ax_gen, shrink=0.7, label="Mean R2")
        plt.tight_layout()
        plt.savefig(os.path.join(ARTIFACT_DIR, "cross_market_heatmap.png"),
                    dpi=150, bbox_inches="tight")
        plt.show()
        print("Cross-market heatmap saved.")