# 03 — Baselines ML (TimeSeriesSplit)

On compare plusieurs baselines sur la cible **score ∈ [-1,1]**.

Métriques:
- **MSE** (erreur)
- **Directional accuracy** (signe correct)
- **Information Coefficient** (corrélation de rang Spearman)


In [3]:
import sys
from pathlib import Path

ROOT = Path("..").resolve()
SRC = ROOT / "src"
if str(SRC) not in sys.path:
    sys.path.insert(0, str(SRC))

import numpy as np
import pandas as pd

from utils import get_logger
logger = get_logger("notebook", log_file=str(ROOT/"logs"/"run.log"))

from data import load_ohlc_from_xlsx
from features import build_features
from labels import add_target_20d_score, fit_score_scaler, apply_score
from split import time_series_splits
from models import get_baselines
from metrics import mse, directional_accuracy, information_coefficient

XLSX = str(ROOT / "dataset_train.xlsx")
df = load_ohlc_from_xlsx(XLSX, sheet_name="Gold")
df = add_target_20d_score(build_features(df), horizon=20)
df = df.dropna().reset_index(drop=True)

exclude = {"Date","Open","High","Low","Close","fut_ret_20","y_score"}
feature_cols = [c for c in df.columns if c not in exclude]

X = df[feature_cols].to_numpy()
future_ret = df["fut_ret_20"].to_numpy()

rows = []
for split_id, (tr, te) in enumerate(time_series_splits(len(df), n_splits=5, embargo=0), start=1):
    scale = fit_score_scaler(pd.Series(future_ret[tr]), std_mult=2.0)
    y_tr = np.clip(future_ret[tr] / scale, -1.0, 1.0)
    y_te = np.clip(future_ret[te] / scale, -1.0, 1.0)

    for spec in get_baselines():
        model = spec.model
        model.fit(X[tr], y_tr)
        pred = model.predict(X[te])

        rows.append({
            "split": split_id,
            "model": spec.name,
            "mse": mse(y_te, pred),
            "dir_acc": directional_accuracy(y_te, pred),
            "ic_spearman": information_coefficient(y_te, pred),
            "n_test": int(len(te))
        })

results = pd.DataFrame(rows)
logger.info("Done. Head:\n%s", results.head())
results


2025-12-15 16:27:19,038 | INFO | data | Loading sheet=Gold from C:\Users\fayca\Downloads\hackathon_gold_project\hackathon_gold_project\dataset_train.xlsx
2025-12-15 16:27:22,335 | INFO | data | Loaded 11340 rows, columns=['Date', 'Open', 'High', 'Low', 'Close', 'smavg_50', 'smavg_100', 'smavg_240']
2025-12-15 16:27:22,338 | INFO | features | Building features...
2025-12-15 16:27:22,387 | INFO | features | Features built. Total columns=34
2025-12-15 16:27:22,418 | INFO | labels | Fitted score scale=0.192158 (std_mult=2.00, std=0.096079)
2025-12-15 16:27:30,244 | INFO | labels | Fitted score scale=0.151396 (std_mult=2.00, std=0.075698)
2025-12-15 16:27:48,990 | INFO | labels | Fitted score scale=0.127992 (std_mult=2.00, std=0.063996)
2025-12-15 16:28:12,161 | INFO | labels | Fitted score scale=0.118446 (std_mult=2.00, std=0.059223)
2025-12-15 16:28:44,563 | INFO | labels | Fitted score scale=0.117416 (std_mult=2.00, std=0.058708)
2025-12-15 16:29:26,554 | INFO | notebook | Done. Head:
  

Unnamed: 0,split,model,mse,dir_acc,ic_spearman,n_test
0,1,ridge,0.059001,0.52221,0.084885,1846
1,1,random_forest,0.157641,0.419827,0.094134,1846
2,1,gbrt,0.117895,0.561213,0.187243,1846
3,2,ridge,0.035732,0.56013,0.114044,1846
4,2,random_forest,0.041974,0.523294,0.036182,1846
5,2,gbrt,0.058618,0.529252,0.024483,1846
6,3,ridge,0.084193,0.517876,0.153824,1846
7,3,random_forest,0.151104,0.44312,0.042797,1846
8,3,gbrt,0.156136,0.470206,0.018202,1846
9,4,ridge,0.710243,0.36403,0.004987,1846


In [5]:

# Agrégation par modèle
agg = results.groupby("model")[["mse","dir_acc","ic_spearman"]].mean().sort_values("mse")
agg


Unnamed: 0_level_0,mse,dir_acc,ic_spearman
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
random_forest,0.20515,0.441062,0.015276
ridge,0.210531,0.481798,0.082537
gbrt,0.2211,0.476706,0.055438


✅ Prends le meilleur compromis (souvent RF/GBRT) comme baseline officielle.