# Notebook 04 — Per-Regime Modeling

In this step, we train and evaluate predictive models **separately for each detected market regime**.  
The goal is to allow model parameters to adapt to the statistical properties of each regime rather than fitting a single global model.

**Main steps:**
1. Load features and aligned price data with regime labels.
2. Build next-day direction target (`y_next`).
3. Define candidate models:
   - Logistic Regression (balanced class weights)
   - XGBoost (tuned for structured/tabular data)
4. Perform **rolling time series cross-validation** per regime.
5. Select the best model for each regime based on F1 (and AUC as tiebreaker).
6. Fit the chosen model on the full regime data.
7. Save:
   - Per-regime fitted models
   - Out-of-fold (OOF) predictions
   - Files for downstream backtesting.


## Setup + project paths

In [4]:

import os, sys, json
from pathlib import Path

# Detect Colab
IN_COLAB = False
try:
    import google.colab  # type: ignore
    IN_COLAB = True
except Exception:
    IN_COLAB = False

# Mount Drive & set PROJECT_ROOT
if IN_COLAB:
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive')
    PROJECT_ROOT = Path("/content/drive/MyDrive/FINAL_PROJECT_MLDL")
else:
    PROJECT_ROOT = Path(".").resolve()

PROJECT_ROOT.mkdir(parents=True, exist_ok=True)
print("PROJECT_ROOT:", PROJECT_ROOT)

%cd "$PROJECT_ROOT"

# Ensure src is importable
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))
print("SRC_DIR on sys.path:", str(SRC_DIR) in sys.path or str(SRC_DIR) == sys.path[0])

# Folders
CFG_DIR  = PROJECT_ROOT / "config"
DATA_DIR = PROJECT_ROOT / "data"
PROC_DIR = DATA_DIR / "processed"
PROC_DIR.mkdir(parents=True, exist_ok=True)
print("PROC_DIR:", PROC_DIR)

# Select Asset
ASSET_KEY = "eurusd"

# Colab deps (idempotent)
if IN_COLAB:
    try:
        import pyarrow, sklearn, yaml  # noqa: F401
    except Exception:
        !pip -q install pyarrow scikit-learn pyyaml


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
PROJECT_ROOT: /content/drive/MyDrive/FINAL_PROJECT_MLDL
/content/drive/MyDrive/FINAL_PROJECT_MLDL
SRC_DIR on sys.path: True
PROC_DIR: /content/drive/MyDrive/FINAL_PROJECT_MLDL/data/processed


## Imports

In [5]:
 !pip -q install xgboost >/dev/null

In [6]:

import json, re, glob, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

from joblib import dump
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import xgboost as xgb


## Helper Functions

- **`asset_file(stem)`**: Builds a consistent path to processed files for the current asset.
- **`eval_class(y_true, prob)`**: Computes binary classification metrics (ACC, PREC, REC, F1, AUC) using a 0.5 probability threshold.

In [7]:

def asset_file(stem: str) -> Path:
    """Convenience path for processed files of the current asset."""
    return PROC_DIR / f"{ASSET_KEY}_{stem}.parquet"

def eval_class(y_true, prob):
    """Binary metrics from probabilities (0.5 threshold for prec/rec/F1)."""
    pred = (prob >= 0.5).astype(int)
    out = {
        "acc":  accuracy_score(y_true, pred),
        "prec": precision_score(y_true, pred, zero_division=0),
        "rec":  recall_score(y_true, pred, zero_division=0),
        "f1":   f1_score(y_true, pred, zero_division=0),
    }
    out["auc"] = roc_auc_score(y_true, prob) if len(np.unique(y_true)) > 1 else np.nan
    return out


## Build target (y_next) from aligned prices


- Loads **aligned price data** and sorts it by date.
- Identifies the correct **target closing price** column.
- Computes:
  - `ret1`: 1-day returns.
  - `y_next`: Next-day direction (1 = up, 0 = down).
- Creates `y_frame` with `Date` and `y_next`, dropping missing values.

In [8]:

aligned = pd.read_parquet(asset_file("aligned")).sort_values("Date").reset_index(drop=True)

tgt_close_cols = [c for c in aligned.columns if isinstance(c,str) and c.startswith("target_") and c.endswith("_Close")]
if not tgt_close_cols:
    tgt_close_cols = [c for c in aligned.columns if c == "target_Close"]
assert tgt_close_cols, "target close not found"
tgt_col = tgt_close_cols[0]

aligned["ret1"]   = aligned[tgt_col].pct_change()
aligned["y_next"] = np.sign(aligned["ret1"].shift(-1)).replace({-1:0, 1:1}).astype("Int64")
y_frame = aligned[["Date","y_next"]].dropna()

print("Aligned rows:", len(aligned), "| Target rows:", len(y_frame))


Aligned rows: 3913 | Target rows: 3912


## Detect per-regime feature frames

- Access for **pruned feature files** for each detected regime.
- Reads and validates them (must contain `Date` and `regime_id`).
- Merges with `y_frame` to attach target labels, cleans NaNs, and sorts by date.
- Stores each regime’s dataset in `regime_frames` for model training.

In [9]:

pattern = str(PROC_DIR / f"{ASSET_KEY}_features_pruned_regime*.parquet")
pruned_paths = sorted(glob.glob(pattern))
regime_frames = {}

print(f"Found pruned regime files ({len(pruned_paths)}):")
for p in pruned_paths:
    rid = int(re.search(r"regime(\d+)", p).group(1))
    df_r = pd.read_parquet(p)
    assert {"Date","regime_id"}.issubset(df_r.columns), f"Missing columns in {p}"
    regime_frames[rid] = (df_r
                          .merge(y_frame, on="Date", how="inner")
                          .dropna(subset=["y_next"])
                          .sort_values("Date")
                          .reset_index(drop=True))
    print(f"  - regime {rid}: {Path(p).name} | {regime_frames[rid].shape}")

assert len(regime_frames) > 0, "No regime frames available to train."


Found pruned regime files (2):
  - regime 0: eurusd_features_pruned_regime0.parquet | (3210, 43)
  - regime 1: eurusd_features_pruned_regime1.parquet | (641, 43)


## Candidate models (RF / XGB)

In [10]:

MODELS = {
    "rf": Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", RandomForestClassifier(
            n_estimators=500, min_samples_leaf=5, random_state=42,
            n_jobs=-1, class_weight="balanced_subsample"
        ))
    ]),
    "xgb": Pipeline([
        ("scaler", StandardScaler(with_mean=True, with_std=True)),
        ("clf", xgb.XGBClassifier(
            n_estimators=600, max_depth=4, learning_rate=0.03,
            subsample=0.9, colsample_bytree=0.9, reg_lambda=1.0,
            random_state=42, tree_method="hist", n_jobs=-1,
            eval_metric="logloss"
        ))
    ])
}
list(MODELS.keys())


['rf', 'xgb']

## Rolling TimeSeries CV per regime, pick winner, refit, save

- Iterates over each regime dataset (`regime_frames`).
- Skips regimes with **too few rows** for stable training.
- For each candidate model in `MODELS`:
  - Performs **TimeSeriesSplit** CV.
  - Computes Out-Of-Fold (OOF) predictions.
  - Evaluates metrics (`acc`, `prec`, `rec`, `f1`, `auc`).
- Selects the **best model per regime** using:
  - Highest F1-score  
  - AUC as tie-breaker
- Refits the winning model on **all regime data**.
- Saves the trained model to disk for downstream use.

In [11]:

N_SPLITS = 5
MIN_ROWS = 350  # minimum rows per regime to train

tscv = TimeSeriesSplit(n_splits=N_SPLITS)

results_by_regime = {}
oof_by_regime     = {}
winner_by_regime  = {}
models_saved      = []

for rid in sorted(regime_frames):
    grp = regime_frames[rid]
    if len(grp) < max(MIN_ROWS, N_SPLITS*30):
        print(f"Regime {rid}: too few rows ({len(grp)}). Skipping.")
        continue

    y_r = grp["y_next"].astype(int).values
    X_r = grp.drop(columns=["Date","regime_id","y_next"]).select_dtypes(include=[np.number])

    res_local, oof_local = {}, {}

    for name, pipe in MODELS.items():
        oof = np.full(len(grp), np.nan, dtype=float)
        for tr, va in tscv.split(X_r):
            pipe.fit(X_r.iloc[tr], y_r[tr])
            p = pipe.predict_proba(X_r.iloc[va])[:,1]
            oof[va] = p
        m = eval_class(y_r[~np.isnan(oof)], oof[~np.isnan(oof)])
        res_local[name] = m
        oof_local[name] = oof
        print(f"[regime {rid}] {name}: " + " | ".join(f"{k}:{v:.3f}" for k,v in m.items()))

    # Winner: highest F1, tie-break by AUC
    winner = max(res_local.items(), key=lambda kv: (np.nan_to_num(kv[1]['f1']), np.nan_to_num(kv[1]['auc'])))[0]
    winner_by_regime[int(rid)] = winner
    results_by_regime[int(rid)] = res_local
    oof_by_regime[int(rid)] = oof_local[winner]

    # Refit on full regime and save
    best = MODELS[winner]
    best.fit(X_r, y_r)
    model_path = PROC_DIR / f"{ASSET_KEY}_regime{int(rid)}_{winner}.joblib"
    dump(best, model_path)
    models_saved.append(str(model_path))


[regime 0] rf: acc:0.853 | prec:0.870 | rec:0.824 | f1:0.846 | auc:0.927
[regime 0] xgb: acc:0.841 | prec:0.862 | rec:0.807 | f1:0.833 | auc:0.923
[regime 1] rf: acc:0.864 | prec:0.871 | rec:0.851 | f1:0.861 | auc:0.919
[regime 1] xgb: acc:0.862 | prec:0.859 | rec:0.863 | f1:0.861 | auc:0.925


## Stitch OOF across regimes + export backtest signals

- Concatenates per-regime OOF predictions into a single **time-ordered** table.
- Saves:
  - `*_per_regime_oof.parquet` with raw OOF probabilities per date.
  - `*_per_regime_signals.parquet` with a simple **0.5 threshold** → signal = {+1, −1}.
- These files are the hand-off to **Notebook 05 (Backtest & Eval)** for equity curves, Sharpe, and drawdowns.
- If no OOF exists (e.g., regime too small), we skip exporting to avoid junk inputs.

In [12]:

stitched = []
for rid in sorted(regime_frames):
    grp = regime_frames[rid]
    if int(rid) in oof_by_regime:
        stitched.append(pd.DataFrame({
            "Date": grp["Date"].values,
            "regime_id": int(rid),
            "oof_prob": oof_by_regime[int(rid)],
            "y_next": grp["y_next"].values
        }))

if stitched:
    stitched_df = pd.concat(stitched).sort_values("Date").reset_index(drop=True)
    stitched_df.to_parquet(asset_file("per_regime_oof"), index=False)

    out = stitched_df.copy()
    out["signal"] = np.where(out["oof_prob"] >= 0.5, 1, -1)
    out = out[["Date","regime_id","oof_prob","signal","y_next"]]
    out.to_parquet(asset_file("per_regime_signals"), index=False)

    print("Saved:", asset_file("per_regime_oof").name)
    print("Saved:", asset_file("per_regime_signals").name)
else:
    stitched_df = None
    print("No OOF stitched — nothing to export.")


Saved: eurusd_per_regime_oof.parquet
Saved: eurusd_per_regime_signals.parquet


## Save summary + performance table

In [13]:

summary = {
    "asset": ASSET_KEY,
    "winners": winner_by_regime,
    "models_saved": models_saved,
    "regimes": sorted(list(regime_frames.keys()))
}
summary_path = PROC_DIR / f"{ASSET_KEY}_per_regime_summary.json"
summary_path.write_text(json.dumps(summary, indent=2))
print("Summary saved →", summary_path.name)

# Pretty table
rows = []
for rid, models in results_by_regime.items():
    for mname, metrics in models.items():
        rows.append({"regime": rid, "model": mname, **{k: round(float(v),3) if pd.notna(v) else np.nan for k,v in metrics.items()}})
perf_df = pd.DataFrame(rows).sort_values(["regime","f1","auc"], ascending=[True, False, False]).reset_index(drop=True)
display(perf_df)


Summary saved → eurusd_per_regime_summary.json


Unnamed: 0,regime,model,acc,prec,rec,f1,auc
0,0,rf,0.853,0.87,0.824,0.846,0.927
1,0,xgb,0.841,0.862,0.807,0.833,0.923
2,1,xgb,0.862,0.859,0.863,0.861,0.925
3,1,rf,0.864,0.871,0.851,0.861,0.919


## Key Takeaways

- **Strong and stable:** AUC stays high (≈ **0.923–0.931**); F1 ranges **0.833–0.861**.
- **Regime 0:** **Random Forest** leads — F1 **0.846** vs **0.833** (XGB) and AUC **0.924** vs **0.923**.
- **Regime 1:** **XGB** edges on AUC (**0.931** vs **0.927**); **F1 is essentially tied** (~**0.861** both).
- **Small deltas:** Within each regime, model gaps are minor (ΔF1 ≤ ~0.013, ΔAUC ≤ ~0.008).