# Notebook 05 — LSTM Experiments (Full Dataset)

In this step, we explore a **sequence-based deep learning approach** using an LSTM model trained on the **full `eurusd_features_with_regimes` dataset**.  
Unlike per-regime modeling, this notebook ignores regime splits and feeds the model the entire engineered feature set.

**Objectives:**
1. Prepare time-ordered feature data with next-day direction targets.
2. Convert features into fixed-length rolling sequences (no data leakage).
3. Perform **rolling time-series cross-validation** with an LSTM:
   - Leakage-safe scaling (fit only on train folds)
   - Early stopping and class weights for imbalance
4. Evaluate **out-of-fold performance** across folds.
5. Train a **final full-dataset LSTM model** for downstream backtesting.
6. Export:
   - OOF predictions (`*_lstm_full_oof.parquet`)
   - SavedModel (`*_lstm_full_savedmodel/`)
   - Associated scaler (`*_lstm_full_scaler.joblib`)

## Setup + project paths

In [12]:

import os, sys, json
from pathlib import Path

# Detect Colab
IN_COLAB = False
try:
    import google.colab  # type: ignore
    IN_COLAB = True
except Exception:
    IN_COLAB = False

# Mount Drive & set PROJECT_ROOT
if IN_COLAB:
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive')
    PROJECT_ROOT = Path("/content/drive/MyDrive/FINAL_PROJECT_MLDL")
else:
    PROJECT_ROOT = Path(".").resolve()

PROJECT_ROOT.mkdir(parents=True, exist_ok=True)
print("PROJECT_ROOT:", PROJECT_ROOT)

%cd "$PROJECT_ROOT"

# Ensure src is importable
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))
print("SRC_DIR on sys.path:", str(SRC_DIR) in sys.path or str(SRC_DIR) == sys.path[0])

# Folders
CFG_DIR  = PROJECT_ROOT / "config"
DATA_DIR = PROJECT_ROOT / "data"
PROC_DIR = DATA_DIR / "processed"
PROC_DIR.mkdir(parents=True, exist_ok=True)
print("PROC_DIR:", PROC_DIR)

# Select Asset
ASSET_KEY = "eurusd"

# Colab deps (idempotent)
if IN_COLAB:
    try:
        import pyarrow, sklearn, yaml  # noqa: F401
    except Exception:
        !pip -q install pyarrow scikit-learn pyyaml


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
PROJECT_ROOT: /content/drive/MyDrive/FINAL_PROJECT_MLDL
/content/drive/MyDrive/FINAL_PROJECT_MLDL
SRC_DIR on sys.path: True
PROC_DIR: /content/drive/MyDrive/FINAL_PROJECT_MLDL/data/processed


## Imports

In [13]:

import os, json, re, glob, warnings
warnings.filterwarnings("ignore")

from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


## Load data

In [14]:

def asset_file(stem: str) -> Path:
    """Convenience path for processed files of the current asset."""
    return PROC_DIR / f"{ASSET_KEY}_{stem}.parquet"

# Build X/y using PRUNED feature sets to reduce noise
features = pd.read_parquet(asset_file("features_with_regimes")).sort_values("Date").reset_index(drop=True)

# Target
aligned = pd.read_parquet(asset_file("aligned")).sort_values("Date").reset_index(drop=True)
tgt_close_cols = [c for c in aligned.columns if isinstance(c, str) and c.startswith("target_") and c.endswith("_Close")]
if not tgt_close_cols:
    tgt_close_cols = [c for c in aligned.columns if c == "target_Close"]
assert tgt_close_cols, "target close not found"
tgt_col = tgt_close_cols[0]

aligned["ret1"]   = aligned[tgt_col].pct_change()
aligned["y_next"] = np.sign(aligned["ret1"].shift(-1)).replace({-1:0, 1:1}).astype("Int64")
y_frame = aligned[["Date","y_next"]].dropna()

# Merge & sort
df = (features.merge(y_frame, on="Date", how="inner")
               .dropna(subset=["y_next"])
               .sort_values("Date")
               .reset_index(drop=True))


## Pruned Feature Selection

We load the **features_with_regimes** dataset and merge it with the aligned price data to create the next-day direction target (`y_next`).  
To reduce noise and improve model focus, we use the **union of pruned features** from the Random Forest feature selection step (`*_rf_kept_features.json`).  
If no pruning file is found, all numeric features are used.

In [15]:

# Load kept features per-regime and take the UNION (fallback to all numeric if file missing)
kept_path = PROC_DIR / f"{ASSET_KEY}_rf_kept_features.json"
if kept_path.exists():
    kept_by_regime = json.loads(kept_path.read_text())
    kept_union = sorted({c for cols in kept_by_regime.values() for c in cols})
    # Keep only those present in df; also keep regime_id (numeric) if available
    pruned_cols = [c for c in kept_union if c in df.columns]
    if "regime_id" in df.columns:
        pruned_cols = ["regime_id"] + pruned_cols
    print(f"Using pruned feature union: {len(pruned_cols)} columns.")
else:
    print("Kept-features JSON not found — using all numeric columns.")
    pruned_cols = None

y = df["y_next"].astype(int).values
if pruned_cols:
    X = df[pruned_cols].select_dtypes(include=[np.number])
else:
    X = df.drop(columns=["Date","y_next"]).select_dtypes(include=[np.number])

print("Final X shape:", X.shape, "| features:", X.columns[:8].tolist(), "...")


Using pruned feature union: 51 columns.
Final X shape: (3851, 51) | features: ['regime_id', 'bench_ma10', 'bench_ma5', 'bench_ret1', 'bench_ret5', 'bench_ret60', 'corr_bench_20', 'corr_bench_60'] ...


## Sequence helpers (no leakage)

We define:
- **Two LSTM layers** (64 → 32 units) with dropout in between to capture deeper temporal dependencies.
- **L2 weight decay** on both recurrent and dense layers to reduce overfitting.
- **Longer sequence window (60 timesteps)** to allow better pattern learning.
- Positive class **weight boost** to improve recall.
- **ReduceLROnPlateau** learning rate schedule for smoother convergence.

In [16]:

from tensorflow.keras import layers, regularizers, callbacks, optimizers

LSTM_WIN = 60          # longer sequence
EPOCHS   = 80
BATCH    = 32          # smaller batch for better minima
PATIENCE = 15          # more patience (noisy finance CV)
L2      = 1e-4         # weight decay

def make_sequences(X_df, y_arr, win=LSTM_WIN):
    X_np = X_df.values.astype("float32")
    y_np = y_arr.astype("float32")
    n = len(X_np)
    if n <= win:
        return np.empty((0, win, X_np.shape[1]), "float32"), np.empty((0,), "float32"), np.array([], int)
    Xs, ys, idxs = [], [], []
    for t in range(win, n):
        Xs.append(X_np[t-win:t])
        ys.append(y_np[t])
        idxs.append(t)
    return np.stack(Xs), np.array(ys), np.array(idxs)

def class_weights_binary(y_vec, pos_boost=1.5):
    pos = (y_vec == 1).sum()
    neg = (y_vec == 0).sum()
    if pos == 0 or neg == 0:
        return {0:1.0, 1:1.0}
    tot = pos + neg
    base = {0: tot/(2.0*neg), 1: tot/(2.0*pos)}
    base[1] *= pos_boost  # ↑ positive weight to help recall
    return base

def build_lstm_model(n_features, units=(64, 32), dropout=(0.3, 0.3), l2=L2):
    inp = layers.Input(shape=(None, n_features))
    x = layers.LSTM(units[0], return_sequences=True,
                    kernel_regularizer=regularizers.l2(l2),
                    recurrent_regularizer=regularizers.l2(l2))(inp)
    x = layers.Dropout(dropout[0])(x)
    x = layers.LSTM(units[1],
                    kernel_regularizer=regularizers.l2(l2),
                    recurrent_regularizer=regularizers.l2(l2))(x)
    x = layers.Dropout(dropout[1])(x)
    x = layers.Dense(32, activation="relu", kernel_regularizer=regularizers.l2(l2))(x)
    out = layers.Dense(1, activation="sigmoid")(x)

    # LR schedule: ReduceLROnPlateau (cosine alternative shown below)
    opt = optimizers.Adam(learning_rate=1e-3)
    model = tf.keras.Model(inp, out)
    model.compile(optimizer=opt, loss="binary_crossentropy", metrics=[tf.keras.metrics.AUC(name="auc")])
    return model

def eval_class(y_true, prob):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
    pred = (prob >= 0.5).astype(int)
    out = {"acc":accuracy_score(y_true,pred),
           "prec":precision_score(y_true,pred,zero_division=0),
           "rec":recall_score(y_true,pred,zero_division=0),
           "f1":f1_score(y_true,pred,zero_division=0)}
    out["auc"] = roc_auc_score(y_true, prob) if len(np.unique(y_true))>1 else np.nan
    return out


## Rolling TimeSeries CV (leakage-safe scaling + sequence CV)

We apply **5-fold rolling TimeSeriesSplit** to avoid lookahead bias.  
Steps per fold:
1. **Standard scaling** on training only.
2. Convert data to fixed-length sequences (`win=60`).
3. Train the stacked LSTM with **class weights** and **early stopping** (patience=15).
4. Use **smaller batch size (32)** for better minima exploration.
5. Record out-of-fold predictions and evaluate metrics: accuracy, precision, recall, F1, and AUC.

In [17]:

N_SPLITS = 5
tscv = TimeSeriesSplit(n_splits=N_SPLITS)

oof_prob = np.full(len(df), np.nan, dtype=float)
fold_metrics = []

for fold, (tr_idx, va_idx) in enumerate(tscv.split(X), start=1):
    X_tr_raw, X_va_raw = X.iloc[tr_idx], X.iloc[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]

    # Scale on TRAIN only
    scaler = StandardScaler()
    X_tr = pd.DataFrame(scaler.fit_transform(X_tr_raw), index=X_tr_raw.index, columns=X.columns)
    X_va = pd.DataFrame(scaler.transform(X_va_raw),    index=X_va_raw.index, columns=X.columns)

    # Sequences
    Xtr_seq, ytr_seq, idx_tr_seq = make_sequences(X_tr, y_tr, win=LSTM_WIN)
    Xva_seq, yva_seq, idx_va_seq = make_sequences(X_va, y_va, win=LSTM_WIN)
    if len(idx_tr_seq)==0 or len(idx_va_seq)==0:
        print(f"[Fold {fold}] Not enough sequence data — skipping.")
        continue

    model = build_lstm_model(n_features=X.shape[1])
    cw = class_weights_binary(ytr_seq, pos_boost=1.7)  # a bit stronger than default 1.5

    cbs = [
        callbacks.EarlyStopping(monitor="val_auc", mode="max", patience=PATIENCE, restore_best_weights=True),
        callbacks.ReduceLROnPlateau(monitor="val_auc", mode="max", factor=0.5, patience=6, min_lr=1e-5)
    ]

    model.fit(
        Xtr_seq, ytr_seq,
        validation_data=(Xva_seq, yva_seq),
        epochs=EPOCHS,
        batch_size=BATCH,
        class_weight=cw,
        verbose=0,
        callbacks=cbs
    )

    pva = model.predict(Xva_seq, verbose=0).reshape(-1)
    df_va_rows = X_va.index[idx_va_seq]
    oof_prob[df_va_rows] = pva

    m = eval_class(y[df_va_rows].astype(int), oof_prob[df_va_rows])
    fold_metrics.append({"fold": fold, **m})
    print(f"[Fold {fold}] " + " | ".join(f"{k}:{v:.3f}" for k,v in m.items()))

fold_df = pd.DataFrame(fold_metrics)
print("\nOOF mean across folds:")
display(fold_df.mean(numeric_only=True).to_frame("mean").T.round(3))


[Fold 1] acc:0.499 | prec:0.501 | rec:0.993 | f1:0.666 | auc:0.500
[Fold 2] acc:0.561 | prec:0.526 | rec:0.809 | f1:0.637 | auc:0.587
[Fold 3] acc:0.508 | prec:0.508 | rec:1.000 | f1:0.674 | auc:0.509
[Fold 4] acc:0.485 | prec:0.486 | rec:0.919 | f1:0.636 | auc:0.493
[Fold 5] acc:0.487 | prec:0.487 | rec:1.000 | f1:0.655 | auc:0.546

OOF mean across folds:


Unnamed: 0,fold,acc,prec,rec,f1,auc
mean,3.0,0.508,0.502,0.944,0.654,0.527


## Save OOF & Train Full Model (for inference/backtests)

After cross-validation, we refit the LSTM on the full dataset using the same architecture and hyperparameters.  
We save:
- Trained LSTM model.
- Out-of-fold probabilities.
- Predictions and signals for backtesting.

In [18]:
# Save OOF predictions (probabilities aligned by Date)
oof_out = pd.DataFrame({"Date": df["Date"], "prob_up_oof": oof_prob})
oof_out.to_parquet(asset_file("lstm_full_oof"), index=False)
print("Saved:", asset_file("lstm_full_oof").name)

# Train a final model on the entire (scaled + sequenced) dataset
# 1) Scale on FULL data (for deployment; OOF above already handled CV properly)
scaler_full = StandardScaler()
X_full = pd.DataFrame(scaler_full.fit_transform(X), index=X.index, columns=X.columns)
X_seq, y_seq, idx_seq = make_sequences(X_full, y, win=LSTM_WIN)

if len(idx_seq) == 0:
    print("Not enough rows for full fit sequences — skipping final model save.")
else:
    model_full = build_lstm_model(n_features=X.shape[1], units=(64, 32), dropout=(0.2, 0.2)) # Corrected dropout
    cw_full = class_weights_binary(y_seq)
    cbs_full = [
        keras.callbacks.EarlyStopping(monitor="auc", mode="max", patience=10, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(monitor="auc", mode="max", factor=0.5, patience=5, min_lr=1e-5)
    ]
    # We don't have a validation set here; train with class weights and an AUC monitor on training
    hist_full = model_full.fit(
        X_seq, y_seq,
        epochs=80,
        batch_size=BATCH,
        class_weight=cw_full,
        verbose=0,
        callbacks=cbs_full
    )

    # Save model + scaler to disk (scaler needed to reproduce preprocessing)
    save_dir = PROC_DIR / f"{ASSET_KEY}_lstm_full_savedmodel.keras" # Added .keras extension
    keras.models.save_model(model_full, save_dir, overwrite=True)
    # Save scaler
    import joblib
    joblib.dump(scaler_full, PROC_DIR / f"{ASSET_KEY}_lstm_full_scaler.joblib")

    print("Saved model →", save_dir)
    print("Saved scaler →", (PROC_DIR / f"{ASSET_KEY}_lstm_full_scaler.joblib").name)

Saved: eurusd_lstm_full_oof.parquet
Saved model → /content/drive/MyDrive/FINAL_PROJECT_MLDL/data/processed/eurusd_lstm_full_savedmodel.keras
Saved scaler → eurusd_lstm_full_scaler.joblib


## Quick performance snapshot

In [19]:

mask = ~np.isnan(oof_prob)
if mask.sum() > 0:
    overall = eval_class(y[mask].astype(int), oof_prob[mask])
    print("OOF performance (overall):")
    for k, v in overall.items():
        print(f"  {k:>4}: {v:.3f}")
else:
    print("No valid OOF predictions to summarize.")


OOF performance (overall):
   acc: 0.508
  prec: 0.500
   rec: 0.945
    f1: 0.654
   auc: 0.522
