# Stage 10b — Modeling: Time Series & Classification

This notebook creates **lag/rolling features** on a synthetic time series and trains a **classifier** to predict next‑step direction. It follows the assignment: Data → Features → Target → Split → Pipeline → Metrics → Interpretation.

**Tip:** Replace the synthetic generator with your dataset loader to adapt this baseline.

In [None]:
# Imports
import numpy as np, pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from src.utils import project_path, save_df, classification_report_df, plot_confusion_matrix

# Plot config
plt.rcParams['figure.figsize'] = (8,4)
plt.rcParams['axes.grid'] = True


## 1) Data
We simulate a daily return series with mild autocorrelation (AR(1) + noise).

In [None]:
# Generate synthetic daily returns (AR(1))
rng = np.random.default_rng(7)
n = 1500
phi = 0.15   # AR(1) coefficient
eps = rng.normal(0, 0.01, size=n)
ret = np.zeros(n)
for t in range(1, n):
    ret[t] = phi * ret[t-1] + eps[t]

dates = pd.date_range('2015-01-01', periods=n, freq='B')  # business days
df = pd.DataFrame({'ret': ret}, index=dates)
df.head()

## 2) Features (leakage‑safe)
We create several features using only **past** information.

In [None]:
def make_features(d: pd.DataFrame) -> pd.DataFrame:
    out = d.copy()
    # Lags
    out['lag_1'] = out['ret'].shift(1)
    out['lag_5'] = out['ret'].shift(5)
    # Rolling stats (window=5)
    out['roll_mean_5'] = out['ret'].shift(1).rolling(5).mean()   # shift(1) to avoid leakage
    out['roll_std_5']  = out['ret'].shift(1).rolling(5).std()
    # Momentum over 5 days
    out['momentum_5'] = out['ret'].shift(1) - out['ret'].shift(6)
    # Z-score over a 5-day window
    mu = out['ret'].shift(1).rolling(5).mean()
    sd = out['ret'].shift(1).rolling(5).std()
    out['zscore_5'] = (out['ret'].shift(1) - mu) / sd
    return out

df_feat = make_features(df)
df_feat.tail()

## 3) Target
Binary direction for the **next** step: `y_up = (ret.shift(-1) > 0)`.

In [None]:
df_feat['y_up'] = (df_feat['ret'].shift(-1) > 0).astype(int)
df_feat = df_feat.dropna()
features = ['lag_1','lag_5','roll_mean_5','roll_std_5','momentum_5','zscore_5']
X = df_feat[features].values
y = df_feat['y_up'].values
df_feat[['ret']+features+['y_up']].head()

## 4) Split (Time‑aware)
We use a **holdout** (last 25%) and also a **TimeSeriesSplit** CV for robustness.

In [None]:
# Holdout split (last 25%)
split_idx = int(len(df_feat) * 0.75)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
dates_train = df_feat.index[:split_idx]
dates_test  = df_feat.index[split_idx:]
len(X_train), len(X_test)

## 5) Pipeline (Scaler → LogisticRegression)
This keeps preprocessing and model bundled and avoids leakage.

In [None]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000))
])

## 6) Fit, Predict, Evaluate
We report accuracy, precision, recall, F1, a confusion matrix, and predicted class over time.

In [None]:
# Fit on train, evaluate on holdout
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

report = classification_report_df(y_test, y_pred)
print(report)

# Save metrics
save_df(report, 'data/processed/stage10b_holdout_classification_metrics.csv')

# Confusion matrix plot
fig, ax = plt.subplots(figsize=(4,4))
plot_confusion_matrix(y_test, y_pred, ax=ax, title="Holdout Confusion Matrix")
plt.tight_layout()
plt.savefig(project_path('data/processed/stage10b_confusion_matrix.png'))
plt.show()

# Prediction-over-time plot
fig, ax = plt.subplots(figsize=(9,3.5))
ax.plot(dates_test, y_test, label='True (Up=1)', linewidth=1)
ax.plot(dates_test, y_pred, label='Pred (Up=1)', linewidth=1)
ax.set_title("Next‑Step Direction: True vs Predicted (Holdout)")
ax.legend(loc='upper right')
plt.tight_layout()
plt.savefig(project_path('data/processed/stage10b_true_vs_pred.png'))
plt.show()

### Optional: TimeSeriesSplit cross‑validation (sanity check)

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

tscv = TimeSeriesSplit(n_splits=5)
scores = []
for tr_idx, va_idx in tscv.split(X_train):
    X_tr, X_va = X_train[tr_idx], X_train[va_idx]
    y_tr, y_va = y_train[tr_idx], y_train[va_idx]
    pipe.fit(X_tr, y_tr)
    scores.append(f1_score(y_va, pipe.predict(X_va)))
print("TimeSeriesSplit F1 scores:", np.round(scores,3), " | mean =", np.mean(scores).round(3))

## 7) Interpretation
- **What worked:** Lags and short rolling stats captured weak autocorrelation; the logistic model learns a slight edge.
- **What failed:** Signal is small in synthetic data; performance is modest and may be noisy.
- **Assumptions at risk:** Stationarity and stable decision boundary. Real markets shift; features may drift. Use **walk‑forward validation**, recalibration, and richer features (longer windows, macro/volatility states).