# Geopolitical Alpha: HAR-Lasso Bipartite Network — Full Pipeline

**Math 279 — Ivan Sit**

This notebook runs the complete pipeline end-to-end:

1. **Data Loading** — CRSP energy stocks + commodity futures + SPY
2. **Feature Engineering** — HAR (daily/weekly/monthly) + derived cross-commodity features
3. **Market Residualization** — rolling-beta OLS to strip SPY co-movement
4. **LassoCV Rolling OOS** — dynamic penalty selection, no look-ahead, bipartite edge extraction
5. **Bipartite Network** — which commodities predict which stocks
6. **Backtest** — cross-sectional long/short, 5 bps TC, rolling Sharpe
7. **Sensitivity** — window length (60 / 126 / 252 d) and target (Y_idio vs Y_raw)

**Key fix vs earlier version**: `LassoCV` with `TimeSeriesSplit` dynamically selects the penalty
inside each rolling window — no fixed `alpha=0.001`.

In [None]:
import sys, warnings
sys.path.insert(0, '..')
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import networkx as nx
from pathlib import Path

from sklearn.linear_model import LassoCV, Lasso
from sklearn.model_selection import TimeSeriesSplit

from src.pipeline import load_all
from src.strategy.backtest import compute_metrics, rolling_sharpe

plt.rcParams.update({
    'figure.dpi': 110,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'font.size': 11,
})
SEED = 42
OUTPUTS = Path('../outputs')
OUTPUTS.mkdir(exist_ok=True)
print('Libraries loaded.')

---
## Section 1 — Load Data

In [None]:
data = load_all(
    source='cache',
    start='2000-01-01',
    end='2024-12-31',
    sector='energy',
    beta_window=252,
    verbose=True,
)

X            = data['X']           # feature matrix  (T × P)
Y_idio       = data['Y_idio']      # idiosyncratic returns  (T × N)
Y_raw        = data['Y_raw']       # raw stock returns  (T × N)
spy_ret      = data['spy_ret']     # SPY daily returns  (T,)
comm_prices  = data['comm_prices'] # commodity prices aligned to equity calendar

print(f'\n{" Data dimensions ":=^60}')
print(f'  Feature matrix X    : {X.shape[0]:>6} days  ×  {X.shape[1]:>3} features')
print(f'  Y_idio              : {Y_idio.shape[0]:>6} days  ×  {Y_idio.shape[1]:>3} stocks')
print(f'  Y_raw               : {Y_raw.shape[0]:>6} days  ×  {Y_raw.shape[1]:>3} stocks')
print(f'  Date range          : {X.index[0].date()} → {X.index[-1].date()}')
print(f'  Features            : {list(X.columns)}')

In [None]:
# ── Commodity prices overview ─────────────────────────────────────────────────
comm_ret = comm_prices.pct_change(fill_method=None).clip(-1, 1)

fig, axes = plt.subplots(2, 2, figsize=(16, 8))
for ax, col in zip(axes.flat, comm_prices.columns):
    (comm_prices[col] / comm_prices[col].iloc[0]).plot(ax=ax, linewidth=1.2, color='steelblue')
    ax.set_title(f'{col} — normalised price (base=1)', fontweight='bold')
    ax.set_ylabel('Relative price')

    # Mark key geopolitical events
    events = {
        'Libya 2011': '2011-02-17',
        'OPEC war 2014': '2014-11-27',
        'COVID 2020': '2020-03-09',
        'Ukraine 2022': '2022-02-24',
    }
    for label, date in events.items():
        dt = pd.Timestamp(date)
        if ax.get_xlim()[0] < mdates.date2num(dt) < ax.get_xlim()[1]:
            ax.axvline(dt, color='red', linestyle=':', linewidth=1, alpha=0.7)

plt.suptitle('Commodity prices — normalised (red lines = key geopolitical events)',
             fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
## Section 2 — Feature Engineering

The HAR feature matrix $X \in \mathbb{R}^{T \times P}$ is built from commodity prices:

$$
X_t = \bigl[
  r^{(d)}_{t-1},\; \bar{r}^{(w)}_{t-5:t-1},\; \bar{r}^{(m)}_{t-22:t-1},\;
  \text{CrackSpread},\; \text{OilGasRatio},\; \text{RV},\; \text{BrentWTI},\; \ldots
\bigr]
$$

All features are **lagged by at least 1 day** — no look-ahead.

In [None]:
# ── Feature correlation heatmap ───────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(14, 11))
corr = X.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            annot=False, linewidths=0.3, ax=ax)
ax.set_title(f'Feature Correlation Matrix  ({X.shape[1]} features × {X.shape[1]} features)',
             fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

print('Feature statistics:')
print(X.describe().T[['mean','std','min','max']].round(5).to_string())

In [None]:
# ── WTI HAR features over time ────────────────────────────────────────────────
wti_cols = [c for c in X.columns if c.startswith('WTI_d') or c.startswith('WTI_w') or c.startswith('WTI_m')]
fig, axes = plt.subplots(3, 1, figsize=(16, 9), sharex=True)
labels = ['Daily lag (t−1)', '5-day mean (t−5:t−1)', '22-day mean (t−22:t−1)']
colors = ['steelblue', 'darkorange', 'seagreen']
for ax, col, label, color in zip(axes, wti_cols[:3], labels, colors):
    ax.plot(X[col], linewidth=0.8, color=color, alpha=0.85)
    ax.set_ylabel(label, fontsize=10)
    ax.axhline(0, color='black', linewidth=0.5)
    for date in ['2011-02-17','2014-11-27','2020-03-09','2022-02-24']:
        ax.axvline(pd.Timestamp(date), color='red', linestyle=':', linewidth=1, alpha=0.6)

axes[0].set_title('WTI HAR features over time  (red = geopolitical events)', fontweight='bold')
plt.tight_layout()
plt.show()

---
## Section 3 — Market Residualization

We strip the systematic market factor using rolling-window OLS beta:

$$
Y^{\text{idio}}_t = r^{\text{stock}}_t - \hat{\beta}_t \cdot r^{\text{SPY}}_t,
\qquad
\hat{\beta}_t = \frac{\text{Cov}(r^{\text{stock}}, r^{\text{SPY}})_{252d}}{\text{Var}(r^{\text{SPY}})_{252d}}
$$

This isolates the **commodity-specific** alpha from broad market moves.

In [None]:
# ── Show residualization effect for XOM and one small-cap ─────────────────────
sample_stocks = [s for s in ['XOM', 'DVN', 'HAL', 'RRC'] if s in Y_raw.columns]

fig, axes = plt.subplots(len(sample_stocks), 2, figsize=(16, 3.5 * len(sample_stocks)))

for row, ticker in enumerate(sample_stocks):
    ax_raw, ax_idio = axes[row]

    r = Y_raw[ticker].dropna()
    i = Y_idio[ticker].dropna()
    common_idx = r.index.intersection(i.index)

    ax_raw.plot(r.loc[common_idx].rolling(21).mean(), color='steelblue', linewidth=1.2)
    ax_raw.set_title(f'{ticker} — Raw return (21d MA)', fontweight='bold')
    ax_raw.axhline(0, color='black', linewidth=0.5)

    ax_idio.plot(i.loc[common_idx].rolling(21).mean(), color='darkorange', linewidth=1.2)
    ax_idio.set_title(f'{ticker} — Idiosyncratic return (21d MA)', fontweight='bold')
    ax_idio.axhline(0, color='black', linewidth=0.5)

    # Annotate correlation with SPY
    spy_common = spy_ret.reindex(common_idx).dropna()
    r_spy = r.reindex(spy_common.index)
    i_spy = i.reindex(spy_common.index)
    corr_raw  = r_spy.corr(spy_common)
    corr_idio = i_spy.corr(spy_common)
    ax_raw.set_xlabel(f'Corr(raw, SPY) = {corr_raw:.3f}')
    ax_idio.set_xlabel(f'Corr(idio, SPY) = {corr_idio:.3f}  ← should be ≈0')

plt.suptitle('Market Residualization: Raw vs Idiosyncratic Returns', fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
## Section 4 — LassoCV Rolling Out-of-Sample

**Core algorithm — strictly causal:**

```
For each rebalance day t (every Friday):
  1. Take training slice: X[t-W:t], Y[t-W:t]  (past W days only)
  2. Standardize X on training data: z = (x - μ_train) / σ_train
  3. Fit LassoCV with TimeSeriesSplit(n_splits=3) to find optimal α
  4. Predict: ŷ_t = model.predict(z_t)   (next-day prediction)
  5. Store: prediction, selected α, non-zero coefficients (→ bipartite edges)
```

**Why LassoCV over fixed α**: volatility regimes shift — a fixed penalty that works during low-vol
will under-regularize during high-vol crises and over-regularize during calm periods.

In [None]:
# ── Liquid universe filter ────────────────────────────────────────────────────
# Keep stocks with ≥70% non-NaN observations over the full period.
# Sparse stocks are mostly micro-caps with very short listing windows.
coverage = Y_raw.notna().mean()
COVERAGE_THR = 0.70
liquid = coverage[coverage >= COVERAGE_THR].index.tolist()

print(f'Full universe : {Y_raw.shape[1]} stocks')
print(f'Liquid (≥{COVERAGE_THR:.0%}): {len(liquid)} stocks')
print(f'Dropped       : {Y_raw.shape[1] - len(liquid)} stocks')

Y_raw_liq  = Y_raw[liquid]
Y_idio_liq = Y_idio[liquid]

In [None]:
ALPHAS = np.logspace(-4, -0.5, 15)   # 15 candidates from 0.0001 to ~0.32
TSCV   = TimeSeriesSplit(n_splits=3)

def rolling_lasso_cv(
    X: pd.DataFrame,
    Y: pd.DataFrame,
    train_window: int = 252,
    weekly: bool = True,
    min_obs: int = 60,
) -> dict:
    """
    Rolling OOS LassoCV — strictly causal, bipartite edge extraction.

    Returns dict with:
        'predictions' : DataFrame (T × N)  — OOS predicted returns
        'alpha_hist'  : DataFrame (T × N)  — CV-selected alpha per stock per date
        'edges'       : list of (date, feature, stock, coef) tuples
        'sparsity'    : Series  — daily fraction of non-zero coefs (averaged across stocks)
    """
    n, feat_names = len(X), list(X.columns)
    predictions = pd.DataFrame(np.nan, index=X.index, columns=Y.columns)
    alpha_hist  = pd.DataFrame(np.nan, index=X.index, columns=Y.columns)
    edges_list  = []
    sparsity    = pd.Series(np.nan, index=X.index)

    # Which days to rebalance
    if weekly:
        trade_idx = [t for t in range(train_window, n) if X.index[t].weekday() == 4]
    else:
        trade_idx = list(range(train_window, n))

    print(f'LassoCV rolling OOS — window={train_window}d, '
          f'{"weekly" if weekly else "daily"}, {len(trade_idx)} rebalances')
    print(f'Universe: {Y.shape[1]} stocks | α candidates: {len(ALPHAS)}')

    X_arr = X.values

    for i, t in enumerate(trade_idx):
        if i % 100 == 0:
            print(f'  [{i+1:4d}/{len(trade_idx)}]  {X.index[t].date()} ...', end='\r')

        X_tr = X_arr[t - train_window : t]
        X_te = X_arr[t : t + 1]
        mu, sd = X_tr.mean(0), X_tr.std(0) + 1e-8
        X_tr_z = (X_tr - mu) / sd
        X_te_z = (X_te - mu) / sd

        date       = X.index[t]
        nnz_counts = []

        for stock in Y.columns:
            y_tr = Y[stock].iloc[t - train_window : t].values
            mask = ~np.isnan(y_tr)
            if mask.sum() < min_obs:
                continue

            model = LassoCV(
                alphas=ALPHAS, cv=TSCV,
                max_iter=3000, fit_intercept=True, n_jobs=1,
            )
            model.fit(X_tr_z[mask], y_tr[mask])

            predictions.loc[date, stock]  = float(model.predict(X_te_z)[0])
            alpha_hist.loc[date, stock]   = model.alpha_
            nnz = np.where(model.coef_ != 0)[0]
            nnz_counts.append(len(nnz) / len(feat_names))

            for fi in nnz:
                edges_list.append((date, feat_names[fi], stock, float(model.coef_[fi])))

        if nnz_counts:
            sparsity.loc[date] = float(np.mean(nnz_counts))

    print(f'\nDone! {predictions.notna().any(axis=1).sum()} prediction days, '
          f'{len(edges_list)} total edges.')
    return dict(
        predictions=predictions,
        alpha_hist=alpha_hist,
        edges=edges_list,
        sparsity=sparsity,
    )

In [None]:
# ── Run LassoCV on idiosyncratic returns (252-day window) ─────────────────────
# Runtime: ~5-10 min depending on CPU (weekly rebalancing × liquid universe)
result_252 = rolling_lasso_cv(X, Y_idio_liq, train_window=252, weekly=True)

preds_252   = result_252['predictions']
alpha_hist  = result_252['alpha_hist']
edges_list  = result_252['edges']
sparsity    = result_252['sparsity']

# Save for use in notebook 02
preds_252.to_pickle(OUTPUTS / 'preds_252.pkl')
pd.DataFrame(edges_list, columns=['date','feature','stock','coef']).to_pickle(OUTPUTS / 'edges_252.pkl')
print('Saved predictions and edges to outputs/')

---
## Section 5 — Feature Sparsity & Alpha Diagnostics

Before running the backtest, verify the model is actually finding signal — not zeroing everything out.

In [None]:
# ── Sparsity and alpha evolution over time ────────────────────────────────────
sp = sparsity.dropna()
alpha_med = alpha_hist.median(axis=1).dropna()

fig, axes = plt.subplots(2, 1, figsize=(16, 8), sharex=True)

ax = axes[0]
ax.plot(sp.rolling(13).mean(), color='steelblue', linewidth=1.5,
        label='Feature density (13-wk MA)')
ax.fill_between(sp.index, sp.rolling(13).mean(), alpha=0.15, color='steelblue')
ax.set_ylabel('Fraction of non-zero coefficients')
ax.set_title('Model sparsity over time  (higher = more features active)', fontweight='bold')
ax.legend()
for date in ['2011-02-17','2014-11-27','2020-03-09','2022-02-24']:
    ax.axvline(pd.Timestamp(date), color='red', linestyle=':', linewidth=1, alpha=0.7)

ax = axes[1]
ax.semilogy(alpha_med.rolling(13).mean(), color='darkorange', linewidth=1.5,
            label='Median CV-selected alpha (13-wk MA)')
ax.set_ylabel('Lasso alpha (log scale)')
ax.set_title('CV-selected penalty α over time  (higher α = more regularization)', fontweight='bold')
ax.legend()
for date in ['2011-02-17','2014-11-27','2020-03-09','2022-02-24']:
    ax.axvline(pd.Timestamp(date), color='red', linestyle=':', linewidth=1, alpha=0.7)

plt.tight_layout()
plt.show()

print(f'Average feature density : {sp.mean():.1%}  (fraction of features with non-zero coef)')
print(f'Average selected alpha  : {alpha_med.mean():.5f}')
print(f'Min / max selected alpha: {alpha_med.min():.5f} / {alpha_med.max():.5f}')

In [None]:
# ── Which features are selected most often? ───────────────────────────────────
if edges_list:
    edges_df = pd.DataFrame(edges_list, columns=['date','feature','stock','coef'])
    feat_usage = edges_df['feature'].value_counts(normalize=True)
    feat_sign  = edges_df.groupby('feature')['coef'].mean()  # avg direction

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    feat_usage.head(20).sort_values().plot.barh(
        ax=axes[0], color='steelblue')
    axes[0].set_title('Top 20 most-selected features (% of all edges)', fontweight='bold')
    axes[0].set_xlabel('Selection frequency')

    feat_sign.reindex(feat_usage.head(20).index).sort_values().plot.barh(
        ax=axes[1], color=['crimson' if v < 0 else 'seagreen' for v in feat_sign.reindex(feat_usage.head(20).index).sort_values()])
    axes[1].axvline(0, color='black', linewidth=0.8)
    axes[1].set_title('Mean coefficient sign  (green=positive, red=negative)', fontweight='bold')
    axes[1].set_xlabel('Mean Lasso coefficient')

    plt.tight_layout()
    plt.show()
else:
    print('No edges found — model zeroed all coefficients. Try a smaller alpha.')

---
## Section 6 — Bipartite Network Extraction

The **core contribution** of this project: non-zero Lasso coefficients define a bipartite graph
$G = (U \cup V, E)$ where:
- $U$ = commodity features (27 nodes)
- $V$ = energy stocks ($N$ nodes)
- $E$ = non-zero coefficients at time $t$  (edge weight = coefficient value)

We track how this network topology shifts during geopolitical events.

In [None]:
def build_bipartite_graph(edges_df_sub: pd.DataFrame) -> nx.DiGraph:
    """Build a weighted bipartite DiGraph from an edge subset."""
    G = nx.DiGraph()
    G.add_nodes_from(edges_df_sub['feature'].unique(), bipartite='commodity')
    G.add_nodes_from(edges_df_sub['stock'].unique(),   bipartite='stock')
    for _, row in edges_df_sub.iterrows():
        G.add_edge(row['feature'], row['stock'], weight=row['coef'])
    return G


def plot_bipartite(G: nx.DiGraph, title: str, ax=None):
    """Draw bipartite graph: commodity features on left, stocks on right."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(12, 10))

    comm_nodes  = [n for n, d in G.nodes(data=True) if d.get('bipartite') == 'commodity']
    stock_nodes = [n for n, d in G.nodes(data=True) if d.get('bipartite') == 'stock']

    # Layout: two columns
    pos = {}
    for i, n in enumerate(sorted(comm_nodes)):
        pos[n] = (0, -(i / max(len(comm_nodes)-1, 1)))
    for i, n in enumerate(sorted(stock_nodes)):
        pos[n] = (1, -(i / max(len(stock_nodes)-1, 1)))

    weights  = [abs(G[u][v]['weight']) * 800 for u, v in G.edges()]
    colors_e = ['green' if G[u][v]['weight'] > 0 else 'crimson' for u, v in G.edges()]

    nx.draw_networkx_nodes(G, pos, nodelist=comm_nodes,  node_color='steelblue',
                           node_size=300, ax=ax)
    nx.draw_networkx_nodes(G, pos, nodelist=stock_nodes, node_color='darkorange',
                           node_size=200, ax=ax)
    nx.draw_networkx_edges(G, pos, edge_color=colors_e, width=[w/200 for w in weights],
                           alpha=0.5, ax=ax, arrows=True,
                           arrowstyle='-|>', arrowsize=10)
    nx.draw_networkx_labels(G, pos, {n: n for n in comm_nodes},
                            font_size=8, ax=ax, horizontalalignment='right')
    nx.draw_networkx_labels(G, pos, {n: n for n in stock_nodes},
                            font_size=7, ax=ax, horizontalalignment='left')
    ax.set_title(title, fontweight='bold', fontsize=11)
    ax.axis('off')


if edges_list:
    edges_df = pd.DataFrame(edges_list, columns=['date','feature','stock','coef'])

    # Network density over time
    density_ts = edges_df.groupby('date').size()
    density_ts.index = pd.DatetimeIndex(density_ts.index)

    fig, ax = plt.subplots(figsize=(16, 4))
    density_ts.rolling(13).mean().plot(ax=ax, color='purple', linewidth=1.5)
    ax.fill_between(density_ts.index,
                    density_ts.rolling(13).mean(), alpha=0.15, color='purple')
    ax.set_ylabel('Number of active edges (13-wk MA)')
    ax.set_title('Bipartite Network Density over Time  (# non-zero Lasso edges)',
                 fontweight='bold')
    for label, date in {'Libya 2011':'2011-02-17', 'OPEC 2014':'2014-11-27',
                         'COVID 2020':'2020-03-09', 'Ukraine 2022':'2022-02-24'}.items():
        dt = pd.Timestamp(date)
        ax.axvline(dt, color='red', linestyle=':', linewidth=1.2, alpha=0.8)
        ax.text(dt, ax.get_ylim()[1]*0.95, label, rotation=90,
                fontsize=8, color='red', va='top')
    plt.tight_layout()
    plt.show()

In [None]:
if edges_list:
    # ── Network snapshots: calm vs. crisis ───────────────────────────────────
    windows = [
        ('Calm 2013',         '2013-01-01', '2013-12-31'),
        ('OPEC crisis 2015',  '2015-01-01', '2015-12-31'),
        ('COVID 2020',        '2020-01-01', '2020-12-31'),
        ('Ukraine 2022',      '2022-02-01', '2022-12-31'),
    ]

    fig, axes = plt.subplots(1, 4, figsize=(22, 11))

    for ax, (label, s, e) in zip(axes, windows):
        sub = edges_df[
            (edges_df['date'] >= pd.Timestamp(s)) &
            (edges_df['date'] <= pd.Timestamp(e))
        ].copy()

        # Aggregate: mean coefficient per (feature, stock) over the window
        agg = sub.groupby(['feature','stock'])['coef'].mean().reset_index()
        # Keep only the most active stock connections
        top_stocks = agg.groupby('stock')['coef'].abs().sum().nlargest(15).index
        agg = agg[agg['stock'].isin(top_stocks)]

        G = build_bipartite_graph(agg)
        n_edges = G.number_of_edges()
        plot_bipartite(G, f'{label}\n({n_edges} edges, top-15 stocks)', ax=ax)

    plt.suptitle('Bipartite Network Snapshots — Calm vs. Geopolitical Crisis',
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

In [None]:
if edges_list:
    # ── Adjacency heatmap for the most recent 252-day window ─────────────────
    recent_edges = edges_df[edges_df['date'] >= edges_df['date'].max() - pd.Timedelta(days=365)]
    pivot = recent_edges.pivot_table(
        index='feature', columns='stock', values='coef',
        aggfunc='mean', fill_value=0
    )

    # Show top-30 stocks by total absolute weight
    top_stocks = pivot.abs().sum(axis=0).nlargest(30).index
    pivot_sub  = pivot[top_stocks]

    fig, ax = plt.subplots(figsize=(18, 8))
    sns.heatmap(pivot_sub, cmap='RdBu_r', center=0,
                linewidths=0.3, linecolor='lightgray',
                cbar_kws={'label': 'Mean Lasso coefficient'},
                ax=ax)
    ax.set_title('Bipartite Adjacency Matrix — most recent year\n'
                 '(rows = commodity features, cols = energy stocks)',
                 fontweight='bold', fontsize=12)
    ax.set_xlabel('Energy stocks (top 30 by connection strength)')
    ax.set_ylabel('Commodity features')
    plt.tight_layout()
    plt.show()

---
## Section 7 — Backtest: Long-Short Strategy

**Strategy**:
- Each Friday: rank stocks by predicted return
- Long top 25% (highest predicted), short bottom 25% (lowest)
- Hold until next Friday (weekly rebalancing)
- 5 bps one-way transaction cost on weight changes
- PnL on **raw returns** (what you actually earn holding the stock)

In [None]:
def run_ls_backtest(
    predictions: pd.DataFrame,
    raw_returns: pd.DataFrame,
    top_pct: float = 0.25,
    tc: float = 0.0005,
) -> pd.Series:
    """
    Long-short cross-sectional backtest.

    - Equal-weight within long and short book
    - Forward-fill weights between prediction dates (hold position)
    - TC charged only on rebalance days
    """
    pred_days = predictions.index[predictions.notna().any(axis=1)]

    weights_sparse = pd.DataFrame(0.0, index=predictions.index, columns=predictions.columns)

    for date in pred_days:
        row = predictions.loc[date].dropna()
        if len(row) < 4:
            continue
        n        = len(row)
        n_long   = max(1, int(n * top_pct))
        n_short  = max(1, int(n * top_pct))
        ranked   = row.sort_values()
        shorts   = ranked.head(n_short).index
        longs    = ranked.tail(n_long).index
        weights_sparse.loc[date, longs]  =  1.0 / n_long
        weights_sparse.loc[date, shorts] = -1.0 / n_short

    # Forward-fill (hold position between rebalances)
    active = weights_sparse.abs().sum(axis=1) > 0
    w_ff   = weights_sparse.where(active, other=np.nan).ffill().fillna(0.0)

    ret_aligned = raw_returns.reindex(
        index=w_ff.index, columns=w_ff.columns).fillna(0.0)

    port_ret = (w_ff * ret_aligned).sum(axis=1)
    tc_cost  = weights_sparse.diff().abs().sum(axis=1) * tc
    port_ret = port_ret - tc_cost

    return port_ret.loc[pred_days[0]:]


# ── Run main backtest ─────────────────────────────────────────────────────────
ret_main = run_ls_backtest(preds_252, Y_raw_liq, top_pct=0.25, tc=0.0005)

oos_start      = ret_main.dropna().index[0]
ret_bench_ew   = Y_raw_liq.mean(axis=1)
ret_bench_spy  = spy_ret

print(f'OOS period: {oos_start.date()} → {ret_main.index[-1].date()}')
print(f'Strategy trading days: {ret_main.notna().sum()}')

In [None]:
# ── Performance table ─────────────────────────────────────────────────────────
strategies = {
    'HAR-Lasso L/S (LassoCV)': ret_main,
    'Benchmark: EW Energy':    ret_bench_ew,
    'Benchmark: SPY':          ret_bench_spy,
}

perf = pd.DataFrame({
    name: compute_metrics(ret, start=oos_start)
    for name, ret in strategies.items()
}).T

print('\n' + '=' * 70)
print('PERFORMANCE  (OOS period)')
print(f'Start: {oos_start.date()}  |  End: {ret_main.index[-1].date()}')
print('=' * 70)
print(perf.to_string())
print('=' * 70)

In [None]:
# ── Cumulative returns + drawdown + rolling Sharpe ────────────────────────────
fig, axes = plt.subplots(3, 1, figsize=(16, 13), sharex=True,
                          gridspec_kw={'height_ratios': [3, 1.5, 1.5]})

palette = {'HAR-Lasso L/S (LassoCV)': ('royalblue', 2.2, '-'),
           'Benchmark: EW Energy':    ('gray',      1.2, '--'),
           'Benchmark: SPY':          ('black',     1.2, ':')}

# Panel 1: cumulative
ax = axes[0]
for name, ret in strategies.items():
    c, lw, ls = palette[name]
    cum = (1 + ret.loc[oos_start:].fillna(0)).cumprod()
    ax.plot(cum, color=c, linewidth=lw, linestyle=ls, label=name)
ax.axhline(1, color='black', linewidth=0.5, linestyle='--')
ax.set_ylabel('Growth of $1')
ax.set_title('HAR-Lasso Bipartite Strategy — Out-of-Sample Performance',
             fontsize=13, fontweight='bold')
ax.legend()
for dt in ['2011-02-17','2014-11-27','2020-03-09','2022-02-24']:
    ax.axvline(pd.Timestamp(dt), color='red', linestyle=':', linewidth=1, alpha=0.5)

# Panel 2: drawdown
ax = axes[1]
for name, ret in strategies.items():
    c, lw, ls = palette[name]
    r   = ret.loc[oos_start:].fillna(0)
    cum = (1 + r).cumprod()
    dd  = (cum - cum.cummax()) / cum.cummax()
    ax.fill_between(dd.index, dd, 0, color=c, alpha=0.25, label=name)
ax.set_ylabel('Drawdown')
ax.set_title('Drawdowns', fontsize=11)
ax.legend(fontsize=9)

# Panel 3: rolling Sharpe
ax = axes[2]
for name, ret in strategies.items():
    c, lw, ls = palette[name]
    sr = rolling_sharpe(ret.loc[oos_start:], window=252)
    ax.plot(sr, color=c, linewidth=lw, linestyle=ls, label=name)
ax.axhline(0,  color='black', linewidth=0.5, linestyle='--')
ax.axhline( 1, color='green', linewidth=0.7, linestyle=':')
ax.axhline(-1, color='red',   linewidth=0.7, linestyle=':')
ax.set_ylim(-3.5, 3.5)
ax.set_ylabel('Rolling Sharpe (252d)')
ax.set_xlabel('Date')
ax.set_title('Rolling 1-Year Sharpe', fontsize=11)
ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig(OUTPUTS / 'backtest_main.png', dpi=120, bbox_inches='tight')
plt.show()

---
## Section 8 — Sensitivity: Window Length & Target

Per the proposal: *"study how the Lasso penalty and training window affect edge stability"*.

We test:
1. **Window**: 60d (responsive to shocks) vs 252d (stable estimation)
2. **Target**: `Y_idio` (residualized) vs `Y_raw` (raw returns)

> Hypothesis: shorter window captures faster-decaying geopolitical alpha;
> Y_raw may outperform if commodity-beta is itself predictable.

In [None]:
# ── 60-day window ─────────────────────────────────────────────────────────────
print('Running 60-day window (Y_idio)...')
result_60 = rolling_lasso_cv(X, Y_idio_liq, train_window=60, weekly=True, min_obs=30)
ret_60_idio = run_ls_backtest(result_60['predictions'], Y_raw_liq)

# ── 60-day window on Y_raw ─────────────────────────────────────────────────
print('Running 60-day window (Y_raw)...')
result_60_raw = rolling_lasso_cv(X, Y_raw_liq, train_window=60, weekly=True, min_obs=30)
ret_60_raw = run_ls_backtest(result_60_raw['predictions'], Y_raw_liq)

# ── 252-day window on Y_raw ────────────────────────────────────────────────
print('Running 252-day window (Y_raw)...')
result_252_raw = rolling_lasso_cv(X, Y_raw_liq, train_window=252, weekly=True)
ret_252_raw = run_ls_backtest(result_252_raw['predictions'], Y_raw_liq)

In [None]:
oos_all = max(
    ret_main.dropna().index[0],
    ret_60_idio.dropna().index[0] if ret_60_idio.notna().any() else ret_main.dropna().index[0],
    ret_60_raw.dropna().index[0]  if ret_60_raw.notna().any()  else ret_main.dropna().index[0],
    ret_252_raw.dropna().index[0] if ret_252_raw.notna().any() else ret_main.dropna().index[0],
)

variants = {
    '252d window / Y_idio (main)': ret_main,
    '252d window / Y_raw':         ret_252_raw,
    ' 60d window / Y_idio':        ret_60_idio,
    ' 60d window / Y_raw':         ret_60_raw,
    'Benchmark: EW Energy':        ret_bench_ew,
    'Benchmark: SPY':               ret_bench_spy,
}

sens_table = pd.DataFrame({
    k: compute_metrics(v, start=oos_all) for k, v in variants.items()
}).T

print('\n' + '=' * 72)
print('SENSITIVITY: Window Length × Target')
print('=' * 72)
print(sens_table.to_string())
print('=' * 72)

# Cumulative chart
colors_s = ['royalblue','dodgerblue','darkorange','orangered','gray','black']
styles_s  = ['-','-','--','--',':',':']

fig, ax = plt.subplots(figsize=(16, 6))
for (name, ret), color, ls in zip(variants.items(), colors_s, styles_s):
    cum = (1 + ret.loc[oos_all:].fillna(0)).cumprod()
    lw  = 2.0 if 'main' in name or 'Benchmark' in name else 1.3
    ax.plot(cum, color=color, linewidth=lw, linestyle=ls, label=name, alpha=0.85)
ax.axhline(1, color='black', linewidth=0.5, linestyle='--')
ax.set_ylabel('Growth of $1')
ax.set_title('Sensitivity: Window Length vs Target  (252d vs 60d, Y_idio vs Y_raw)',
             fontweight='bold', fontsize=12)
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

---
## Summary

| Component | Implementation |
|---|---|
| Data | CRSP energy stocks (via cache) + 4 commodity futures + SPY |
| Features | 27 HAR + derived (daily/weekly/monthly lags, crack spread, RV, Brent-WTI spread) |
| Residualization | Rolling 252-day OLS beta vs SPY → Y_idio |
| Model | LassoCV with TimeSeriesSplit(n_splits=3) — dynamic alpha per window |
| Rebalancing | Weekly (Fridays) — reduces TC ~5× vs daily |
| Portfolio | Long top 25%, short bottom 25% by predicted return |
| TC | 5 bps one-way on weight changes |
| Bipartite graph | Non-zero Lasso coefficients = edges; tracked over time |

**Continue to:**
- `02_case_study.ipynb` — geopolitical event analysis and network topology shifts
- `03_math_walkthrough.ipynb` — full mathematical architecture with visual derivations