Data splits (walk-forward, same stocks/timeframes)

Features (identical input sets)

Reward (cumulative)

Evaluation (out-of-sample EWM, Sharpe, etc.)


That’s exactly the study I’d dream of running at this stage!
Let’s set up a true **apples-to-apples benchmark** of agent architectures on our trading pipeline, using the exact same:

* Data splits (walk-forward, same stocks/timeframes)
* Features (identical input sets)
* Reward (cumulative)
* Evaluation (out-of-sample EWM, Sharpe, etc.)

I’ll outline the practical plan below and can deliver full code scaffolding for each model if you want to go straight to implementation.

---

## **Agent Architecture Benchmark Plan**

### **1. Baseline PPO-MLP**

* **Policy:** Standard multilayer perceptron (MLP)
* **Library:** Stable Baselines3 (`PPO`)
* **Policy kwargs:** e.g., `[128, 128]` or `[256, 128]`

---

### **2. LSTM PPO (RecurrentPPO)**

* **Policy:** LSTM (single-layer or 2-layer, 128 units)
* **Library:** Stable Baselines3-Contrib (`RecurrentPPO`)
* **Policy:** `"MlpLstmPolicy"`
* **Handles sequences natively**
* **Extra: Tune sequence/episode length for best results**

---

### **3. Single-Head Attention Transformer Policy**

* **Policy:** Transformer encoder with 1 attention head (minimalist setup)
* **Implementation:**

  * *Option 1*: Use `stable-baselines3` with a custom policy class (PyTorch).
  * *Option 2*: Use SB3 fork/extensions that support transformer policies out-of-the-box (less common; will probably need custom code).
* **Goal:** Test transformer’s “pattern memory” edge vs LSTM.

---

### **4. Multi-Head Attention Transformer Policy**

* **Policy:** Transformer encoder, e.g., 4–8 heads, 1–2 layers
* **Implementation:** Same as above but with multiple heads
* **Why:** See if more heads/layers boost performance (at higher compute cost)

---

## **Benchmarking Protocol**

1. **Data**: Use our best meta-selected stocks/timeframes, identical for all runs.
2. **Feature set**: Fix features for all models (no advantage to one or another).
3. **Hyperparameters**: Tune as fairly as possible (similar total params, same optimizer, batch size, episode length).
4. **Evaluation**:

   * Out-of-sample EWM cumulative reward
   * Sharpe ratio, drawdown, % > market
   * Policy entropy, if curious
   * 5+ random seeds per setting
5. **Logging**: Use Weights & Biases, MLflow, or simple CSVs to compare runs.

---

## **Implementation Plan**

**A. Write/Adapt Custom Policies**

* For LSTM: use `RecurrentPPO` (easy).
* For Transformers: extend SB3’s `ActorCriticPolicy` using PyTorch, plug in transformer blocks.

**B. Standardized Training Loop**

* For each agent: loop over all stocks/timeframes, train, evaluate, record metrics.

**C. Result Table**

| Model           | Architecture     | Params | Mean EWM Reward | Sharpe | % > Market | Notes       |
| --------------- | ---------------- | ------ | --------------- | ------ | ---------- | ----------- |
| PPO-MLP         | \[256,128] MLP   | X      | ...             | ...    | ...        | Baseline    |
| PPO-LSTM        | 1x128 LSTM       | Y      | ...             | ...    | ...        | Recurrent   |
| PPO-Transformer | 1-head, 1 layer  | Z      | ...             | ...    | ...        | Single head |
| PPO-Transformer | 4-head, 2 layers | W      | ...             | ...    | ...        | Multi-head  |

---

## **Deliverables**

* **Scripts for each model type** (ready to run)
* **Unified training and eval pipeline** (for apples-to-apples comparison)
* **Benchmarking notebooks** for quick result viz
* **Markdown summary template** for documentation

---



In [1]:
# SETUP: Imports & Paths ===========================
import jupyter
from src.utils.system import boot, Notify

boot()
import os
import joblib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


from tqdm import tqdm

from src.data.feature_pipeline import basic_chart_features,load_base_dataframe
from src.predictability.easiness import rolling_sharpe, rolling_r2, rolling_info_ratio, rolling_autocorr
from src.predictability.pipeline import generate_universe_easiness_report
from IPython import display

from src.experiments.experiment_tracker import ExperimentTracker
from src.config import TOP2_STOCK_BY_SECTOR


from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from scipy.stats import skew, kurtosis, entropy
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import acf, acovf

from src.env.base_trading_env import (
    CumulativeTradingEnv,
)
import warnings
warnings.filterwarnings("ignore")


  from pandas.core import (


In [None]:
# ========== IMPORTS & SETUP ==========
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from stable_baselines3 import PPO
from sb3_contrib import RecurrentPPO
from stable_baselines3.common.vec_env import DummyVecEnv
from tqdm import tqdm

from src.env.base_trading_env import CumulativeTradingEnv
from src.data.feature_pipeline import load_base_dataframe
from src.experiments.experiment_tracker import ExperimentTracker
from src.defaults import FEATURE_COLS, EPISODE_LENGTH,EXCLUDED_TICKERS

# ========== CONFIG ==========
EXPERIENCE_NAME = "__agent_design_and_benchmark"

N_SEEDS = 3
N_EVAL_EPISODES = 3
AGENT_TYPES = ['mlp', 'lstm', 'transformer_single', 'transformer_multi']

TRANSACTION_COST = 0

CONFIG = {
    "batch_size":EPISODE_LENGTH, 
    "n_steps":EPISODE_LENGTH*3,
    "total_timesteps": 1000
}

# ========== DATA LOAD ==========
ohlcv_df = load_base_dataframe()
ohlcv_df['date'] = pd.to_datetime(ohlcv_df['date'])
# Adjust date range as needed!
ohlcv_df['month'] = ohlcv_df['date'].dt.to_period('M')

TICKERS = ohlcv_df['symbol'].unique()
TICKERS = TICKERS[~np.isin(TICKERS, EXCLUDED_TICKERS)]
TICKERS = TOP2_STOCK_BY_SECTOR[:2]
# ========== TRACKER ==========
experiment_tracker = ExperimentTracker(EXPERIENCE_NAME)

# ========== ENV FACTORY ==========
def make_env(df, ticker, feature_cols, episode_length):
    df_ticker = df[df['symbol'] == ticker].copy()
    env = CumulativeTradingEnv(
        df=df_ticker,
        feature_cols=feature_cols,
        episode_length=episode_length,
        transaction_cost=TRANSACTION_COST,
        #reward_fn=None,  # use env default
    )
    return env

# ========== BAREBONES TRANSFORMER POLICY ==========
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.policies import ActorCriticPolicy

class TransformerExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, d_model=32, nhead=1, num_layers=1):
        super().__init__(observation_space, features_dim=d_model)
        self.embedding = nn.Linear(observation_space.shape[0], d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
    def forward(self, obs):
        x = self.embedding(obs)
        x = x.unsqueeze(0)  # (seq=1, batch, d_model)
        x = self.transformer(x)
        x = x.squeeze(0)
        return x

class TransformerPolicy(ActorCriticPolicy):
    def __init__(self, *args, nhead=1, num_layers=1, **kwargs):
        super().__init__(
            *args,
            features_extractor_class=TransformerExtractor,
            features_extractor_kwargs={'d_model': 32, 'nhead': nhead, 'num_layers': num_layers},
            **kwargs
        )

def is_scalar_series(series):
    return series.apply(lambda x: np.isscalar(x) or (isinstance(x, (np.floating, np.integer)))).all()

def evaluate_agent(model, env, n_episodes=10):
    all_infos = []
    all_actions = []
    for _ in range(n_episodes):
        obs, _ = env.reset()
        done = False
        episode_actions = []
        info = {}
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            episode_actions.append(int(action))
        all_infos.append(info)
        all_actions.extend(episode_actions)

    infos_df = pd.DataFrame(all_infos)
    # Only keep scalar columns
    scalar_cols = [col for col in infos_df.columns if is_scalar_series(infos_df[col])]
    metrics = {f"mean_{col}": infos_df[col].median() for col in scalar_cols}
    metrics.update({f"std_{col}": infos_df[col].std() for col in scalar_cols})

    # Optionally: record the *first* (or last) value of array-valued metrics for inspection
    array_cols = [col for col in infos_df.columns if col not in scalar_cols]
    for col in array_cols:
        metrics[f"{col}_sample"] = infos_df[col].iloc[0]  # Or another summary

    # Action breakdown and entropy
    action_counts = pd.Series(all_actions).value_counts(normalize=True).to_dict()
    metrics["action_counts"] = action_counts
    metrics["action_entropy"] = -sum(p * np.log(p + 1e-8) for p in action_counts.values())
    return metrics

# --- MAIN BENCHMARK LOOP ---
results = []



RESULTS_PATH = f"data/experiments/{EXPERIENCE_NAME}_barebones_results.csv"
if os.path.exists(RESULTS_PATH):
    results_df = pd.read_csv(RESULTS_PATH)
    done_keys = set(zip(results_df['ticker'], results_df['agent'], results_df['seed']))
    results = results_df.to_dict('records')
    print(f"Loaded {len(done_keys)} previously completed results.")
else:
    done_keys = set()
    results = []

for ticker in tqdm(TICKERS, desc="Tickers"):
    df_ticker = ohlcv_df[ohlcv_df['symbol'] == ticker].copy()
    env = make_env(ohlcv_df, ticker, FEATURE_COLS, EPISODE_LENGTH)
    vec_env = DummyVecEnv([lambda: env])

    for agent_type in AGENT_TYPES:
        for seed in range(N_SEEDS):
            key = (ticker, agent_type, seed)
            if key in done_keys:
                print(f"Skipping already completed: {key}")
                continue
                
            np.random.seed(seed)
            torch.manual_seed(seed)
            if agent_type == 'mlp':
                model = PPO(
                    "MlpPolicy", vec_env,
                    verbose=0, seed=seed,
                    batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                )
            elif agent_type == 'lstm':
                model = RecurrentPPO(
                    "MlpLstmPolicy", vec_env,
                    verbose=0, seed=seed,
                    batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                )
            elif agent_type == 'transformer_single':
                model = PPO(
                    TransformerPolicy, vec_env,
                    policy_kwargs={'nhead': 1, 'num_layers': 1},
                    verbose=0, seed=seed,
                    batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                )
            elif agent_type == 'transformer_multi':
                model = PPO(
                    TransformerPolicy, vec_env,
                    policy_kwargs={'nhead': 4, 'num_layers': 2},
                    verbose=0, seed=seed,
                    batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                )
            key = (ticker, agent_type, seed)
            if key in done_keys:
                print(f"Skipping already completed: {key}")
                continue
            #model.learn(total_timesteps=1000)  # Minimal, adjust as needed
            #mean_reward, std_reward = evaluate_agent(model, env, n_episodes=N_EVAL_EPISODES)
            model.learn(total_timesteps=CONFIG["total_timesteps"])  # Minimal, adjust as needed
            metrics = evaluate_agent(model, env, n_episodes=N_EVAL_EPISODES)
            result = {"ticker": ticker, "agent": agent_type, "seed": seed}
            result.update(metrics)
            results.append(result)
            # Save after every new result
            pd.DataFrame(results).to_csv(RESULTS_PATH, index=False)
            # Log all metrics to experiment tracker
            for k, v in metrics.items():
                print(f"{agent_type}_{k}", v)
            #results.append(result)
            print("--------------------------------------")
            print('')

# ========== SAVE RESULTS ==========
results_df = pd.DataFrame(results)
results_df.to_csv(RESULTS_PATH, index=False)
results_df.groupby(['agent'])[['mean_reward', 'std_reward']].median()


400