Data splits (walk-forward, same stocks/timeframes)

Features (identical input sets)

Reward (cumulative)

Evaluation (out-of-sample EWM, Sharpe, etc.)



Let’s set up a true **apples-to-apples benchmark** of agent architectures on our trading pipeline, using the exact same:

* Data splits (walk-forward, same stocks/timeframes)
* Features (identical input sets)
* Reward (cumulative)
* Evaluation (out-of-sample EWM, Sharpe, etc.)



---

## **Agent Architecture Benchmark Plan**

### **1. Baseline PPO-MLP**

* **Policy:** Standard multilayer perceptron (MLP)
* **Library:** Stable Baselines3 (`PPO`)
* **Policy kwargs:** e.g., `[128, 128]` or `[256, 128]`

---

### **2. LSTM PPO (RecurrentPPO)**

* **Policy:** LSTM (single-layer or 2-layer, 128 units)
* **Library:** Stable Baselines3-Contrib (`RecurrentPPO`)
* **Policy:** `"MlpLstmPolicy"`
* **Handles sequences natively**
* **Extra: Tune sequence/episode length for best results**

---

### **3. Single-Head Attention Transformer Policy**

* **Policy:** Transformer encoder with 1 attention head (minimalist setup)
* **Implementation:**

  * *Option 1*: Use `stable-baselines3` with a custom policy class (PyTorch).
  * *Option 2*: Use SB3 fork/extensions that support transformer policies out-of-the-box (less common; will probably need custom code).
* **Goal:** Test transformer’s “pattern memory” edge vs LSTM.

---

### **4. Multi-Head Attention Transformer Policy**

* **Policy:** Transformer encoder, e.g., 4–8 heads, 1–2 layers
* **Implementation:** Same as above but with multiple heads
* **Why:** See if more heads/layers boost performance (at higher compute cost)

---

## **Benchmarking Protocol**

1. **Data**: Use our best meta-selected stocks/timeframes, identical for all runs.
2. **Feature set**: Fix features for all models (no advantage to one or another).
3. **Hyperparameters**: Tune as fairly as possible (similar total params, same optimizer, batch size, episode length).
4. **Evaluation**:

   * Out-of-sample EWM cumulative reward
   * Sharpe ratio, drawdown, % > market
   * Policy entropy, if curious
   * 5+ random seeds per setting
5. **Logging**: Use Weights & Biases, MLflow, or simple CSVs to compare runs.

---

## **Implementation Plan**

**A. Write/Adapt Custom Policies**

* For LSTM: use `RecurrentPPO` (easy).
* For Transformers: extend SB3’s `ActorCriticPolicy` using PyTorch, plug in transformer blocks.

**B. Standardized Training Loop**

* For each agent: loop over all stocks/timeframes, train, evaluate, record metrics.

**C. Result Table**

| Model           | Architecture     | Params | Mean EWM Reward | Sharpe | % > Market | Notes       |
| --------------- | ---------------- | ------ | --------------- | ------ | ---------- | ----------- |
| PPO-MLP         | \[256,128] MLP   | X      | ...             | ...    | ...        | Baseline    |
| PPO-LSTM        | 1x128 LSTM       | Y      | ...             | ...    | ...        | Recurrent   |
| PPO-Transformer | 1-head, 1 layer  | Z      | ...             | ...    | ...        | Single head |
| PPO-Transformer | 4-head, 2 layers | W      | ...             | ...    | ...        | Multi-head  |

---

## **Deliverables**

* **Scripts for each model type** (ready to run)
* **Unified training and eval pipeline** (for apples-to-apples comparison)
* **Benchmarking notebooks** for quick result viz
* **Markdown summary template** for documentation

---



In [1]:
# SETUP ===================================
import jupyter

from src.utils.system import boot, Notify

boot()

# PACKAGES ================================
import os
import joblib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.preprocessing import  RobustScaler

# FRAMEWORK STUFF =========================
from src.config import TOP2_STOCK_BY_SECTOR
from src.data.feature_pipeline importload_base_dataframe
from src.experiments.experiment_tracker import ExperimentTracker



from src.env.base_trading_env import (
    CumulativeTradingEnv,
)
import warnings
warnings.filterwarnings("ignore")


  from pandas.core import (


In [2]:
# ========== IMPORTS & SETUP ==========
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from stable_baselines3 import PPO
from sb3_contrib import RecurrentPPO
from stable_baselines3.common.vec_env import DummyVecEnv
from tqdm import tqdm

from src.env.base_trading_env import CumulativeTradingEnv
from src.data.feature_pipeline import load_base_dataframe
from src.experiments.experiment_tracker import ExperimentTracker
from src.defaults import FEATURE_COLS, EPISODE_LENGTH, EXCLUDED_TICKERS

# ========== CONFIG ==========
EXPERIENCE_NAME = "agent_design_and_benchmark"
RESULTS_PATH = f"data/experiments/{EXPERIENCE_NAME}_barebones_results.csv"
N_EPISODES = 20
N_SEEDS = 3
N_EVAL_EPISODES = 3
AGENT_TYPES = ['mlp', 'lstm', 'transformer_single', 'transformer_multi']

TRANSACTION_COST = 0

CONFIG = {
    "batch_size": 32,
    "n_steps": 128,
    "total_timesteps": 10000,   # Adjust for speed/depth
}

walk_forward_splits = [
    ("2023-01-01", "2023-07-01", "2023-12-01"),
    ("2024-01-01", "2024-07-01", "2024-12-01"),
]

# --- Load data ---
ohlcv_df = load_base_dataframe()
ohlcv_df['date'] = pd.to_datetime(ohlcv_df['date'])

# --- Experiment tracker ---
experiment_tracker = ExperimentTracker(EXPERIENCE_NAME)

# --- Transformer Policy ---
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.policies import ActorCriticPolicy

class TransformerExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, d_model=32, nhead=1, num_layers=1):
        # observation_space.shape = (window_length, n_features)
        super().__init__(observation_space, features_dim=d_model)
        win_len, n_features = observation_space.shape
        self.embedding = nn.Linear(n_features, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.win_len = win_len
    def forward(self, obs):
        # obs: (batch, window_length, n_features)
        # If obs has no batch, add it
        if obs.ndim == 2:
            obs = obs.unsqueeze(0)
        x = self.embedding(obs)        # (batch, win_len, d_model)
        x = x.permute(1, 0, 2)        # (win_len, batch, d_model)
        x = self.transformer(x)        # (win_len, batch, d_model)
        # Pooling: use the last output (or mean pooling)
        pooled = x[-1]                # (batch, d_model)
        return pooled

class TransformerPolicy(ActorCriticPolicy):
    def __init__(self, *args, nhead=1, num_layers=1, **kwargs):
        super().__init__(
            *args,
            features_extractor_class=TransformerExtractor,
            features_extractor_kwargs={'d_model': 32, 'nhead': nhead, 'num_layers': num_layers},
            **kwargs
        )

# --- Env factory ---
def make_env(df, ticker, feature_cols, episode_length):
    df_ticker = df[df['symbol'] == ticker].copy()
    return CumulativeTradingEnv(
        df=df_ticker,
        feature_cols=feature_cols,
        episode_length=episode_length,
        transaction_cost=TRANSACTION_COST,
    )

# --- Episode generator ---
def generate_episode_sequences(df, episode_length, n_episodes, excluded_tickers, seed=314):
    rng = np.random.default_rng(seed)
    eligible_tickers = [t for t in df['symbol'].unique() if t not in excluded_tickers]
    sequences = []
    for _ in range(n_episodes):
        ticker = rng.choice(eligible_tickers)
        stock_df = df[df['symbol'] == ticker]
        max_start = len(stock_df) - episode_length - 1
        if max_start < 1:
            continue
        start_idx = rng.integers(0, max_start)
        sequences.append((ticker, int(start_idx)))
    return sequences

# --- Evaluation: Only use scalar metrics from env info ---
def is_scalar_series(series):
    return series.apply(lambda x: np.isscalar(x) or isinstance(x, (np.floating, np.integer, float, int, np.float64, np.int64))).all()

def evaluate_agent(model, df, sequences, feature_cols, episode_length):
    all_infos = []
    all_actions = []
    for ticker, start_idx in sequences:
        env = make_env(df, ticker, feature_cols, episode_length)
        obs, _ = env.reset(start_index=start_idx)
        done = False
        info = {}
        episode_actions = []
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            episode_actions.append(int(action))
        all_infos.append(info)
        all_actions.extend(episode_actions)
    infos_df = pd.DataFrame(all_infos)
    scalar_cols = [col for col in infos_df.columns if is_scalar_series(infos_df[col])]
    metrics = {f"mean_{col}": infos_df[col].mean() for col in scalar_cols}
    metrics.update({f"std_{col}": infos_df[col].std() for col in scalar_cols})
    action_counts = pd.Series(all_actions).value_counts(normalize=True).to_dict()
    metrics["action_counts"] = action_counts
    metrics["action_entropy"] = -sum(p * np.log(p + 1e-8) for p in action_counts.values())
    return metrics

# --- Resumability: Load existing results ---
results = []
done_keys = set()

if os.path.exists(RESULTS_PATH):
    results_df = pd.read_csv(RESULTS_PATH)
    required_cols = {'split', 'split_start', 'agent', 'seed'}
    if required_cols.issubset(results_df.columns):
        done_keys = set(zip(
            results_df['split'], results_df['split_start'], results_df['agent'], results_df['seed']
        ))
        results = results_df.to_dict('records')
        print(f"Loaded {len(done_keys)} previously completed results.")
    else:
        print(f"WARNING: Existing {RESULTS_PATH} is missing required columns or is from an old experiment.")
        backup_path = RESULTS_PATH.replace(".csv", "_backup.csv")
        os.rename(RESULTS_PATH, backup_path)
        print(f"Backed up old file to {backup_path}. Starting new results file.")

# --- Precompute episode sequences for all splits, types, and seeds ---
episode_sequences = {}  # (split_type, split_start, seed) -> list of (ticker, start_idx)

for split in walk_forward_splits:
    train_start, train_end, test_end = split
    test_start = train_end
    df_train = ohlcv_df[(ohlcv_df['date'] >= train_start) & (ohlcv_df['date'] < train_end)]
    df_test  = ohlcv_df[(ohlcv_df['date'] >= test_start) & (ohlcv_df['date'] < test_end)]
    for split_type, df, split_start in [
        ("train", df_train, train_start),
        ("test",  df_test,  test_start),
    ]:
        for seed in range(N_SEEDS):
            seqs = generate_episode_sequences(df, EPISODE_LENGTH, N_EPISODES, EXCLUDED_TICKERS, seed=seed)
            if len(seqs) == 0:
                print(f"WARNING: No episodes for {split_type} {split_start} seed={seed}")
            episode_sequences[(split_type, split_start, seed)] = seqs

# --- Main walk-forward benchmark ---
for split in tqdm(walk_forward_splits, desc="Splits"):
    train_start, train_end, test_end = split
    test_start = train_end
    df_train = ohlcv_df[(ohlcv_df['date'] >= train_start) & (ohlcv_df['date'] < train_end)]
    df_test  = ohlcv_df[(ohlcv_df['date'] >= test_start) & (ohlcv_df['date'] < test_end)]

    if len(df_train) < EPISODE_LENGTH or len(df_test) < EPISODE_LENGTH:
        print(f"Skipping split {split}: Not enough data")
        continue

    for agent_type in AGENT_TYPES:
        for seed in range(N_SEEDS):
            for split_type, df, split_start, split_end in [
                ("train", df_train, train_start, train_end),
                ("test",  df_test,  test_start,  test_end)
            ]:
                key = (split_type, split_start, agent_type, seed)
                if key in done_keys:
                    print(f"Skipping {key} (already done)")
                    continue

                sequences = episode_sequences.get((split_type, split_start, seed), [])
                if len(sequences) == 0:
                    print(f"No episodes to sample in {split_type} {split} seed={seed}")
                    continue

                # Train only on train split, eval on both
                if split_type == "train":
                    ticker = sequences[0][0]  # For env construction
                    env = make_env(df, ticker, FEATURE_COLS, EPISODE_LENGTH)
                    vec_env = DummyVecEnv([lambda: env])
                    np.random.seed(seed)
                    torch.manual_seed(seed)
                    if agent_type == 'mlp':
                        model = PPO(
                            "MlpPolicy", vec_env,
                            verbose=0, seed=seed,
                            batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                        )
                    elif agent_type == 'lstm':
                        model = RecurrentPPO(
                            "MlpLstmPolicy", vec_env,
                            verbose=0, seed=seed,
                            batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                        )
                    elif agent_type == 'transformer_single':
                        model = PPO(
                            TransformerPolicy, vec_env,
                            policy_kwargs={'nhead': 1, 'num_layers': 1},
                            verbose=0, seed=seed,
                            batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                        )
                    elif agent_type == 'transformer_multi':
                        model = PPO(
                            TransformerPolicy, vec_env,
                            policy_kwargs={'nhead': 4, 'num_layers': 2},
                            verbose=0, seed=seed,
                            batch_size=CONFIG["batch_size"], n_steps=CONFIG['n_steps']
                        )
                    model.learn(total_timesteps=CONFIG["total_timesteps"])

                # Evaluate on current split (train or test)
                metrics = evaluate_agent(model, df, sequences, FEATURE_COLS, EPISODE_LENGTH)
                result = {
                    "split": split_type,
                    "split_start": split_start,
                    "split_end": split_end,
                    "agent": agent_type,
                    "seed": seed,
                }
                result.update(metrics)
                results.append(result)
                pd.DataFrame(results).to_csv(RESULTS_PATH, index=False)
                #for k, v in metrics.items():
                #    print(f"{split_type}_{agent_type}_{k}", v)
                #print(f"Done: {result}")
                print(f"Complete {key} ")
                
print("\nFinished all splits. Final summary:")
results_df = pd.DataFrame(results)
results_df.groupby(['split', 'agent']).mean(numeric_only=True)


Splits:   0%|          | 0/2 [00:00<?, ?it/s]

Complete ('train', '2023-01-01', 'mlp', 0) 
Complete ('test', '2023-07-01', 'mlp', 0) 
Complete ('train', '2023-01-01', 'mlp', 1) 
Complete ('test', '2023-07-01', 'mlp', 1) 
Complete ('train', '2023-01-01', 'mlp', 2) 
Complete ('test', '2023-07-01', 'mlp', 2) 
Complete ('train', '2023-01-01', 'lstm', 0) 
Complete ('test', '2023-07-01', 'lstm', 0) 
Complete ('train', '2023-01-01', 'lstm', 1) 
Complete ('test', '2023-07-01', 'lstm', 1) 
Complete ('train', '2023-01-01', 'lstm', 2) 
Complete ('test', '2023-07-01', 'lstm', 2) 
Complete ('train', '2023-01-01', 'transformer_single', 0) 
Complete ('test', '2023-07-01', 'transformer_single', 0) 
Complete ('train', '2023-01-01', 'transformer_single', 1) 
Complete ('test', '2023-07-01', 'transformer_single', 1) 
Complete ('train', '2023-01-01', 'transformer_single', 2) 
Complete ('test', '2023-07-01', 'transformer_single', 2) 
Complete ('train', '2023-01-01', 'transformer_multi', 0) 
Complete ('test', '2023-07-01', 'transformer_multi', 0) 
Comple

Splits:  50%|█████     | 1/2 [29:22<29:22, 1762.40s/it]

Complete ('test', '2023-07-01', 'transformer_multi', 2) 
Complete ('train', '2024-01-01', 'mlp', 0) 
Complete ('test', '2024-07-01', 'mlp', 0) 
Complete ('train', '2024-01-01', 'mlp', 1) 
Complete ('test', '2024-07-01', 'mlp', 1) 
Complete ('train', '2024-01-01', 'mlp', 2) 
Complete ('test', '2024-07-01', 'mlp', 2) 
Complete ('train', '2024-01-01', 'lstm', 0) 
Complete ('test', '2024-07-01', 'lstm', 0) 
Complete ('train', '2024-01-01', 'lstm', 1) 
Complete ('test', '2024-07-01', 'lstm', 1) 
Complete ('train', '2024-01-01', 'lstm', 2) 
Complete ('test', '2024-07-01', 'lstm', 2) 
Complete ('train', '2024-01-01', 'transformer_single', 0) 
Complete ('test', '2024-07-01', 'transformer_single', 0) 
Complete ('train', '2024-01-01', 'transformer_single', 1) 
Complete ('test', '2024-07-01', 'transformer_single', 1) 
Complete ('train', '2024-01-01', 'transformer_single', 2) 
Complete ('test', '2024-07-01', 'transformer_single', 2) 
Complete ('train', '2024-01-01', 'transformer_multi', 0) 
Comple

Splits: 100%|██████████| 2/2 [1:01:17<00:00, 1838.60s/it]

Complete ('test', '2024-07-01', 'transformer_multi', 2) 

Finished all splits. Final summary:





Unnamed: 0_level_0,Unnamed: 1_level_0,seed,mean_episode_sharpe,mean_episode_sortino,mean_episode_total_reward,mean_cumulative_return,mean_calmar,mean_max_drawdown,mean_win_rate,mean_alpha,std_episode_sharpe,std_episode_sortino,std_episode_total_reward,std_cumulative_return,std_calmar,std_max_drawdown,std_win_rate,std_alpha,action_entropy
split,agent,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
test,lstm,1.0,-0.022839,-0.010702,-0.014054,-0.015412,0.218293,0.209261,0.608333,-0.065035,0.097085,0.165696,0.250022,0.250322,1.524458,0.088346,0.453213,0.250322,0.391187
test,mlp,1.0,-0.009016,0.00772,-0.040333,-0.032386,0.307942,0.214944,0.595833,-0.08201,0.104307,0.162981,0.260828,0.239442,1.455544,0.122863,0.484976,0.239442,0.549098
test,transformer_multi,1.0,-0.001516,0.018376,-0.025622,-0.017662,0.516542,0.209027,0.533333,-0.067286,0.097769,0.166936,0.259647,0.238342,1.927739,0.121956,0.472206,0.238342,0.376073
test,transformer_single,1.0,-0.012857,-0.005222,-0.056951,-0.050469,0.184314,0.216646,0.508333,-0.100093,0.097675,0.147079,0.257856,0.22037,1.240141,0.130189,0.48142,0.22037,0.320755
train,lstm,1.0,-0.026309,-0.022146,-0.059703,-0.058985,0.010351,0.199566,0.541667,-0.152143,0.088233,0.135322,0.191231,0.168496,0.965169,0.102151,0.466466,0.168496,0.40841
train,mlp,1.0,0.008707,0.038354,0.026173,0.02787,0.545106,0.174483,0.5125,-0.065288,0.094118,0.168269,0.207351,0.224764,1.797854,0.07704,0.449493,0.224764,0.59439
train,transformer_multi,1.0,0.002256,0.024894,0.015968,0.016017,0.392807,0.184493,0.5,-0.077141,0.091816,0.15843,0.199319,0.210973,1.47334,0.079427,0.470304,0.210973,0.421903
train,transformer_single,1.0,0.006,0.032084,0.022707,0.022211,0.465429,0.17885,0.529167,-0.070947,0.089722,0.153589,0.194515,0.206314,1.537634,0.079801,0.458442,0.206314,0.348825


In [3]:
results_df.groupby(['split', 'agent']).mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,seed,mean_episode_sharpe,mean_episode_sortino,mean_episode_total_reward,mean_cumulative_return,mean_calmar,mean_max_drawdown,mean_win_rate,mean_alpha,std_episode_sharpe,std_episode_sortino,std_episode_total_reward,std_cumulative_return,std_calmar,std_max_drawdown,std_win_rate,std_alpha,action_entropy
split,agent,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
test,lstm,1.0,-0.022839,-0.010702,-0.014054,-0.015412,0.218293,0.209261,0.608333,-0.065035,0.097085,0.165696,0.250022,0.250322,1.524458,0.088346,0.453213,0.250322,0.391187
test,mlp,1.0,-0.009016,0.00772,-0.040333,-0.032386,0.307942,0.214944,0.595833,-0.08201,0.104307,0.162981,0.260828,0.239442,1.455544,0.122863,0.484976,0.239442,0.549098
test,transformer_multi,1.0,-0.001516,0.018376,-0.025622,-0.017662,0.516542,0.209027,0.533333,-0.067286,0.097769,0.166936,0.259647,0.238342,1.927739,0.121956,0.472206,0.238342,0.376073
test,transformer_single,1.0,-0.012857,-0.005222,-0.056951,-0.050469,0.184314,0.216646,0.508333,-0.100093,0.097675,0.147079,0.257856,0.22037,1.240141,0.130189,0.48142,0.22037,0.320755
train,lstm,1.0,-0.026309,-0.022146,-0.059703,-0.058985,0.010351,0.199566,0.541667,-0.152143,0.088233,0.135322,0.191231,0.168496,0.965169,0.102151,0.466466,0.168496,0.40841
train,mlp,1.0,0.008707,0.038354,0.026173,0.02787,0.545106,0.174483,0.5125,-0.065288,0.094118,0.168269,0.207351,0.224764,1.797854,0.07704,0.449493,0.224764,0.59439
train,transformer_multi,1.0,0.002256,0.024894,0.015968,0.016017,0.392807,0.184493,0.5,-0.077141,0.091816,0.15843,0.199319,0.210973,1.47334,0.079427,0.470304,0.210973,0.421903
train,transformer_single,1.0,0.006,0.032084,0.022707,0.022211,0.465429,0.17885,0.529167,-0.070947,0.089722,0.153589,0.194515,0.206314,1.537634,0.079801,0.458442,0.206314,0.348825


In [4]:
df

Unnamed: 0,id,symbol,timestamp,date,open,high,low,close,volume,trade_count,...,vwap_change,trade_count_change,sector_id,industry_id,return_1d,vix,vix_norm,sp500,sp500_norm,market_return_1d
624,625,MMM,2024-07-01 04:00:00,2024-07-01,102.86,103.4494,100.2050,100.61,2705605.0,47196.0,...,-0.011600,-0.015499,unknown,unknown,-0.015461,0.1222,-0.017685,54.7509,0.002676,0.002676
625,626,MMM,2024-07-02 04:00:00,2024-07-02,100.56,101.9300,100.4600,101.62,2291274.0,43717.0,...,0.001638,-0.073714,unknown,unknown,0.010039,0.1203,-0.015548,55.0901,0.006195,0.006195
626,627,MMM,2024-07-03 04:00:00,2024-07-03,101.29,102.1500,100.6800,101.62,1230776.0,24937.0,...,0.002640,-0.429581,unknown,unknown,0.000000,0.1209,0.004988,55.3702,0.005084,0.005084
627,628,MMM,2024-07-05 04:00:00,2024-07-05,101.40,101.6600,100.6400,101.32,3059577.0,40548.0,...,-0.003695,0.626018,unknown,unknown,-0.002952,0.1248,0.017945,55.6719,0.005449,0.005449
628,629,MMM,2024-07-08 04:00:00,2024-07-08,101.51,102.7400,100.6200,101.10,2338695.0,37410.0,...,0.001051,-0.077390,unknown,unknown,-0.002171,0.1237,-0.008814,55.7285,0.001017,0.001017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429575,429576,SPY,2024-11-22 05:00:00,2024-11-22,593.66,596.1500,593.1525,595.51,38226390.0,346477.0,...,0.004474,-0.230292,unknown,unknown,0.003099,0.1524,-0.096621,59.6934,0.003468,0.003468
429576,429577,SPY,2024-11-25 05:00:00,2024-11-25,599.52,600.8600,595.2000,597.53,42441393.0,427181.0,...,0.004801,0.232927,unknown,unknown,0.003392,0.1460,-0.041995,59.8737,0.003020,0.003020
429577,429578,SPY,2024-11-26 05:00:00,2024-11-26,598.80,601.3300,598.0700,600.65,45621288.0,383149.0,...,0.003447,-0.103076,unknown,unknown,0.005221,0.1410,-0.034247,60.2163,0.005722,0.005722
429578,429579,SPY,2024-11-27 05:00:00,2024-11-27,600.46,600.8500,597.2800,598.83,34000163.0,332766.0,...,-0.001755,-0.131497,unknown,unknown,-0.003030,0.1410,0.000000,59.9874,-0.003801,-0.003801


In [5]:
results_df

Unnamed: 0,split,split_start,split_end,agent,seed,mean_episode_sharpe,mean_episode_sortino,mean_episode_total_reward,mean_cumulative_return,mean_calmar,...,std_episode_sharpe,std_episode_sortino,std_episode_total_reward,std_cumulative_return,std_calmar,std_max_drawdown,std_win_rate,std_alpha,action_counts,action_entropy
0,train,2023-01-01,2023-07-01,mlp,0,-0.017205,0.001035,-0.038135,-0.026913,0.705519,...,0.113786,0.200175,0.240828,0.279376,3.506723,0.102423,0.470162,0.279376,"{2: 0.6909090909090909, 0: 0.18383838383838383...",0.8270308
1,test,2023-07-01,2023-12-01,mlp,0,-0.018533,-0.023882,-0.039154,-0.049955,-0.081704,...,0.074292,0.106762,0.155586,0.154698,0.722285,0.079809,0.512989,0.154698,"{2: 0.7717171717171717, 0: 0.16464646464646465...",0.6722862
2,train,2023-01-01,2023-07-01,mlp,1,0.005478,0.015858,0.018293,0.014823,0.320889,...,0.081378,0.12546,0.180932,0.184627,1.185613,0.075745,0.46169,0.184627,"{1: 0.8171717171717172, 2: 0.09494949494949495...",0.6022453
3,test,2023-07-01,2023-12-01,mlp,1,-0.015,-0.006276,-0.007444,-0.008162,0.236519,...,0.104498,0.162148,0.187967,0.189835,1.210764,0.084998,0.46169,0.189835,"{1: 0.8666666666666667, 2: 0.11717171717171718...",0.4419189
4,train,2023-01-01,2023-07-01,mlp,2,-0.004653,0.017854,-0.008262,-0.013495,0.239796,...,0.100223,0.165116,0.172785,0.168776,1.244307,0.07333,0.379577,0.168776,"{0: 0.7181818181818181, 2: 0.2737373737373737,...",0.6313273
5,test,2023-07-01,2023-12-01,mlp,2,-0.01051,-0.00568,-0.033467,-0.036752,0.056927,...,0.087689,0.151015,0.157102,0.155262,1.234758,0.071223,0.502625,0.155262,"{0: 0.6545454545454545, 2: 0.3378787878787879,...",0.6810181
6,train,2023-01-01,2023-07-01,lstm,0,-0.020974,-0.017355,-0.040627,-0.044903,0.106886,...,0.085466,0.129349,0.173594,0.172272,1.188037,0.085094,0.470162,0.172272,"{2: 0.796969696969697, 1: 0.201010101010101, 0...",0.5158983
7,test,2023-07-01,2023-12-01,lstm,0,-0.002537,0.002093,-0.012489,-0.013603,0.204021,...,0.087695,0.135185,0.20944,0.231678,1.225812,0.088389,0.510418,0.231678,"{2: 0.8747474747474747, 1: 0.1202020202020202,...",0.3984249
8,train,2023-01-01,2023-07-01,lstm,1,-0.002203,0.001115,-0.006251,-0.018041,0.120018,...,0.070781,0.096657,0.134197,0.12642,0.697176,0.075517,0.475727,0.12642,"{2: 0.6757575757575758, 1: 0.3242424242424242}",0.630026
9,test,2023-07-01,2023-12-01,lstm,1,0.007399,0.045904,0.019334,0.021351,0.676753,...,0.112514,0.19623,0.202749,0.201597,1.494856,0.082809,0.340279,0.201597,"{2: 0.751010101010101, 1: 0.24848484848484848,...",0.5648588


In [6]:
episode_sequences

{('train', '2023-01-01', 0): [('SYF', 14),
  ('INVH', 6),
  ('DHI', 0),
  ('APA', 0),
  ('CARR', 18),
  ('MS', 20),
  ('IPG', 13),
  ('WST', 16),
  ('MAA', 12),
  ('LRCX', 21),
  ('DECK', 18),
  ('NKE', 0),
  ('FI', 19),
  ('LHX', 0),
  ('PTC', 16),
  ('SYK', 4),
  ('ANET', 19),
  ('ABNB', 12),
  ('AMAT', 6),
  ('IBM', 9)],
 ('train', '2023-01-01', 1): [('HUBB', 11),
  ('PGR', 21),
  ('LNT', 3),
  ('SPG', 21),
  ('CPAY', 7),
  ('TEL', 9),
  ('DVA', 19),
  ('COST', 9),
  ('MPWR', 12),
  ('ACGL', 0),
  ('TGT', 17),
  ('SBUX', 12),
  ('NOW', 7),
  ('HES', 18),
  ('BBY', 6),
  ('BBY', 10),
  ('WSM', 3),
  ('FRT', 9),
  ('USB', 4),
  ('IPG', 6)],
 ('train', '2023-01-01', 2): [('SBUX', 6),
  ('AXON', 6),
  ('GE', 18),
  ('HSY', 2),
  ('ENPH', 13),
  ('SLB', 16),
  ('ZBRA', 4),
  ('TXT', 1),
  ('LH', 6),
  ('CMG', 15),
  ('DOW', 12),
  ('CRWD', 3),
  ('PPL', 9),
  ('NSC', 15),
  ('VMC', 9),
  ('CLX', 14),
  ('VRTX', 22),
  ('TEL', 15),
  ('FAST', 9),
  ('GOOGL', 4)],
 ('test', '2023-07-01', 0

In [7]:
!pip install plotly

Defaulting to user installation because normal site-packages is not writeable


DEPRECATION: Loading egg at c:\programdata\anaconda3\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..


In [16]:
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
from IPython.display import display

def plotly_leaderboard(df, metric="mean_episode_sharpe", top_k=3):
    test_df = df[df["split"] == "test"]
    leaderboard = (
        test_df.groupby("agent")[metric]
        .mean()
        .sort_values(ascending=False)
        .reset_index()
    )
    leaderboard['Rank'] = leaderboard[metric].rank(ascending=False, method='min')
    leaderboard = leaderboard.sort_values('Rank')
    fig = px.bar(
        leaderboard.head(top_k),
        x="agent",
        y=metric,
        color="agent",
        text=metric,
        title=f"Top-{top_k} Agents by {metric} (test splits)"
    )
    fig.update_traces(texttemplate='%{text:.3f}', textposition='outside')
    fig.update_layout(yaxis_title=metric, xaxis_title="Agent", showlegend=False)
    fig.show()
    return leaderboard



def plotly_metric_by_agent_and_split(df, metric="mean_episode_sharpe"):
    df = df.copy()
    fig = px.box(
        df[df["split"] == "test"],
        x="agent",
        y=metric,
        points="all",
        color="agent",
        title=f"{metric} distribution per Agent (test splits)",
        hover_data=["seed", "split_start", "split_end"]
    )
    fig.show()



# Interactive policy sanity check (bar for action counts)
def plotly_action_distribution(df):
    import ast
    df = df[df["split"] == "test"].copy()
    all_action_data = []
    for idx, row in df.iterrows():
        counts = row["action_counts"]
        if isinstance(counts, str):
            counts = ast.literal_eval(counts)
        for action, frac in counts.items():
            all_action_data.append({
                "agent": row["agent"],
                "seed": row["seed"],
                "split_start": row["split_start"],
                "split_end": row["split_end"],
                "action": str(action),
                "fraction": frac
            })
    actions_df = pd.DataFrame(all_action_data)
    fig = px.bar(
        actions_df,
        x="action",
        y="fraction",
        color="agent",
        barmode="group",
        facet_col="agent",
        title="Agent Action Distributions (test splits)",
        category_orders={"action": sorted(actions_df["action"].unique())}
    )
    fig.show()




In [18]:
print('MEAN EPISODE SHARPE')
display(plotly_leaderboard(results_df, metric="mean_episode_sharpe", top_k=3))
print('MEAN EPISODE CUMULATIVE RETURN')
display(plotly_leaderboard(results_df, metric="mean_cumulative_return", top_k=3))
plotly_metric_by_agent_and_split(results_df, metric="mean_episode_sharpe")
plotly_metric_by_agent_and_split(results_df, metric="mean_cumulative_return")
plotly_action_distribution(results_df)

MEAN EPISODE SHARPE


Unnamed: 0,agent,mean_episode_sharpe,Rank
0,transformer_multi,-0.001516,1.0
1,mlp,-0.009016,2.0
2,transformer_single,-0.012857,3.0
3,lstm,-0.022839,4.0


MEAN EPISODE CUMULATIVE RETURN


Unnamed: 0,agent,mean_cumulative_return,Rank
0,lstm,-0.015412,1.0
1,transformer_multi,-0.017662,2.0
2,mlp,-0.032386,3.0
3,transformer_single,-0.050469,4.0


In [15]:

display(plotly_leaderboard(results_df, metric="mean_episode_sharpe", top_k=3))

Unnamed: 0,agent,mean_episode_sharpe,Rank
0,transformer_multi,-0.001516,1.0
1,mlp,-0.009016,2.0
2,transformer_single,-0.012857,3.0
3,lstm,-0.022839,4.0


In [None]:
# Bonus test - Episode Boundary learning mode

