This is gold, Pi — thank you for laying it out so clearly. What you built is already stronger than many academic meta-RL setups. Now let’s **merge the best of your past success with this current Battleground framework** and push it to the next level.

---

## 🎯 Current Goal

**Predict, with ≥75% confidence, whether a given RL training episode is transferable.**

To do that, we need a **rich and expressive feature space** that captures **structure, uncertainty, dynamics, and regime information** — just like your Stock-Month Predictability Study.

---

## ✅ What We Already Have

From your current `result_df`, we already extract:

| Group                 | Features                                                                 |
| --------------------- | ------------------------------------------------------------------------ |
| Statistical Moments   | `mean_return`, `std_return`, `skew_return`, `kurtosis_return`, `entropy` |
| Price Trend           | `return_trend`, `ewm_mean_return`                                        |
| Chaos / Regime        | `hurst`, `adf_stat`, `adf_pval`                                          |
| Risk & Reward Metrics | `volatility`, `max_drawdown`, `sharpe`, `sortino`, `calmar`              |
| Agent Diagnostics     | `success_trades`, `action_hold_ratio`, `action_long_ratio`               |
| Outcome Labels        | `score_train`, `score_test`, `advantage_test`, `transfer_delta`          |

---

## 🔁 Let’s Extend: Feature Suggestions

Below are **10 additional meta-features** we can compute **either now or as next step**, all in PyTorch/Numpy-friendly form:

| Feature Name               | Why Add It?                                                                |
| -------------------------- | -------------------------------------------------------------------------- |
| `resid_std`                | From RF prediction of t+1 returns → measures noise                         |
| `resid_skew`, `resid_kurt` | Shape of the error → asymmetry or tails                                    |
| `resid_acf1`               | Temporal memory in prediction error                                        |
| `ljung_pval`               | Statistical confirmation of noise/randomness                               |
| `cv_r2`                    | Proxy for model learnability/predictability                                |
| `garch_volatility`         | Conditional volatility ⇒ market stress estimation                          |
| `change_point_count`       | Regime switch count (e.g. via ruptures or cusum)                           |
| `rolling_adf_pval`         | Stationarity evolution over time                                           |
| `forecast_entropy`         | Entropy of predictions from RF or AE                                       |
| `price_entropy_peak`       | Local entropy spike detection before regime breaks (good for online usage) |

These build on your previous success and aim at:

* **Residual structure**
* **Volatility structure**
* **Forecast structure**
* **Regime changes**

---

## 🧠 Architecture Suggestion (Final Plan)

```bash
[Episode -> Raw OHLCV]
         |
         v
[Feature Extractor (Meta + Residual + Chaos)]
         |
         v
[Representation Learner (AE, Transformer Encoder, etc)]
         |
         v
[Predictor (Classifier or Ranker)]
         |
         v
[Score: Learnability + Transferability + Difficulty]
```

---

## ✅ Action Plan (Ordered)

**Phase 1 – Today**

1. ✅ Implement `EpisodeMetaFeatureExtractor` (done or in progress)
2. ✅ Implement `TradingEnvironmentBattleground` (done!)
3. 🔁 Add new diagnostics: `resid_std`, `resid_acf1`, `cv_r2`, `ljung_pval`
4. 🧠 Train `EpisodeTransferabilityPredictor` with new features

**Phase 2 – Next**
5\. 🧬 Add `AutoencoderRepresentation` wrapper
6\. 🔄 Train contrastive ranking model: "Is A > B?"
7\. ⏱ Evaluate with out-of-time validation (e.g., train on 2023 Q1, test on Q2)

---


In [1]:
import jupyter

In [2]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import gymnasium as gym
import matplotlib.pyplot as plt


from src.utils.system import boot
from src.defaults import RANDOM_SEEDS
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash
from environments import PositionTradingEnv,PositionTradingEnvV1

# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "trading_environment_development"
DEFAULT_PATH = "data/experiments/" + EXPERIMENT_NAME

# ========== CONFIG ==========
TICKER = "AAPL"
TIMESTEPS = 10_000
EVAL_EPISODES = 5
N_TIMESTEPS = 60
LOOKBACK = 0
SEEDS = [42, 52, 62]
MARKET_FEATURES = ['close']
BENCHMARK_PATH = DEFAULT_PATH+"/benchmark_episodes.json"
CHECKPOINT_DIR = DEFAULT_PATH+"/checkpoints"
SCORES_DIR = DEFAULT_PATH+"/scores"
META_PATH = DEFAULT_PATH+"/meta_df_transfer.csv"

MODEL_PATH = CHECKPOINT_DIR+"/episode_quality_model.pkl"
MARKET_FEATURES.sort()
SEEDS.sort()

DEVICE = boot()
OHLCV_DF = load_base_dataframe()

  from pandas.core import (


In [3]:
EXPERIENCE_NAME = "stock_universe_predictability_selection__MetaFeatures__MetaRlLabeling"
FEATURES_PATH = f"../data/cache/features_{EXPERIENCE_NAME}.pkl"
TARGETS_PATH = f"../data/cache/targets_{EXPERIENCE_NAME}.pkl"
META_PATH = f"../data/cache/meta_{EXPERIENCE_NAME}.pkl"
RL_LABELS_PATH = "../data/cache/meta_rl_labels_stock_universe_predictability_selection__MetaFeatures__MetaRlLabeling__6293649262173480064.pkl"

excluded_tickers=['CEG', 'GEHC', 'GEV', 'KVUE', 'SOLV']
excluded_tickers.sort()
#tickers = TOP2_STOCK_BY_SECTOR

config={
    "regressor":"RandomForestRegressor",
    "n_estimators": 200,
    "random_state":314,
    "transaction_cost":0
}
run_settings={
    "excluded_tickers": excluded_tickers,
    "min_samples": 10,
    "cv_folds": 3,
    "lags": 5,
    "start_date":"2022-01-01",
    "end_date":"2025-01-01",
    "seed":314,
    "episode_length":18,
    "noise_feature_cols": ["return_1d", "volume"]  ,

    "train_steps": 300,
    "min_ep_len" : 18
}

# Config section

In [45]:
# episode_benchmarking_engine.py

import os
import json
import hashlib
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt

from typing import Optional, Dict, List
from stable_baselines3 import PPO, A2C
from stable_baselines3.common.monitor import Monitor
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import skew, kurtosis, entropy
from statsmodels.stats.diagnostic import acorr_ljungbox

from environments import PositionTradingEnv,PositionTradingEnvV1,PositionTradingEnvV2

# Updated utility functions

def generate_lagged_features(df, config):
    df = df.copy()
    if config.get('return'):
        df['return_1d'] = df['close'].pct_change()
    if config.get('volume'):
        df['volume_1d'] = df['volume'].pct_change()
    for lag in range(1, config.get('lags', 0) + 1):
        df[f'return_lag_{lag}'] = df['return_1d'].shift(lag)
        df[f'volume_lag_{lag}'] = df['volume_1d'].shift(lag)
    return df.dropna()

def compute_sharpe_ratio(returns):
    mean = np.mean(returns)
    std = np.std(returns)
    return mean / std if std > 0 else np.nan

def compute_sortino_ratio(returns):
    mean = np.mean(returns)
    downside_std = np.std(returns[returns < 0])
    return mean / downside_std if downside_std > 0 else np.nan

def compute_calmar_ratio(returns):
    cum_returns = np.cumprod(1 + returns)
    drawdown = np.max(cum_returns) - np.min(cum_returns)
    mean_return = np.mean(returns)
    return mean_return / drawdown if drawdown > 0 else np.nan



# Utility functions with real logic for agent and oracle evaluation
def compute_agent_metrics(model, env, random=False):
    obs, _ = env.reset()
    done, actions, values = False, [], []
    while not done:
        action = env.action_space.sample() if random else model.predict(obs, deterministic=True)[0]
        obs, reward, done, _, info = env.step(action)
        actions.append(action)

    values = np.array(env.env.wallet_progress)
    if len(values) < 2:
        return {}
   
    returns = values-1 #np.diff(values) #/ (values[:-1] + 1e-9)
    negative_returns = returns[returns < 0] if len(returns[returns < 0]) > 0 else np.array([1e-9])
    action_probs = np.bincount(actions, minlength=2) / (len(actions) + 1e-9)
    drawdowns = values / np.maximum.accumulate(values)
    max_drawdown = drawdowns.min() - 1 if len(drawdowns) > 0 else -1
    success_trades = env.env.success_trades = 0
    failed_trades = env.env.failed_trades = 0
    total_trades = env.env.total_trades = 0
  
    sharpe = 0
    sortino = 0
    calmar = 0
    if returns.std() !=0:
        sharpe = returns.mean() / (returns.std() + 1e-9) * np.sqrt(252)
        sortino = returns.mean() / (negative_returns.std() + 1e-9) * np.sqrt(252)
   
    return {
        "reward": values[-1] - values[0],
        "volatility": returns.std(),
        "entropy": -np.sum(action_probs * np.log2(action_probs + 1e-9)),
        "max_drawdown": max_drawdown,
        "sharpe": sharpe, #returns.mean() / (returns.std() + 1e-9) * np.sqrt(252),
        "sortino": sortino, #returns.mean() / (negative_returns.std() + 1e-9) * np.sqrt(252),
        "calmar": returns.mean() / abs(max_drawdown + 1e-9),
        "success_trades": success_trades,
        "action_hold_ratio": np.mean(np.array(actions) == 0),
        "action_long_ratio": np.mean(np.array(actions) == 1),
        "cumulative_return": values[-1] / values[0] - 1 if values[0] != 0 else 0
    }

def compute_oracle_metrics(env):
    obs, _ = env.reset()
    done, actions, values = False, [], []
    

    while not done:
        curr_idx = env.env.step_idx
        next_idx = min(curr_idx + 1, len(env.env.prices) - 1)
        curr_price = env.env.prices[curr_idx]
        next_price = env.env.prices[next_idx]
        price_diff = next_price - curr_price
        action = 1 if price_diff > 0 else 0
        #action = episode_df.iloc[env.current]  # oracle always assumes uptrend (long)
        obs, reward, done, _, info = env.step(action)
        actions.append(action)

    values = np.array(env.env.wallet_progress)
    if len(values) < 2:
        return {}
    #print(values,actions)
    returns = values #np.diff(values) / (values[:-1] + 1e-9)
    negative_returns = returns[returns < 1] if len(returns[returns < 1]) > 0 else np.array([1e-9])
    action_probs = np.bincount(actions, minlength=2) / (len(actions) + 1e-9)
    drawdowns = values / np.maximum.accumulate(values)
    max_drawdown = drawdowns.min() - 1 if len(drawdowns) > 0 else -1

    success_trades = 0
    fail_trades = 0
    total_trades = 0
  

    return {
        "oracle_volatility": returns.std(),
        "oracle_entropy": -np.sum(action_probs * np.log2(action_probs + 1e-9)),
        "oracle_max_drawdown": max_drawdown,
        "oracle_sharpe": returns.mean() / (returns.std() + 1e-9) * np.sqrt(252),
        "oracle_sortino": returns.mean() / (negative_returns.std() + 1e-9) * np.sqrt(252),
        "oracle_calmar": returns.mean() / abs(max_drawdown + 1e-9),
        "oracle_success_trades": success_trades,
        "oracle_action_hold_ratio": np.mean(np.array(actions) == 0),
        "oracle_action_long_ratio": np.mean(np.array(actions) == 1),
        "oracle_cumulative_return": values[-1] / values[0] - 1 if values[0] != 0 else 0
    }



def compute_additional_diagnostics(env):
    df = env.env.episode_df
    df = generate_lagged_features(df.copy(), {'return': True, 'lags': 5})
    df = df.dropna()
    X = df[[f'return_lag_{i}' for i in range(1, 6)]].values
    y = df['return_1d'].values
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X, y)
    residuals = y - model.predict(X)
    ljung_pval = acorr_ljungbox(residuals, lags=[5], return_df=True).iloc[0]['lb_pvalue']
    return {
        'resid_std': residuals.std(),
        'resid_skew': skew(residuals),
        'resid_kurtosis': kurtosis(residuals),
        'ljung_pval': ljung_pval
    }

class EpisodeBenchmarkingEngine:
    def __init__(self, 
                 df: pd.DataFrame,
                 experiment_name: str= "episode_benchmark_engine",
                 agent_classes: List = [PPO, A2C],
                 seeds: List[int] = RANDOM_SEEDS,
                 train_steps: List[int] = [10_000], #, 50_000, 100_000, 200_000],
                 n_timesteps: int = 120,
                 lookback: int = 20,
                 min_valid_length: int = 100,
                 market_features=['close','volume'],
                 feature_config: Dict = None):

        self.df = df.copy()
        self.experiment_name = experiment_name
        self.agent_classes = agent_classes
        self.seeds = seeds
        self.train_steps = train_steps
        self.n_timesteps = n_timesteps
        self.lookback = lookback
        self.min_valid_length = min_valid_length
        self.market_features = market_features
        self.feature_config = feature_config or {'return': True, 'volume': True, 'lags': 5}
        self.envs = [PositionTradingEnv,PositionTradingEnvV1,PositionTradingEnvV2]
        self.base_path = f"data/experiments/{experiment_name}"
        self.checkpoint_path = os.path.join(self.base_path, "checkpoints")
        self.result_path = os.path.join(self.base_path, "meta_df_transfer.csv")
        os.makedirs(self.checkpoint_path, exist_ok=True)

    def _get_config_hash(self, config: Dict) -> str:
        raw = json.dumps(config, sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()

    def _save_model(self, model, config_hash, agent_name, seed):
        model_path = os.path.join(self.checkpoint_path, f"{agent_name}_{config_hash}_{seed}.zip")
        model.save(model_path)

    def _load_seen_hashes(self) -> set:
        if os.path.exists(self.result_path):
            df = pd.read_csv(self.result_path)
            return set(zip(df['config_hash'], df['agent_name'], df['seed']))
        return set()

    def train(self, ticker: str):
        df_ticker = self.df[self.df['symbol'] == ticker].sort_values("date").reset_index(drop=True)
        months = df_ticker['date'].dt.to_period("M").unique()[24:]

        records = []
        seen_hashes = self._load_seen_hashes()
        market_features = self.market_features
        for m in tqdm(months[1:-1], desc=f"Benchmarking {ticker}"):
            train_end_date = pd.Timestamp(m.start_time)
            train_start_idx = df_ticker[df_ticker['date'] < train_end_date].index.max() - self.n_timesteps
            test_start_idx = df_ticker[df_ticker['date'] < train_end_date].index.max() - self.lookback

            if train_start_idx - self.lookback < 0 or test_start_idx + self.n_timesteps >= len(df_ticker):
                continue
            _df = df_ticker.copy() #.iloc[train_start_idx - self.lookback: train_start_idx + self.n_timesteps].copy()
            #df_train = df_ticker.iloc[train_start_idx - self.lookback: train_start_idx + self.n_timesteps].copy()
            #df_test = df_ticker.iloc[test_start_idx - self.lookback: test_start_idx + self.n_timesteps].copy()

            #df_train = generate_lagged_features(df_train, self.feature_config)
            #df_test = generate_lagged_features(df_test, self.feature_config)
            _df = generate_lagged_features(_df, self.feature_config)
            df_train = _df.iloc[: train_start_idx + self.n_timesteps].copy()
            df_test = _df.iloc[: test_start_idx + self.n_timesteps].copy()
            for agent_cls in self.agent_classes:
                for seed in self.seeds:
                    for steps in self.train_steps:
                        for env_cls in self.envs:
                            env_version = 'v'+str(env_cls.__version__)
                            config = {
                                'ticker': ticker,
                                'train_idx': int(train_start_idx),
                                'test_idx': int(test_start_idx),
                                'agent_policy':'MlpPolicy',
                                'agent_name': agent_cls.__name__,
                                'env_version':env_version,
                                'timesteps': steps,
                                'seed': seed,
                                'feature_config': self.feature_config,
                                'market_features':json.dumps(market_features)
                            }
                            config_hash = self._get_config_hash(config)

                            if (config_hash, agent_cls.__name__, seed) in seen_hashes:
                                continue

                            env_train = Monitor(PositionTradingEnv(df_train, ticker , market_features=market_features, n_timesteps=self.n_timesteps, seed=seed, start_idx=train_start_idx))
                            model = agent_cls("MlpPolicy", env_train, verbose=0, seed=seed)
                            model.learn(total_timesteps=steps)
                            self._save_model(model, config_hash, agent_cls.__name__, seed)

                            train_metrics = compute_agent_metrics(model, env_train)
                            rand_metrics = compute_agent_metrics(None, env_train, random=True)

                            env_test = Monitor(PositionTradingEnv(df_test, ticker,market_features=market_features, n_timesteps=self.n_timesteps, seed=seed, start_idx=test_start_idx))
                            test_metrics = compute_agent_metrics(model, env_test)
                            rand_test_metrics = compute_agent_metrics(None, env_test, random=True)

                            oracle_train = compute_oracle_metrics(env_train)
                            oracle_test = compute_oracle_metrics(env_test)

                            record = {
                                'ticker': ticker,
                                'config_hash': config_hash,
                                'agent_name': agent_cls.__name__,
                                'env_version':env_version,
                                'seed': seed,
                                'month': m,
                                #'train_reward': train_metrics['reward'],
                                #'test_reward': test_metrics['reward'],
                                #'train_random': rand_metrics['reward'],
                                #'test_random': rand_test_metrics['reward'],
                                'advantage_train': train_metrics['reward'] - rand_metrics['reward'],
                                'advantage_test': test_metrics['reward'] - rand_test_metrics['reward'],
                                'transfer_delta': test_metrics['reward'] - train_metrics['reward'],
                                'market_features':json.dumps(market_features),
                                **{f"train_{k}": v for k, v in train_metrics.items()},
                                **{f"test_{k}": v for k, v in test_metrics.items()},
                                **{f"train_random_{k}": v for k, v in rand_metrics.items()},
                                **{f"test_random_{k}": v for k, v in rand_test_metrics.items()},
                                **{f"oracle_train_{k}": v for k, v in oracle_train.items()},
                                **{f"oracle_test_{k}": v for k, v in oracle_test.items()},
                                #**oracle_train,
                                #**oracle_test,
                                #**test_metrics,
                                **compute_additional_diagnostics(env_test)
                            }
                            records.append(record)

                if records:
                    df_new = pd.DataFrame(records)
                    if os.path.exists(self.result_path):
                        df_existing = pd.read_csv(self.result_path)
                        df_all = pd.concat([df_existing, df_new], ignore_index=True)
                    else:
                        df_all = df_new
                    df_all.to_csv(self.result_path, index=False)
                    print(f"[✓] Results saved to {self.result_path}")

    def evaluate(self):
        df = pd.read_csv(self.result_path)
        print("[INFO] Transfer success rate:", (df['transfer_delta'] > 0).mean())
        plt.hist(df['transfer_delta'], bins=50)
        plt.title("Transfer Delta Distribution")
        plt.show()
        return df

    def predict(self):
        print("[TODO] Implement meta-feature-based predictor")

    def report(self):
        print("[TODO] Implement full report generation with diagnostics")


In [46]:
ebm = EpisodeBenchmarkingEngine(OHLCV_DF)

In [47]:
ebm.train('AAPL')

Benchmarking AAPL:   0%|          | 0/16 [06:31<?, ?it/s]


KeyboardInterrupt: 