Absolutely — here's the full **README.md** draft that documents the initial version of our **LTM Test Suite**: Learnability, Transferability, Meta-Evaluation pipeline.

---

# 📊 LTM Test Suite: Learnability, Transferability, and Meta-Evaluation

This project establishes the **foundation for evaluating trading agents** based on their ability to:

1. **Learn effectively** in specific environments (Learnability)
2. **Generalize their knowledge** to new timeframes (Transferability)
3. **Be selected or ranked** using meta-features (Meta-Evaluation)

This evaluation system is central to our long-term goal of creating **regime-aware, introspective, intelligent agents** that know when and where they can succeed.

---

## 🚧 Status

✅ **Prototype phase (single-file implementation)**
⬜ Modular pipeline with full CLI / batch capabilities
⬜ Meta-learning model integration
⬜ Curriculum design system

---

## 🧠 Core Concepts

### 🔹 Learnability

> Can an agent learn useful trading behavior in a specific environment?

* Agent is trained on a single episode (e.g., 1 stock in 1 month)
* Evaluated by its normalized episode score (`0–100`)
* Uses fixed seeds and training steps for fair comparison
* Logged across multiple runs to assess robustness

---

### 🔹 Transferability

> Can the knowledge acquired in one environment be applied to the next one?

* Agent trained on `Month T`, evaluated (or fine-tuned) on `Month T+1`
* Compared against:

  * Random agent baseline
  * Oracle performance
* Transfer success = performance delta relative to training or baseline

---

### 🔹 Meta-Evaluation

> Can we predict which environments are promising *before* training?

* Extracts meta-features from the environment:

  * Volatility, momentum, entropy, Hurst, kurtosis, etc.
* Generates `meta_df.csv` with:

  * `learnability`, `agent advantage`, `transfer_delta` labels
* Enables future predictive modeling and curriculum learning

---

## 🧪 Test Protocol

### Episode Sampling

* All episodes:

  * Start on Mondays
  * Have fixed `n_timesteps`
  * Are non-overlapping **or** weekly (depending on config)
* Benchmark episodes are stored in:

  ```
  data/experiments/learnability_test/benchmark_episodes.json
  ```

---

### Agent Setup

* Agents trained with PPO (via Stable-Baselines3)
* Random agent used as baseline
* Oracle score used as upper bound reference
* Configurations, seeds, and policies are logged and reproducible

---

### Logging & Outputs

| Output                    | Description                                    |
| ------------------------- | ---------------------------------------------- |
| `meta_df.csv`             | Meta-features + labels for each run            |
| `checkpoints/{id}.zip`    | Trained agents saved with unique config hashes |
| `scores/{id}_results.csv` | Per-step and final metrics                     |
| `logs/`                   | Training logs and config snapshots             |

---

## 📁 Project Structure (Coming Soon)

```
ltm_suite/
├── configs/
│   └── benchmark_episodes.json
├── benchmarks/
│   └── scores/
│   └── checkpoints/
│   └── logs/
├── results/
│   └── meta_df.csv
├── runners/
│   ├── run_learnability.py
│   ├── run_transferability.py
│   └── run_meta_evaluation.py
├── utils/
│   └── env_loader.py
│   └── metrics.py
│   └── logger.py
└── README.md
```

---

## ✅ Success Criteria

* Episode scores consistently above 50% = good learning
* Transfer performance better than random baseline = generalization
* Meta-features predictive of good environments = meta-learning success
* All results are reproducible and statistically valid (multiple seeds)

---

## 📌 Next Milestone

We now proceed to:

* Implement a single-file prototype for **Learnability Test**
* Store benchmark episodes
* Save logs, scores, meta-data, and model checkpoints

---

Let me know if you want to add/change anything before we start the implementation.


In [1]:
import jupyter

In [2]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


from src.utils.system import boot
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash

# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "core_learnability_test"
DEFAULT_PATH = "data/experiments/" + EXPERIMENT_NAME



DEVICE = boot()




OHLCV_DF = load_base_dataframe()

  from pandas.core import (


In [3]:
# ltm_test_suite.py

import os
import json
import hashlib
import pandas as pd
import numpy as np
from datetime import datetime
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from environments import PositionTradingEnv  # assumed to exist
from data import sample_valid_episodes, extract_meta_features 

# ========== CONFIG ==========
TICKER = "AAPL"
TIMESTEPS = 10_000
EVAL_EPISODES = 5
N_TIMESTEPS = 60
LOOKBACK = 0
SEEDS = [42, 52, 62]
BENCHMARK_PATH = DEFAULT_PATH+"/benchmark_episodes.json"
CHECKPOINT_DIR = DEFAULT_PATH+"/checkpoints"
SCORES_DIR = DEFAULT_PATH+"/scores"
META_PATH = DEFAULT_PATH+"/meta_df.csv"

os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(SCORES_DIR, exist_ok=True)
os.makedirs(os.path.dirname(BENCHMARK_PATH), exist_ok=True)

# ========== UTILITIES ==========
def generate_config_hash(config):
    raw = json.dumps(config, sort_keys=True)
    return hashlib.sha256(raw.encode()).hexdigest()

def save_model(model, config_hash):
    path = os.path.join(CHECKPOINT_DIR, f"agent_{config_hash}.zip")
    model.save(path)
    with open(path.replace(".zip", "_config.json"), "w") as f:
        json.dump(config_hash, f, indent=2)

# ========== STEP 1: Load Data ==========
print("[INFO] Loading data...")
# Replace this with real OHLCV loading
df = OHLCV_DF.copy()#[OHLCV_DF['symbol']==TICKER].copy()


# ========== STEP 2: Sample Benchmark Episodes ==========
if os.path.exists(BENCHMARK_PATH):
    with open(BENCHMARK_PATH) as f:
        benchmark_episodes = json.load(f)
else:
    print("[INFO] Sampling benchmark episodes...")
    np.random.seed(0)
    benchmark_episodes = sample_valid_episodes(df, TICKER, N_TIMESTEPS, LOOKBACK, EVAL_EPISODES)
    with open(BENCHMARK_PATH, "w") as f:
        json.dump(benchmark_episodes.tolist(), f)  # ← ✅ Convert to list here

# ========== STEP 3: Run Learnability Tests ==========
meta_records = []
for seed in SEEDS:
    for start_idx in benchmark_episodes:
        print(f"[INFO] Running episode from idx {start_idx} with seed {seed}")

        # Prepare Env
        env = Monitor(PositionTradingEnv(df, TICKER, N_TIMESTEPS, LOOKBACK, start_idx=start_idx, seed=seed))
        model = PPO("MlpPolicy", env, verbose=0, seed=seed)

        model.learn(total_timesteps=TIMESTEPS)

        # Evaluate PPO agent
        obs,_ = env.reset()
        done, score = False, 0
        while not done:
            action = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            score += reward

        # Evaluate random agent
        obs,_ = env.reset()
        done, rand_score = False, 0
        while not done:
            action = env.action_space.sample()
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            rand_score += reward

        # Calculate advantage
        advantage = score - rand_score

        # Log meta-data
        start_date = df.loc[start_idx, "date"]
        end_date = df.loc[start_idx + N_TIMESTEPS - 1, "date"]
        config = {
            "ticker": TICKER,
            "start_date": str(start_date),
            "end_date": str(end_date),
            "timesteps": TIMESTEPS,
            "seed": seed
        }
        config_hash = generate_config_hash(config)
        save_model(model, config_hash)

        meta = extract_meta_features(df.iloc[start_idx: start_idx + N_TIMESTEPS])
        meta.update({
            "config_hash": config_hash,
            "score": score,
            "rand_score": rand_score,
            "advantage": advantage,
            "seed": seed,
            "ticker": TICKER,
            "start_date": str(start_date),
            "end_date": str(end_date)
        })
        meta_records.append(meta)

# ========== STEP 4: Save Results ==========
pd.DataFrame(meta_records).to_csv(META_PATH, index=False)
print("[INFO] Learnability test complete. Results saved to:", META_PATH)


[INFO] Loading data...
[INFO] Running episode from idx 615 with seed 42
[INFO] Running episode from idx 360 with seed 42
[INFO] Running episode from idx 528 with seed 42
[INFO] Running episode from idx 71 with seed 42
[INFO] Running episode from idx 355 with seed 42
[INFO] Running episode from idx 615 with seed 52
[INFO] Running episode from idx 360 with seed 52
[INFO] Running episode from idx 528 with seed 52
[INFO] Running episode from idx 71 with seed 52
[INFO] Running episode from idx 355 with seed 52
[INFO] Running episode from idx 615 with seed 62
[INFO] Running episode from idx 360 with seed 62
[INFO] Running episode from idx 528 with seed 62
[INFO] Running episode from idx 71 with seed 62
[INFO] Running episode from idx 355 with seed 62
[INFO] Learnability test complete. Results saved to: data/experiments/core_learnability_test/meta_df.csv


In [4]:
env.reset()

(array([179.58], dtype=float32), {})

In [5]:
pd.DataFrame(meta_records)

Unnamed: 0,mean_return,median_return,std_return,skew_return,kurtosis_return,return_trend,ewm_mean_return,hurst,adf_stat,adf_pval,entropy,config_hash,score,rand_score,advantage,seed,ticker,start_date,end_date
0,0.004833,0.001356,0.031406,6.179051,41.760315,-4.6e-05,9.8e-05,,-1.082511,0.721996,-113.241764,0102bdeb26bc63511016644e0ba4a5c32acf5900b0be01...,47.888019,50.969607,-3.081588,42,AAPL,2024-06-17 00:00:00,2024-09-11 00:00:00
1,0.000966,-0.001028,0.01631,1.277221,2.382294,2.4e-05,0.003653,,-1.873265,0.344741,-381.633496,d7fa37ca295def7018294e1131ba7fc50bf625fcaf70eb...,50.426522,33.77278,16.653742,42,AAPL,2023-06-12 00:00:00,2023-09-06 00:00:00
2,0.0004,0.001081,0.021477,-1.934139,12.180226,-6e-06,0.00074,,-2.149887,0.224933,-169.932861,940eecb07f0d7b52ac2c338618763e931fe487689c1498...,52.021793,51.621691,0.400102,42,AAPL,2024-02-12 00:00:00,2024-05-07 00:00:00
3,-0.002073,-0.001471,0.015908,0.073364,0.387725,-0.000115,-0.00232,,-0.509868,0.890076,-317.293904,219ecf617d05e6dfed3753093cb5d05ccb7004351bb876...,55.469274,52.050279,3.418994,42,AAPL,2022-04-18 00:00:00,2022-07-13 00:00:00
4,0.0014,-0.001028,0.016507,1.203456,2.078534,-2.7e-05,0.008284,,-2.142094,0.227928,-381.657819,bf705023300def9a487a0d89651ae2dd69355637f72c82...,47.695899,50.989647,-3.293747,42,AAPL,2023-06-05 00:00:00,2023-08-29 00:00:00
5,0.004833,0.001356,0.031406,6.179051,41.760315,-4.6e-05,9.8e-05,,-1.082511,0.721996,-113.241764,282ed8c910be27b102ffcb92008a92a2d0c299f92ab5d8...,47.888019,43.092871,4.795148,52,AAPL,2024-06-17 00:00:00,2024-09-11 00:00:00
6,0.000966,-0.001028,0.01631,1.277221,2.382294,2.4e-05,0.003653,,-1.873265,0.344741,-381.633496,f662c1e5e465a291d8c239872164d5e267263f2c30b300...,50.426522,57.968205,-7.541683,52,AAPL,2023-06-12 00:00:00,2023-09-06 00:00:00
7,0.0004,0.001081,0.021477,-1.934139,12.180226,-6e-06,0.00074,,-2.149887,0.224933,-169.932861,8f3dc8c107d5e4937687bf77a9bf800ec4ddbafff34991...,52.021793,51.945177,0.076615,52,AAPL,2024-02-12 00:00:00,2024-05-07 00:00:00
8,-0.002073,-0.001471,0.015908,0.073364,0.387725,-0.000115,-0.00232,,-0.509868,0.890076,-317.293904,b74161c78ffe455dfb0b182903c3079ff091b55ef016cc...,55.469274,51.145251,4.324022,52,AAPL,2022-04-18 00:00:00,2022-07-13 00:00:00
9,0.0014,-0.001028,0.016507,1.203456,2.078534,-2.7e-05,0.008284,,-2.142094,0.227928,-381.657819,42b91762e45bb7d01c88f6adda56fd70c138e3e0589cdc...,47.695899,54.53715,-6.841251,52,AAPL,2023-06-05 00:00:00,2023-08-29 00:00:00


In [6]:


def generate_config_hash(config):
    raw = json.dumps(config, sort_keys=True)
    return hashlib.sha256(raw.encode()).hexdigest()

def save_model(model, config_hash):
    path = os.path.join(CHECKPOINT_DIR, f"agent_{config_hash}.zip")
    model.save(path)
    with open(path.replace(".zip", "_config.json"), "w") as f:
        json.dump(config_hash, f, indent=2)




print("[INFO] Loading benchmark episodes...")
with open(BENCHMARK_PATH) as f:
    benchmark_episodes = json.load(f)

meta_records = []

for seed in SEEDS:
    for start_idx in benchmark_episodes:
        print(f"[INFO] Transferability episode: seed={seed}, train_idx={start_idx}")

        # TRAIN on Month T
        env_train = Monitor(PositionTradingEnv(df[df['symbol'] ==TICKER].reset_index(), TICKER, N_TIMESTEPS, LOOKBACK, seed=seed, start_idx=start_idx))
        model = PPO("MlpPolicy", env_train, verbose=0, seed=seed)
        model.learn(total_timesteps=TIMESTEPS)

        # Evaluate PPO on Month T
        obs, _= env_train.reset()
        done, score_train = False, 0
        while not done:
            action = model.predict(obs, deterministic=True)
            obs, reward, done, _, _ = env_train.step(action)
            score_train += reward

        # Random agent on Month T
        obs, _ = env_train.reset()
        done, rand_train = False, 0
        while not done:
            action = env_train.action_space.sample()
            obs, reward, done, _, _ = env_train.step(action)
            rand_train += reward

        # TEST on Month T+1
        test_idx = start_idx + 60  # approx. one month later
        if test_idx + N_TIMESTEPS >= len(df):
            print("[WARN] Skipping episode — test idx out of range")
            continue

        env_test = Monitor(PositionTradingEnv(df[df['symbol'] ==TICKER].reset_index(), TICKER, N_TIMESTEPS, LOOKBACK, seed=seed, start_idx=test_idx))

        obs, _ = env_test.reset()
        done, score_test = False, 0
        while not done:
            action = model.predict(obs, deterministic=True)
            obs, reward, done, _, _ = env_test.step(action)
            score_test += reward

        # Random agent on Month T+1
        obs, _ = env_test.reset()
        done, rand_test = False, 0
        while not done:
            action = env_test.action_space.sample()
            obs, reward, done, _, _ = env_test.step(action)
            rand_test += reward

        advantage_train = score_train - rand_train
        advantage_test = score_test - rand_test
        transfer_delta = score_test - score_train

        config = {
            "ticker": TICKER,
            "train_idx": int(start_idx),
            "test_idx": int(test_idx),
            "timesteps": TIMESTEPS,
            "seed": seed
        }
        config_hash = generate_config_hash(config)
        save_model(model, config_hash)

        features = extract_meta_features(df.iloc[start_idx:start_idx + N_TIMESTEPS])
        features.update({
            "config_hash": config_hash,
            "score_train": score_train,
            "score_test": score_test,
            "advantage_train": advantage_train,
            "advantage_test": advantage_test,
            "transfer_delta": transfer_delta,
            "transfer_success": int(transfer_delta > 0),
            "ticker": TICKER,
            "seed": seed
        })
        meta_records.append(features)

pd.DataFrame(meta_records).to_csv(TRANSFER_META_PATH, index=False)
print("[INFO] Transferability test complete. Results saved to:", TRANSFER_META_PATH)



[INFO] Loading benchmark episodes...
[INFO] Transferability episode: seed=42, train_idx=615


ValueError: Provided start_idx 675 is not a valid episode start.

In [None]:
benchmark_episodes


In [None]:
df[df['symbol']=="AAPL"].reset_index().iloc[672 ]

In [None]:

import os
import json
import numpy as np
import pandas as pd
from typing import Callable, Dict, List, Union
from stable_baselines3 import PPO, A2C
from stable_baselines3.common.monitor import Monitor
from scipy.stats import ttest_ind

# Example agent registry
AGENT_REGISTRY = {
    "ppo": PPO,
    "a2c": A2C
}

def compute_additional_metrics(env):
    values = np.array(env.values)
    rewards = np.array(env.rewards)
    actions = np.array(env.actions)

    returns = pd.Series(values).pct_change().dropna()
    volatility = returns.std()
    entropy = -np.sum(np.bincount(actions, minlength=2)/len(actions) * np.log2(np.bincount(actions, minlength=2)/len(actions) + 1e-9))
    max_drawdown = (values / np.maximum.accumulate(values)).min() - 1
    sharpe = returns.mean() / (returns.std() + 1e-9) * np.sqrt(252)
    sortino = returns.mean() / (returns[returns < 0].std() + 1e-9) * np.sqrt(252)
    calmar = returns.mean() / abs(max_drawdown + 1e-9)
    success_trades = np.sum((np.diff(values) > 0) & (actions[1:] == 1)) + np.sum((np.diff(values) < 0) & (actions[1:] == 0))

    return {
        "volatility": volatility,
        "entropy": entropy,
        "max_drawdown": max_drawdown,
        "sharpe": sharpe,
        "sortino": sortino,
        "calmar": calmar,
        "success_trades": success_trades,
        "action_hold_ratio": np.mean(actions == 0),
        "action_long_ratio": np.mean(actions == 1)
    }

def formalized_learning_evaluation(
    df: pd.DataFrame,
    ticker: str,
    agents: List[str] = ["ppo", "a2c"],
    env_cls: Callable = None,
    env_name: str = "PositionTradingEnv",
    env_version: str = "v0",
    env_config: Dict = None,
    timesteps: int = 10_000,
    eval_episodes: int = 10,
    n_timesteps: int = 60,
    lookback: int = 0,
    seed: int = 42,
    result_path: str = "data/eval/ltm_learnability.csv"
):
    os.makedirs(os.path.dirname(result_path), exist_ok=True)

    # Load previously completed configs
    if os.path.exists(result_path):
        past_df = pd.read_csv(result_path)
        past_configs = set(past_df['config_hash'].unique())
    else:
        past_df = pd.DataFrame()
        past_configs = set()

    # Filter data and sample episodes
    df = df[df['symbol'] == ticker].copy().sort_values('date')
    df['date'] = pd.to_datetime(df['date'])
    mondays = df[df['date'].dt.weekday == 0]
    valid_starts = []
    for date in mondays['date']:
        start_idx = df.index[df['date'] == date][0]
        if start_idx + n_timesteps < len(df):
            valid_starts.append(start_idx)
    sampled_starts = np.random.default_rng(seed).choice(valid_starts, size=min(eval_episodes, len(valid_starts)), replace=False)

    all_results = []

    for agent_name in agents:
        agent_cls = AGENT_REGISTRY[agent_name]
        for start_idx in sampled_starts:
            config = {
                "agent": agent_name,
                "ticker": ticker,
                "start_idx": start_idx,
                "timesteps": timesteps,
                "n_timesteps": n_timesteps,
                "lookback": lookback,
                "seed": seed,
                "env_name": env_name,
                "env_version": env_version,
                "env_config": env_config or {}
            }
            config_hash = hash(json.dumps(config, sort_keys=True))
            if config_hash in past_configs:
                continue

            env_train = env_cls(df, ticker, n_timesteps, lookback, seed=seed, start_idx=start_idx)
            env_train = Monitor(env_train)
            model = agent_cls("MlpPolicy", env_train, seed=seed, verbose=0)
            model.learn(total_timesteps=timesteps)

            # Eval PPO
            env_eval = env_cls(df, ticker, n_timesteps, lookback, seed=seed, start_idx=start_idx)
            obs, _ = env_eval.reset()
            done, ppo_score = False, 0
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, _, _ = env_eval.step(action)
                ppo_score += reward

            metrics = compute_additional_metrics(env_eval)

            # Eval Random
            env_rand = env_cls(df, ticker, n_timesteps, lookback, seed=seed, start_idx=start_idx)
            obs, _ = env_rand.reset()
            done, rand_score = False, 0
            while not done:
                action = env_rand.action_space.sample()
                obs, reward, done, _, _ = env_rand.step(action)
                rand_score += reward

            row = {
                "config_hash": config_hash,
                "agent": agent_name,
                "env_name": env_name,
                "env_version": env_version,
                "ticker": ticker,
                "seed": seed,
                "start_idx": start_idx,
                "timesteps": timesteps,
                "ppo_score": ppo_score,
                "rand_score": rand_score,
                "ppo_advantage": ppo_score - rand_score,
                "ppo_std": np.std([ppo_score]),
                "rand_std": np.std([rand_score]),
                "ppo_median": ppo_score,
                "rand_median": rand_score,
                "train_start_date": df.loc[start_idx, 'date'],
                "train_end_date": df.loc[start_idx + n_timesteps, 'date'],
                "config_json": json.dumps(config, sort_keys=True),
                **metrics
            }
            all_results.append(row)

    if all_results:
        new_df = pd.DataFrame(all_results)
        combined = pd.concat([past_df, new_df], ignore_index=True)
        combined.to_csv(result_path, index=False)

    return pd.DataFrame(all_results)
