Absolutely — here's the full **README.md** draft that documents the initial version of our **LTM Test Suite**: Learnability, Transferability, Meta-Evaluation pipeline.

---

# 📊 LTM Test Suite: Learnability, Transferability, and Meta-Evaluation

This project establishes the **foundation for evaluating trading agents** based on their ability to:

1. **Learn effectively** in specific environments (Learnability)
2. **Generalize their knowledge** to new timeframes (Transferability)
3. **Be selected or ranked** using meta-features (Meta-Evaluation)

This evaluation system is central to our long-term goal of creating **regime-aware, introspective, intelligent agents** that know when and where they can succeed.

---

## 🚧 Status

✅ **Prototype phase (single-file implementation)**
⬜ Modular pipeline with full CLI / batch capabilities
⬜ Meta-learning model integration
⬜ Curriculum design system

---

## 🧠 Core Concepts

### 🔹 Learnability

> Can an agent learn useful trading behavior in a specific environment?

* Agent is trained on a single episode (e.g., 1 stock in 1 month)
* Evaluated by its normalized episode score (`0–100`)
* Uses fixed seeds and training steps for fair comparison
* Logged across multiple runs to assess robustness

---

### 🔹 Transferability

> Can the knowledge acquired in one environment be applied to the next one?

* Agent trained on `Month T`, evaluated (or fine-tuned) on `Month T+1`
* Compared against:

  * Random agent baseline
  * Oracle performance
* Transfer success = performance delta relative to training or baseline

---

### 🔹 Meta-Evaluation

> Can we predict which environments are promising *before* training?

* Extracts meta-features from the environment:

  * Volatility, momentum, entropy, Hurst, kurtosis, etc.
* Generates `meta_df.csv` with:

  * `learnability`, `agent advantage`, `transfer_delta` labels
* Enables future predictive modeling and curriculum learning

---

## 🧪 Test Protocol

### Episode Sampling

* All episodes:

  * Start on Mondays
  * Have fixed `n_timesteps`
  * Are non-overlapping **or** weekly (depending on config)
* Benchmark episodes are stored in:

  ```
  data/experiments/learnability_test/benchmark_episodes.json
  ```

---

### Agent Setup

* Agents trained with PPO (via Stable-Baselines3)
* Random agent used as baseline
* Oracle score used as upper bound reference
* Configurations, seeds, and policies are logged and reproducible

---

### Logging & Outputs

| Output                    | Description                                    |
| ------------------------- | ---------------------------------------------- |
| `meta_df.csv`             | Meta-features + labels for each run            |
| `checkpoints/{id}.zip`    | Trained agents saved with unique config hashes |
| `scores/{id}_results.csv` | Per-step and final metrics                     |
| `logs/`                   | Training logs and config snapshots             |

---

## 📁 Project Structure (Coming Soon)

```
ltm_suite/
├── configs/
│   └── benchmark_episodes.json
├── benchmarks/
│   └── scores/
│   └── checkpoints/
│   └── logs/
├── results/
│   └── meta_df.csv
├── runners/
│   ├── run_learnability.py
│   ├── run_transferability.py
│   └── run_meta_evaluation.py
├── utils/
│   └── env_loader.py
│   └── metrics.py
│   └── logger.py
└── README.md
```

---

## ✅ Success Criteria

* Episode scores consistently above 50% = good learning
* Transfer performance better than random baseline = generalization
* Meta-features predictive of good environments = meta-learning success
* All results are reproducible and statistically valid (multiple seeds)

---

## 📌 Next Milestone

We now proceed to:

* Implement a single-file prototype for **Learnability Test**
* Store benchmark episodes
* Save logs, scores, meta-data, and model checkpoints

---

Let me know if you want to add/change anything before we start the implementation.


In [1]:
import jupyter

In [2]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


from src.utils.system import boot
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash

# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "core_learnability_test"
DEFAULT_PATH = "data/experiments/" + EXPERIMENT_NAME



DEVICE = boot()




OHLCV_DF = load_base_dataframe()

  from pandas.core import (


In [4]:
# ltm_test_suite.py

import os
import json
import hashlib
import pandas as pd
import numpy as np
from datetime import datetime
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from environments import PositionTradingEnv  # assumed to exist
from data import sample_valid_episodes, extract_meta_features 

# ========== CONFIG ==========
TICKER = "AAPL"
TIMESTEPS = 10_000
EVAL_EPISODES = 5
N_TIMESTEPS = 60
LOOKBACK = 0
SEEDS = [42, 52, 62]
BENCHMARK_PATH = DEFAULT_PATH+"/benchmark_episodes.json"
CHECKPOINT_DIR = DEFAULT_PATH+"/checkpoints"
SCORES_DIR = DEFAULT_PATH+"/scores"
META_PATH = DEFAULT_PATH+"/meta_df.csv"

os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(SCORES_DIR, exist_ok=True)
os.makedirs(os.path.dirname(BENCHMARK_PATH), exist_ok=True)

# ========== UTILITIES ==========
def generate_config_hash(config):
    raw = json.dumps(config, sort_keys=True)
    return hashlib.sha256(raw.encode()).hexdigest()

def save_model(model, config_hash):
    path = os.path.join(CHECKPOINT_DIR, f"agent_{config_hash}.zip")
    model.save(path)
    with open(path.replace(".zip", "_config.json"), "w") as f:
        json.dump(config_hash, f, indent=2)

# ========== STEP 1: Load Data ==========
print("[INFO] Loading data...")
# Replace this with real OHLCV loading
df = OHLCV_DF.copy()#[OHLCV_DF['symbol']==TICKER].copy()


# ========== STEP 2: Sample Benchmark Episodes ==========
if os.path.exists(BENCHMARK_PATH):
    with open(BENCHMARK_PATH) as f:
        benchmark_episodes = json.load(f)
else:
    print("[INFO] Sampling benchmark episodes...")
    np.random.seed(0)
    benchmark_episodes = sample_valid_episodes(df, TICKER, N_TIMESTEPS, LOOKBACK, EVAL_EPISODES)
    with open(BENCHMARK_PATH, "w") as f:
        json.dump(benchmark_episodes.tolist(), f)  # ← ✅ Convert to list here

# ========== STEP 3: Run Learnability Tests ==========
meta_records = []
for seed in SEEDS:
    for start_idx in benchmark_episodes:
        print(f"[INFO] Running episode from idx {start_idx} with seed {seed}")

        # Prepare Env
        env = Monitor(PositionTradingEnv(df, TICKER, N_TIMESTEPS, LOOKBACK, start_idx=start_idx, seed=seed))
        model = PPO("MlpPolicy", env, verbose=0, seed=seed)

        model.learn(total_timesteps=TIMESTEPS)

        # Evaluate PPO agent
        obs,_ = env.reset()
        done, score = False, 0
        while not done:
            action = model.predict(obs, deterministic=True)
            obs, reward, done,terminated, info = env.step(action)
            
            score += reward

        # Evaluate random agent
        obs,_ = env.reset()
        done, rand_score = False, 0
        while not done:
            action = env.action_space.sample()
            obs, reward, done,terminated, info = env.step(action)
            rand_score += reward

        # Calculate advantage
        advantage = score - rand_score

        # Log meta-data
        start_date = df.loc[start_idx, "date"]
        end_date = df.loc[start_idx + N_TIMESTEPS - 1, "date"]
        config = {
            "ticker": TICKER,
            "start_date": str(start_date),
            "end_date": str(end_date),
            "timesteps": TIMESTEPS,
            "seed": seed
        }
        config_hash = generate_config_hash(config)
        save_model(model, config_hash)

        #meta = extract_meta_features(df.iloc[start_idx: start_idx + N_TIMESTEPS])
        meta.update({
            "config_hash": config_hash,
            "score": score,
            "rand_score": rand_score,
            "advantage": advantage,
            "seed": seed,
            "ticker": TICKER,
            "start_date": str(start_date),
            "end_date": str(end_date)
        })
        meta_records.append(meta)

# ========== STEP 4: Save Results ==========
pd.DataFrame(meta_records).to_csv(META_PATH, index=False)
print("[INFO] Learnability test complete. Results saved to:", META_PATH)


[INFO] Loading data...
[INFO] Running episode from idx 615 with seed 42


ValueError: You have passed a tuple to the predict() function instead of a Numpy array or a Dict. You are probably mixing Gym API with SB3 VecEnv API: `obs, info = env.reset()` (Gym) vs `obs = vec_env.reset()` (SB3 VecEnv). See related issue https://github.com/DLR-RM/stable-baselines3/issues/1694 and documentation for more information: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecenv-api-vs-gym-api

In [5]:
env.reset()

(array([216.67], dtype=float32), {})