Absolutely — here's the full **README.md** draft that documents the initial version of our **LTM Test Suite**: Learnability, Transferability, Meta-Evaluation pipeline.

---

# 📊 LTM Test Suite: Learnability, Transferability, and Meta-Evaluation

This project establishes the **foundation for evaluating trading agents** based on their ability to:

1. **Learn effectively** in specific environments (Learnability)
2. **Generalize their knowledge** to new timeframes (Transferability)
3. **Be selected or ranked** using meta-features (Meta-Evaluation)

This evaluation system is central to our long-term goal of creating **regime-aware, introspective, intelligent agents** that know when and where they can succeed.

---

## 🚧 Status

✅ **Prototype phase (single-file implementation)**
⬜ Modular pipeline with full CLI / batch capabilities
⬜ Meta-learning model integration
⬜ Curriculum design system

---

## 🧠 Core Concepts

### 🔹 Learnability

> Can an agent learn useful trading behavior in a specific environment?

* Agent is trained on a single episode (e.g., 1 stock in 1 month)
* Evaluated by its normalized episode score (`0–100`)
* Uses fixed seeds and training steps for fair comparison
* Logged across multiple runs to assess robustness

---

### 🔹 Transferability

> Can the knowledge acquired in one environment be applied to the next one?

* Agent trained on `Month T`, evaluated (or fine-tuned) on `Month T+1`
* Compared against:

  * Random agent baseline
  * Oracle performance
* Transfer success = performance delta relative to training or baseline

---

### 🔹 Meta-Evaluation

> Can we predict which environments are promising *before* training?

* Extracts meta-features from the environment:

  * Volatility, momentum, entropy, Hurst, kurtosis, etc.
* Generates `meta_df.csv` with:

  * `learnability`, `agent advantage`, `transfer_delta` labels
* Enables future predictive modeling and curriculum learning

---

## 🧪 Test Protocol

### Episode Sampling

* All episodes:

  * Start on Mondays
  * Have fixed `n_timesteps`
  * Are non-overlapping **or** weekly (depending on config)
* Benchmark episodes are stored in:

  ```
  data/experiments/learnability_test/benchmark_episodes.json
  ```

---

### Agent Setup

* Agents trained with PPO (via Stable-Baselines3)
* Random agent used as baseline
* Oracle score used as upper bound reference
* Configurations, seeds, and policies are logged and reproducible

---

### Logging & Outputs

| Output                    | Description                                    |
| ------------------------- | ---------------------------------------------- |
| `meta_df.csv`             | Meta-features + labels for each run            |
| `checkpoints/{id}.zip`    | Trained agents saved with unique config hashes |
| `scores/{id}_results.csv` | Per-step and final metrics                     |
| `logs/`                   | Training logs and config snapshots             |

---

## 📁 Project Structure (Coming Soon)

```
ltm_suite/
├── configs/
│   └── benchmark_episodes.json
├── benchmarks/
│   └── scores/
│   └── checkpoints/
│   └── logs/
├── results/
│   └── meta_df.csv
├── runners/
│   ├── run_learnability.py
│   ├── run_transferability.py
│   └── run_meta_evaluation.py
├── utils/
│   └── env_loader.py
│   └── metrics.py
│   └── logger.py
└── README.md
```

---

## ✅ Success Criteria

* Episode scores consistently above 50% = good learning
* Transfer performance better than random baseline = generalization
* Meta-features predictive of good environments = meta-learning success
* All results are reproducible and statistically valid (multiple seeds)

---

## 📌 Next Milestone

We now proceed to:

* Implement a single-file prototype for **Learnability Test**
* Store benchmark episodes
* Save logs, scores, meta-data, and model checkpoints

---

Let me know if you want to add/change anything before we start the implementation.


In [1]:
import jupyter

In [2]:
import os 
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


from src.utils.system import boot
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash
from notebooks.environments import PositionTradingEnv,PositionTradingEnvV1,PositionTradingEnvV2
# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "environment_test_battery"
DEFAULT_PATH = "data/experiments/" + EXPERIMENT_NAME
DEVICE = boot()
OHLCV_DF = load_base_dataframe()

  from pandas.core import (


In [3]:
"""
Compare flat vs long vs random actions.

Confirm episode reward ranges.

Plot price paths and agent decisions.

Check correlation between reward and actual trading logic.
"""

BASE_ENV_KWARGS ={
    "full_df":OHLCV_DF,
    "ticker":"AAPL",
    "market_features":['close'],
    "start_idx":100
}
class EnvironmentTestBattery:
    
    def __init__(self,env_cls, env_kwargs=BASE_ENV_KWARGS):
        self.env_cls = env_cls
        self.env_kwargs = env_kwargs
        # Storage 
        self.results_file = DEFAULT_PATH+'/results.csv'
        self.load_results()
        
        # Checklist
        self.checklist = {
            "rewards":{
                "flat_long_random_actions":False,
                "oracle_weight_sum":False,
                "anti_oracle_weight_sum":False,
              #  "price_reward_correlation":False,
                "trade_reward_correlation":False
            }
        }
        

    def reset(self):
        for key in self.checklist['rewards']:
            self.checklist['rewards'][key] = False
        
    def run_validators(self, n_episodes=10):
        """
        Environment behaviour requirements:
        * Reward system
          * Sum of oracle rewards must be 1
          * Sum of anti-oracle must be -1
          * Agent performance must be statistically aligned with rewards
        """
        self.compare_flat_long_random_actions(n_episodes)
        self.validate_reward_weight_sums(n_episodes)
        #self.validate_price_reward_correlation(n_episodes)
        self.validate_reward_trade_logic_correlation(n_episodes)
        return self.checklist
    
    def load_results(self):
        if os.path.exists(self.results_file):
            self.results_df= pd.read_csv(self.results_file)
        else:
            self.results_df= pd.DataFrame()

    
    # HELPER METHODS ============================================
    def collect_rewards(self, policy_fn, n_episodes):
        all_rewards = []
        all_prices = []
        all_actions = []
        oracle = []
        anti_oracle = []

        for _ in range(n_episodes):
            env = self.env_cls(**self.env_kwargs)
            obs, _ = env.reset()
            done = False
            while not done:
                action = policy_fn(env)
                obs, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                all_rewards.append(reward)
                all_prices.append(env.prices[env.step_idx])
                all_actions.append(action)
                oracle.append(abs(reward))
                anti_oracle.append(abs(reward)*-1)

        return np.array(all_rewards), np.array(all_prices), np.array(all_actions),np.array(oracle),np.array(anti_oracle)

    def compare_flat_long_random_actions(self, n_episodes=10):
        def always_flat(env): return 0
        def always_long(env): return 1
        def random_action(env): return np.random.choice([0, 1])

        flat_rewards, _, _, oracle,anti_oracle = self.collect_rewards(always_flat, n_episodes)
        long_rewards, _, _, _, _               = self.collect_rewards(always_long, n_episodes)
        random_rewards, _, _, _, _             = self.collect_rewards(random_action, n_episodes)

        summary = {
            'flat': np.sum(flat_rewards),
            'long': np.sum(long_rewards),
            'random': np.sum(random_rewards),
            'oracle':np.sum(oracle),
            'anti_oracle':np.sum(anti_oracle),
            'validations':{
                'flat_long': np.sum(flat_rewards)+np.sum(long_rewards)==0,
                'between_oracle': (np.sum(random_rewards)/np.sum(oracle) >-1) and (np.sum(random_rewards)/np.sum(oracle) <=1)
            }
        }
        #print("Reward comparison:", summary)
        self.checklist['rewards']['flat_long_random_actions'] = summary['validations']['flat_long'] and  summary['validations']['between_oracle']

        
    def validate_reward_weight_sums(self, n_episodes=5):
        oracle_sum = 0
        anti_sum = 0

        for _ in range(n_episodes):
            env = self.env_cls(**self.env_kwargs)
            env.reset()
            for i in range(len(env.prices) - 1):
                price_diff = env.prices[i + 1] - env.prices[i]
                #oracle = np.sign(price_diff) * env.step_weights[i] * 100
                oracle = abs(env.step_weights[i]) * 100
                anti = -oracle
                oracle_sum += oracle
                anti_sum += anti

        oracle_ok = np.isclose(oracle_sum, 100 * n_episodes, atol=1e-1)
        anti_ok = np.isclose(anti_sum, -100 * n_episodes, atol=1e-1)
        #print(oracle_ok,anti_ok,oracle_sum, 100 * n_episodes,oracle_sum)
        self.checklist['rewards']['oracle_weight_sum'] = oracle_ok
        self.checklist['rewards']['anti_oracle_weight_sum'] = anti_ok

    def validate_price_reward_correlation(self, n_episodes=5):
        def random_action(env): return 0 #np.random.choice([0, 1])
        rewards, prices, _,_,_ = self.collect_rewards(random_action, n_episodes)
        #print(len(rewards))
        if len(rewards) < 2:
            self.checklist['rewards']['price_reward_correlation'] = False
            return

        returns = np.diff(prices) / prices[:-1]
        rewards = rewards[:-1]  # align
        corr = np.corrcoef(returns, rewards)[0, 1]
        self.checklist['rewards']['price_reward_correlation'] = abs(corr) #> 0.2
        

    def validate_reward_trade_logic_correlation(self, n_episodes=5):
        def oracle_action(env):  # follow the price
            curr = env.prices[env.step_idx]
            next_ = env.prices[min(env.step_idx + 1, len(env.prices) - 1)]
            return int(next_ > curr)

        rewards, _, actions,_,_ = self.collect_rewards(oracle_action, n_episodes)
        action_changes = np.abs(np.diff(actions))
        self.checklist['rewards']['trade_reward_correlation'] = (np.sum(rewards) > 0 and np.sum(action_changes) > 0)

    

env_test_check = EnvironmentTestBattery(PositionTradingEnvV2)
env_test_check.run_validators()

{'rewards': {'flat_long_random_actions': True,
  'oracle_weight_sum': True,
  'anti_oracle_weight_sum': True,
  'trade_reward_correlation': True}}