Fantastic — your answers make it clear: we're building a **free-thinking, risk-aware, market-grounded learning trader**, with the potential to become not just profitable, but deeply **adaptive** and **introspective**. This will be like training a real human trader, starting simple and layering on cognition and tools.

---

## ✅ Summary of Your Vision

| Category              | Your Decision                                                          |
| --------------------- | ---------------------------------------------------------------------- |
| **Self-Awareness**    | Track PnL, trade history, confidence per trade                         |
| **Market Context**    | Allow multi-timeframe (1D, 1W, maybe 1H later), and any usable feature |
| **External Signals**  | Include events like earnings/fed/news if available                     |
| **Risk Management**   | Wants liquidation/capital erosion + learned position sizing (v2+)      |
| **Strategy Modeling** | Enable strategy playbooks and adaptive behavior                        |
| **Meta-Learning**     | Agent should retain memory of past conditions, learn from meta-signals |
| **Limitations**       | No peeking into future — only prediction from available past           |

---

## 🎯 Now Here's the Plan: "The Trader Intelligence Stack"

We'll organize this into **four layers** that build on each other. Each layer adds trader-like qualities and improves survivability and strategy creation.

---

### **🔹 Layer 1: Survival & Orientation (v1)**

> Minimal working agent that can hold/sell one stock, one timeframe, rewarded by position-based score.

**Inputs:**

* OHLCV (daily)
* Agent’s current position
* Time since position opened
* Estimated profit/loss if selling now

**Internal features:**

* Current PnL (unrealized)
* Position duration
* Action history (last N actions — optional at this stage)

**Reward:**

* Oracle-relative reward between 0–100 per episode (✅ already implemented)

**Goal:** Learn to enter/exit positions intelligently on one stock.

---

### **🔹 Layer 2: Market Perception & Meta-Features**

> Now the agent *reads the environment*, and we open it to *multi-feature* inputs.

**Additions:**

* Volatility, momentum, kurtosis, entropy, regime label, VIX, etc.
* Optional: add price features from 3-day, 1-week trailing windows

**Goal:** Learn to recognize **conditions** that precede profitable trends.

---

### **🔹 Layer 3: Portfolio & Risk Awareness**

> The agent now becomes a risk-aware trader.

**Additions:**

* Realized volatility, trailing drawdown
* Simulated liquidation: episode ends if capital drops below X%
* Optional: reward penalty for big drawdowns

**Later upgrade:**

* Learn dynamic position sizing (0%, 25%, 50%, 100%) or continuous size

**Goal:** Survive, control risk, avoid death by bad trades.

---

### **🔹 Layer 4: Strategic Thinking & Memory**

> Agent becomes *introspective* and *adaptive* — career-trader-level.

**Additions:**

* Confidence score (learned or predicted)
* Episodic memory (compare current conditions to prior wins/losses)
* Strategy archetype detection (trend following, mean reversion, etc.)
* Meta-reward: evaluate *how well the agent acted*, not just profit

**Goal:** Develop strategic behavior that generalizes to new situations.

---

### **🔹 Layer 5: Real World Awareness (Optional Later)**

> External signals from scraping/news pipelines and macro indicators.

* Earnings day, Fed announcement, news sentiment
* Sector rotation features (sector-relative strength)
* Macro ETFs (SPY, QQQ, TLT, etc.)

---

## ✅ Next Step: We Start at Layer 1

Let’s:

1. **Solidify internal features**: PnL, duration, trade history, position status.
2. **Wrap it into the `PositionTradingEnv`** — this becomes part of the observation.
3. **Optionally**: Add rolling average reward, confidence, or simple position score.

---

### 🔧 Can I proceed to implement a Layer 1 `env.get_observation()` that includes:

* Agent's current position (0 = flat, 1 = long)
* Time since entry
* Unrealized PnL (oracle-relative)
* Normalized current price vs. entry price
* One-hot day-of-week (already done)
* Rolling average return over past N days
* Optional: last 3 actions (as one-hot vectors)

?

Once this is done, we’ll run the first agent and start logging learnability under **"Survival Mode"**.

Ready to code this?


In [1]:
import jupyter

In [2]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import gymnasium as gym
import matplotlib.pyplot as plt


from src.utils.system import boot
from src.defaults import RANDOM_SEEDS
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash
from environments import PositionTradingEnv,PositionTradingEnvV1

# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "trading_environment_development"
DEFAULT_PATH = "data/experiments/" + EXPERIMENT_NAME

# ========== CONFIG ==========
TICKER = "AAPL"
TIMESTEPS = 10_000
EVAL_EPISODES = 5
N_TIMESTEPS = 60
LOOKBACK = 0
SEEDS = RANDOM_SEEDS
MARKET_FEATURES = ['close']
BENCHMARK_PATH = DEFAULT_PATH+"/benchmark_episodes.json"
CHECKPOINT_DIR = DEFAULT_PATH+"/checkpoints"
SCORES_DIR = DEFAULT_PATH+"/scores"
META_PATH = DEFAULT_PATH+"/meta_df.csv"

MARKET_FEATURES.sort()
SEEDS.sort()

DEVICE = boot()
OHLCV_DF = load_base_dataframe()

  from pandas.core import (


In [5]:
import os
import json
import hashlib
import numpy as np
import pandas as pd
from typing import Callable
from stable_baselines3 import PPO,A2C
from stable_baselines3.common.monitor import Monitor
from environments import PositionTradingEnv
from data import extract_meta_features

def compute_additional_metrics(env):
    if hasattr(env, "env"):  # unwrap Monitor
        env = env.env
    values = np.array(env.values)
    rewards = np.array(env.rewards)
    actions = np.array(env.actions)

    returns = pd.Series(values).pct_change().dropna()
    volatility = returns.std()
    entropy = -np.sum(np.bincount(actions, minlength=2)/len(actions) * np.log2(np.bincount(actions, minlength=2)/len(actions) + 1e-9))
    max_drawdown = (values / np.maximum.accumulate(values)).min() - 1
    sharpe = returns.mean() / (returns.std() + 1e-9) * np.sqrt(252)
    sortino = returns.mean() / (returns[returns < 0].std() + 1e-9) * np.sqrt(252)
    calmar = returns.mean() / abs(max_drawdown + 1e-9)
    success_trades = np.sum((np.diff(values) > 0) & (actions[1:] == 1)) + np.sum((np.diff(values) < 0) & (actions[1:] == 0))

    return {
        "volatility": volatility,
        "entropy": entropy,
        "max_drawdown": max_drawdown,
        "sharpe": sharpe,
        "sortino": sortino,
        "calmar": calmar,
        "success_trades": success_trades,
        "action_hold_ratio": np.mean(actions == 0),
        "action_long_ratio": np.mean(actions == 1)
    }

def formalized_transferability_evaluation(
    df: pd.DataFrame,
    ticker: str,
    env_cls: Callable = PositionTradingEnv,
    benchmark_path: str = "data/experiments/learnability_test/benchmark_episodes.json",
    result_path: str = "data/experiments/learnability_test/meta_df_transfer.csv",
    timesteps: int = 10_000,
    n_timesteps: int = 60,
    lookback: int = 0,
    seeds: list = [42, 52, 62],
    checkpoint_dir: str = "data/experiments/learnability_test/checkpoints",
    agent_cls: Callable = PPO,
    
    agent_config: dict = None,
    env_config: dict = None
) -> pd.DataFrame:

    os.makedirs(os.path.dirname(result_path), exist_ok=True)
    os.makedirs(checkpoint_dir, exist_ok=True)
    agent_name: str = agent_cls.__name__
    env_version: str = f"v{env_cls.__version__}"
        
    def generate_config_hash(config):
        raw = json.dumps(config, sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()

    def save_model(model, config_full, config_hash):
        path = os.path.join(checkpoint_dir, f"agent_{config_hash}.zip")
        model.save(path)
        with open(path.replace(".zip", "_config.json"), "w") as f:
            json.dump(config_full, f, indent=2)

    print("[INFO] Loading benchmark episodes...")
    with open(benchmark_path) as f:
        benchmark_episodes = json.load(f)
    
    meta_records = []
    df_ticker = df[df['symbol'] == ticker].reset_index(drop=True)

    if os.path.exists(result_path):
        existing = pd.read_csv(result_path)
        seen_hashes = set(existing['config_hash'].unique())
    else:
        seen_hashes = set()
  
    for seed in seeds:
        for start_idx in benchmark_episodes:
            
            test_idx = start_idx + n_timesteps
            if test_idx + n_timesteps >= len(df_ticker):
                print("[WARN] Skipping episode — test idx out of range")
                continue

            config = {
                "ticker": ticker,
                "train_idx": int(start_idx),
                "test_idx": int(test_idx),
                "timesteps": timesteps,
                "episode_steps":n_timesteps,
                #"seed": seed,
                "env_version": env_version,
                "env_config": env_config,
                #"agent_name": agent_name,
                "agent_config": agent_config,
            }
            config_hash = generate_config_hash(config)
            if config_hash in seen_hashes:
                print(f"[INFO] Skipping previously completed run: {config_hash}")
                continue

            print(f"[INFO] Transferability: seed={seed}, start_idx={start_idx}, config_hash={config_hash}")

 
            env_train = Monitor(env_cls(df_ticker, ticker=ticker, seed=seed, start_idx=start_idx, **(env_config or {})))
            model = agent_cls("MlpPolicy", env_train, verbose=0, seed=seed, **(agent_config or {}))
            model.learn(total_timesteps=timesteps)

            obs, _ = env_train.reset()
            done, score_train = False, 0
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, _, _ = env_train.step(action)
                score_train += reward

            obs, _ = env_train.reset()
            done, rand_train = False, 0
            while not done:
                action = env_train.action_space.sample()
                obs, reward, done, _, _ = env_train.step(action)
                rand_train += reward

            env_test = Monitor(env_cls(df_ticker, ticker=ticker, seed=seed, start_idx=test_idx, **(env_config or {})))
            obs, _ = env_test.reset()
            done, score_test = False, 0
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, _, _ = env_test.step(action)
                score_test += reward

            obs, _ = env_test.reset()
            done, rand_test = False, 0
            while not done:
                action = env_test.action_space.sample()
                obs, reward, done, _, _ = env_test.step(action)
                rand_test += reward

            advantage_train = score_train - rand_train
            advantage_test = score_test - rand_test
            transfer_delta = score_test - score_train

            save_model(model, config, config_hash)

            meta = extract_meta_features(df_ticker.iloc[start_idx:start_idx + n_timesteps])
            diagnostics = compute_additional_metrics(env_test)

            meta.update({
                "config_hash": config_hash,
                "env_version": env_version,
                "agent_name": agent_name,
                "score_train": score_train,
                "score_test": score_test,
                "advantage_train": advantage_train,
                "advantage_test": advantage_test,
                "transfer_delta": transfer_delta,
                "transfer_success": int(transfer_delta > 0),
                "ticker": ticker,
                "config":json.dumps(config),
                "seed": seed,
                "ticker": ticker,
                "train_idx": int(start_idx),
                "test_idx": int(test_idx),
                "timesteps": timesteps,
                "episode_steps":n_timesteps,
                "seed": seed,
                **diagnostics
            })
            meta_records.append(meta)

    result_df = pd.DataFrame(meta_records)
    if os.path.exists(result_path):
        result_df = pd.concat([pd.read_csv(result_path), result_df], ignore_index=True)
    result_df.to_csv(result_path, index=False)
    print("[INFO] Transferability test complete. Results saved to:", result_path)
    return result_df

In [6]:
if os.path.exists(BENCHMARK_PATH):
    with open(BENCHMARK_PATH) as f:
        benchmark_episodes = json.load(f)
else:
    print("[INFO] Sampling benchmark episodes...")
    np.random.seed(0)
    benchmark_episodes = sample_valid_episodes(OHLCV_DF[OHLCV_DF['symbol']==TICKER], TICKER, N_TIMESTEPS, LOOKBACK, EVAL_EPISODES)
    with open(BENCHMARK_PATH, "w") as f:
        json.dump(benchmark_episodes.tolist(), f)  # ← ✅ Convert to list here

print("[INFO] Episódios de benchmark salvos em:", BENCHMARK_PATH)
for env_cls in [PositionTradingEnv,PositionTradingEnvV1]:
    for agent_cls in [PPO, A2C]:
        result_df = formalized_transferability_evaluation(
            df=OHLCV_DF.copy(),
            ticker=TICKER,
            env_cls=env_cls,
            agent_cls=agent_cls,
            benchmark_path=DEFAULT_PATH+"/benchmark_episodes.json",
            result_path=DEFAULT_PATH+"/meta_df_transfer.csv",
            timesteps=TIMESTEPS,
            n_timesteps=N_TIMESTEPS,
            lookback=LOOKBACK,
            seeds=SEEDS,  # or just [42] for quick run
            checkpoint_dir=DEFAULT_PATH+"/checkpoints",
            env_config={"market_features":MARKET_FEATURES}
)

[INFO] Episódios de benchmark salvos em: data/experiments/trading_environment_development/benchmark_episodes.json
[INFO] Loading benchmark episodes...


EmptyDataError: No columns to parse from file

In [None]:
result_df = pd.read_csv(BENCHMARK_PATH)
result_df

In [None]:
from scipy.stats import ttest_ind


def compare_environments(result_df,env_version_a="v0",env_version_b = "v1"):
    

    summary = result_df.groupby("env_version")[[
        "score_train", "score_test", "advantage_train", "advantage_test",
        "transfer_delta", "success_trades", "sharpe", "sortino", "calmar",
        "max_drawdown", "volatility", "action_hold_ratio", "action_long_ratio"
    ]].agg(["mean", "std", "median"]).T
    

    mean_df = summary.xs('mean', level=1)
    # Compute absolute difference between env_version 1 and 0
    diffs = (mean_df[env_version_a] - mean_df[env_version_b]).abs().sort_values(ascending=False)
  
    # Plot using this sorted order
    mean_df.loc[diffs.index].plot.bar(
        figsize=(14, 6),
        title=f"Env {env_version_a} vs {env_version_b} – Mean metric comparison (sorted by difference)",
        ylabel="Mean Value"
    )
    metrics = ["score_test", "advantage_test", "transfer_delta", "sharpe", "sortino"]

    for metric in metrics:
        v0 = result_df[result_df.env_version == env_version_a][metric]
        v1 = result_df[result_df.env_version == env_version_b][metric]
        stat, pval = ttest_ind(v0, v1)
        print(f"{metric}: p={pval:.4f} | {env_version_a}_mean={v0.mean():.3f}, {env_version_b}_mean={v1.mean():.3f}")

    for metric in metrics:
        sns.boxplot(data=result_df, x="env_version", y=metric)
        plt.title(f"{metric} by Environment Version")
        plt.show()
        
    result_df['composite_score'] = (
        result_df['advantage_test'] +
        result_df['transfer_delta'] +
        result_df['sharpe'] * 5 -
        result_df['max_drawdown'] * 10
    )

    return result_df,result_df.groupby("env_version")["composite_score"].mean()




In [None]:
summary = compare_environments(result_df)
summary

In [None]:
results = pd.read_csv(DEFAULT_PATH+"/meta_df_transfer.csv")
results.groupby('env_version').mean(numeric_only=True).T