
## 📘 **Notebook Summary**: Environment Upgrade and Benchmark

### 🎯 **Objective**

To evaluate whether the upgraded environment `PositionTradingEnvV1` (which includes internal features like position, holding time, PnL, etc.) improves agent performance and learnability compared to the original `PositionTradingEnv`.

---

### 🏗️ **Structure of the Notebook**

#### 1. **Imports and Setup**

* Essential packages are imported: `stable_baselines3`, `gym`, `numpy`, `pandas`, etc.
* A2C and PPO agents are prepared for training.

#### 2. **Environment Definitions**

* You define `PositionTradingEnv` and the upgraded version `PositionTradingEnvV1`.
* `PositionTradingEnvV1` includes richer observations:

  * `position`, `time_in_position`, `unrealized_pnl`, `price_vs_entry`, `rolling_return`
  * One-hot encoding for weekday
  * One-hot action history (3 last actions)

#### 3. **Training Procedure**

* Agents are trained on both v0 and v1 environments.
* You use `run_learning_evaluation()` which:

  * Trains each agent on a sample episode (fixed seed)
  * Evaluates on both training and test episodes
  * Compares PPO and A2C to random policy
  * Logs detailed metrics: reward, advantage, transferability, etc.

#### 4. **Metric Aggregation**

* Results are stored and aggregated in a DataFrame
* Metrics include:

  * `score_train`, `score_test` (normalized episode score)
  * `advantage_train`, `advantage_test` (agent - random)
  * `transfer_delta` (test - train)
  * `success_trades`, `action ratios`, Sharpe, Sortino, etc.

#### 5. **Visualization / Tables**

* Grouped comparison tables are printed
* Showing `agent_name` × `env_version` (v0, v1) across all metrics

---

## 📊 **Key Results**

| Metric              | A2C v1 vs v0   | PPO v1 vs v0 | Conclusion                                    |
| ------------------- | -------------- | ------------ | --------------------------------------------- |
| `score_train`       | +0.22          | **+0.49** ✅  | Both improved significantly on training       |
| `advantage_train`   | +0.22          | **+0.48** ✅  | Better advantage signal learning              |
| `score_test`        | −0.01          | **−0.53** ❌  | PPO overfits? Generalization worsened         |
| `advantage_test`    | −0.58          | **+0.14** ✅  | PPO still beats random more often             |
| `transfer_delta`    | −0.21          | **−1.02** ❌  | PPO v1 transfers worse — needs regularization |
| `action_hold_ratio` | \~ +0.02       | +0.02        | Agents learned to hold longer                 |
| `success_trades`    | Roughly stable | Slightly ↓   | Behaviorally still consistent                 |

---

## ✅ **Conclusion**

* `PositionTradingEnvV1` **improves learning** and agent awareness during training.
* PPO agents **learn to trade better** in v1, with higher training scores and advantage.
* However, **generalization and transferability dropped**, suggesting:

  * Overfitting to internal agent features
  * Lack of training diversity
  * No regularization (yet)

---

## 🧭 **Next Steps (Suggested)**

1. Add **dropout or entropy tuning** to avoid overfitting in PPO.
2. Add **training diversity** (tickers/months) or curriculum.
3. Consider **meta-learning memory** (comparing past episode conditions).
4. Benchmark with larger set of episodes and visualize **success vs failure cases**.



Ponto de situação..

config_hash -> deveria servir para que eu descobrisse em que condições uma run foi feita. Para poder comparar performance de ambiente vs episódio vs agemte


In [1]:
import jupyter

In [2]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import gymnasium as gym
import matplotlib.pyplot as plt


from src.utils.system import boot, Notify
from src.defaults import RANDOM_SEEDS
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash
from environments import PositionTradingEnv,PositionTradingEnvV1,PositionTradingEnvV2

# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "trading_environment_development"
DEFAULT_PATH = "data/experiments/" + EXPERIMENT_NAME

# ========== CONFIG ==========
TICKER = "AAPL"
TIMESTEPS = 100_000
EVAL_EPISODES = 5
N_TIMESTEPS = 120
LOOKBACK = 0
SEEDS = RANDOM_SEEDS
MARKET_FEATURES = ['volume','close']
BENCHMARK_PATH = DEFAULT_PATH+"/benchmark_episodes.json"
CHECKPOINT_DIR = DEFAULT_PATH+"/checkpoints"
SCORES_DIR = DEFAULT_PATH+"/scores"
META_PATH = DEFAULT_PATH+"/meta_df_transfer.csv"

MARKET_FEATURES.sort()
SEEDS.sort()

DEVICE = boot()
OHLCV_DF = load_base_dataframe()

NOTIFICATION = Notify(EXPERIMENT_NAME)

  from pandas.core import (


In [None]:


class TradingEnvironmentBenchmark:
    benchmark_path=DEFAULT_PATH+"/benchmark_episodes.json"
    result_path=DEFAULT_PATH+"/meta_df_transfer.csv"
    checkpoint_dir=DEFAULT_PATH+"/checkpoints"
    
    def __init__(self):
        self._boot()
        
    def _boot(self):
        self._load_results()
        self._load_ohlcv()
        
    def _load_results(self):
        result_df = pd.read_csv(self.result_path)
        # Parse the 'config' column into a dictionary
        result_df['config_dict'] = result_df['config'].apply(json.loads)
        # Normalize the JSON dictionaries into a DataFrame
        config_expanded_df = pd.json_normalize(result_df['config_dict'])
        # Join the expanded config to the original DataFrame
        result_df_expanded = pd.concat([result_df.drop(columns=['config', 'config_dict']), config_expanded_df], axis=1)
        self.result_df = result_df_expanded
        
    def _load_ohlcv(self):
        self.ohlcv_df = OHLCV_DF.copy()
        
        

In [3]:
import os
import json
import hashlib
import numpy as np
import pandas as pd
from typing import Callable
from stable_baselines3 import PPO,A2C
from stable_baselines3.common.monitor import Monitor
from environments import PositionTradingEnv
from data import extract_meta_features

def compute_additional_metrics(env):
    if hasattr(env, "env"):  # unwrap Monitor
        env = env.env
    values = np.array(env.values)
    rewards = np.array(env.rewards)
    actions = np.array(env.actions)

    returns = pd.Series(values).pct_change().dropna()
    volatility = returns.std()
    entropy = -np.sum(np.bincount(actions, minlength=2)/len(actions) * np.log2(np.bincount(actions, minlength=2)/len(actions) + 1e-9))
    max_drawdown = (values / np.maximum.accumulate(values)).min() - 1
    sharpe = returns.mean() / (returns.std() + 1e-9) * np.sqrt(252)
    sortino = returns.mean() / (returns[returns < 0].std() + 1e-9) * np.sqrt(252)
    calmar = returns.mean() / abs(max_drawdown + 1e-9)
    success_trades = np.sum((np.diff(values) > 0) & (actions[1:] == 1)) + np.sum((np.diff(values) < 0) & (actions[1:] == 0))

    return {
        "volatility": volatility,
        "entropy": entropy,
        "max_drawdown": max_drawdown,
        "sharpe": sharpe,
        "sortino": sortino,
        "calmar": calmar,
        "success_trades": success_trades,
        "action_hold_ratio": np.mean(actions == 0),
        "action_long_ratio": np.mean(actions == 1)
    }

def formalized_transferability_evaluation(
    df: pd.DataFrame,
    ticker: str,
    env_cls: Callable = PositionTradingEnv,
    benchmark_path: str = "data/experiments/learnability_test/benchmark_episodes.json",
    result_path: str = "data/experiments/learnability_test/meta_df_transfer.csv",
    timesteps: int = 10_000,
    n_timesteps: int = 60,
    lookback: int = 0,
    seeds: list = [42, 52, 62],
    checkpoint_dir: str = "data/experiments/learnability_test/checkpoints",
    agent_cls: Callable = PPO,
    
    agent_config: dict = None,
    env_config: dict = None
) -> pd.DataFrame:

    os.makedirs(os.path.dirname(result_path), exist_ok=True)
    os.makedirs(checkpoint_dir, exist_ok=True)
    agent_name: str = agent_cls.__name__
    env_version: str = f"v{env_cls.__version__}"
        
    def generate_config_hash(config):
        raw = json.dumps(config, sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()

    def save_model(model, config_full, config_hash):
        path = os.path.join(checkpoint_dir, f"agent_{config_hash}.zip")
        model.save(path)
        with open(path.replace(".zip", "_config.json"), "w") as f:
            json.dump(config_full, f, indent=2)

    print("[INFO] Loading benchmark episodes...")
    with open(benchmark_path) as f:
        benchmark_episodes = json.load(f)
    
    meta_records = []
    df_ticker = df[df['symbol'] == ticker].reset_index(drop=True)

    if os.path.exists(result_path):
        existing = pd.read_csv(result_path)
        #seen_hashes = set(existing['config_hash'].unique())
        seen_hashes = set(zip(existing['config_hash'], existing['agent_name'], existing['seed']))
    else:
        seen_hashes = set()
  
    for seed in seeds:
        for start_idx in benchmark_episodes:
            
            test_idx = start_idx + n_timesteps
            if test_idx + n_timesteps >= len(df_ticker):
                print("[WARN] Skipping episode — test idx out of range")
                continue

            config = {
                "ticker": ticker,
                "train_idx": int(start_idx),
                "test_idx": int(test_idx),
                "timesteps": timesteps,
                "episode_steps":n_timesteps,
                #"seed": seed,
                #"env_version": env_version,
                "env_config": env_config,
                #"agent_name": agent_name,
                "agent_config": agent_config,
            }
            config_hash = generate_config_hash(config)
            #if config_hash in seen_hashes:
            if (config_hash, agent_name, seed) in seen_hashes:
                print(f"[INFO] Skipping previously completed run: {config_hash} for {agent_name} and seed = {seed}")
                continue

            print(f"[INFO] Transferability: seed={seed}, start_idx={start_idx}, config_hash={config_hash}")

 
            env_train = Monitor(env_cls(df_ticker, ticker=ticker, seed=seed, start_idx=start_idx, **(env_config or {})))
            model = agent_cls("MlpPolicy", env_train, verbose=0, seed=seed, **(agent_config or {}))
            model.learn(total_timesteps=timesteps)

            obs, _ = env_train.reset()
            done, score_train = False, 0
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, _, _ = env_train.step(action)
                score_train += reward

            obs, _ = env_train.reset()
            done, rand_train = False, 0
            while not done:
                action = env_train.action_space.sample()
                obs, reward, done, _, _ = env_train.step(action)
                rand_train += reward

            env_test = Monitor(env_cls(df_ticker, ticker=ticker, seed=seed, start_idx=test_idx, **(env_config or {})))
            obs, _ = env_test.reset()
            done, score_test = False, 0
            while not done:
                action, _ = model.predict(obs, deterministic=True)
                obs, reward, done, _, _ = env_test.step(action)
                score_test += reward

            obs, _ = env_test.reset()
            done, rand_test = False, 0
            while not done:
                action = env_test.action_space.sample()
                obs, reward, done, _, _ = env_test.step(action)
                rand_test += reward

            advantage_train = score_train - rand_train
            advantage_test = score_test - rand_test
            transfer_delta = score_test - score_train

            save_model(model, config, config_hash)

            meta = extract_meta_features(df_ticker.iloc[start_idx:start_idx + n_timesteps])
            diagnostics = compute_additional_metrics(env_test)

            meta.update({
                "config_hash": config_hash,
                "env_version": env_version,
                "agent_name": agent_name,
                "score_train": score_train,
                "score_test": score_test,
                "advantage_train": advantage_train,
                "advantage_test": advantage_test,
                "transfer_delta": transfer_delta,
                "transfer_success": int(transfer_delta > 0),
                "ticker": ticker,
                "config":json.dumps(config),
                "seed": seed,
                "ticker": ticker,
                "train_idx": int(start_idx),
                "test_idx": int(test_idx),
                "timesteps": timesteps,
                "episode_steps":n_timesteps,
                "seed": seed,
                **diagnostics
            })
            meta_records.append(meta)

    result_df = pd.DataFrame(meta_records)
    if os.path.exists(result_path):
        result_df = pd.concat([pd.read_csv(result_path), result_df], ignore_index=True)
    result_df.to_csv(result_path, index=False)
    print("[INFO] Transferability test complete. Results saved to:", result_path)
    return result_df

In [4]:
if os.path.exists(BENCHMARK_PATH):
    with open(BENCHMARK_PATH) as f:
        benchmark_episodes = json.load(f)
else:
    print("[INFO] Sampling benchmark episodes...")
    np.random.seed(0)
    benchmark_episodes = sample_valid_episodes(OHLCV_DF[OHLCV_DF['symbol']==TICKER], TICKER, N_TIMESTEPS, LOOKBACK, EVAL_EPISODES)
    with open(BENCHMARK_PATH, "w") as f:
        json.dump(benchmark_episodes.tolist(), f)  # ← ✅ Convert to list here

print("[INFO] Episódios de benchmark salvos em:", BENCHMARK_PATH)
for env_cls in [PositionTradingEnv,PositionTradingEnvV1,PositionTradingEnvV2]:
#for env_cls in [PositionTradingEnvV2]:
    for agent_cls in [PPO,A2C]:
        try:
            NOTIFICATION.info('Started training new agent')
            result_df = formalized_transferability_evaluation(
                df=OHLCV_DF.copy(),
                ticker=TICKER,
                env_cls=env_cls,
                agent_cls=agent_cls,
                benchmark_path=DEFAULT_PATH+"/benchmark_episodes.json",
                result_path=DEFAULT_PATH+"/meta_df_transfer.csv",
                timesteps=TIMESTEPS,
                n_timesteps=N_TIMESTEPS,
                lookback=LOOKBACK,
                seeds=SEEDS,  # or just [42] for quick run
                checkpoint_dir=DEFAULT_PATH+"/checkpoints",
                env_config={"market_features":MARKET_FEATURES}
            )
            NOTIFICATION.info('Test complete')
        except:
            NOTIFICATION.danger('Error on train')
            
NOTIFICATION.success('Done')

[INFO] Episódios de benchmark salvos em: data/experiments/trading_environment_development/benchmark_episodes.json
[INFO] Loading benchmark episodes...
[INFO] Transferability: seed=644267, start_idx=615, config_hash=cc52d36107501860df9e8d7f51e2b6b4f16135f1689074aa0ee5ead139f8990a
[INFO] Transferability: seed=644267, start_idx=360, config_hash=1f4df918d671a34057cd48263a1d934aeb72fd8b8dfefea14e90ad2193720ea8
[INFO] Transferability: seed=644267, start_idx=528, config_hash=fa5930884aba6600cc15d4e073b60278d403a0730afda6a1c04ff6aa0c514348
[INFO] Transferability: seed=644267, start_idx=71, config_hash=ae4ece6321cdb6c4233860e04786855d02c31f017aaad36c284e2403a6a6202b
[INFO] Transferability: seed=644267, start_idx=355, config_hash=bf65000dc7052e85150e5bac66d210d65265683b7ddbf1b26747c413898cf430
[INFO] Transferability: seed=1674633, start_idx=615, config_hash=cc52d36107501860df9e8d7f51e2b6b4f16135f1689074aa0ee5ead139f8990a
[INFO] Transferability: seed=1674633, start_idx=360, config_hash=1f4df918d6

[INFO] Transferability: seed=9731925, start_idx=360, config_hash=1f4df918d671a34057cd48263a1d934aeb72fd8b8dfefea14e90ad2193720ea8
[INFO] Transferability: seed=9731925, start_idx=528, config_hash=fa5930884aba6600cc15d4e073b60278d403a0730afda6a1c04ff6aa0c514348
[INFO] Transferability: seed=9731925, start_idx=71, config_hash=ae4ece6321cdb6c4233860e04786855d02c31f017aaad36c284e2403a6a6202b
[INFO] Transferability: seed=9731925, start_idx=355, config_hash=bf65000dc7052e85150e5bac66d210d65265683b7ddbf1b26747c413898cf430
[INFO] Transferability: seed=25166689, start_idx=615, config_hash=cc52d36107501860df9e8d7f51e2b6b4f16135f1689074aa0ee5ead139f8990a
[INFO] Transferability: seed=25166689, start_idx=360, config_hash=1f4df918d671a34057cd48263a1d934aeb72fd8b8dfefea14e90ad2193720ea8
[INFO] Transferability: seed=25166689, start_idx=528, config_hash=fa5930884aba6600cc15d4e073b60278d403a0730afda6a1c04ff6aa0c514348
[INFO] Transferability: seed=25166689, start_idx=71, config_hash=ae4ece6321cdb6c4233860e

[INFO] Transferability: seed=66923877, start_idx=71, config_hash=4c2911dee9842587a33dbaf7b29fd8b16debbada405955ab8d9c5101e81449e8
[INFO] Transferability: seed=66923877, start_idx=355, config_hash=16dbaa22e496604f04eee860d427ab602301e637e6fb291d3b601bd127ab9e02
[INFO] Transferability: seed=81048948, start_idx=615, config_hash=20d09c24d26e3de48d148f15b9840c3314b5c3b7c33e4d1196e6e995616e8c61
[INFO] Transferability: seed=81048948, start_idx=360, config_hash=50d2df2a37021dfa31c4117638ef8d4be2b20e19523332fb4895fe6bc7f83feb
[INFO] Transferability: seed=81048948, start_idx=528, config_hash=623d434dffca364456509bf1f4a6433c26033d59b4e57250c1bcd46d12bd8a25
[INFO] Transferability: seed=81048948, start_idx=71, config_hash=4c2911dee9842587a33dbaf7b29fd8b16debbada405955ab8d9c5101e81449e8
[INFO] Transferability: seed=81048948, start_idx=355, config_hash=16dbaa22e496604f04eee860d427ab602301e637e6fb291d3b601bd127ab9e02
[INFO] Transferability: seed=89890134, start_idx=615, config_hash=20d09c24d26e3de48d1

[INFO] Transferability: seed=118482530, start_idx=615, config_hash=20d09c24d26e3de48d148f15b9840c3314b5c3b7c33e4d1196e6e995616e8c61
[INFO] Transferability: seed=118482530, start_idx=360, config_hash=50d2df2a37021dfa31c4117638ef8d4be2b20e19523332fb4895fe6bc7f83feb
[INFO] Transferability: seed=118482530, start_idx=528, config_hash=623d434dffca364456509bf1f4a6433c26033d59b4e57250c1bcd46d12bd8a25
[INFO] Transferability: seed=118482530, start_idx=71, config_hash=4c2911dee9842587a33dbaf7b29fd8b16debbada405955ab8d9c5101e81449e8
[INFO] Transferability: seed=118482530, start_idx=355, config_hash=16dbaa22e496604f04eee860d427ab602301e637e6fb291d3b601bd127ab9e02
[INFO] Transferability: seed=135072239, start_idx=615, config_hash=20d09c24d26e3de48d148f15b9840c3314b5c3b7c33e4d1196e6e995616e8c61
[INFO] Transferability: seed=135072239, start_idx=360, config_hash=50d2df2a37021dfa31c4117638ef8d4be2b20e19523332fb4895fe6bc7f83feb
[INFO] Transferability: seed=135072239, start_idx=528, config_hash=623d434dff

[INFO] Transferability: seed=203769678, start_idx=528, config_hash=07072bc12f5dfc65ef14bfdb1f5821308299a2e7c2153678bb7e5bac0968737b
[INFO] Transferability: seed=203769678, start_idx=71, config_hash=bd388094991327b94176290b171756a7eec7fdde3e184702ea13336b5ca64554
[INFO] Transferability: seed=203769678, start_idx=355, config_hash=250c3d81c2bcd46f4406995a34c9fbdef15eef8e9735c7ff3c6a9b53fe21bd75
[INFO] Transferability test complete. Results saved to: data/experiments/trading_environment_development/meta_df_transfer.csv
[INFO] Loading benchmark episodes...
[INFO] Transferability: seed=644267, start_idx=615, config_hash=ef8a93af1a11514774832681d57bf4b274bc0545d546504ac633c5d611af3488
[INFO] Transferability: seed=644267, start_idx=360, config_hash=6ca6544e7bd5802090dd642b6bd106f2c6043931eeb48fa325eff8be6896d29b
[INFO] Transferability: seed=644267, start_idx=528, config_hash=07072bc12f5dfc65ef14bfdb1f5821308299a2e7c2153678bb7e5bac0968737b
[INFO] Transferability: seed=644267, start_idx=71, conf

In [5]:
result_df = pd.read_csv(DEFAULT_PATH+"/meta_df_transfer.csv")
result_df

Unnamed: 0,mean_return,median_return,std_return,skew_return,kurtosis_return,return_trend,ewm_mean_return,hurst,adf_stat,adf_pval,...,timesteps,episode_steps,volatility,max_drawdown,sharpe,sortino,calmar,success_trades,action_hold_ratio,action_long_ratio
0,0.001054,0.002034,0.013287,-0.611195,1.119947,0.000016,0.005045,0.339827,-1.703688,0.429213,...,100000,120,0.015863,-0.140452,-0.387645,-0.533045,-0.002758,24,0.508475,0.491525
1,0.000350,0.001797,0.012033,-0.827952,1.739689,0.000020,0.000996,0.541022,-1.731129,0.415150,...,100000,120,0.011304,-0.085559,-1.026879,-1.731934,-0.008546,29,0.474576,0.525424
2,0.001479,0.001864,0.015860,0.976569,3.816292,0.000055,-0.000652,0.739966,-0.216208,0.936562,...,100000,120,0.012540,-0.058619,2.294734,3.132475,0.030924,22,0.542373,0.457627
3,-0.000800,0.000000,0.022876,-0.365616,-0.173244,0.000014,-0.001585,0.505479,-1.906520,0.329000,...,100000,120,0.025535,-0.190702,-0.610527,-1.267244,-0.005150,29,0.542373,0.457627
4,0.000607,0.002160,0.012180,-0.828750,1.622443,0.000016,0.004138,0.356515,-1.731875,0.414769,...,100000,120,0.011093,-0.085458,-1.030301,-1.678216,-0.008425,41,0.423729,0.576271
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,0.001054,0.002034,0.013287,-0.611195,1.119947,0.000016,0.005045,0.339827,-1.703688,0.429213,...,100000,120,0.015863,-0.140452,-0.387645,-0.533045,-0.002758,27,0.457627,0.542373
296,0.000350,0.001797,0.012033,-0.827952,1.739689,0.000020,0.000996,0.541022,-1.731129,0.415150,...,100000,120,0.011304,-0.085559,-1.026879,-1.731934,-0.008546,22,0.474576,0.525424
297,0.001479,0.001864,0.015860,0.976569,3.816292,0.000055,-0.000652,0.739966,-0.216208,0.936562,...,100000,120,0.012540,-0.058619,2.294734,3.132475,0.030924,26,0.440678,0.559322
298,-0.000800,0.000000,0.022876,-0.365616,-0.173244,0.000014,-0.001585,0.505479,-1.906520,0.329000,...,100000,120,0.025535,-0.190702,-0.610527,-1.267244,-0.005150,33,0.457627,0.542373


In [6]:
from scipy.stats import ttest_ind


def compare_environments(result_df,env_version_a="v0",env_version_b = "v1"):
    

    summary = result_df.groupby("env_version")[[
        "score_train", "score_test", "advantage_train", "advantage_test",
        "transfer_delta", "success_trades", "sharpe", "sortino", "calmar",
        "max_drawdown", "volatility", "action_hold_ratio", "action_long_ratio"
    ]].agg(["mean", "std", "median"]).T
    

    mean_df = summary.xs('mean', level=1)
    # Compute absolute difference between env_version 1 and 0
    diffs = (mean_df[env_version_a] - mean_df[env_version_b]).abs().sort_values(ascending=False)
  
    # Plot using this sorted order
    mean_df.loc[diffs.index].plot.bar(
        figsize=(14, 6),
        title=f"Env {env_version_a} vs {env_version_b} – Mean metric comparison (sorted by difference)",
        ylabel="Mean Value"
    )
    metrics = ["score_test", "advantage_test", "transfer_delta", "sharpe", "sortino"]

    for metric in metrics:
        v0 = result_df[result_df.env_version == env_version_a][metric]
        v1 = result_df[result_df.env_version == env_version_b][metric]
        stat, pval = ttest_ind(v0, v1)
        print(f"{metric}: p={pval:.4f} | {env_version_a}_mean={v0.mean():.3f}, {env_version_b}_mean={v1.mean():.3f}")

    for metric in metrics:
        sns.boxplot(data=result_df, x="env_version", y=metric)
        plt.title(f"{metric} by Environment Version")
        plt.show()
        
    result_df['composite_score'] = (
        result_df['advantage_test'] +
        result_df['transfer_delta'] +
        result_df['sharpe'] * 5 -
        result_df['max_drawdown'] * 10
    )

    return result_df,result_df.groupby("env_version")["composite_score"].mean()




In [7]:
result_df.groupby(["agent_name","env_version"])[[
        "score_train", "score_test", "advantage_train", "advantage_test",
        "transfer_delta", "success_trades", "sharpe", "sortino", "calmar",
        "max_drawdown", "volatility", "action_hold_ratio", "action_long_ratio"
    ]].agg(["mean", "std", "median"]).T

Unnamed: 0_level_0,agent_name,A2C,A2C,A2C,PPO,PPO,PPO
Unnamed: 0_level_1,env_version,v0,v1,v2,v0,v1,v2
score_train,mean,-0.043098,1.122808,1.098672,0.596207,2.427292,2.357596
score_train,std,0.621783,0.929958,0.843257,0.160602,0.353161,0.59037
score_train,median,-0.370033,1.104908,1.198384,0.603259,2.529405,2.504854
score_test,mean,-0.079369,-0.172887,-0.137122,0.123702,-0.134487,0.015528
score_test,std,0.3108,0.422797,0.464753,0.295652,0.532109,0.652777
score_test,median,-0.081661,-0.334435,-0.275769,0.081661,-0.299921,-0.169025
advantage_train,mean,-0.151483,1.014424,0.990287,0.487822,2.318908,2.249211
advantage_train,std,1.05915,1.348975,1.281415,0.882278,0.965212,1.088831
advantage_train,median,-0.181012,0.986723,0.962501,0.419275,2.191546,2.202682
advantage_test,mean,0.020839,-0.175073,-0.186717,0.234468,-0.197812,-0.135487


In [8]:
#result_df['agent_name'].unique()
#existing = pd.read_csv(result_path)
        #seen_hashes = set(existing['config_hash'].unique())
seen_hashes = set(zip(result_df['config_hash'], result_df['agent_name']))
#seen_hashes

In [9]:
#summary = compare_environments(result_df)
#summary

In [10]:
result_df['env_version']

0      v0
1      v0
2      v0
3      v0
4      v0
       ..
295    v2
296    v2
297    v2
298    v2
299    v2
Name: env_version, Length: 300, dtype: object

In [11]:
results = pd.read_csv(DEFAULT_PATH+"/meta_df_transfer.csv")
results.groupby('env_version').mean(numeric_only=True).T

env_version,v0,v1,v2
mean_return,0.0005378421,0.0005378421,0.0005378421
median_return,0.001570828,0.001570828,0.001570828
std_return,0.01524706,0.01524706,0.01524706
skew_return,-0.3313888,-0.3313888,-0.3313888
kurtosis_return,1.625025,1.625025,1.625025
return_trend,2.421621e-05,2.421621e-05,2.421621e-05
ewm_mean_return,0.001588245,0.001588245,0.001588245
hurst,0.4965619,0.4965619,0.4965619
adf_stat,-1.457884,-1.457884,-1.457884
adf_pval,0.5049387,0.5049387,0.5049387


In [12]:
env_train = Monitor(PositionTradingEnvV2(OHLCV_DF, ticker="AAPL", seed=42, start_idx=100, market_features=["close"]))

In [13]:
env_train.reset()

(array([  0.  ,   0.  ,   0.  ,   1.  ,   0.  ,   0.  ,   0.  ,   0.  ,
          0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   1.  ,   1.  ,   0.  ,
          1.  ,   0.  ,   1.  ,   0.  , 149.64], dtype=float32),
 {})

In [14]:
model = PPO("MlpPolicy", env_train, verbose=0, seed=42)
model.learn(total_timesteps=100)

<stable_baselines3.ppo.ppo.PPO at 0x18e56ca09d0>

In [15]:
env_train.observation_space

Box(-inf, inf, (21,), float32)