Here's a structured **summary and technical breakdown** based on your notebook screenshot:

---

## 🧠 **Summary of the Trading Environment POC**

This notebook implements a **custom Gym environment** tailored for a simple but insightful financial trading simulation. The environment was designed to:

* Work with **real historical stock data**.
* Focus on **directional decisions**: stay flat or go long.
* Use a **normalized oracle-relative reward system** that fairly evaluates agent behavior.

The experiment compares the learning performance of a **PPO agent** vs. a **random policy**, using matched episodes sampled with strict rules.

---

## 🏗️ **Environment Design: `PositionTradingEnv`**

### **State & Action Space**

* **State (Observation):** `Box(shape=(1,))` → current price only (can be extended later).
* **Action Space:** `Discrete(2)`

  * `0 = Flat` (no position)
  * `1 = Long` (holding stock)

### **Environment Rules**

Each episode:

* Must begin on a **Monday**.
* Must be **chronologically ordered** and contain only **one ticker**.
* Must be at least `n_timesteps` long.
* If `lookback` is used, it must also follow these rules and only include past data.
* Internally resamples valid episodes using `sample_valid_episodes()`.

---


## 🧮 Reward System — Oracle-Relative Scoring (Normalized to 100)

The reward system is the **core innovation** of this environment. It ensures that:

> 🟢 The agent's performance is **evaluated relative to the best and worst possible actions** at each step,
> 🟢 and scaled so that **total reward per episode always ranges from 0 to 100**, regardless of volatility or duration.

---

### 🧩 Step-by-Step Breakdown

For each timestep `t`:

#### 1. **Compute Price Change**

```python
price_diff = next_price - curr_price
```

#### 2. **Determine Agent’s Action-Dependent Reward**

```python
agent_reward = +price_diff if position == Long else -price_diff
```

* Going **long** is rewarded when price goes **up**
* Staying **flat** is rewarded when price goes **down**

#### 3. **Determine Oracle and Anti-Oracle Baselines**

```python
oracle_reward = max(+price_diff, -price_diff)
anti_reward   = min(+price_diff, -price_diff)
```

* The **oracle** always takes the best action in hindsight.
* The **anti-oracle** always takes the worst possible action.

#### 4. **Normalize Agent Performance**

```python
step_score = (agent_reward - anti_reward) / (oracle_reward - anti_reward)
step_score = np.clip(step_score, 0, 1)
```

This converts any action into a **score from 0 (worst) to 1 (best)** based on how it compares to the oracle range.

---

### ⚖️ Step Weighting (Optional but Enabled)

To avoid rewarding equally across flat and volatile regimes:

```python
step_weight = abs(price_diff) / total_episode_volatility
```

* Steps with more meaningful price movements contribute more to the total score.
* This prevents agents from scoring well just by being conservative in low-volatility episodes.

---

### 🏁 Final Scaled Step Reward

```python
scaled_reward = step_score * step_weight * 100
```

All rewards are summed across the episode. The environment **precomputes the oracle’s total theoretical reward**, and rescales so that:

> 🔥 `agent_total_reward ∈ [0, 100]`

This **decouples reward from episode length or price scale**, making learning signals stable and comparable across episodes, tickers, and training sessions.

---

### ✅ Benefits

* ✅ **Stable learning signal** across different stocks and episodes.
* ✅ **Fair benchmarking** against oracle and random baselines.
* ✅ **Normalized interpretability**: 100 = perfect hindsight behavior.

---

Let me know if you want me to append this into your actual notebook or package it into a Markdown cell automatically.


### Step Reward Calculation

At each step:

* Let `price_diff = next_price - curr_price`

* Agent reward:

  * If `Long`: gets `+price_diff`
  * If `Flat`: gets `-price_diff`

* Oracle (best hindsight position): `max(|price_diff|)`

* Anti-oracle: `-oracle_reward`

Then compute:

```python
step_score = (agent_reward - anti_reward) / (oracle_reward - anti_reward)
```

* This yields a score ∈ \[0, 1] indicating how close the agent was to the ideal choice at that step.
* The final **step reward is scaled**:

```python
scaled_reward = step_score * weight * 100
```

Where `weight` is precomputed to **normalize total oracle reward to 100 per episode** (variable step weighting based on price volatility).

✅ **Total possible reward per episode: 100**
✅ Ensures fair comparability across episodes of different volatility.

---

## 🤖 **Agent Training and Evaluation**

### PPO Agent

* Trained using `Stable-Baselines3 PPO` for 5,000 timesteps.
* Environment wrapped with `Monitor` for logging.
* Observed performance using total reward over multiple sampled episodes.

### Evaluation Logic

* Sample **fixed episodes** from the data for fairness.
* Run both PPO agent and random agent on the **same episodes**.
* Track and compare total normalized reward (0–100 scale).

### Outputs

* `ppo_mean`, `random_mean`: average scores
* `t_stat`, `p_val`: statistical significance (usually **very strong**)
* Histograms for score distributions (via `plot_evaluation_results()`)

---



In [1]:
import jupyter

In [2]:
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


from src.utils.system import boot
from src.data.feature_pipeline import load_base_dataframe
from experiments import check_if_experiment_exists, register_experiment ,experiment_hash

# ========== SYSTEM BOOT ==========
DEVICE = boot()
EXPERIMENT_NAME = "core_rl_trading_environment"
DEFAULT_PATH = "/data/experiments/" + EXPERIMENT_NAME


# ========== CONFIG ==========
CONFIG = {
    "ticker": "AAPL",
    "start_date": "2023-01-01",
    "end_date": "2024-01-01",
    "window_length_days": 60,
    "step_size_days": 30,
    "reward_type": "path_score",
    "model_save_path": DEFAULT_PATH + "/models",
    "log_path": DEFAULT_PATH + "/logs",
    "result_path": DEFAULT_PATH + "/results"
}
config_hash = experiment_hash(CONFIG)
exists = check_if_experiment_exists(config_hash)
DEVICE = boot()




OHLCV_DF = load_base_dataframe()

  from pandas.core import (


In [3]:
from datetime import datetime

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv



## Base Agent
* if price goes up and agent is holding, his reward = % price up / max_ep_reward
* Best possible score = 1
* Worst = 0


In [28]:
import gymnasium as gym
import numpy as np

import gymnasium as gym
import numpy as np
import pandas as pd
from datetime import timedelta

class PositionTradingEnv(gym.Env):
    def __init__(
        self,
        full_df: pd.DataFrame,
        ticker: str,
        n_timesteps: int = 60,
        lookback: int = 0,
        seed: int = 42,
    ):
        super().__init__()
        self.full_df = full_df.copy()
        self.ticker = ticker
        self.n_timesteps = n_timesteps
        self.lookback = lookback
        self.random_state = np.random.RandomState(seed)
        self.action_space = gym.spaces.Discrete(2)  # 0 = Flat, 1 = Long
        self.observation_space = gym.spaces.Box(low=0, high=np.inf, shape=(1,), dtype=np.float32)
        self.episode_df = None
        self.step_idx = 0
        self._prepare_ticker_df()
        self._resample_episode()

    def _prepare_ticker_df(self):
        self.df = self.full_df[self.full_df['symbol'] == self.ticker].copy()
        self.df = self.df.sort_values("date")
        self.df["date"] = pd.to_datetime(self.df["date"])
        self.df = self.df.reset_index(drop=True)

    def _resample_episode(self):
        mondays = self.df[self.df["date"].dt.weekday == 0].copy()
        valid_starts = []

        for date in mondays["date"]:
            start_idx = self.df.index[self.df["date"] == date][0]
            end_idx = start_idx + self.n_timesteps - 1
            if end_idx >= len(self.df):
                continue

            ep_slice = self.df.iloc[start_idx:end_idx + 1]
            if (ep_slice["symbol"].nunique() == 1) and (ep_slice["date"].is_monotonic_increasing):
                valid_starts.append(start_idx)

        if not valid_starts:
            raise ValueError("No valid episodes found with the current constraints.")

        self.start_idx = self.random_state.choice(valid_starts)
        self.end_idx = self.start_idx + self.n_timesteps - 1
        self.lookback_idx = max(0, self.start_idx - self.lookback)
        self.episode_df = self.df.iloc[self.lookback_idx:self.end_idx + 1].reset_index(drop=True)

        # Set prices used for reward logic
        self.prices = self.episode_df["close"].values
        self._precompute_step_weights()

    def _precompute_step_weights(self):
        raw_weights = [abs(self.prices[i + 1] - self.prices[i]) for i in range(len(self.prices) - 1)]
        total = sum(raw_weights)
        self.step_weights = [w / total if total > 0 else 1 / (len(raw_weights)) for w in raw_weights]

    def reset(self, *, seed=None, options=None):
        if seed is not None:
            self.random_state.seed(seed)
        self._resample_episode()
        self.step_idx = self.lookback
        self.position = 0
        self.total_reward = 0.0
        self.rewards = []
        self.actions = []
        self.values = []
        obs = np.array([self.prices[self.step_idx]], dtype=np.float32)
        return obs, {}

    def step(self, action):
        curr_idx = self.step_idx
        next_idx = min(curr_idx + 1, len(self.prices) - 1)
        curr_price = self.prices[curr_idx]
        next_price = self.prices[next_idx]
        price_diff = next_price - curr_price

        self.position = action
        agent_reward = price_diff if self.position == 1 else -price_diff
        oracle_reward = abs(price_diff)
        anti_reward = -oracle_reward

        if oracle_reward == anti_reward:
            step_score = 0.5
        else:
            step_score = (agent_reward - anti_reward) / (oracle_reward - anti_reward)

        step_score = float(np.clip(step_score, 0, 1))
        weight = self.step_weights[curr_idx - self.lookback] if curr_idx - self.lookback < len(self.step_weights) else 0
        scaled_reward = step_score * weight * 100

        self.total_reward += scaled_reward
        self.rewards.append(self.total_reward)
        self.actions.append(self.position)
        self.values.append(curr_price)

        self.step_idx += 1
        terminated = self.step_idx >= self.lookback + self.n_timesteps - 1
        truncated = False
        obs = np.array([self.prices[min(self.step_idx, len(self.prices) - 1)]], dtype=np.float32)

        return obs, scaled_reward, terminated, truncated, {}





def score_episode(agent_ret, oracle_ret, anti_ret):
    if oracle_ret == anti_ret:
        return 50
    return float(np.clip(100 * (agent_ret - anti_ret) / (oracle_ret - anti_ret), 0, 100))


In [29]:
df_raw = OHLCV_DF.copy() 
df_raw = df_raw[(df_raw['date'] >=CONFIG['start_date']) & (df_raw['date']<CONFIG['end_date'])]
df_raw.set_index('date',inplace=True)
df_raw.head(3)

Unnamed: 0_level_0,id,symbol,timestamp,open,high,low,close,volume,trade_count,vwap,...,vwap_change,trade_count_change,sector_id,industry_id,return_1d,vix,vix_norm,sp500,sp500_norm,market_return_1d
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-03,251,MMM,2023-01-03 05:00:00,121.52,122.635,120.37,122.47,2612812.0,44229.0,121.846135,...,0.019225,0.212484,8.0,unknown,0.021264,0.229,0.05676,38.2414,-0.004001,-0.004001
2023-01-04,252,MMM,2023-01-04 05:00:00,123.35,125.29,122.71,125.15,2769831.0,46771.0,124.584773,...,0.022476,0.057474,8.0,unknown,0.021883,0.2201,-0.038865,38.5297,0.007539,0.007539
2023-01-05,253,MMM,2023-01-05 05:00:00,124.21,124.57,122.46,122.96,2606564.0,41426.0,123.168428,...,-0.011369,-0.11428,8.0,unknown,-0.017499,0.2246,0.020445,38.081,-0.011646,-0.011646


In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.evaluation import evaluate_policy
from gym import Env


def sample_valid_episodes(df, ticker, n_timesteps=60, lookback=0, episodes=30, seed=42):
    df = df[df['symbol'] == ticker].copy()
    df = df.sort_values('date')
    df['date'] = pd.to_datetime(df['date'])

    mondays = df[df['date'].dt.weekday == 0]
    valid_starts = []

    for date in mondays['date']:
        start_idx = df.index[df['date'] == date][0]
        end_idx = start_idx + n_timesteps - 1
        if end_idx >= len(df):
            continue

        episode = df.iloc[start_idx - lookback if start_idx - lookback >= 0 else 0 : end_idx + 1]
        if episode['symbol'].nunique() == 1 and episode['date'].is_monotonic_increasing:
            valid_starts.append(start_idx)

    rng = np.random.default_rng(seed)
    sampled_starts = rng.choice(valid_starts, size=episodes, replace=False)
    return sampled_starts


def run_learning_evaluation(df, ticker="AAPL", timesteps=10_000, eval_episodes=30, n_timesteps=60, lookback=0, seed=42):
    np.random.seed(seed)

    # Sample episode start points
    sampled_starts = sample_valid_episodes(df, ticker, n_timesteps, lookback, eval_episodes, seed)

    # Train on the environment normally
    env = Monitor(PositionTradingEnv(df, ticker, n_timesteps, lookback, seed=seed))
    model = PPO("MlpPolicy", env, verbose=1, seed=seed)
    model.learn(total_timesteps=timesteps)

    # Evaluate PPO and Random with same episodes
    ppo_scores = []
    random_scores = []

    for start_idx in sampled_starts:
        # PPO agent evaluation
        env_ppo = PositionTradingEnv(df, ticker, n_timesteps, lookback, seed=seed)
        env_ppo.start_idx = start_idx  # override sampling
        env_ppo.end_idx = start_idx + n_timesteps - 1
        env_ppo.lookback_idx = max(0, start_idx - lookback)
        env_ppo.episode_df = env_ppo.df.iloc[env_ppo.lookback_idx : env_ppo.end_idx + 1].reset_index(drop=True)
        env_ppo.prices = env_ppo.episode_df["close"].values
        env_ppo._precompute_step_weights()
        obs, _ = env_ppo.reset()
        done = False
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env_ppo.step(action)
            done = terminated or truncated
        ppo_scores.append(env_ppo.total_reward)

        # Random agent evaluation
        env_rand = PositionTradingEnv(df, ticker, n_timesteps, lookback, seed=seed)
        env_rand.start_idx = start_idx
        env_rand.end_idx = start_idx + n_timesteps - 1
        env_rand.lookback_idx = max(0, start_idx - lookback)
        env_rand.episode_df = env_rand.df.iloc[env_rand.lookback_idx : env_rand.end_idx + 1].reset_index(drop=True)
        env_rand.prices = env_rand.episode_df["close"].values
        env_rand._precompute_step_weights()
        obs, _ = env_rand.reset()
        done = False
        while not done:
            action = env_rand.action_space.sample()
            obs, reward, terminated, truncated, _ = env_rand.step(action)
            done = terminated or truncated
        random_scores.append(env_rand.total_reward)

    t_stat, p_val = ttest_ind(ppo_scores, random_scores, equal_var=False)

    return {
        "ppo_mean": np.mean(ppo_scores),
        "random_mean": np.mean(random_scores),
        "t_stat": t_stat,
        "p_val": p_val,
        "ppo_scores": ppo_scores,
        "random_scores": random_scores
    }, model, env

# --- Simulated test series ---
def plot_evaluation_results(result_summary, title="Agent vs Random Performance"):
    ppo_scores = result_summary["ppo_scores"]
    random_scores = result_summary["random_scores"]
    
    plt.figure(figsize=(10, 6))
    sns.histplot(ppo_scores, color="green", label="PPO Agent", kde=True, stat="density", bins=10)
    sns.histplot(random_scores, color="red", label="Random Policy", kde=True, stat="density", bins=10)

    plt.axvline(np.mean(ppo_scores), color="green", linestyle="--", label=f"PPO Mean: {np.mean(ppo_scores):.2f}")
    plt.axvline(np.mean(random_scores), color="red", linestyle="--", label=f"Random Mean: {np.mean(random_scores):.2f}")

    plt.title(title)
    plt.xlabel("Episode Score (0–100)")
    plt.ylabel("Density")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()
    
result_summary = run_learning_evaluation(
    df_raw[df_raw['symbol']=="AAPL"].reset_index(),
    ticker='AAPL', 
    timesteps=5000, 
    eval_episodes=5, 
    n_timesteps=30, 
    lookback=0
)

Using cpu device
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 29       |
|    ep_rew_mean     | 48.6     |
| time/              |          |
|    fps             | 860      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 29          |
|    ep_rew_mean          | 49          |
| time/                   |             |
|    fps                  | 735         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014029702 |
|    clip_fraction        | 0.0244      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.682      |
|    explained_varia

  t_stat, p_val = ttest_ind(ppo_scores, random_scores, equal_var=False)


In [40]:
result_summary

({'ppo_mean': 51.60744500846024,
  'random_mean': 66.59590832179667,
  't_stat': -4.345154929945915,
  'p_val': 0.012203413860424778,
  'ppo_scores': [51.607445008460246,
   51.607445008460246,
   51.607445008460246,
   51.607445008460246,
   51.607445008460246],
  'random_scores': [59.83694816182125,
   73.71173665589914,
   56.760498384863865,
   71.31210582987234,
   71.35825257652674]},
 <stable_baselines3.ppo.ppo.PPO at 0x1fa8e8f19d0>,
 <Monitor<PositionTradingEnv instance>>)

In [None]:

plot_evaluation_results(result_summary[0])