

# Self-Aware Transformer Agent for Trading

## Motivation

Traditional RL trading agents rely heavily on market signals. But the market is noisy, volatile, and unpredictable — especially in short windows. Instead of chasing external patterns, we **shift the focus inward**:

> Teach the agent to interpret itself, its context, and act based on internal understanding and consistent behavior.

This system mimics how **career traders** operate: not by predicting price ticks, but by reacting intelligently to their strategy performance, risk levels, and regime shifts.

---

## Agent Design Philosophy

| Principle                | Implementation                                                         |
| ------------------------ | ---------------------------------------------------------------------- |
| Internal State Awareness | Agent tracks position, time held, drawdown, unrealized PnL             |
| Compact, Dense Features  | Handcrafted candlestick + PCA-compressed signals                       |
| Regime-Driven Training   | Train using rolling market windows with real history                   |
| Transformer Context Use  | Use long context (2 months) to act intelligently in the next (1 month) |
| Reward Discipline        | Combine PnL, drawdown penalty, and Sharpe-style consistency bonus      |

---

## Feature Engineering: `FeatureCompressor`

| Feature Group        | Description                                          |
| -------------------- | ---------------------------------------------------- |
| Candlestick Shape    | `body`, `upper_shadow`, `lower_shadow`, `body_ratio` |
| Relative Behavior    | `z_close`, `rel_volume`, `price_vs_range`            |
| Signal Strength      | `rolling_sharpe`, `entropy`                          |
| Historical Embedding | `pca_1`, `pca_2` from rolling PCA of close/volume    |

All features are window-smoothed to reduce noise and expose true structure.

---

## Custom Environment: `SelfAwareTradingEnv`

* Builds the agent's **state** with both market and internal agent features
* Internal metrics: position, time held, cumulative reward, drawdown, unrealized PnL, pct episode completed
* **Reward Function**:

  ```python
  shaped_reward = (
      0.5 * raw_pnl
      + 0.3 * sharpe_bonus
      - 0.2 * drawdown_penalty
  )
  ```

---

## Transformer Architecture

* Input: full feature state (market + internal), padded or clipped to 60 steps
* Positional encoding: learnable
* Backbone: TransformerEncoder (2 layers, 4 heads)
* Outputs a context vector per timestep → passed to actor & critic
* Integrated with `RecurrentPPO` for temporal memory

---

## Windowed Training Design

Each training episode is built from a **single ticker**:

| Step | Description                                                                     |
| ---- | ------------------------------------------------------------------------------- |
| 1    | Select a ticker and sort its data by date                                       |
| 2    | Slice 60 days as `context` and next 30 as `target`                              |
| 3    | Use only `target` for training, but allow agent to leverage Transformer context |
| 4    | Repeat across all valid rolling windows and tickers                             |

This creates hundreds of **coherent, regime-aware episodes** for training and evaluation.

---

## Full Pipeline Summary

```text
load_base_dataframe()
↓
MarketWindowBuilder → [context_df, target_df] for each ticker
↓
FeatureCompressor → Compress raw OHLCV into ~10 descriptive features
↓
SelfAwareTradingEnv → Builds internal + market state, shaped rewards
↓
RecurrentPPO (TransformerPolicy) → Trains over target month
↓
EpisodeLoggerCallback → Tracks Sharpe, Win Rate, Drawdown, # Trades
↓
EvaluationRunner → Loads each agent window and scores deterministically
```

---

## Evaluation & Benchmarking

Each agent is evaluated on the same `target_df` it was trained on, using **deterministic inference**. Logged metrics include:

| Metric         | Description                                      |
| -------------- | ------------------------------------------------ |
| `total_reward` | Sum of all rewards in episode                    |
| `avg_reward`   | Mean reward per step                             |
| `sharpe_bonus` | Volatility-adjusted consistency score            |
| `drawdown`     | Max drawdown during episode                      |
| `win_rate`     | % of profitable actions (to be optionally added) |

Results are saved in `evaluation_results.csv` and can be compared across:

* Tickers
* Time periods
* Volatility regimes

---

## Benchmark Plan

| Baseline             | Description                                                |
| -------------------- | ---------------------------------------------------------- |
| Random Policy        | Random action sampling, same environment                   |
| Hold-only Agent      | Buy once and hold through episode                          |
| Classical PPO w/ MLP | Same pipeline, no memory or context                        |
| Market Benchmark     | Cumulative return of passive exposure during target period |

We will compute **performance deltas** against these baselines and track:

* Sharpe outperformance
* Drawdown reduction
* Adaptation to volatile periods

---

## What Makes This Unique?

* Agent **doesn’t predict**, it *reacts adaptively*
* Context is used **intelligently**, not just stacked
* Agent performance is **relative to itself**, not absolute return
* Internal state + compressed signals = **compact but expressive space**




In [1]:
# SETUP ===================================
import jupyter
import warnings

from src.utils.system import boot, Notify

boot()
warnings.filterwarnings("ignore")



# PACKAGES ================================
import os
import torch
import joblib
import numpy as np
import pandas as pd
import seaborn as sns
import torch.nn as nn
import gymnasium as gym
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.preprocessing import  RobustScaler
from IPython.display import display

# FRAMEWORK STUFF =========================
from src.defaults import TOP2_STOCK_BY_SECTOR, FEATURE_COLS,EPISODE_LENGTH
from src.data.feature_pipeline import load_base_dataframe
from src.experiments.experiment_tracker import ExperimentTracker

from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.policies import ActorCriticPolicy



In [11]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

class FeatureCompressor:
    def __init__(self, window=10):
        self.window = window
        self.pca = PCA(n_components=2)

    def transform(self, df):
        df = df.copy()

        # === CANDLE FEATURES ===
        df['body'] = df['close'] - df['open']
        df['upper_shadow'] = df['high'] - df[['open', 'close']].max(axis=1)
        df['lower_shadow'] = df[['open', 'close']].min(axis=1) - df['low']
        df['range'] = df['high'] - df['low']
        df['body_ratio'] = df['body'] / (df['range'] + 1e-8)

        # === RELATIVE CONTEXT FEATURES ===
        df['z_close'] = (df['close'] - df['close'].rolling(self.window).mean()) / (df['close'].rolling(self.window).std() + 1e-8)
        df['rel_volume'] = df['volume'] / (df['volume'].rolling(self.window).mean() + 1e-8)
        df['price_vs_range'] = (df['close'] - df['low'].rolling(self.window).min()) / (
            df['high'].rolling(self.window).max() - df['low'].rolling(self.window).min() + 1e-8)

        # === SIGNAL QUALITY (EWM) ===
        df['return'] = df['close'].pct_change()
        df['ewm_return_mean'] = df['return'].ewm(span=self.window).mean()
        df['ewm_return_std'] = df['return'].ewm(span=self.window).std()
        df['rolling_sharpe'] = df['ewm_return_mean'] / (df['ewm_return_std'] + 1e-8)
        df['entropy'] = df['return'].rolling(self.window).apply(self.shannon_entropy, raw=True)

        # === PCA EMBEDDING ===
        pca_inputs = []
        raw = df[['close', 'volume']].fillna(0).values
        for i in range(self.window, len(raw)):
            window = raw[i - self.window:i].flatten()
            pca_inputs.append(window)

        if len(pca_inputs) > 0:
            compressed = self.pca.fit_transform(np.array(pca_inputs))
            df.loc[df.index[-len(compressed):], 'pca_1'] = compressed[:, 0]
            df.loc[df.index[-len(compressed):], 'pca_2'] = compressed[:, 1]
        else:
            df['pca_1'] = 0
            df['pca_2'] = 0

        # === DATE-TIME FEATURES ===
        if 'date' in df.columns:
            df['date'] = pd.to_datetime(df['date'])
            df['day_of_week'] = df['date'].dt.dayofweek
            df['day_of_month'] = df['date'].dt.day
            df['month'] = df['date'].dt.month
        else:
            df['day_of_week'] = 0
            df['day_of_month'] = 0
            df['month'] = 0

        # === FINAL FEATURE SET ===
        features = [
            'body', 'upper_shadow', 'lower_shadow', 'body_ratio',
            'z_close', 'rel_volume', 'price_vs_range',
            'rolling_sharpe', 'entropy',
            'pca_1', 'pca_2',
            'day_of_week', 'day_of_month', 'month'
        ]

        return df[features].dropna().reset_index(drop=True)

    def shannon_entropy(self, x):
        hist, bins = np.histogram(x, bins=5, density=True)
        hist = hist[hist > 0]
        return -np.sum(hist * np.log(hist + 1e-8))


In [12]:
# self_aware_trading_env.py
import numpy as np
import gymnasium as gym
from gymnasium import spaces
import pandas as pd
#from feature_compressor import FeatureCompressor

class SelfAwareTradingEnv(gym.Env):
    def __init__(self, ohlcv_df: pd.DataFrame, episode_length=100):
        super().__init__()
        self.episode_length = episode_length
        self.current_step = 0

        # === Feature Compression ===
        self.compressor = FeatureCompressor(window=10)
        self.market_data = self.compressor.transform(ohlcv_df).values

        # Agent state
        self.current_position = 0  # -1 short, 0 flat, 1 long
        self.time_in_position = 0
        self.total_reward = 0.0
        self.position_entry_price = 0.0
        self.equity_curve = []
        self.returns_history = []

        self.market_dim = self.market_data.shape[1]
        self.internal_dim = 6

        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf,
            shape=(self.market_dim + self.internal_dim,),
            dtype=np.float32
        )

        self.action_space = spaces.Discrete(3)  # 0 = sell, 1 = hold, 2 = buy

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_step = 0
        self.current_position = 0
        self.time_in_position = 0
        self.total_reward = 0.0
        self.position_entry_price = 0.0
        self.equity_curve = [1.0]
        self.returns_history = []
        return self._get_obs(), {}

    def step(self, action):
        price = self.market_data[self.current_step, 0]  # assume first feature is close or PCA_1

        # Process action
        raw_reward = 0.0
        if action == 0:  # SELL
            raw_reward = self._close_position(price, -1)
        elif action == 2:  # BUY
            raw_reward = self._close_position(price, 1)
        else:
            self.time_in_position += 1

        # Track and update
        self.total_reward += raw_reward
        self.returns_history.append(raw_reward)
        self.equity_curve.append(self.equity_curve[-1] * (1 + raw_reward))

        self.current_step += 1
        done = self.current_step >= self.episode_length

        # Apply reward shaping
        shaped_reward = (
            0.5 * raw_reward
            + 0.3 * self.sharpe_bonus()
            - 0.2 * self.calc_drawdown()
        )

        return self._get_obs(), shaped_reward, done, False, {}

    def _get_obs(self):
        market_features = self.market_data[self.current_step]
        internal_features = np.array([
            self.current_position,
            self.time_in_position,
            self.calc_unrealized_pnl(),
            self.total_reward,
            self.calc_drawdown(),
            self.current_step / self.episode_length
        ])
        return np.concatenate([market_features, internal_features])

    def _close_position(self, price, new_position):
        reward = 0.0
        if self.current_position != 0:
            change = price - self.position_entry_price
            if self.current_position == -1:
                change *= -1
            reward = change / self.position_entry_price

        self.current_position = new_position
        self.position_entry_price = price
        self.time_in_position = 1
        return reward

    def calc_unrealized_pnl(self):
        if self.current_position == 0:
            return 0.0
        current_price = self.market_data[self.current_step, 0]
        change = current_price - self.position_entry_price
        return (change / self.position_entry_price) * (1 if self.current_position == 1 else -1)

    def calc_drawdown(self):
        peak = np.max(self.equity_curve)
        current = self.equity_curve[-1]
        return (peak - current) / peak if peak > 0 else 0.0

    def sharpe_bonus(self):
        returns = np.array(self.returns_history[-20:])
        if len(returns) < 2:
            return 0.0
        mean = np.mean(returns)
        std = np.std(returns)
        sharpe = mean / (std + 1e-8)
        return sharpe
    
    def _calculate_reward(self):
        reward = 0.0
        reward += self.unrealized_pnl * 0.1   # scaled pnl reward
        reward -= self.calc_drawdown() * 0.2  # penalize risk
        reward += self.sharpe_bonus()         # consistency bonus
        return reward

In [13]:
from stable_baselines3.common.callbacks import BaseCallback
class EpisodeLoggerCallback(BaseCallback):
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_data = []

    def _on_step(self):
        info = self.locals.get('infos', [{}])[0]
        if 'episode' in info:
            self.episode_data.append(info['episode'])
        return True

    def _on_training_end(self):
        df = pd.DataFrame(self.episode_data)
        df.to_csv('agent_journal.csv', index=False)
        print("✅ Saved trading journal to agent_journal.csv")

In [14]:

import pandas as pd

class MarketWindowBuilder:
    def __init__(self, df: pd.DataFrame, context_days=40, target_days=20, ticker_col='symbol', date_col='date'):
        self.df = df.copy()
        self.context_days = context_days
        self.target_days = target_days
        self.ticker_col = ticker_col
        self.date_col = date_col

    def generate_windows(self):
        windows = []
        grouped = self.df.groupby(self.ticker_col)

        for symbol, group in grouped:
            group = group.sort_values(self.date_col).reset_index(drop=True)
            total_days = self.context_days + self.target_days

            for start_idx in range(0, len(group) - total_days):
                context = group.iloc[start_idx : start_idx + self.context_days].copy()
                target = group.iloc[start_idx + self.context_days : start_idx + total_days].copy()

                windows.append({
                    'symbol': symbol,
                    'context': context.reset_index(drop=True),
                    'target': target.reset_index(drop=True)
                })

        return windows


In [15]:

#from market_window_builder import MarketWindowBuilder
#from self_aware_trading_env import SelfAwareTradingEnv
import pandas as pd

class WindowedTrainingLoader:
    def __init__(self, raw_ohlcv_df: pd.DataFrame, context_days=40, target_days=20):
        self.builder = MarketWindowBuilder(raw_ohlcv_df, context_days, target_days)
        self.windows = self.builder.generate_windows()

    def get_training_environments(self):
        envs = []
        for window in self.windows:
            context_df = window['context']  # Can be logged or passed to the agent's memory
            target_df = window['target']
            env = SelfAwareTradingEnv(ohlcv_df=target_df, episode_length=len(target_df))
            envs.append({
                'context': context_df,
                'target_env': env
            })
        return envs


| Component         | Purpose                                                   | Use Cases / Benefits                                    |
| ----------------- | --------------------------------------------------------- | ------------------------------------------------------- |
| `advantage_head`  | Predicts how much **edge** the agent expects from a state | Detect market opportunities or stale/noisy environments |
| `confidence_head` | Predicts how **confident** the agent is in its decision   | Suppress overtrading in noisy periods                   |
| Logging those     | Helps you **analyze** what the agent sees as risky, easy  | Visualize uncertainty over time or by ticker            |

### Dev note: 
Decidi alimentar observation state com confidence_score para o cabrão se autoregular


In [16]:
import torch
import torch.nn as nn
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor


class TransformerFeatureExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space, seq_len=60, embed_dim=64, n_heads=4, n_layers=2):
        super().__init__(observation_space, features_dim=embed_dim + 1)  # +1 for confidence

        self.seq_len = seq_len
        self.embed_dim = embed_dim

        self.input_dim = observation_space.shape[0]
        self.embedding = nn.Linear(self.input_dim, embed_dim)
        self.pos_embedding = nn.Parameter(torch.zeros(1, seq_len, embed_dim))

        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=n_heads, batch_first=True)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        
        # Auto avaliação
        self.output_layer = nn.Linear(embed_dim, embed_dim)
        self.confidence_head = nn.Sequential(
            nn.Linear(embed_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()  # Confidence between 0 and 1
        )

    def forward(self, obs):
        B = obs.shape[0]
        if obs.dim() == 2:
            obs = obs.unsqueeze(1)

        x = self.embedding(obs)
        if x.shape[1] < self.seq_len:
            pad_len = self.seq_len - x.shape[1]
            padding = torch.zeros((B, pad_len, self.embed_dim), device=x.device)
            x = torch.cat([padding, x], dim=1)
        elif x.shape[1] > self.seq_len:
            x = x[:, -self.seq_len:, :]

        x = x + self.pos_embedding[:, :self.seq_len, :]
        x = self.encoder(x)
        x_last = x[:, -1]

        confidence = self.confidence_head(x_last)
        return torch.cat([x_last, confidence], dim=-1)  # shape: (B, embed_dim + 1)



from stable_baselines3.ppo.policies import MlpPolicy
from stable_baselines3.common.torch_layers import MlpExtractor
from stable_baselines3.common.policies import ActorCriticPolicy
from sb3_contrib import RecurrentPPO
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from sb3_contrib.common.recurrent.policies import RecurrentActorCriticPolicy
class TransformerPolicy(RecurrentActorCriticPolicy):
    def __init__(self, *args, **kwargs):
        super().__init__(
            *args,
            **kwargs,
            features_extractor_class=TransformerFeatureExtractor,
            features_extractor_kwargs=dict(seq_len=60, embed_dim=64, n_heads=4, n_layers=2)
        )


In [17]:
import os
import pandas as pd
from tqdm import tqdm
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.monitor import Monitor
from sb3_contrib import RecurrentPPO

class TrainingRunner:
    def __init__(self, raw_ohlcv_df, model_save_path="self_aware_agent", total_timesteps=10_000):
        self.raw_df = raw_ohlcv_df
        self.model_save_path = model_save_path
        self.total_timesteps = total_timesteps

    def run(self):
        loader = WindowedTrainingLoader(self.raw_df)
        episodes = loader.get_training_environments()

        # Resume logic
        journal_path = "training_journal.csv"
        if os.path.exists(journal_path):
            completed_windows = set(pd.read_csv(journal_path)['window'].unique())
            all_metrics = [pd.read_csv(journal_path)]
        else:
            completed_windows = set()
            all_metrics = []

        for i, episode in enumerate(tqdm(episodes, desc="Training windows")):
            window_id = i + 1
            model_path = os.path.join(self.model_save_path, f"agent_window_{window_id}.zip")

            if window_id in completed_windows or os.path.exists(model_path):
                print(f"⏩ Skipping window {window_id}, already completed.")
                continue

            context_df = episode['context']
            env = DummyVecEnv([lambda: Monitor(episode['target_env'])])

            model = RecurrentPPO(
                policy=TransformerPolicy,
                env=env,
                verbose=0,
            )

            print(f"\n🧠 Training window {window_id}/{len(episodes)}")

            callback = EpisodeLoggerCallback(log_interval=1)
            model.learn(total_timesteps=self.total_timesteps, callback=callback)

            os.makedirs(self.model_save_path, exist_ok=True)
            model.save(model_path)

            # Append and immediately persist metrics for crash resilience
            episode_df = pd.DataFrame(callback.episode_data)
            episode_df['window'] = window_id
            all_metrics.append(episode_df)

            pd.concat(all_metrics, ignore_index=True).to_csv(journal_path, index=False)

        print("\n✅ Training complete. Journal saved to 'training_journal.csv'")

In [18]:
notification = Notify('Self aware trading agent')
notification.success('Lets go')

In [None]:
raw_df = load_base_dataframe()
raw_df = raw_df.sort_values("date").reset_index(drop=True)
raw_df = raw_df[(raw_df['date']>="2024-06-01") & (raw_df['date']<'2025-06-01')]
# Replace with your actual data loading pipeline
#raw_df = pd.read_csv("data/sp500_ohlcv.csv", parse_dates=["date"])
#raw_df = raw_df.sort_values("date").reset_index(drop=True)

# === Run the self-aware transformer agent training ===
runner = TrainingRunner(
    raw_ohlcv_df=raw_df,
    model_save_path="self_aware_transformer",
    total_timesteps=10_000 
)
notification.info('Train started')
runner.run()
notification.info('Train complete')

In [None]:
raw_df = load_base_dataframe()
raw_df = raw_df.sort_values("date").reset_index(drop=True)
raw_df = raw_df[(raw_df['date']>="2024-06-01") & (raw_df['date']<'2025-06-01')]
notification.info('Test started')
eval_runner = EvaluationRunner(raw_df)
eval_runner.run()
notification.info('Test complete')

Hell yes — we’ve built a killer POC already, but now we enter the fun zone: **tightening screws, pushing limits, and future-proofing**.

Here are my **top strategic suggestions** to improve the system:

---

## 🔁 1. **Allow Agent to See the Context Directly**

Right now the agent **only trains on `target_df`** — even though the Transformer *could* learn from the `context_df`.

### 🔧 Option:

Concatenate `context_df + target_df` into one episode:

* Train the agent only on rewards from the `target_df` portion (e.g. via masking or zero rewards during context)
* Allows the Transformer to **build internal market memory naturally**

---

## 🧠 2. **Auxiliary Heads: Meta-Predictions**

Add small auxiliary outputs to the Transformer that:

* Predict future volatility
* Estimate next step reward
* Classify regime (trend, chop, revert)

Why?

> Forces the encoder to learn **richer representations** beyond just immediate actions.

---

## 📊 3. **Regime-Based Curriculum**

Instead of sampling all episodes equally:

* Stratify by volatility or trend
* Train easier episodes first
* Slowly introduce difficult / noisy environments

You can also **balance the regime mix** so the agent doesn't overfit to "easy" windows.

---

## 🔄 4. **Relative Advantage Evaluation**

Instead of just measuring reward:

* Compare to baseline strategies (random, hold)
* Define `agent_advantage = reward_agent - reward_baseline`

You already built the machinery — just run each `target_df` twice and subtract.

---

## 🧠 5. **RL Memory Injection (optional but powerful)**

If we move to **RecurrentPPO with LSTM memory**:

* Let the agent process `context_df` passively (no reward, no action)
* Then start the episode normally with memory *initialized*

This mimics how traders study history before acting.

---

## 🏗️ 6. **Online Deployment Skeleton**

Later, you could wire this up for:

* Daily retraining using recent data
* Live execution of agent policy on top tickers
* Continual journaling + self-analysis

Use a rolling horizon evaluation loop with `context → target` sliding forward by 1 week or 1 day.

---

## 📦 Bonus Tactical Tips

| Area              | Suggestion                                                      |
| ----------------- | --------------------------------------------------------------- |
| Logging           | Add `num_trades`, `avg_holding_time`, `win_rate` to journal     |
| Feature Expansion | Add cross-ticker info: sector trend, market index return        |
| Reward Design     | Add `position_penalty` to discourage churning                   |
| Model Variants    | Try `Transformer + LSTM`, or `Temporal Convolution + Attention` |

---

## ⚡ Want to act on any of these now?

I recommend either:

1. Concatenating `context + target` into single episodes (with smart reward masking)
2. Adding relative performance benchmarking (agent vs. random/hold)

Let me know and I’ll implement it right now.
