# PyTorchLOB: Safe RL & FIFO Matching Demo

This notebook demonstrates:
1. **FIFO (Non-RL) Execution**: A baseline passive strategy relying on the orderbook's price-time priority.
2. **Safe RL Execution**: A PPO agent using **PID-Lagrangian** control to minimize slippage.

In [1]:
import torch
import sys
import os
import numpy as np
from torch_exchange.environment import TorchExecutionEnv

# Add current directory to path
sys.path.append(os.getcwd())

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


### 1. Setup Device

In [2]:
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("✅ M1/M2/M3 GPU Detected (MPS)")
else:
    device = torch.device("cpu")
    print("⚠️ Using CPU")

✅ M1/M2/M3 GPU Detected (MPS)


### 2. FIFO Execution (Non-RL Baseline)
This strategy simply places Passive Orders (at the touch) and waits. It utilizes the First-In-First-Out (Price-Time) matching engine logic.

In [3]:
from torch_exchange.environment import TorchExecutionEnv

env_fifo = TorchExecutionEnv(task='sell', task_size=5000, device=device, book_depth=20)
obs = env_fifo.reset()

print("Running FIFO Baseline...")
total_reward = 0
total_cost = 0
steps = 0

# Simple policy: Always place small passive orders
# Action space: [FT, M, NT, PP]
# FIFO/Passive Strategy: 100% at Passive Price (PP) or Near Touch (NT)
# Let's say we split between NT and PP

fifo_action = np.array([0, 0, 50, 50], dtype=np.int32)

while True:
    obs, reward, done, info = env_fifo.step(fifo_action)
    total_reward += reward
    total_cost += info.get('cost', 0)
    steps += 1
    if done:
        break

print(f"FIFO Result: Steps={steps}, Avg Cost (Slippage)={total_cost/steps:.4f}")

Running FIFO Baseline...
FIFO Result: Steps=100, Avg Cost (Slippage)=0.0000


### 3. Safe RL Training (PID-Lagrangian)
Training PPO with a **PID-Lagrangian** constraint to keep slippage low.

In [3]:
from torch_exchange.ppo import PPOAgent

env_rl = TorchExecutionEnv(task='sell', task_size=5000, device=device, book_depth=20)

# Initialize with 'pid' safety mode and tight cost limit
agent = PPOAgent(
    env_rl, 
    device=device, 
    lr = 3e-5,
    cost_limit=0.5, # Limit slippage
    pid_Kp=2.0
)

print("Starting Safe RL Training (PID-Lagrangian)...")
agent.train(total_timesteps=1000, batch_size=32)

Starting Safe RL Training (PID-Lagrangian)...


Training: 1024it [01:37, 10.47it/s, batch_rew=-0.30, batch_cost=59.38, loss=6.427, lambda=0.0000]                    
