# PyTorchLOB: Safe RL & FIFO Matching Demo

This notebook demonstrates:
1. **FIFO (Non-RL) Execution**: A baseline passive strategy relying on the orderbook's price-time priority.
2. **Safe RL Execution**: A PPO agent using **PID-Lagrangian** control to minimize slippage.

In [2]:
import torch
import sys
import os
import numpy as np
from torch_exchange import TorchExecutionEnv

### 1. Setup Device

In [3]:
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("✅ M1/M2/M3 GPU Detected (MPS)")
else:
    device = torch.device("cpu")
    print("⚠️ Using CPU")

✅ M1/M2/M3 GPU Detected (MPS)


### 1.5. Real Market Data Replay
Instead of a random walk, we can load high-frequency Crypto LOB data (e.g., Bitcoin 1-sec resolution) to drive the simulation price path.

The environment will:
1. Read the `midpoint` price series from the CSV.
2. At each step, update the `arrival_mid_price` to the next real data point.
3. Regenerate the LOB liquidity around this new realistic price.

In [5]:
data_path = 'high-frequency-crypto-limit-order-book-data/BTC_1sec.csv'

# Initialize Environment with Data
env_real = TorchExecutionEnv(
    task='sell', 
    task_size=500000, 
    device=device, 
    book_depth=10, 
    data_path=data_path, 
    nrows=20000
)

obs, info = env_real.reset()
print(f"Environment initialized with Real Data. Start Price: {env_real.arrival_mid_price:.2f}")

# Run a few steps to see price movement
for i in range(100):
    action = torch.tensor([0, 0, 10, 10], device=device, dtype=torch.int32)
    obs, reward, terminated, truncated, info = env_real.step(action)
    print(f"Step {i+1}: Price {env_real.arrival_mid_price:.2f}")

Loading data from high-frequency-crypto-limit-order-book-data/BTC_1sec.csv...
Loaded 20000 price points.
Environment initialized with Real Data. Start Price: 56253.11
Step 1: Price 56253.11
Step 2: Price 56253.11
Step 3: Price 56253.11
Step 4: Price 56253.11
Step 5: Price 56264.43
Step 6: Price 56250.00
Step 7: Price 56250.00
Step 8: Price 56250.00
Step 9: Price 56265.50
Step 10: Price 56271.71
Step 11: Price 56271.80
Step 12: Price 56273.68
Step 13: Price 56273.49
Step 14: Price 56285.64
Step 15: Price 56273.23
Step 16: Price 56261.38
Step 17: Price 56261.38
Step 18: Price 56261.45
Step 19: Price 56275.82
Step 20: Price 56266.32
Step 21: Price 56266.32
Step 22: Price 56266.32
Step 23: Price 56266.39
Step 24: Price 56281.20
Step 25: Price 56276.36
Step 26: Price 56278.83
Step 27: Price 56283.98
Step 28: Price 56301.32
Step 29: Price 56302.71
Step 30: Price 56308.34
Step 31: Price 56303.98
Step 32: Price 56307.82
Step 33: Price 56306.93
Step 34: Price 56302.59
Step 35: Price 56288.83
St

### 2. FIFO Execution (Non-RL Baseline)
This strategy simply places Passive Orders (at the touch) and waits. It utilizes the First-In-First-Out (Price-Time) matching engine logic.

In [3]:
env_fifo = TorchExecutionEnv(task='sell', task_size=500000, device=device, book_depth=20, init_price=10000000)
obs, _ = env_fifo.reset()

print("Running FIFO Baseline...")
total_reward = 0
total_cost = 0
steps = 0

# Simple policy: Always place small passive orders
# Action space: [FT, M, NT, PP]
# FIFO/Passive Strategy: 100% at Passive Price (PP) or Near Touch (NT)
# Let's say we split between NT and PP

fifo_action = np.array([0, 0, 50, 50], dtype=np.int32)


while True:
    obs, reward, terminated, truncated, info = env_fifo.step(fifo_action)
    done = terminated or truncated
    total_reward += reward
    total_cost += info.get('cost', 0)
    steps += 1
    if done:
        break

print(f"FIFO Result: Steps={steps}, Avg Cost (Slippage)={total_cost/steps:.4f}")

Running FIFO Baseline...
FIFO Result: Steps=100, Avg Cost (Slippage)=0.0000


### 3. Safe RL Training (PID-Lagrangian)
Training PPO with a **PID-Lagrangian** constraint to keep slippage low.

In [None]:
from torch_exchange.ppo import PPOAgent

env_rl = TorchExecutionEnv(task='sell', task_size=50000, device=device, book_depth=10,
data_path=data_path, nrows=20000
)
obs, _ = env_rl.reset()

# Initialize with 'pid' safety mode and tight cost limit
agent = PPOAgent(
    env_rl, 
    device=device, 
    model_type = 'cnn',
    lr = 3e-5,
    cost_limit=0.5, # Limit slippage
    pid_Kp=2.0
)

print("Starting Safe RL Training (PID-Lagrangian)...")
agent.train(total_timesteps=1000, batch_size=128)

Loading data from high-frequency-crypto-limit-order-book-data/BTC_1sec.csv...
Loaded 20000 price points.
Starting Safe RL Training (PID-Lagrangian)...


Training:  56%|██▊  | 555/1000 [00:54<00:44,  9.96it/s, batch_rew=-0.01, batch_cost=31.36, loss=0.012, lambda=0.0000]