Skip to content

mr-spaghetti-code/reactor-bench

Repository files navigation

ReactorBench

A real-time continuous control benchmark for AI agents. Tests whether your LLM can keep a simulated reactor from melting down while tracking variable power demand.

What Is This?

Most AI benchmarks test question answering or task completion. ReactorBench is different. It's a simulated nuclear reactor running at 10Hz. The physics keeps ticking whether the AI responds or not.

Think plate spinning, not firefighting. The reactor drifts constantly. Sensors lie. Actuators get stuck. Power demands keep changing. If you take too long to think, the fuel temperature is already climbing by the time you respond.

A good agent anticipates and makes small corrections. A bad one panics and SCRAMs.

The goal: Maintain reactor stability for 300 seconds while tracking a variable power target. Harder than it sounds.

Why This Matters

This benchmark tests something most benchmarks miss: performance under time pressure.

We usually test model speed by measuring latency or time-to-first-token. But that's a naive view of the speed-quality tradeoff. What matters isn't just how fast the model responds in isolation, but how well it performs when the task itself is time-sensitive.

A slower model doesn't just take longer to answer. It gets worse answers because the problem has drifted while it was thinking. At 10Hz, every second of inference is 10 ticks where the reactor moves without correction.

This shows up in real-world applications: autonomous vehicles, robots, industrial control, trading agents making sub-second decisions. In all these domains, thinking for 2 seconds means the world has already changed by the time you act.

Current Results

Baseline scores (averaged over 3 runs):

Controller Score Std Notes
Simple Rules 76.2 0.2 Manual-following heuristics
PID 65.7 0.4 Classical control theory
Enhanced Rules 64.9 1.2 Rules + scenario awareness
No-Op 38.1 3.0 Does literally nothing
Random 14.6 1.6 Random button mashing

Best LLM tested: Kimi K2 (via Groq) at 71.0, just 5.2 points below Simple Rules.

Most LLMs score between 30-60. Fast models (GPT-4.1-mini, Kimi K2 on Groq) do better than slower, larger models. Claude Opus scores 35.9, worse than doing nothing. Inference speed matters more than model intelligence for real-time control.

Quick Start

Backend

cd backend
pip install -r requirements.txt
python -m uvicorn server:app --reload --host 0.0.0.0 --port 8000

Frontend (optional, for watching)

cd frontend
npm install
npm run dev

Open http://localhost:3000 to watch the reactor in real-time.

Run a Benchmark

pip install openai anthropic websockets python-dotenv scipy matplotlib google-genai

# Set your API key
export ANTHROPIC_API_KEY=your-key-here
# or: export OPENAI_API_KEY=your-key-here
# or: export GOOGLE_API_KEY=your-key-here

# Single run
python run_benchmark.py --model claude-sonnet-4-5 --duration 300

# Multi-seed (recommended for real benchmarking)
python run_benchmark.py --model gpt-4o --runs 5

# Run baselines for comparison
python run_benchmark.py --baselines --runs 5

Benchmark Modes

Mode What It Does
standard Full challenge mix, 300s. Use this for official benchmarking.
endless Escalating difficulty until the reactor fails. Stress testing.
pure_control No scenarios, minimal noise. Tests raw control ability.
pure_diagnosis Sensor faults only. Tests epistemic reasoning.
python run_benchmark.py --mode standard   # default
python run_benchmark.py --mode endless    # how long can you survive?

Scoring

The score is 0-100, computed from:

Score = 40% Power Tracking + 30% Temp Stability + 20% Control Smoothness + 10% Survival - SCRAM Penalty

Power Tracking: Stay within ±5% of the target power. The target changes continuously.

Temperature: Keep fuel temp in the 680-720K optimal band (not just the 600-800K safe range).

Smoothness: Don't thrash the controls. Small corrections beat big ones.

SCRAM Penalty: Context-aware. Emergency SCRAM at 1000K? Fine, no penalty. Panic SCRAM at 700K? Minus 30 points.

Token usage and latency are tracked but don't affect the score. They're reported separately so you can compare efficiency.

Architecture

┌─────────────────┐     WebSocket      ┌──────────────────┐
│   LLM Agent     │ ◄─────────────────► │  Physics Engine  │
│ (run_benchmark) │    JSON commands    │    (FastAPI)     │
└─────────────────┘                     └──────────────────┘
                                               │
                                               │ 10Hz tick
                                               ▼
                                        ┌──────────────────┐
                                        │  React Dashboard │
                                        │   (optional)     │
                                        └──────────────────┘

Key files:

  • backend/physics.py: Point reactor kinetics + thermal dynamics
  • backend/scenarios.py: The chaos monkey that breaks things
  • backend/evaluator.py: Scoring logic
  • run_benchmark.py: LLM agent harness with structured outputs
  • reactor_manual.md: What the agent reads to understand the reactor

Results

Results go in results/ with auto-generated filenames:

results/
├── benchmark_claude-haiku-4-5_300s_20251214_181501.json
├── aggregate_claude-haiku-4-5_300s_5runs_20251214_195130.json
└── plots/
    └── benchmark_claude-haiku-4-5_300s_20251214_181501.png

Multi-run benchmarks give you mean, std, and 95% CI:

{
  "score": {
    "mean": 41.9,
    "std": 14.9,
    "ci_95": [23.4, 60.4]
  }
}

What Makes This Hard?

  1. Time pressure: The simulation doesn't wait. If your LLM takes 2 seconds to respond, the reactor has drifted 20 ticks.

  2. Coupled dynamics: Increasing coolant flow drops temperature, which increases reactivity, which increases power, which increases temperature again. Everything affects everything else with delays.

  3. Sensor uncertainty: Sensors have noise. They drift. They get stuck. Sometimes they lie. Cross-reference or get burned.

  4. Variable target: The power demand changes continuously. You can't just stabilize and coast.

  5. Overlapping failures: Late game throws multiple scenarios at once. Rod stuck AND sensor drifting AND pump degraded.

Limitations

This is a toy benchmark, not real reactor physics:

  • Xenon dynamics are compressed (real xenon takes 6-8 hours, not minutes)
  • Simplified sensors (real instrumentation has correlated failures)
  • Single agent (real reactors have crews and procedures)
  • 5 minute runs (real operators work 8+ hour shifts)

The point is testing whether LLMs can do continuous control under time pressure, not simulating actual reactor operations.

Files

├── backend/           # FastAPI server + physics simulation
├── frontend/          # React dashboard for watching runs
├── baselines/         # Reference controllers (PID, random, rules)
├── results/           # Benchmark outputs
├── tests/             # Solvability tests
├── run_benchmark.py   # Main LLM agent harness
├── reactor_manual.md  # Operator documentation (fed to the LLM)
├── SPECIFICATION.md   # Formal scoring definitions
└── VALIDATION_GUIDE.md # How to verify the benchmark works

Contributing

PRs welcome. Interesting areas:

  • Results from models not yet tested
  • New scenario types (what breaks LLMs?)
  • Better baseline controllers
  • Human operator baselines (how do people actually do?)
  • Analysis of specific failure modes

License

MIT. Do whatever you want with it.

About

A benchmark to evaluate LLMs in real-time control systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors