A real-time continuous control benchmark for AI agents. Tests whether your LLM can keep a simulated reactor from melting down while tracking variable power demand.
Most AI benchmarks test question answering or task completion. ReactorBench is different. It's a simulated nuclear reactor running at 10Hz. The physics keeps ticking whether the AI responds or not.
Think plate spinning, not firefighting. The reactor drifts constantly. Sensors lie. Actuators get stuck. Power demands keep changing. If you take too long to think, the fuel temperature is already climbing by the time you respond.
A good agent anticipates and makes small corrections. A bad one panics and SCRAMs.
The goal: Maintain reactor stability for 300 seconds while tracking a variable power target. Harder than it sounds.
This benchmark tests something most benchmarks miss: performance under time pressure.
We usually test model speed by measuring latency or time-to-first-token. But that's a naive view of the speed-quality tradeoff. What matters isn't just how fast the model responds in isolation, but how well it performs when the task itself is time-sensitive.
A slower model doesn't just take longer to answer. It gets worse answers because the problem has drifted while it was thinking. At 10Hz, every second of inference is 10 ticks where the reactor moves without correction.
This shows up in real-world applications: autonomous vehicles, robots, industrial control, trading agents making sub-second decisions. In all these domains, thinking for 2 seconds means the world has already changed by the time you act.
Baseline scores (averaged over 3 runs):
| Controller | Score | Std | Notes |
|---|---|---|---|
| Simple Rules | 76.2 | 0.2 | Manual-following heuristics |
| PID | 65.7 | 0.4 | Classical control theory |
| Enhanced Rules | 64.9 | 1.2 | Rules + scenario awareness |
| No-Op | 38.1 | 3.0 | Does literally nothing |
| Random | 14.6 | 1.6 | Random button mashing |
Best LLM tested: Kimi K2 (via Groq) at 71.0, just 5.2 points below Simple Rules.
Most LLMs score between 30-60. Fast models (GPT-4.1-mini, Kimi K2 on Groq) do better than slower, larger models. Claude Opus scores 35.9, worse than doing nothing. Inference speed matters more than model intelligence for real-time control.
cd backend
pip install -r requirements.txt
python -m uvicorn server:app --reload --host 0.0.0.0 --port 8000cd frontend
npm install
npm run devOpen http://localhost:3000 to watch the reactor in real-time.
pip install openai anthropic websockets python-dotenv scipy matplotlib google-genai
# Set your API key
export ANTHROPIC_API_KEY=your-key-here
# or: export OPENAI_API_KEY=your-key-here
# or: export GOOGLE_API_KEY=your-key-here
# Single run
python run_benchmark.py --model claude-sonnet-4-5 --duration 300
# Multi-seed (recommended for real benchmarking)
python run_benchmark.py --model gpt-4o --runs 5
# Run baselines for comparison
python run_benchmark.py --baselines --runs 5| Mode | What It Does |
|---|---|
standard |
Full challenge mix, 300s. Use this for official benchmarking. |
endless |
Escalating difficulty until the reactor fails. Stress testing. |
pure_control |
No scenarios, minimal noise. Tests raw control ability. |
pure_diagnosis |
Sensor faults only. Tests epistemic reasoning. |
python run_benchmark.py --mode standard # default
python run_benchmark.py --mode endless # how long can you survive?The score is 0-100, computed from:
Score = 40% Power Tracking + 30% Temp Stability + 20% Control Smoothness + 10% Survival - SCRAM Penalty
Power Tracking: Stay within ±5% of the target power. The target changes continuously.
Temperature: Keep fuel temp in the 680-720K optimal band (not just the 600-800K safe range).
Smoothness: Don't thrash the controls. Small corrections beat big ones.
SCRAM Penalty: Context-aware. Emergency SCRAM at 1000K? Fine, no penalty. Panic SCRAM at 700K? Minus 30 points.
Token usage and latency are tracked but don't affect the score. They're reported separately so you can compare efficiency.
┌─────────────────┐ WebSocket ┌──────────────────┐
│ LLM Agent │ ◄─────────────────► │ Physics Engine │
│ (run_benchmark) │ JSON commands │ (FastAPI) │
└─────────────────┘ └──────────────────┘
│
│ 10Hz tick
▼
┌──────────────────┐
│ React Dashboard │
│ (optional) │
└──────────────────┘
Key files:
backend/physics.py: Point reactor kinetics + thermal dynamicsbackend/scenarios.py: The chaos monkey that breaks thingsbackend/evaluator.py: Scoring logicrun_benchmark.py: LLM agent harness with structured outputsreactor_manual.md: What the agent reads to understand the reactor
Results go in results/ with auto-generated filenames:
results/
├── benchmark_claude-haiku-4-5_300s_20251214_181501.json
├── aggregate_claude-haiku-4-5_300s_5runs_20251214_195130.json
└── plots/
└── benchmark_claude-haiku-4-5_300s_20251214_181501.png
Multi-run benchmarks give you mean, std, and 95% CI:
{
"score": {
"mean": 41.9,
"std": 14.9,
"ci_95": [23.4, 60.4]
}
}-
Time pressure: The simulation doesn't wait. If your LLM takes 2 seconds to respond, the reactor has drifted 20 ticks.
-
Coupled dynamics: Increasing coolant flow drops temperature, which increases reactivity, which increases power, which increases temperature again. Everything affects everything else with delays.
-
Sensor uncertainty: Sensors have noise. They drift. They get stuck. Sometimes they lie. Cross-reference or get burned.
-
Variable target: The power demand changes continuously. You can't just stabilize and coast.
-
Overlapping failures: Late game throws multiple scenarios at once. Rod stuck AND sensor drifting AND pump degraded.
This is a toy benchmark, not real reactor physics:
- Xenon dynamics are compressed (real xenon takes 6-8 hours, not minutes)
- Simplified sensors (real instrumentation has correlated failures)
- Single agent (real reactors have crews and procedures)
- 5 minute runs (real operators work 8+ hour shifts)
The point is testing whether LLMs can do continuous control under time pressure, not simulating actual reactor operations.
├── backend/ # FastAPI server + physics simulation
├── frontend/ # React dashboard for watching runs
├── baselines/ # Reference controllers (PID, random, rules)
├── results/ # Benchmark outputs
├── tests/ # Solvability tests
├── run_benchmark.py # Main LLM agent harness
├── reactor_manual.md # Operator documentation (fed to the LLM)
├── SPECIFICATION.md # Formal scoring definitions
└── VALIDATION_GUIDE.md # How to verify the benchmark works
PRs welcome. Interesting areas:
- Results from models not yet tested
- New scenario types (what breaks LLMs?)
- Better baseline controllers
- Human operator baselines (how do people actually do?)
- Analysis of specific failure modes
MIT. Do whatever you want with it.