ReactorBench

A real-time continuous control benchmark for AI agents. Tests whether your LLM can keep a simulated reactor from melting down while tracking variable power demand.

What Is This?

Most AI benchmarks test question answering or task completion. ReactorBench is different. It's a simulated nuclear reactor running at 10Hz. The physics keeps ticking whether the AI responds or not.

Think plate spinning, not firefighting. The reactor drifts constantly. Sensors lie. Actuators get stuck. Power demands keep changing. If you take too long to think, the fuel temperature is already climbing by the time you respond.

A good agent anticipates and makes small corrections. A bad one panics and SCRAMs.

The goal: Maintain reactor stability for 300 seconds while tracking a variable power target. Harder than it sounds.

Why This Matters

This benchmark tests something most benchmarks miss: performance under time pressure.

We usually test model speed by measuring latency or time-to-first-token. But that's a naive view of the speed-quality tradeoff. What matters isn't just how fast the model responds in isolation, but how well it performs when the task itself is time-sensitive.

A slower model doesn't just take longer to answer. It gets worse answers because the problem has drifted while it was thinking. At 10Hz, every second of inference is 10 ticks where the reactor moves without correction.

This shows up in real-world applications: autonomous vehicles, robots, industrial control, trading agents making sub-second decisions. In all these domains, thinking for 2 seconds means the world has already changed by the time you act.

Current Results

Baseline scores (averaged over 3 runs):

Controller	Score	Std	Notes
Simple Rules	76.2	0.2	Manual-following heuristics
PID	65.7	0.4	Classical control theory
Enhanced Rules	64.9	1.2	Rules + scenario awareness
No-Op	38.1	3.0	Does literally nothing
Random	14.6	1.6	Random button mashing

Best LLM tested: Kimi K2 (via Groq) at 71.0, just 5.2 points below Simple Rules.

Most LLMs score between 30-60. Fast models (GPT-4.1-mini, Kimi K2 on Groq) do better than slower, larger models. Claude Opus scores 35.9, worse than doing nothing. Inference speed matters more than model intelligence for real-time control.

Quick Start

Backend

cd backend
pip install -r requirements.txt
python -m uvicorn server:app --reload --host 0.0.0.0 --port 8000

Frontend (optional, for watching)

cd frontend
npm install
npm run dev

Open http://localhost:3000 to watch the reactor in real-time.

Run a Benchmark

pip install openai anthropic websockets python-dotenv scipy matplotlib google-genai

# Set your API key
export ANTHROPIC_API_KEY=your-key-here
# or: export OPENAI_API_KEY=your-key-here
# or: export GOOGLE_API_KEY=your-key-here

# Single run
python run_benchmark.py --model claude-sonnet-4-5 --duration 300

# Multi-seed (recommended for real benchmarking)
python run_benchmark.py --model gpt-4o --runs 5

# Run baselines for comparison
python run_benchmark.py --baselines --runs 5

Benchmark Modes

Mode	What It Does
`standard`	Full challenge mix, 300s. Use this for official benchmarking.
`endless`	Escalating difficulty until the reactor fails. Stress testing.
`pure_control`	No scenarios, minimal noise. Tests raw control ability.
`pure_diagnosis`	Sensor faults only. Tests epistemic reasoning.

python run_benchmark.py --mode standard   # default
python run_benchmark.py --mode endless    # how long can you survive?

Scoring

The score is 0-100, computed from:

Score = 40% Power Tracking + 30% Temp Stability + 20% Control Smoothness + 10% Survival - SCRAM Penalty

Power Tracking: Stay within ±5% of the target power. The target changes continuously.

Temperature: Keep fuel temp in the 680-720K optimal band (not just the 600-800K safe range).

Smoothness: Don't thrash the controls. Small corrections beat big ones.

SCRAM Penalty: Context-aware. Emergency SCRAM at 1000K? Fine, no penalty. Panic SCRAM at 700K? Minus 30 points.

Token usage and latency are tracked but don't affect the score. They're reported separately so you can compare efficiency.

Architecture

┌─────────────────┐     WebSocket      ┌──────────────────┐
│   LLM Agent     │ ◄─────────────────► │  Physics Engine  │
│ (run_benchmark) │    JSON commands    │    (FastAPI)     │
└─────────────────┘                     └──────────────────┘
                                               │
                                               │ 10Hz tick
                                               ▼
                                        ┌──────────────────┐
                                        │  React Dashboard │
                                        │   (optional)     │
                                        └──────────────────┘

Key files:

backend/physics.py: Point reactor kinetics + thermal dynamics
backend/scenarios.py: The chaos monkey that breaks things
backend/evaluator.py: Scoring logic
run_benchmark.py: LLM agent harness with structured outputs
reactor_manual.md: What the agent reads to understand the reactor

Results

Results go in results/ with auto-generated filenames:

results/
├── benchmark_claude-haiku-4-5_300s_20251214_181501.json
├── aggregate_claude-haiku-4-5_300s_5runs_20251214_195130.json
└── plots/
    └── benchmark_claude-haiku-4-5_300s_20251214_181501.png

Multi-run benchmarks give you mean, std, and 95% CI:

{
  "score": {
    "mean": 41.9,
    "std": 14.9,
    "ci_95": [23.4, 60.4]
  }
}

What Makes This Hard?

Time pressure: The simulation doesn't wait. If your LLM takes 2 seconds to respond, the reactor has drifted 20 ticks.
Coupled dynamics: Increasing coolant flow drops temperature, which increases reactivity, which increases power, which increases temperature again. Everything affects everything else with delays.
Sensor uncertainty: Sensors have noise. They drift. They get stuck. Sometimes they lie. Cross-reference or get burned.
Variable target: The power demand changes continuously. You can't just stabilize and coast.
Overlapping failures: Late game throws multiple scenarios at once. Rod stuck AND sensor drifting AND pump degraded.

Limitations

This is a toy benchmark, not real reactor physics:

Xenon dynamics are compressed (real xenon takes 6-8 hours, not minutes)
Simplified sensors (real instrumentation has correlated failures)
Single agent (real reactors have crews and procedures)
5 minute runs (real operators work 8+ hour shifts)

The point is testing whether LLMs can do continuous control under time pressure, not simulating actual reactor operations.

Files

├── backend/           # FastAPI server + physics simulation
├── frontend/          # React dashboard for watching runs
├── baselines/         # Reference controllers (PID, random, rules)
├── results/           # Benchmark outputs
├── tests/             # Solvability tests
├── run_benchmark.py   # Main LLM agent harness
├── reactor_manual.md  # Operator documentation (fed to the LLM)
├── SPECIFICATION.md   # Formal scoring definitions
└── VALIDATION_GUIDE.md # How to verify the benchmark works

Contributing

PRs welcome. Interesting areas:

Results from models not yet tested
New scenario types (what breaks LLMs?)
Better baseline controllers
Human operator baselines (how do people actually do?)
Analysis of specific failure modes

License

MIT. Do whatever you want with it.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
baselines		baselines
frontend		frontend
results		results
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SPECIFICATION.md		SPECIFICATION.md
VALIDATION_GUIDE.md		VALIDATION_GUIDE.md
blog_post.md		blog_post.md
llm_debug.log		llm_debug.log
reactor_manual.md		reactor_manual.md
run_benchmark.py		run_benchmark.py
scenarios_summary.md		scenarios_summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReactorBench

What Is This?

Why This Matters

Current Results

Quick Start

Backend

Frontend (optional, for watching)

Run a Benchmark

Benchmark Modes

Scoring

Architecture

Results

What Makes This Hard?

Limitations

Files

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReactorBench

What Is This?

Why This Matters

Current Results

Quick Start

Backend

Frontend (optional, for watching)

Run a Benchmark

Benchmark Modes

Scoring

Architecture

Results

What Makes This Hard?

Limitations

Files

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages