mario-eval

A trajectory eval harness for AI agents, with a faithful Super Mario Bros. World 1-1 as the demo environment.

The interesting part is the eval harness, not the game. The game is just a controllable world with clear ground truth, so it makes a clean target for the thing that's actually hard: measuring whether an agent makes good decisions, not just whether it got a good outcome.

Why a game

Agent evals are easiest to learn on a world you fully own. The environment gives you free, unambiguous ground truth (Mario's position, alive/dead, reached-flag, score), and a run is naturally a trajectory: a sequence of action choices given observations. That is exactly what trajectory evals grade. No PII, no flaky external calls, fully reproducible.

What's here

Piece	File	Role
Environment	`index.html`	The game, single file. Exposes an agent API: `getState()` (observation) + `step(action)` (one deterministic frame). Same world is human-playable and agent-drivable.
Scenarios	`eval/scenarios.js`	Deterministic test cases. Each is a precise setup + a programmatic pass condition + a natural-language rubric.
Policies	`eval/policies.js`	The agents under test. A policy is `{ name, reset(), act(obs) -> action }`. An LLM agent implements the same interface.
Runner	`eval/harness.js`	Runs a policy against the suite, records the trajectory + event timeline, computes summary metrics, scores pass/fail.
LLM judge	`eval/judge.mjs`	Scores trajectory decision quality against the rubric (binary checks miss the nuance: was a pass lucky? was a fail a safe-but-incomplete choice?).
Report	`eval.html`	Loads the game in an iframe and renders a per-policy, per-scenario pass/fail report.

Quickstart

# serve the folder (any static server)
python3 -m http.server 8770

Play the game: http://localhost:8770/index.html (arrows to move, Z/Space jump, X/Shift run).
Run the eval report: http://localhost:8770/eval.html (runs every policy against every scenario).
Automated LLM judge: ANTHROPIC_API_KEY=... node eval/judge.mjs results.json

What the harness demonstrates

The point of an eval suite is to discriminate (separate better agents from worse) and localize failure, then drive the loop: measure → diagnose → improve → re-measure.

Three policies, same suite:

policy	score
naive-runner (run right, jump at walls)	2/13
baseline-heuristic (timed jumps, gap/enemy logic)	7/13
stomper (baseline + a real stomp behavior)	10/13

The stomper policy came from one loop of the harness: the LLM judge read the trajectories and unified two differently-scored scenarios into a single root cause ("the policy jumps over enemies, it never lands on them"). Fixing that one behavior flipped three scenarios with zero regressions.

Principles baked into this repo

Grade the trajectory, not just the outcome. Two agents can both survive; one stomped on purpose and one got lucky. Outcome-only evals reward luck.
Improve the system, never weaken the test. Editing a test to make it pass is the cardinal sin. Fixes go in the policy (or the game capability), never in the scenario's pass criteria.
Re-run the whole suite after every change. The suite is a regression ratchet, not a one-time exam. A real fix generalizes across scenarios; a hack only moves the one you targeted.
Determinism is a precondition. A reset must fully restore world state, otherwise reruns silently disagree.
Programmatic checks are the binary backbone; the LLM judge adds the nuance. You want both.

See CLAUDE.md for the full developer notes (agent API reference, physics facts, harness conventions).

Status / roadmap

Working harness with the full loop demonstrated. Next, grounded in current eval practice (Hamel Husain, Shreya Shankar's "Who Validates the Validators?", Eugene Yan, Anthropic's agent-eval guide): validate the judge against hand-labeled ground truth, move it to binary per-failure-mode scoring, add partial credit and reference "gold" trajectories, plug in an LLM policy and report pass@k vs pass^k, and source new scenarios from clustered real-trace failures.

A note on the Mario assets

This is a from-scratch, fan-made recreation of World 1-1 for educational, non-commercial use. Super Mario Bros. and all related characters are © Nintendo. No original game code or assets are used. If you're from Nintendo and want this taken down, open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval		eval
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
eval.html		eval.html
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mario-eval

Why a game

What's here

Quickstart

What the harness demonstrates

Principles baked into this repo

Status / roadmap

A note on the Mario assets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mario-eval

Why a game

What's here

Quickstart

What the harness demonstrates

Principles baked into this repo

Status / roadmap

A note on the Mario assets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages