A trajectory eval harness for AI agents, with a faithful Super Mario Bros. World 1-1 as the demo environment.
The interesting part is the eval harness, not the game. The game is just a controllable world with clear ground truth, so it makes a clean target for the thing that's actually hard: measuring whether an agent makes good decisions, not just whether it got a good outcome.
Agent evals are easiest to learn on a world you fully own. The environment gives you free, unambiguous ground truth (Mario's position, alive/dead, reached-flag, score), and a run is naturally a trajectory: a sequence of action choices given observations. That is exactly what trajectory evals grade. No PII, no flaky external calls, fully reproducible.
| Piece | File | Role |
|---|---|---|
| Environment | index.html |
The game, single file. Exposes an agent API: getState() (observation) + step(action) (one deterministic frame). Same world is human-playable and agent-drivable. |
| Scenarios | eval/scenarios.js |
Deterministic test cases. Each is a precise setup + a programmatic pass condition + a natural-language rubric. |
| Policies | eval/policies.js |
The agents under test. A policy is { name, reset(), act(obs) -> action }. An LLM agent implements the same interface. |
| Runner | eval/harness.js |
Runs a policy against the suite, records the trajectory + event timeline, computes summary metrics, scores pass/fail. |
| LLM judge | eval/judge.mjs |
Scores trajectory decision quality against the rubric (binary checks miss the nuance: was a pass lucky? was a fail a safe-but-incomplete choice?). |
| Report | eval.html |
Loads the game in an iframe and renders a per-policy, per-scenario pass/fail report. |
# serve the folder (any static server)
python3 -m http.server 8770- Play the game:
http://localhost:8770/index.html(arrows to move, Z/Space jump, X/Shift run). - Run the eval report:
http://localhost:8770/eval.html(runs every policy against every scenario). - Automated LLM judge:
ANTHROPIC_API_KEY=... node eval/judge.mjs results.json
The point of an eval suite is to discriminate (separate better agents from worse) and localize failure, then drive the loop: measure → diagnose → improve → re-measure.
Three policies, same suite:
| policy | score |
|---|---|
| naive-runner (run right, jump at walls) | 2/13 |
| baseline-heuristic (timed jumps, gap/enemy logic) | 7/13 |
| stomper (baseline + a real stomp behavior) | 10/13 |
The stomper policy came from one loop of the harness: the LLM judge read the trajectories and unified two differently-scored scenarios into a single root cause ("the policy jumps over enemies, it never lands on them"). Fixing that one behavior flipped three scenarios with zero regressions.
- Grade the trajectory, not just the outcome. Two agents can both survive; one stomped on purpose and one got lucky. Outcome-only evals reward luck.
- Improve the system, never weaken the test. Editing a test to make it pass is the cardinal sin. Fixes go in the policy (or the game capability), never in the scenario's pass criteria.
- Re-run the whole suite after every change. The suite is a regression ratchet, not a one-time exam. A real fix generalizes across scenarios; a hack only moves the one you targeted.
- Determinism is a precondition. A reset must fully restore world state, otherwise reruns silently disagree.
- Programmatic checks are the binary backbone; the LLM judge adds the nuance. You want both.
See CLAUDE.md for the full developer notes (agent API reference, physics facts, harness conventions).
Working harness with the full loop demonstrated. Next, grounded in current eval practice (Hamel Husain, Shreya Shankar's "Who Validates the Validators?", Eugene Yan, Anthropic's agent-eval guide): validate the judge against hand-labeled ground truth, move it to binary per-failure-mode scoring, add partial credit and reference "gold" trajectories, plug in an LLM policy and report pass@k vs pass^k, and source new scenarios from clustered real-trace failures.
This is a from-scratch, fan-made recreation of World 1-1 for educational, non-commercial use. Super Mario Bros. and all related characters are © Nintendo. No original game code or assets are used. If you're from Nintendo and want this taken down, open an issue.