Skip to content

mattej5/mario-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mario-eval

A trajectory eval harness for AI agents, with a faithful Super Mario Bros. World 1-1 as the demo environment.

The interesting part is the eval harness, not the game. The game is just a controllable world with clear ground truth, so it makes a clean target for the thing that's actually hard: measuring whether an agent makes good decisions, not just whether it got a good outcome.

Why a game

Agent evals are easiest to learn on a world you fully own. The environment gives you free, unambiguous ground truth (Mario's position, alive/dead, reached-flag, score), and a run is naturally a trajectory: a sequence of action choices given observations. That is exactly what trajectory evals grade. No PII, no flaky external calls, fully reproducible.

What's here

Piece File Role
Environment index.html The game, single file. Exposes an agent API: getState() (observation) + step(action) (one deterministic frame). Same world is human-playable and agent-drivable.
Scenarios eval/scenarios.js Deterministic test cases. Each is a precise setup + a programmatic pass condition + a natural-language rubric.
Policies eval/policies.js The agents under test. A policy is { name, reset(), act(obs) -> action }. An LLM agent implements the same interface.
Runner eval/harness.js Runs a policy against the suite, records the trajectory + event timeline, computes summary metrics, scores pass/fail.
LLM judge eval/judge.mjs Scores trajectory decision quality against the rubric (binary checks miss the nuance: was a pass lucky? was a fail a safe-but-incomplete choice?).
Report eval.html Loads the game in an iframe and renders a per-policy, per-scenario pass/fail report.

Quickstart

# serve the folder (any static server)
python3 -m http.server 8770
  • Play the game: http://localhost:8770/index.html (arrows to move, Z/Space jump, X/Shift run).
  • Run the eval report: http://localhost:8770/eval.html (runs every policy against every scenario).
  • Automated LLM judge: ANTHROPIC_API_KEY=... node eval/judge.mjs results.json

What the harness demonstrates

The point of an eval suite is to discriminate (separate better agents from worse) and localize failure, then drive the loop: measure → diagnose → improve → re-measure.

Three policies, same suite:

policy score
naive-runner (run right, jump at walls) 2/13
baseline-heuristic (timed jumps, gap/enemy logic) 7/13
stomper (baseline + a real stomp behavior) 10/13

The stomper policy came from one loop of the harness: the LLM judge read the trajectories and unified two differently-scored scenarios into a single root cause ("the policy jumps over enemies, it never lands on them"). Fixing that one behavior flipped three scenarios with zero regressions.

Principles baked into this repo

  • Grade the trajectory, not just the outcome. Two agents can both survive; one stomped on purpose and one got lucky. Outcome-only evals reward luck.
  • Improve the system, never weaken the test. Editing a test to make it pass is the cardinal sin. Fixes go in the policy (or the game capability), never in the scenario's pass criteria.
  • Re-run the whole suite after every change. The suite is a regression ratchet, not a one-time exam. A real fix generalizes across scenarios; a hack only moves the one you targeted.
  • Determinism is a precondition. A reset must fully restore world state, otherwise reruns silently disagree.
  • Programmatic checks are the binary backbone; the LLM judge adds the nuance. You want both.

See CLAUDE.md for the full developer notes (agent API reference, physics facts, harness conventions).

Status / roadmap

Working harness with the full loop demonstrated. Next, grounded in current eval practice (Hamel Husain, Shreya Shankar's "Who Validates the Validators?", Eugene Yan, Anthropic's agent-eval guide): validate the judge against hand-labeled ground truth, move it to binary per-failure-mode scoring, add partial credit and reference "gold" trajectories, plug in an LLM policy and report pass@k vs pass^k, and source new scenarios from clustered real-trace failures.

A note on the Mario assets

This is a from-scratch, fan-made recreation of World 1-1 for educational, non-commercial use. Super Mario Bros. and all related characters are © Nintendo. No original game code or assets are used. If you're from Nintendo and want this taken down, open an issue.

About

A trajectory eval harness for AI agents (demo env: a from-scratch SMB 1-1 clone). Deterministic scenarios, policy runner, LLM-as-judge, measure->diagnose->improve->re-measure.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors