-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial Verifiable Benchmark
Use case: "How does model X actually do on benchmark Y — per problem, not just the headline number?"
You will run the complete itemeval pipeline on
AIME 2025 (competition
math, integer answers) with openai/gpt-5-mini. Because the answers are
integers, grading uses the built-in numeric scorer — pure Python, no judge
model — so the only cost is generation. This exact run has been validated live:
5 problems, 5/5 correct, about $0.014 of generation and $0.00 of grading.
You need: pip install itemeval[openai] and an OPENAI_API_KEY in your
environment. Time: ~10 minutes.
One YAML file describes the whole study. Save this as aime.yaml:
study: aime_quickstart
benchmark:
adapter: hf # load from the HuggingFace Hub
datasets:
- id: MathArena/aime_2025 # dataset revision auto-pins at first run
split: train
mapping: # dataset columns -> itemeval's Item fields
id: problem_idx
input: problem
target: answer # integer answers -> the numeric scorer
solvers:
models: [openai/gpt-5-mini]
max_tokens: 8192 # room for hidden reasoning + the "ANSWER:" line
facets:
prompt: [builtin:minimal] # packaged template; ends with 'state your final
# answer on a line starting with "ANSWER:"'
scorer: numeric # verifiable scorer: extracts the last number, $0
model_config: [{name: low, reasoning_effort: low}] # keep it fast and cheap
budget:
policy: dev # dev = only the first dev_items items
dev_items: 5Three choices worth noticing:
-
mappingis the only thing you change to point at a different dataset: name the columns that hold the problem id, the problem text, and the reference answer. -
scorer: numericmeans grading is free. The packagedbuiltin:minimalprompt instructs the model to end with anANSWER:line, which the scorer parses. -
policy: devcaps the run at the first 5 items. This is the default posture for any new config — prove the pipeline first, scale later (Tutorial 5).
itemeval estimate aime.yamlThis makes no model API calls. It renders the actual prompts, applies a
token heuristic and the pricing table, and prints projected calls, tokens, and
dollars per stage. Expect a projection of well under $0.10 for this config
(estimates are deliberately conservative — actuals usually come in lower).
It also prints which pricing table the numbers came from; before bigger runs,
add --refresh-pricing to pull current prices.
itemeval generate aime.yamlThis expands the design grid — here a single condition,
(gpt-5-mini × builtin:minimal × low) — and runs one inspect_ai task over the 5
items, with live progress. Every solution is upserted into
studies/aime_quickstart/solutions.parquet with full provenance: prompt hash,
requested vs effective sampling params, tokens, dollars, and a pointer to the
raw .eval transcript.
If it's interrupted or a provider call fails, just run the same command again — completed items skip, failed ones retry, and nothing is double-paid.
itemeval grade aime.yamlThe numeric scorer parses each stored solution's ANSWER: line and compares
it to the target. No LLM, no cost, instant. Results go to
studies/aime_quickstart/gradings.parquet; any solution whose answer could not
be parsed is kept and flagged (parse_ok=false), never silently dropped.
itemeval export aime.yamlThis writes the analysis-ready table and prints the spend summary:
-
studies/aime_quickstart/export/gradings_long.parquet(+ a CSV mirror) — one row per grading event, 45 columns. -
studies/aime_quickstart/export/ledger.csv— the cost ledger.
Open it:
import pandas as pd
df = pd.read_parquet("studies/aime_quickstart/export/gradings_long.parquet")
df[["item_id", "model", "score", "score_raw", "solution", "gen_usd"]]Each row tells you, for one problem: the score (1.0/0.0), the raw value the
scorer extracted (score_raw), the full solution text, token counts, and what
that call cost. The aggregate accuracy is df.score.mean() — but the point of
itemeval is that you have the rows, not just the mean.
itemeval status aime.yamlstatus shows the completion matrix (5/5 done), error/parse-failure counts,
and spend per stage. Run it any time; like estimate, it never calls a model.
- The dataset revision was pinned at first run (
dataset_locks.json), and a manifest was written per run with template hashes, model ids, effective sampling params, and package versions — re-running the same config reproduces the same study (Pipeline Concepts). - Everything is resumable: re-run any command and completed work skips.
- Want more items? Raise
dev_items, or switchpolicy— see Tutorial 5 before scaling.
-
Letter answers (A/B/C/D):
scorer: multiple_choice— your dataset'sinputcolumn must already contain the choices. -
String answers:
scorer: exact_match. - Free-form answers that need judgment: you need an LLM judge — that's Tutorial 2.