Tutorial Verifiable Benchmark

Tutorial 1 — Score a verifiable benchmark for ~2 cents

Use case: "How does model X actually do on benchmark Y — per problem, not just the headline number?"

You will run the complete itemeval pipeline on AIME 2025 (competition math, integer answers) with openai/gpt-5-mini. Because the answers are integers, grading uses the built-in numeric scorer — pure Python, no judge model — so the only cost is generation. This exact run has been validated live: 5 problems, 5/5 correct, about $0.014 of generation and $0.00 of grading.

You need: pip install itemeval[openai] and an OPENAI_API_KEY in your environment. Time: ~10 minutes.

Step 1 — Write the config

One YAML file describes the whole study. Save this as aime.yaml:

study: aime_quickstart
benchmark:
  adapter: hf                    # load from the HuggingFace Hub
  datasets:
    - id: MathArena/aime_2025    # dataset revision auto-pins at first run
      split: train
  mapping:                       # dataset columns -> itemeval's Item fields
    id: problem_idx
    input: problem
    target: answer               # integer answers -> the numeric scorer
solvers:
  models: [openai/gpt-5-mini]
  max_tokens: 8192               # room for hidden reasoning + the "ANSWER:" line
facets:
  prompt: [builtin:minimal]      # packaged template; ends with 'state your final
                                 # answer on a line starting with "ANSWER:"'
  scorer: numeric                # verifiable scorer: extracts the last number, $0
  model_config: [{name: low, reasoning_effort: low}]   # keep it fast and cheap
budget:
  policy: dev                    # dev = only the first dev_items items
  dev_items: 5

Three choices worth noticing:

mapping is the only thing you change to point at a different dataset: name the columns that hold the problem id, the problem text, and the reference answer.
scorer: numeric means grading is free. The packaged builtin:minimal prompt instructs the model to end with an ANSWER: line, which the scorer parses.
policy: dev caps the run at the first 5 items. This is the default posture for any new config — prove the pipeline first, scale later (Tutorial 5).

Step 2 — Estimate before you spend

itemeval estimate aime.yaml

This makes no model API calls. It renders the actual prompts, applies a token heuristic and the pricing table, and prints projected calls, tokens, and dollars per stage. Expect a projection of well under $0.10 for this config (estimates are deliberately conservative — actuals usually come in lower). It also prints which pricing table the numbers came from; before bigger runs, add --refresh-pricing to pull current prices.

Step 3 — Generate solutions

itemeval generate aime.yaml

This expands the design grid — here a single condition, (gpt-5-mini × builtin:minimal × low) — and runs one inspect_ai task over the 5 items, with live progress. Every solution is upserted into studies/aime_quickstart/solutions.parquet with full provenance: prompt hash, requested vs effective sampling params, tokens, dollars, and a pointer to the raw .eval transcript.

If it's interrupted or a provider call fails, just run the same command again — completed items skip, failed ones retry, and nothing is double-paid.

Step 4 — Grade

itemeval grade aime.yaml

The numeric scorer parses each stored solution's ANSWER: line and compares it to the target. No LLM, no cost, instant. Results go to studies/aime_quickstart/gradings.parquet; any solution whose answer could not be parsed is kept and flagged (parse_ok=false), never silently dropped.

Step 5 — Export and look at your data

itemeval export aime.yaml

This writes the analysis-ready table and prints the spend summary:

studies/aime_quickstart/export/gradings_long.parquet (+ a CSV mirror) — one row per grading event, 45 columns.
studies/aime_quickstart/export/ledger.csv — the cost ledger.

Open it:

import pandas as pd

df = pd.read_parquet("studies/aime_quickstart/export/gradings_long.parquet")
df[["item_id", "model", "score", "score_raw", "solution", "gen_usd"]]

Each row tells you, for one problem: the score (1.0/0.0), the raw value the scorer extracted (score_raw), the full solution text, token counts, and what that call cost. The aggregate accuracy is df.score.mean() — but the point of itemeval is that you have the rows, not just the mean.

Step 6 — Check the books

itemeval status aime.yaml

status shows the completion matrix (5/5 done), error/parse-failure counts, and spend per stage. Run it any time; like estimate, it never calls a model.

What just happened

The dataset revision was pinned at first run (dataset_locks.json), and a manifest was written per run with template hashes, model ids, effective sampling params, and package versions — re-running the same config reproduces the same study (Pipeline Concepts).
Everything is resumable: re-run any command and completed work skips.
Want more items? Raise dev_items, or switch policy — see Tutorial 5 before scaling.

Variations

Letter answers (A/B/C/D): scorer: multiple_choice — your dataset's input column must already contain the choices.
String answers: scorer: exact_match.
Free-form answers that need judgment: you need an LLM judge — that's Tutorial 2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial Verifiable Benchmark

Tutorial 1 — Score a verifiable benchmark for ~2 cents

Step 1 — Write the config

Step 2 — Estimate before you spend

Step 3 — Generate solutions

Step 4 — Grade

Step 5 — Export and look at your data

Step 6 — Check the books

What just happened

Variations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally