# Tutorial 1 — Score a verifiable benchmark for ~2 cents

**Use case:** "How does model X actually do on benchmark Y — per problem, not
just the headline number?"

You will run the complete itemeval pipeline on
[AIME 2025](https://huggingface.co/datasets/MathArena/aime_2025) (competition
math, integer answers) with `openai/gpt-5-mini`. Because the answers are
integers, grading uses the built-in `numeric` scorer — pure Python, no judge
model — so the only cost is generation. This exact run has been validated live:
5 problems, 5/5 correct, about $0.014 of generation and $0.00 of grading.

**You need:** `pip install itemeval[openai]` and an `OPENAI_API_KEY` in your
environment. Time: ~10 minutes.

## Step 1 — Write the config

One YAML file describes the whole study. Save this as `aime.yaml`:

```yaml
study: aime_quickstart
benchmark:
  adapter: hf                    # load from the HuggingFace Hub
  datasets:
    - id: MathArena/aime_2025    # dataset revision auto-pins at first run
      split: train
  mapping:                       # dataset columns -> itemeval's Item fields
    id: problem_idx
    input: problem
    target: answer               # integer answers -> the numeric scorer
solvers:
  models: [openai/gpt-5-mini]
  max_tokens: 8192               # room for hidden reasoning + the "ANSWER:" line
facets:
  prompt: [builtin:minimal]      # packaged template; ends with 'state your final
                                 # answer on a line starting with "ANSWER:"'
  scorer: numeric                # verifiable scorer: extracts the last number, $0
  model_config: [{name: low, reasoning_effort: low}]   # keep it fast and cheap
budget:
  policy: dev                    # dev = only the first dev_items items
  dev_items: 5
```

Three choices worth noticing:

- **`mapping`** is the only thing you change to point at a different dataset:
  name the columns that hold the problem id, the problem text, and the
  reference answer.
- **`scorer: numeric`** means grading is free. The packaged `builtin:minimal`
  prompt instructs the model to end with an `ANSWER:` line, which the scorer
  parses.
- **`policy: dev`** caps the run at the first 5 items. This is the default
  posture for any new config — prove the pipeline first, scale later
  ([Tutorial 5](Tutorial-Budget-and-Scale.md)).

## Step 2 — Estimate before you spend

```bash
itemeval estimate aime.yaml
```

This makes **no model API calls**. It renders the actual prompts, applies a
token heuristic and the pricing table, and prints projected calls, tokens, and
dollars per stage. Expect a projection of well under $0.10 for this config
(estimates are deliberately conservative — actuals usually come in lower).
It also prints which pricing table the numbers came from; before bigger runs,
add `--refresh-pricing` to pull current prices.

## Step 3 — Generate solutions

```bash
itemeval generate aime.yaml
```

This expands the design grid — here a single condition,
(gpt-5-mini × builtin:minimal × low) — and runs one inspect_ai task over the 5
items, with live progress. Every solution is upserted into
`studies/aime_quickstart/solutions.parquet` with full provenance: prompt hash,
requested vs effective sampling params, tokens, dollars, and a pointer to the
raw `.eval` transcript.

If it's interrupted or a provider call fails, just run the same command again —
completed items skip, failed ones retry, and nothing is double-paid.

## Step 4 — Grade

```bash
itemeval grade aime.yaml
```

The `numeric` scorer parses each stored solution's `ANSWER:` line and compares
it to the target. No LLM, no cost, instant. Results go to
`studies/aime_quickstart/gradings.parquet`; any solution whose answer could not
be parsed is kept and flagged (`parse_ok=false`), never silently dropped.

## Step 5 — Export and look at your data

```bash
itemeval export aime.yaml
```

This writes the analysis-ready table and prints the spend summary:

- `studies/aime_quickstart/export/gradings_long.parquet` (+ a CSV mirror) —
  **one row per grading event**, 45 columns.
- `studies/aime_quickstart/export/ledger.csv` — the cost ledger.

Open it:

```python
import pandas as pd

df = pd.read_parquet("studies/aime_quickstart/export/gradings_long.parquet")
df[["item_id", "model", "score", "score_raw", "solution", "gen_usd"]]
```

Each row tells you, for one problem: the score (`1.0`/`0.0`), the raw value the
scorer extracted (`score_raw`), the full solution text, token counts, and what
that call cost. The aggregate accuracy is `df.score.mean()` — but the point of
itemeval is that you have the rows, not just the mean.

## Step 6 — Check the books

```bash
itemeval status aime.yaml
```

`status` shows the completion matrix (5/5 done), error/parse-failure counts,
and spend per stage. Run it any time; like `estimate`, it never calls a model.

## What just happened

- The dataset revision was **pinned** at first run (`dataset_locks.json`), and a
  **manifest** was written per run with template hashes, model ids, effective
  sampling params, and package versions — re-running the same config
  reproduces the same study ([Pipeline Concepts](Pipeline-Concepts.md)).
- Everything is **resumable**: re-run any command and completed work skips.
- Want more items? Raise `dev_items`, or switch `policy` —
  see [Tutorial 5](Tutorial-Budget-and-Scale.md) before scaling.

## Variations

- **Letter answers** (A/B/C/D): `scorer: multiple_choice` — your dataset's
  `input` column must already contain the choices.
- **String answers**: `scorer: exact_match`.
- **Free-form answers that need judgment**: you need an LLM judge —
  that's [Tutorial 2](Tutorial-LLM-Judge.md).