# Tutorial 2 — Grade open-ended answers with an LLM judge **Use case:** "My benchmark's answers are proofs / essays / explanations — no string match can grade them. I want an LLM judge to score each answer against a rubric, and I want the judge's reasoning on the record." You will solve a small set of olympiad proof problems ([MathArena/usamo_2025](https://huggingface.co/datasets/MathArena/usamo_2025) — the same public dataset the repo's free demo uses) and grade the proofs with a judge model against a rubric. Judge calls are real model calls: they get their own logs, caching, retries, and cost accounting, and they must return a structured score. **You need:** `pip install itemeval[openai]`, an `OPENAI_API_KEY`. Time: ~15 minutes; cost: a few cents at `dev` scope. ## Step 1 — A config with a grader instead of a scorer Save as `proofs.yaml`: ```yaml study: proof_judging benchmark: adapter: hf datasets: - id: MathArena/usamo_2025 split: train mapping: id: problem_idx input: problem target: sample_solution # reference solution -> {target} in the rubric grading_scheme: grading_scheme # per-item rubric text -> {grading_scheme} metadata: [points] solvers: models: [openai/gpt-5-mini] max_tokens: 16384 # proofs are long; reasoning models need headroom facets: prompt: [builtin:standard] # asks for a complete, rigorous argument grader: [judge_a] # judge grading instead of scorer rubric: [builtin:standard] # packaged rubric template graders: judge_a: model: openai/gpt-5-mini max_tokens: 4096 # the judge also needs reasoning headroom reasoning_effort: minimal budget: policy: dev # first 2 items while we validate the pipeline confirm_above_usd: 1 ``` What changed versus [Tutorial 1](Tutorial-Verifiable-Benchmark.md): - **`facets.grader` + `graders:`** replace `facets.scorer`. A grader is a judge model with its own settings; judge temperature is pinned to 0 in v0.1 for grading stability. - **`facets.rubric`** names the rubric template. The packaged `builtin:standard` rubric shows the judge the problem (`{input}`), the grading scheme (`{grading_scheme}`), the reference solution (`{target}`), and the candidate solution (`{solution}`), and asks for a score according to the scheme. - **`mapping.grading_scheme`** wires a per-item rubric column from the dataset into the template. If your dataset has no such column, omit it — write the scoring criteria into your rubric file instead (Step 5). ## Step 2 — Estimate, generate, grade ```bash itemeval estimate proofs.yaml # now shows TWO paid stages itemeval generate proofs.yaml # solve the problems (stage 1) itemeval grade proofs.yaml # judge the stored solutions (stage 2) ``` Note that `estimate` now projects costs for grading too — the judge reads the whole problem + rubric + solution, so judge input tokens often rival generation. Grading runs as its **own inspect task** whose dataset is your stored solutions; it never re-generates anything. ## Step 3 — The judge output contract itemeval appends a format instruction to every rubric: the judge must end with a fenced JSON block ```json {"score": 4, "reasoning": "..."} ``` Parsing is strict. If the judge replies without a valid numeric `score`, the row is **kept** with `parse_ok=false` and an exact failure code (`no_json_object`, `no_score_in_json`, `score_not_numeric`, `score_not_finite`) plus the raw judge text — never silently dropped, and never retried on re-runs (a parse failure is a *result*; use `grade --force` to redo). The `grade` summary line reports `parse_failures` so you see them immediately. ## Step 4 — Read the judged data ```bash itemeval export proofs.yaml ``` ```python import pandas as pd df = pd.read_parquet("studies/proof_judging/export/gradings_long.parquet") df[["item_id", "score", "reasoning", "parse_ok", "grade_usd"]] ``` Every row now carries the judge's numeric `score` **and** its `reasoning` — auditable, per item. The full judge completion is in `judge_completion`, and the raw transcript of every judge call is an `.eval` log under `studies/proof_judging/logs/grade/`. ## Step 5 — Write your own rubric The packaged rubric is a generic starting point. To customize it: ```bash itemeval init proof_study --with-templates # copies builtin templates locally ``` or create `rubrics/strict.md` next to your config: ```markdown You are grading a candidate proof. Award an integer score from 0 to 7. Problem: {input} Reference solution: {target} Candidate solution: {solution} Award 7 only for a complete and rigorous proof. Deduct points for gaps, unjustified steps, or missing cases. A correct final answer with no valid argument scores at most 1. ``` then reference it by **bare name** (bare = local file; `builtin:` = packaged): ```yaml facets: rubric: [strict] ``` Rubrics must contain `{input}` and `{solution}`; `{target}`, `{grading_scheme}`, and `{id}` are optional. Placeholders are validated before any run, and the rubric's content hash goes into the condition id — editing a rubric starts a fresh, clearly-separated grade condition ([Pipeline Concepts](Pipeline-Concepts.md)). ## Troubleshooting - **Empty solutions reported by `grade`** — a reasoning model spent the whole `max_tokens` budget on hidden reasoning. Raise `solvers.max_tokens` or lower `reasoning_effort`; see `solvers.on_empty` in [Configuration](Configuration.md). - **Many parse failures** — your judge model may be wrapping the JSON in extra prose or hitting its own `max_tokens` mid-reply. Raise the grader's `max_tokens` first; it fixes most cases. ## Next The judge model and the rubric are *facets* — you can add more of either over the same stored solutions, paying only judge tokens: [Tutorial 4 — Add a second judge at $0 generation](Tutorial-Second-Judge.md).