Skip to content

Tutorial LLM Judge

github-actions[bot] edited this page Jun 11, 2026 · 1 revision

Tutorial 2 — Grade open-ended answers with an LLM judge

Use case: "My benchmark's answers are proofs / essays / explanations — no string match can grade them. I want an LLM judge to score each answer against a rubric, and I want the judge's reasoning on the record."

You will solve a small set of olympiad proof problems (MathArena/usamo_2025 — the same public dataset the repo's free demo uses) and grade the proofs with a judge model against a rubric. Judge calls are real model calls: they get their own logs, caching, retries, and cost accounting, and they must return a structured score.

You need: pip install itemeval[openai], an OPENAI_API_KEY. Time: ~15 minutes; cost: a few cents at dev scope.

Step 1 — A config with a grader instead of a scorer

Save as proofs.yaml:

study: proof_judging
benchmark:
  adapter: hf
  datasets:
    - id: MathArena/usamo_2025
      split: train
  mapping:
    id: problem_idx
    input: problem
    target: sample_solution        # reference solution -> {target} in the rubric
    grading_scheme: grading_scheme # per-item rubric text -> {grading_scheme}
    metadata: [points]
solvers:
  models: [openai/gpt-5-mini]
  max_tokens: 16384                # proofs are long; reasoning models need headroom
facets:
  prompt: [builtin:standard]       # asks for a complete, rigorous argument
  grader: [judge_a]                # judge grading instead of scorer
  rubric: [builtin:standard]       # packaged rubric template
graders:
  judge_a:
    model: openai/gpt-5-mini
    max_tokens: 4096               # the judge also needs reasoning headroom
    reasoning_effort: minimal
budget:
  policy: dev                      # first 2 items while we validate the pipeline
  confirm_above_usd: 1

What changed versus Tutorial 1:

  • facets.grader + graders: replace facets.scorer. A grader is a judge model with its own settings; judge temperature is pinned to 0 in v0.1 for grading stability.
  • facets.rubric names the rubric template. The packaged builtin:standard rubric shows the judge the problem ({input}), the grading scheme ({grading_scheme}), the reference solution ({target}), and the candidate solution ({solution}), and asks for a score according to the scheme.
  • mapping.grading_scheme wires a per-item rubric column from the dataset into the template. If your dataset has no such column, omit it — write the scoring criteria into your rubric file instead (Step 5).

Step 2 — Estimate, generate, grade

itemeval estimate proofs.yaml          # now shows TWO paid stages
itemeval generate proofs.yaml          # solve the problems (stage 1)
itemeval grade    proofs.yaml          # judge the stored solutions (stage 2)

Note that estimate now projects costs for grading too — the judge reads the whole problem + rubric + solution, so judge input tokens often rival generation. Grading runs as its own inspect task whose dataset is your stored solutions; it never re-generates anything.

Step 3 — The judge output contract

itemeval appends a format instruction to every rubric: the judge must end with a fenced JSON block

{"score": 4, "reasoning": "..."}

Parsing is strict. If the judge replies without a valid numeric score, the row is kept with parse_ok=false and an exact failure code (no_json_object, no_score_in_json, score_not_numeric, score_not_finite) plus the raw judge text — never silently dropped, and never retried on re-runs (a parse failure is a result; use grade --force to redo). The grade summary line reports parse_failures so you see them immediately.

Step 4 — Read the judged data

itemeval export proofs.yaml
import pandas as pd

df = pd.read_parquet("studies/proof_judging/export/gradings_long.parquet")
df[["item_id", "score", "reasoning", "parse_ok", "grade_usd"]]

Every row now carries the judge's numeric score and its reasoning — auditable, per item. The full judge completion is in judge_completion, and the raw transcript of every judge call is an .eval log under studies/proof_judging/logs/grade/.

Step 5 — Write your own rubric

The packaged rubric is a generic starting point. To customize it:

itemeval init proof_study --with-templates   # copies builtin templates locally

or create rubrics/strict.md next to your config:

You are grading a candidate proof. Award an integer score from 0 to 7.

Problem:
{input}

Reference solution:
{target}

Candidate solution:
{solution}

Award 7 only for a complete and rigorous proof. Deduct points for gaps,
unjustified steps, or missing cases. A correct final answer with no valid
argument scores at most 1.

then reference it by bare name (bare = local file; builtin: = packaged):

facets:
  rubric: [strict]

Rubrics must contain {input} and {solution}; {target}, {grading_scheme}, and {id} are optional. Placeholders are validated before any run, and the rubric's content hash goes into the condition id — editing a rubric starts a fresh, clearly-separated grade condition (Pipeline Concepts).

Troubleshooting

  • Empty solutions reported by grade — a reasoning model spent the whole max_tokens budget on hidden reasoning. Raise solvers.max_tokens or lower reasoning_effort; see solvers.on_empty in Configuration.
  • Many parse failures — your judge model may be wrapping the JSON in extra prose or hitting its own max_tokens mid-reply. Raise the grader's max_tokens first; it fixes most cases.

Next

The judge model and the rubric are facets — you can add more of either over the same stored solutions, paying only judge tokens: Tutorial 4 — Add a second judge at $0 generation.

Clone this wiki locally