-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial LLM Judge
Use case: "My benchmark's answers are proofs / essays / explanations — no string match can grade them. I want an LLM judge to score each answer against a rubric, and I want the judge's reasoning on the record."
You will solve a small set of olympiad proof problems (MathArena/usamo_2025 — the same public dataset the repo's free demo uses) and grade the proofs with a judge model against a rubric. Judge calls are real model calls: they get their own logs, caching, retries, and cost accounting, and they must return a structured score.
You need: pip install itemeval[openai], an OPENAI_API_KEY. Time: ~15
minutes; cost: a few cents at dev scope.
Save as proofs.yaml:
study: proof_judging
benchmark:
adapter: hf
datasets:
- id: MathArena/usamo_2025
split: train
mapping:
id: problem_idx
input: problem
target: sample_solution # reference solution -> {target} in the rubric
grading_scheme: grading_scheme # per-item rubric text -> {grading_scheme}
metadata: [points]
solvers:
models: [openai/gpt-5-mini]
max_tokens: 16384 # proofs are long; reasoning models need headroom
facets:
prompt: [builtin:standard] # asks for a complete, rigorous argument
grader: [judge_a] # judge grading instead of scorer
rubric: [builtin:standard] # packaged rubric template
graders:
judge_a:
model: openai/gpt-5-mini
max_tokens: 4096 # the judge also needs reasoning headroom
reasoning_effort: minimal
budget:
policy: dev # first 2 items while we validate the pipeline
confirm_above_usd: 1What changed versus Tutorial 1:
-
facets.grader+graders:replacefacets.scorer. A grader is a judge model with its own settings; judge temperature is pinned to 0 in v0.1 for grading stability. -
facets.rubricnames the rubric template. The packagedbuiltin:standardrubric shows the judge the problem ({input}), the grading scheme ({grading_scheme}), the reference solution ({target}), and the candidate solution ({solution}), and asks for a score according to the scheme. -
mapping.grading_schemewires a per-item rubric column from the dataset into the template. If your dataset has no such column, omit it — write the scoring criteria into your rubric file instead (Step 5).
itemeval estimate proofs.yaml # now shows TWO paid stages
itemeval generate proofs.yaml # solve the problems (stage 1)
itemeval grade proofs.yaml # judge the stored solutions (stage 2)Note that estimate now projects costs for grading too — the judge reads the
whole problem + rubric + solution, so judge input tokens often rival
generation. Grading runs as its own inspect task whose dataset is your
stored solutions; it never re-generates anything.
itemeval appends a format instruction to every rubric: the judge must end with a fenced JSON block
{"score": 4, "reasoning": "..."}Parsing is strict. If the judge replies without a valid numeric score, the
row is kept with parse_ok=false and an exact failure code
(no_json_object, no_score_in_json, score_not_numeric,
score_not_finite) plus the raw judge text — never silently dropped, and
never retried on re-runs (a parse failure is a result; use grade --force
to redo). The grade summary line reports parse_failures so you see them
immediately.
itemeval export proofs.yamlimport pandas as pd
df = pd.read_parquet("studies/proof_judging/export/gradings_long.parquet")
df[["item_id", "score", "reasoning", "parse_ok", "grade_usd"]]Every row now carries the judge's numeric score and its reasoning —
auditable, per item. The full judge completion is in judge_completion, and
the raw transcript of every judge call is an .eval log under
studies/proof_judging/logs/grade/.
The packaged rubric is a generic starting point. To customize it:
itemeval init proof_study --with-templates # copies builtin templates locallyor create rubrics/strict.md next to your config:
You are grading a candidate proof. Award an integer score from 0 to 7.
Problem:
{input}
Reference solution:
{target}
Candidate solution:
{solution}
Award 7 only for a complete and rigorous proof. Deduct points for gaps,
unjustified steps, or missing cases. A correct final answer with no valid
argument scores at most 1.then reference it by bare name (bare = local file; builtin: = packaged):
facets:
rubric: [strict]Rubrics must contain {input} and {solution}; {target},
{grading_scheme}, and {id} are optional. Placeholders are validated before
any run, and the rubric's content hash goes into the condition id — editing a
rubric starts a fresh, clearly-separated grade condition
(Pipeline Concepts).
-
Empty solutions reported by
grade— a reasoning model spent the wholemax_tokensbudget on hidden reasoning. Raisesolvers.max_tokensor lowerreasoning_effort; seesolvers.on_emptyin Configuration. -
Many parse failures — your judge model may be wrapping the JSON in extra
prose or hitting its own
max_tokensmid-reply. Raise the grader'smax_tokensfirst; it fixes most cases.
The judge model and the rubric are facets — you can add more of either over the same stored solutions, paying only judge tokens: Tutorial 4 — Add a second judge at $0 generation.