# Tutorial 4 — Add a second judge (and rubric) at $0 generation cost **Use case:** "How much do my results depend on the judge? Would a different judge model — or a stricter rubric — change the scores?" This is the question behind every LLM-as-judge methodology study, and it is the reason itemeval's pipeline is two-stage. Solutions live in a store, independent of grading. Adding a grader or a rubric **re-uses every stored solution**: you pay judge tokens only, never generation again. You will extend [Tutorial 2](Tutorial-LLM-Judge.md)'s study with a second judge model and a stricter rubric, then measure judge agreement. **You need:** the finished `proofs.yaml` study from Tutorial 2, plus `pip install itemeval[anthropic]` and an `ANTHROPIC_API_KEY` for the second judge (any provider works). Cost: a few cents. ## Step 1 — Add the new grading facets Edit `proofs.yaml` — only the grading side changes; nothing under `solvers:` or `benchmark:` moves: ```yaml facets: prompt: [builtin:standard] grader: [judge_a, judge_b] # was: [judge_a] rubric: [builtin:standard, strict] # was: [builtin:standard] graders: judge_a: model: openai/gpt-5-mini max_tokens: 4096 reasoning_effort: minimal judge_b: # NEW: a second judge, different provider model: anthropic/claude-haiku-4-5 max_tokens: 4096 ``` And create the stricter rubric as `rubrics/strict.md` next to the config (bare name = local file; it must contain `{input}` and `{solution}`): ```markdown You are grading a candidate proof. Award an integer score from 0 to 7. Problem: {input} Reference solution: {target} Candidate solution: {solution} Award 7 only for a complete and rigorous proof with all cases handled. Any unjustified step caps the score at 3. A correct final answer with no valid argument scores 0. ``` The grade grid is now `grader × rubric` = **4 grade conditions**. The one already graded (judge_a × builtin:standard) is complete and will be skipped. ## Step 2 — See what's pending, then grade only the new cells ```bash itemeval status proofs.yaml # 3 new grade conditions at 0/N, old one complete itemeval estimate proofs.yaml # generation projects too, but it won't re-run itemeval grade proofs.yaml # grades only the 3 pending conditions ``` `grade` computes what's pending per (grader × rubric) over the stored solutions and runs just that. Generation is untouched — the solutions store is read-only to the grade stage. You can also target cells explicitly, which is handy when iterating on one rubric: ```bash itemeval grade proofs.yaml --grader judge_b # one judge, all rubrics itemeval grade proofs.yaml --grader judge_a --rubric strict ``` ## Step 3 — Measure judge agreement ```bash itemeval export proofs.yaml ``` Every grading event is a row, so agreement is a pivot away: ```python import pandas as pd df = pd.read_parquet("studies/proof_judging/export/gradings_long.parquet") ok = df[df.parse_ok] # exclude flagged parse failures from analysis # One column per (grader, rubric), one row per solution scores = ok.pivot_table( index=["item_id", "gen_condition_id", "replication"], columns=["grader_name", "rubric_name"], values="score", ) scores.corr() # inter-judge / inter-rubric correlation (scores.max(axis=1) - scores.min(axis=1)) # per-solution judge disagreement .sort_values(ascending=False).head(10) # the solutions judges fight over ``` Disagreement cases are auditable: each row's `reasoning` and `judge_completion` columns hold both judges' rationales for the *same* stored solution, and `grade_log_file` points at the raw transcripts. ## Why this matters With in-eval grading (the usual harness design), each of the 4 judge × rubric cells would have re-run generation — paying the most expensive stage 4 times to study the cheap one. Here the grading dimension scales independently: N judges × M rubrics over the same solutions costs only judge tokens, and the gradings table keeps grader and rubric as first-class design columns. That is what makes judge-sensitivity and rubric-sensitivity studies routine instead of heroic. Two notes for serious measurement work: - Judge temperature is pinned to 0 in v0.1, so re-judging the same solution under the same (grader × rubric) is not a replication design — judge replication is on the roadmap ([FUTURE.md](https://github.com/luozm/itemeval/blob/main/docs/FUTURE.md)). - Editing a rubric file changes its content hash and therefore its condition id: old gradings stay under the old id, the edited rubric grades fresh. Two rubric versions can never silently mix ([Pipeline Concepts](Pipeline-Concepts.md)). ## Next [Tutorial 5 — Scale up without surprises](Tutorial-Budget-and-Scale.md): take a validated design from `dev` scope to the full item set, batched and budget-capped.