TML-2736: bespoke LLM judge — intent correctness signal + classifiers#654
Conversation
…-judge slice Record the slice-3 spike outcome (bespoke-minimal confirmed over Inspect / Braintrust / promptfoo; impedance mismatch in the eval unit; promptfoo as the escape hatch) in the project spec and design-notes, and scaffold the llm-judge (TML-2736) slice: spec, dispatch plan, and slice-scoped trace. Signed-off-by: Will Madden <madden@prisma.io>
Add a bespoke-minimal LLM judge under skills-contrib/drive-judge-harness/judge/
that grades one Drive run against a golden case's acceptance set behind an
injected, mockable JudgeModel boundary — so typecheck/test/lint stay green with
no CURSOR_API_KEY and @cursor/sdk absent.
- judge-model{,-sdk}.ts: injected boundary + live adapter with a synchronous
cross-family guard (rejects a same-family judge id) and a lazy SDK import.
- rubric-correctness.ts: requirements+intent rubric; a malformed model response
yields intent:null (never a false pass).
- classify-failure.ts / classify-operator.ts: diagnostic classifiers; the
operator-turn buckets follow the measurement model's five canonical buckets.
- emit-correctness.ts: merge-preserving emission — fills intent while preserving
any already-recorded mechanical/qa, because the scorecard is last-write-wins
on the whole triple.
- calibration.ts + calibration/labels.md: agreement tally with the >=0.80 gate;
the calibration run itself is parked (corpus-gated, operator approves spend).
43 tests (node --test) green; typecheck / lint:deps / lint:casts clean. Pins the
fail-to-null and merge-preserving invariants end-to-end against the scorecard.
Signed-off-by: Will Madden <madden@prisma.io>
|
Warning Review limit reached
More reviews will be available in 24 minutes and 42 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR introduces a complete LLM judge harness for Drive orchestrator evaluation. It establishes a ChangesDrive LLM Judge Grading & Correctness Pipeline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
size-limit report 📦
|
@prisma-next/extension-author-tools
@prisma-next/mongo-runtime
@prisma-next/family-mongo
@prisma-next/sql-runtime
@prisma-next/family-sql
@prisma-next/extension-arktype-json
@prisma-next/middleware-cache
@prisma-next/mongo
@prisma-next/extension-paradedb
@prisma-next/extension-pgvector
@prisma-next/extension-postgis
@prisma-next/postgres
@prisma-next/sql-orm-client
@prisma-next/sqlite
@prisma-next/target-mongo
@prisma-next/adapter-mongo
@prisma-next/driver-mongo
@prisma-next/contract
@prisma-next/utils
@prisma-next/config
@prisma-next/errors
@prisma-next/framework-components
@prisma-next/operations
@prisma-next/ts-render
@prisma-next/contract-authoring
@prisma-next/ids
@prisma-next/psl-parser
@prisma-next/psl-printer
@prisma-next/cli
@prisma-next/cli-telemetry
@prisma-next/emitter
@prisma-next/migration-tools
prisma-next
@prisma-next/vite-plugin-contract-emit
@prisma-next/mongo-codec
@prisma-next/mongo-contract
@prisma-next/mongo-value
@prisma-next/mongo-contract-psl
@prisma-next/mongo-contract-ts
@prisma-next/mongo-emitter
@prisma-next/mongo-schema-ir
@prisma-next/mongo-query-ast
@prisma-next/mongo-orm
@prisma-next/mongo-query-builder
@prisma-next/mongo-lowering
@prisma-next/mongo-wire
@prisma-next/sql-contract
@prisma-next/sql-errors
@prisma-next/sql-operations
@prisma-next/sql-schema-ir
@prisma-next/sql-contract-psl
@prisma-next/sql-contract-ts
@prisma-next/sql-contract-emitter
@prisma-next/sql-lane-query-builder
@prisma-next/sql-relational-core
@prisma-next/sql-builder
@prisma-next/target-postgres
@prisma-next/target-sqlite
@prisma-next/adapter-postgres
@prisma-next/adapter-sqlite
@prisma-next/driver-postgres
@prisma-next/driver-sqlite
commit: |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
skills-contrib/drive-judge-harness/judge/parse-json.ts (1)
27-32: 💤 Low valueBraced span extraction may incorrectly match unrelated braces.
extractBracedSpanusesindexOf('{')andlastIndexOf('}')to find the first opening and last closing brace. If the model's response contains multiple independent JSON objects or braced prose (e.g.,"The result is {good}. Also here is the JSON: {\"intent\":\"pass\",...}"), this will extract everything from the first{to the last}, producing invalid JSON.This is mitigated by
tryParseObjectrejecting invalid JSON, but the function could be more precise by scanning for balanced braces or documenting this limitation.Possible refinement
Add a comment documenting the limitation:
function extractBracedSpan(raw: string): string | undefined { + // Extracts from first '{' to last '}'; malformed spans are rejected by tryParseObject. const start = raw.indexOf('{');🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@skills-contrib/drive-judge-harness/judge/parse-json.ts` around lines 27 - 32, extractBracedSpan can grab unrelated braces because it uses indexOf/lastIndexOf; update extractBracedSpan to locate the first opening brace and then scan forward counting nested braces (increment on '{', decrement on '}') until the count returns to zero, returning that balanced slice; if no balanced span is found return undefined — reference the extractBracedSpan function and ensure the scanner handles nested objects and stops at the first balanced closing brace to avoid spanning multiple JSON objects or prose with braces.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@skills-contrib/drive-judge-harness/judge/calibration/labels.md`:
- Around line 47-48: The labels.md file contains a transient project artifact
reference to `projects/drive-judge-harness/slices/llm-judge/spec.md`; remove
that reference or replace it with a link to a stable architecture or design doc
(for example a canonical README or architecture doc) and update the line
"Calibration harness — built, run deferred." if needed to point to the durable
doc; ensure no other durable docs in `labels.md` reference `projects/` artifacts
and update any link text to reflect the stable source.
In `@skills-contrib/drive-judge-harness/judge/classify-failure.ts`:
- Around line 45-50: FailureResponse currently hardcodes the failure-mode
string-union which can drift from the canonical FAILURE_MODE_CODES; change the
FailureResponse definition so its failureModes arktype is generated from
FAILURE_MODE_CODES (e.g., build the union string or enum values from
FAILURE_MODE_CODES keys/values and pass that into the existing
type(...).array()) instead of the literal string list—update the FailureResponse
declaration and any helper used to construct the union so the validator always
reflects FAILURE_MODE_CODES.
---
Nitpick comments:
In `@skills-contrib/drive-judge-harness/judge/parse-json.ts`:
- Around line 27-32: extractBracedSpan can grab unrelated braces because it uses
indexOf/lastIndexOf; update extractBracedSpan to locate the first opening brace
and then scan forward counting nested braces (increment on '{', decrement on
'}') until the count returns to zero, returning that balanced slice; if no
balanced span is found return undefined — reference the extractBracedSpan
function and ensure the scanner handles nested objects and stops at the first
balanced closing brace to avoid spanning multiple JSON objects or prose with
braces.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 42abe22c-cf4e-49c9-bf42-428fa52c14d5
⛔ Files ignored due to path filters (5)
projects/drive-judge-harness/design-notes.mdis excluded by!projects/**projects/drive-judge-harness/slices/llm-judge/plan.mdis excluded by!projects/**projects/drive-judge-harness/slices/llm-judge/spec.mdis excluded by!projects/**projects/drive-judge-harness/slices/llm-judge/trace.jsonlis excluded by!projects/**projects/drive-judge-harness/spec.mdis excluded by!projects/**
📒 Files selected for processing (17)
package.jsonskills-contrib/drive-judge-harness/SKILL.mdskills-contrib/drive-judge-harness/judge/calibration.tsskills-contrib/drive-judge-harness/judge/calibration/labels.mdskills-contrib/drive-judge-harness/judge/classify-failure.tsskills-contrib/drive-judge-harness/judge/classify-operator.tsskills-contrib/drive-judge-harness/judge/emit-correctness.tsskills-contrib/drive-judge-harness/judge/judge-model-sdk.tsskills-contrib/drive-judge-harness/judge/judge-model.tsskills-contrib/drive-judge-harness/judge/parse-json.tsskills-contrib/drive-judge-harness/judge/rubric-correctness.tsskills-contrib/drive-judge-harness/test/calibration.test.tsskills-contrib/drive-judge-harness/test/classify-failure.test.tsskills-contrib/drive-judge-harness/test/classify-operator.test.tsskills-contrib/drive-judge-harness/test/emit-correctness.test.tsskills-contrib/drive-judge-harness/test/judge-model-sdk.test.tsskills-contrib/drive-judge-harness/test/rubric-correctness.test.ts
- derive the failure-mode arktype validator from FAILURE_MODE_CODES via type.enumerated(...) so the taxonomy has a single source of truth - drop the transient projects/ spec link from the durable labels.md doc; point at the co-located SKILL.md instead Signed-off-by: Will Madden <madden@prisma.io>
Linked issue
Refs TML-2736. Third slice of the Drive — Judge + live-experiment harness project; builds on the two-tier scorecard (#640) and the golden-case harness (#641).
At a glance
The judge grades one Drive run and emits the
intentcorrectness signal the scorecard already reads. Two invariants do the load-bearing work. A malformed model response is never a silent pass:…and the emission preserves any gate-recorded
mechanical/qarather than clobbering it, because the scorecard is last-write-wins on the whole triple:Before this slice, the scorecard's
intentcomponent was alwaysnull→ every run wasnot-computable. This slice makes it producible.Summary
This PR ships a bespoke-minimal LLM judge under
skills-contrib/drive-judge-harness/judge/. It carries two substantive pieces:acceptance.md, through a cross-family judge model, and emits theintentcorrectness component. Three prompt sets: a requirements+intent rubric, a failure-mode classifier, and an operator-turn classifier.spec.mdanddesign-notes.md; promptfoo is the recorded escape hatch.The judge model is injected everywhere, so the whole subtree typechecks, tests, and lints with no
CURSOR_API_KEYand@cursor/sdkabsent — tests pass a mock. The live adapter is reached only behind the same--live+ key gate as the harness.How it fits together
judge/judge-model.ts) — a one-methodJudgeModelinterface (grade(prompt) => Promise<string>). Everything downstream takes it as a dependency; tests inject a mock and never make a real call.judge/judge-model-sdk.ts) — pins a cross-family judge id (defaultgpt-5.5) and rejects a same-family judge id at construction (a Claude judge grading a Claude orchestrator throws before any SDK code runs). The@cursor/sdkimport is lazy, so module load stays green without the package.rubric-correctness.ts,classify-failure.ts,classify-operator.ts) — each renders a prompt, calls the model, and parses an arktype-validated verdict. The operator-turn classifier uses the measurement model's five canonical buckets (docs/drive/measurement-model.md): legitimate-design, legitimate-authorisation, illegitimate-asked, illegitimate-correction, illegitimate-rescue.judge/emit-correctness.ts) — folds the rubric'sintentinto the run's latest recorded{mechanical, qa}and emits onecorrectness-recordedevent through the deterministic emitter. The slice-1 scorecard composes it; no scorecard or schema edits.judge/calibration.ts+judge/calibration/labels.md) — a judge-vs-human agreement tally with a ≥0.80 gate. The machinery lands; the calibration run is parked (see Reviewer notes).Reviewer notes
SKILL.mdandcalibration/labels.mdboth record the deferral and the operator-spend gate.computeScorecardis last-write-wins on the whole{mechanical, qa, intent}triple — it does not merge components. A naive judge emitting{mechanical:null, qa:null, intent:pass}would erase a gate's recorded pass.emit-correctness.tsreads-merges-emits so that can't happen; the end-to-end test asserts a priormechanical:passsurvives.judge/parse-json.tslifts a JSON object out of a model response (bare / fenced / embedded) — factored out so the malformed-→null path lives in one place rather than three copies.Testing performed
node --testover the six new judge suites — 43 cases, all green, run withCURSOR_API_KEYunset.pnpm typecheck— clean.pnpm lint:deps— no dependency violations.pnpm lint:casts—delta=0(no new bare casts).pnpm test:scripts— 545 cases green (nothing else regressed).Skill update
skills-contrib/drive-judge-harness/SKILL.mddocuments the judge, the cross-family requirement, thecorrectness-recordedmerge rule, the fail-to-null invariant, and the parked calibration.Checklist
TML-NNNN:conventionlint:castsdelta 0)Alternatives considered
(input → model output); our unit is a whole Drive run scored from trace + diff + golden acceptance set. A framework can host the tiny grading call but not the integration with our trace/scorecard/golden assets — that glue is bespoke either way. promptfoo (TS, MIT, local) is recorded as the escape hatch if the bespoke scorer grows hairy.intentcomponent on its own. Rejected — it would clobber the gate-recordedmechanical/qaunder last-write-wins. Hence the merge-preserving helper.otheroperator-turn bucket + non-null fallback. Rejected — the measurement model defines exactly five buckets; a malformed response yieldsbucket: null(same fail-to-null discipline as the rubric) rather than an off-doc catch-all.Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests