TML-2736: bespoke LLM judge — intent correctness signal + classifiers by wmadden-electric · Pull Request #654 · prisma/prisma-next

wmadden-electric · 2026-05-31T10:29:35Z

Linked issue

Refs TML-2736. Third slice of the Drive — Judge + live-experiment harness project; builds on the two-tier scorecard (#640) and the golden-case harness (#641).

At a glance

The judge grades one Drive run and emits the intent correctness signal the scorecard already reads. Two invariants do the load-bearing work. A malformed model response is never a silent pass:

const validated = RubricResponse(parsed);
if (validated instanceof type.errors) {
  return { intent: null, reasons: [`malformed model output: ${validated.summary}`] };
}

…and the emission preserves any gate-recorded mechanical/qa rather than clobbering it, because the scorecard is last-write-wins on the whole triple:

export function mergedCorrectnessPayload(events, projectRunId, intent) {
  const prior = latestCorrectness(events, projectRunId);
  return { mechanical: prior?.mechanical ?? null, qa: prior?.qa ?? null, intent };
}

Before this slice, the scorecard's intent component was always null → every run was not-computable. This slice makes it producible.

Summary

This PR ships a bespoke-minimal LLM judge under skills-contrib/drive-judge-harness/judge/. It carries two substantive pieces:

The judge itself — grades a completed Drive run (the produced diff + the run's trace) against a golden case's acceptance.md, through a cross-family judge model, and emits the intent correctness component. Three prompt sets: a requirements+intent rubric, a failure-mode classifier, and an operator-turn classifier.
The recorded decision to build it bespoke — a time-boxed spike compared Inspect / Braintrust / promptfoo and confirmed bespoke-minimal. The rationale lands in the project spec.md and design-notes.md; promptfoo is the recorded escape hatch.

The judge model is injected everywhere, so the whole subtree typechecks, tests, and lints with no CURSOR_API_KEY and @cursor/sdk absent — tests pass a mock. The live adapter is reached only behind the same --live + key gate as the harness.

How it fits together

The model boundary (judge/judge-model.ts) — a one-method JudgeModel interface (grade(prompt) => Promise<string>). Everything downstream takes it as a dependency; tests inject a mock and never make a real call.
The live adapter (judge/judge-model-sdk.ts) — pins a cross-family judge id (default gpt-5.5) and rejects a same-family judge id at construction (a Claude judge grading a Claude orchestrator throws before any SDK code runs). The @cursor/sdk import is lazy, so module load stays green without the package.
The three prompt sets (rubric-correctness.ts, classify-failure.ts, classify-operator.ts) — each renders a prompt, calls the model, and parses an arktype-validated verdict. The operator-turn classifier uses the measurement model's five canonical buckets (docs/drive/measurement-model.md): legitimate-design, legitimate-authorisation, illegitimate-asked, illegitimate-correction, illegitimate-rescue.
The merge-preserving emission (judge/emit-correctness.ts) — folds the rubric's intent into the run's latest recorded {mechanical, qa} and emits one correctness-recorded event through the deterministic emitter. The slice-1 scorecard composes it; no scorecard or schema edits.
The calibration harness (judge/calibration.ts + judge/calibration/labels.md) — a judge-vs-human agreement tally with a ≥0.80 gate. The machinery lands; the calibration run is parked (see Reviewer notes).

Reviewer notes

The calibration run is deliberately parked, not forgotten. Calibration needs ~10–20 instrumented runs, and corpus generation is real-dollar spend the operator is holding. So this slice ships the gate machinery and an honest "uncalibrated" status; the project-DoD calibration item stays unchecked. SKILL.md and calibration/labels.md both record the deferral and the operator-spend gate.
The merge rule is the subtle part. computeScorecard is last-write-wins on the whole {mechanical, qa, intent} triple — it does not merge components. A naive judge emitting {mechanical:null, qa:null, intent:pass} would erase a gate's recorded pass. emit-correctness.ts reads-merges-emits so that can't happen; the end-to-end test asserts a prior mechanical:pass survives.
One unplanned helper. judge/parse-json.ts lifts a JSON object out of a model response (bare / fenced / embedded) — factored out so the malformed-→null path lives in one place rather than three copies.
The planning commit rides along. The first commit scaffolds the slice (spec, plan, trace) and records the spike; the second is the implementation. They're one reviewable unit.

Testing performed

node --test over the six new judge suites — 43 cases, all green, run with CURSOR_API_KEY unset.
pnpm typecheck — clean.
pnpm lint:deps — no dependency violations.
pnpm lint:casts — delta=0 (no new bare casts).
pnpm test:scripts — 545 cases green (nothing else regressed).

Skill update

skills-contrib/drive-judge-harness/SKILL.md documents the judge, the cross-family requirement, the correctness-recorded merge rule, the fail-to-null invariant, and the parked calibration.

Checklist

DCO sign-off on every commit
Tests written first and passing
Title follows the TML-NNNN: convention
No new bare casts (lint:casts delta 0)

Alternatives considered

Adopt an off-the-shelf eval framework (Inspect / Braintrust / promptfoo). Confirmed-rejected by the spike. They grade (input → model output); our unit is a whole Drive run scored from trace + diff + golden acceptance set. A framework can host the tiny grading call but not the integration with our trace/scorecard/golden assets — that glue is bespoke either way. promptfoo (TS, MIT, local) is recorded as the escape hatch if the bespoke scorer grows hairy.
Emit the intent component on its own. Rejected — it would clobber the gate-recorded mechanical/qa under last-write-wins. Hence the merge-preserving helper.
An other operator-turn bucket + non-null fallback. Rejected — the measurement model defines exactly five buckets; a malformed response yields bucket: null (same fail-to-null discipline as the rubric) rather than an off-doc catch-all.
Run the calibration now. Rejected — corpus generation is held on cost. The judge ships uncalibrated-but-honest; the gate is computable the moment the corpus exists.

Summary by CodeRabbit

Release Notes

New Features
- Introduced LLM-based judge system for Drive orchestrator evaluation with failure mode classification, operator turn assessment, and correctness rubric grading
- Implemented cross-family model constraint enforcement between judge and orchestrator
- Added calibration framework for judge accuracy validation with agreement-rate metrics
Documentation
- Expanded judge harness documentation with detailed module descriptions and key invariants
- Added calibration corpus specification and workflow guidance
Tests
- Added comprehensive test coverage for judge components, classifiers, and calibration logic

…-judge slice Record the slice-3 spike outcome (bespoke-minimal confirmed over Inspect / Braintrust / promptfoo; impedance mismatch in the eval unit; promptfoo as the escape hatch) in the project spec and design-notes, and scaffold the llm-judge (TML-2736) slice: spec, dispatch plan, and slice-scoped trace. Signed-off-by: Will Madden <madden@prisma.io>

Add a bespoke-minimal LLM judge under skills-contrib/drive-judge-harness/judge/ that grades one Drive run against a golden case's acceptance set behind an injected, mockable JudgeModel boundary — so typecheck/test/lint stay green with no CURSOR_API_KEY and @cursor/sdk absent. - judge-model{,-sdk}.ts: injected boundary + live adapter with a synchronous cross-family guard (rejects a same-family judge id) and a lazy SDK import. - rubric-correctness.ts: requirements+intent rubric; a malformed model response yields intent:null (never a false pass). - classify-failure.ts / classify-operator.ts: diagnostic classifiers; the operator-turn buckets follow the measurement model's five canonical buckets. - emit-correctness.ts: merge-preserving emission — fills intent while preserving any already-recorded mechanical/qa, because the scorecard is last-write-wins on the whole triple. - calibration.ts + calibration/labels.md: agreement tally with the >=0.80 gate; the calibration run itself is parked (corpus-gated, operator approves spend). 43 tests (node --test) green; typecheck / lint:deps / lint:casts clean. Pins the fail-to-null and merge-preserving invariants end-to-end against the scorecard. Signed-off-by: Will Madden <madden@prisma.io>

coderabbitai · 2026-05-31T10:29:49Z

Warning

Review limit reached

@wmadden-electric, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 42 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f54c6178-368d-4432-b3a0-2e19aa905e63

📥 Commits

Reviewing files that changed from the base of the PR and between bb9f4fb and a724fc0.

📒 Files selected for processing (2)

skills-contrib/drive-judge-harness/judge/calibration/labels.md
skills-contrib/drive-judge-harness/judge/classify-failure.ts

📝 Walkthrough

Walkthrough

This PR introduces a complete LLM judge harness for Drive orchestrator evaluation. It establishes a JudgeModel interface backed by an SDK adapter, implements classifiers for failure modes and operator turns, grades correctness rubrics, and emits merged verdicts to trace files while preserving prior mechanical and QA scores. Comprehensive tests and calibration infrastructure are included.

Changes

Drive LLM Judge Grading & Correctness Pipeline

Layer / File(s)	Summary
Judge model interface and SDK adapter with cross-family guard `skills-contrib/drive-judge-harness/judge/judge-model.ts`, `judge/judge-model-sdk.ts`, `test/judge-model-sdk.test.ts`	`JudgeModel` defines the grading contract. `createSdkJudgeModel` infers model families, enforces cross-family pairing synchronously, and returns an adapter that lazily imports `@cursor/sdk`, streams messages, and accumulates extracted assistant text.
JSON extraction and parsing utility for model responses `skills-contrib/drive-judge-harness/judge/parse-json.ts`	`parseJsonFromModel` safely extracts and parses JSON objects from model output, supporting direct JSON, fenced code blocks, and braced substrings, returning `undefined` on failure.
Failure mode classification with taxonomy validation `skills-contrib/drive-judge-harness/judge/classify-failure.ts`, `test/classify-failure.test.ts`	Fixed failure-mode taxonomy (F1–F15, scope-trap, qa-coverage-gap) is graded via prompts built from acceptance markdown and diffs. JSON responses are validated against arktype schema; malformed output yields empty `failureModes` with diagnostic reasons.
Operator turn classification into five buckets `skills-contrib/drive-judge-harness/judge/classify-operator.ts`, `test/classify-operator.test.ts`	Operator actions are classified into five allowed buckets. Prompts embed operator turn text and trace context; JSON validation returns `bucket: null` with reasons when parsing or validation fails.
Correctness rubric grading with intent verdict `skills-contrib/drive-judge-harness/judge/rubric-correctness.ts`, `test/rubric-correctness.test.ts`	Grades whether produced diffs satisfy acceptance criteria and design-quality clauses. Parses `intent` and `reasons` from model JSON; returns `intent: null` with diagnostic reasons on parse/validation failure.
Correctness event emission with mechanical/QA preservation `skills-contrib/drive-judge-harness/judge/emit-correctness.ts`, `test/emit-correctness.test.ts`	Merges the judge's new `intent` verdict with prior `mechanical` and `qa` scores from trace events, preserving older values and writing deterministic `correctness-recorded` JSONL lines to trace files.
Calibration threshold, verdict types, and agreement measurement `skills-contrib/drive-judge-harness/judge/calibration.ts`, `judge/calibration/labels.md`, `test/calibration.test.ts`	`CALIBRATION_THRESHOLD` (0.8) and verdict types support judge-vs-human agreement tracking. `agreementRate()` computes exact-match agreement and derives a `passes` boolean from the threshold.
Judge harness overview documentation and test script updates `skills-contrib/drive-judge-harness/SKILL.md`, `package.json`	`SKILL.md` describes the judge module layout, invariants (safe-fail parsing, merge rules, cross-family requirement), and parked calibration workflow. Root `test:scripts` is updated to include new test files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

prisma/prisma-next#633: Introduces the emitEvent trace-file emission mechanism in drive-record-traces/emit.ts, which is consumed by the main PR's judge/emit-correctness.ts to append correctness verdicts to trace JSONL files.
prisma/prisma-next#582: Updates the same package.json test:scripts command to expand Node test coverage in CI, overlapping with the main PR's script enhancement.
prisma/prisma-next#641: Extends the same skills-contrib/drive-judge-harness directory with live @cursor/sdk integration, sharing the underlying SDK adapter pattern.

Poem

🐰 With whiskers twitching and tail held high,
A judge is born to grade and certify!
It grids the rubric, parses what models say,
Preserves old scores while guiding the way. ✨
Correctness flows, no family conflict here—
The harness hops forward, crystal clear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly references the main change: implementation of a bespoke LLM judge that emits an intent correctness signal and includes three classifier modules (rubric, failure, operator).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tml-2736-llm-judge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-31T10:31:49Z

size-limit report 📦

Path	Size
postgres / no-emit	135.37 KB (0%)
postgres / emit	125.16 KB (0%)
mongo / no-emit	73.9 KB (0%)
mongo / emit	68.89 KB (0%)

pkg-pr-new · 2026-05-31T10:31:55Z

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@654

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@654

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@654

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@654

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@654

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@654

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@654

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@654

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@654

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@654

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@654

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@654

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@654

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@654

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@654

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@654

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@654

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@654

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@654

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@654

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@654

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@654

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@654

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@654

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@654

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@654

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@654

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@654

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@654

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@654

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@654

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@654

prisma-next

npm i https://pkg.pr.new/prisma-next@654

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@654

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@654

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@654

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@654

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@654

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@654

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@654

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@654

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@654

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@654

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@654

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@654

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@654

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@654

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@654

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@654

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@654

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@654

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@654

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@654

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@654

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@654

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@654

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@654

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@654

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@654

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@654

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@654

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@654

commit: a724fc0

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

skills-contrib/drive-judge-harness/judge/parse-json.ts (1)
27-32: 💤 Low value

Braced span extraction may incorrectly match unrelated braces.

extractBracedSpan uses indexOf('{') and lastIndexOf('}') to find the first opening and last closing brace. If the model's response contains multiple independent JSON objects or braced prose (e.g., "The result is {good}. Also here is the JSON: {\"intent\":\"pass\",...}"), this will extract everything from the first { to the last }, producing invalid JSON.

This is mitigated by tryParseObject rejecting invalid JSON, but the function could be more precise by scanning for balanced braces or documenting this limitation.
Possible refinement

Add a comment documenting the limitation:
 function extractBracedSpan(raw: string): string | undefined {
+  // Extracts from first '{' to last '}'; malformed spans are rejected by tryParseObject.
   const start = raw.indexOf('{');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills-contrib/drive-judge-harness/judge/parse-json.ts` around lines 27 - 32,
extractBracedSpan can grab unrelated braces because it uses indexOf/lastIndexOf;
update extractBracedSpan to locate the first opening brace and then scan forward
counting nested braces (increment on '{', decrement on '}') until the count
returns to zero, returning that balanced slice; if no balanced span is found
return undefined — reference the extractBracedSpan function and ensure the
scanner handles nested objects and stops at the first balanced closing brace to
avoid spanning multiple JSON objects or prose with braces.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills-contrib/drive-judge-harness/judge/calibration/labels.md`:
- Around line 47-48: The labels.md file contains a transient project artifact
reference to `projects/drive-judge-harness/slices/llm-judge/spec.md`; remove
that reference or replace it with a link to a stable architecture or design doc
(for example a canonical README or architecture doc) and update the line
"Calibration harness — built, run deferred." if needed to point to the durable
doc; ensure no other durable docs in `labels.md` reference `projects/` artifacts
and update any link text to reflect the stable source.

In `@skills-contrib/drive-judge-harness/judge/classify-failure.ts`:
- Around line 45-50: FailureResponse currently hardcodes the failure-mode
string-union which can drift from the canonical FAILURE_MODE_CODES; change the
FailureResponse definition so its failureModes arktype is generated from
FAILURE_MODE_CODES (e.g., build the union string or enum values from
FAILURE_MODE_CODES keys/values and pass that into the existing
type(...).array()) instead of the literal string list—update the FailureResponse
declaration and any helper used to construct the union so the validator always
reflects FAILURE_MODE_CODES.

---

Nitpick comments:
In `@skills-contrib/drive-judge-harness/judge/parse-json.ts`:
- Around line 27-32: extractBracedSpan can grab unrelated braces because it uses
indexOf/lastIndexOf; update extractBracedSpan to locate the first opening brace
and then scan forward counting nested braces (increment on '{', decrement on
'}') until the count returns to zero, returning that balanced slice; if no
balanced span is found return undefined — reference the extractBracedSpan
function and ensure the scanner handles nested objects and stops at the first
balanced closing brace to avoid spanning multiple JSON objects or prose with
braces.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 42abe22c-cf4e-49c9-bf42-428fa52c14d5

📥 Commits

Reviewing files that changed from the base of the PR and between 4dd15e6 and bb9f4fb.

⛔ Files ignored due to path filters (5)

projects/drive-judge-harness/design-notes.md is excluded by !projects/**
projects/drive-judge-harness/slices/llm-judge/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/llm-judge/spec.md is excluded by !projects/**
projects/drive-judge-harness/slices/llm-judge/trace.jsonl is excluded by !projects/**
projects/drive-judge-harness/spec.md is excluded by !projects/**

📒 Files selected for processing (17)

package.json
skills-contrib/drive-judge-harness/SKILL.md
skills-contrib/drive-judge-harness/judge/calibration.ts
skills-contrib/drive-judge-harness/judge/calibration/labels.md
skills-contrib/drive-judge-harness/judge/classify-failure.ts
skills-contrib/drive-judge-harness/judge/classify-operator.ts
skills-contrib/drive-judge-harness/judge/emit-correctness.ts
skills-contrib/drive-judge-harness/judge/judge-model-sdk.ts
skills-contrib/drive-judge-harness/judge/judge-model.ts
skills-contrib/drive-judge-harness/judge/parse-json.ts
skills-contrib/drive-judge-harness/judge/rubric-correctness.ts
skills-contrib/drive-judge-harness/test/calibration.test.ts
skills-contrib/drive-judge-harness/test/classify-failure.test.ts
skills-contrib/drive-judge-harness/test/classify-operator.test.ts
skills-contrib/drive-judge-harness/test/emit-correctness.test.ts
skills-contrib/drive-judge-harness/test/judge-model-sdk.test.ts
skills-contrib/drive-judge-harness/test/rubric-correctness.test.ts

- derive the failure-mode arktype validator from FAILURE_MODE_CODES via type.enumerated(...) so the taxonomy has a single source of truth - drop the transient projects/ spec link from the durable labels.md doc; point at the co-located SKILL.md instead Signed-off-by: Will Madden <madden@prisma.io>

wmadden added 2 commits May 31, 2026 11:59

wmadden-electric requested a review from a team as a code owner May 31, 2026 10:29

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

Comment thread skills-contrib/drive-judge-harness/judge/calibration/labels.md Outdated

Comment thread skills-contrib/drive-judge-harness/judge/classify-failure.ts

wmadden approved these changes May 31, 2026

View reviewed changes

wmadden merged commit 70bd5ed into main May 31, 2026
21 checks passed

wmadden deleted the tml-2736-llm-judge branch May 31, 2026 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TML-2736: bespoke LLM judge — intent correctness signal + classifiers#654

TML-2736: bespoke LLM judge — intent correctness signal + classifiers#654
wmadden merged 3 commits into
mainfrom
tml-2736-llm-judge

wmadden-electric commented May 31, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 31, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented May 31, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wmadden-electric commented May 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

At a glance

Summary

How it fits together

Reviewer notes

Testing performed

Skill update

Checklist

Alternatives considered

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

size-limit report 📦

Uh oh!

pkg-pr-new Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wmadden-electric commented May 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading

github-actions Bot commented May 31, 2026 •

edited

Loading

pkg-pr-new Bot commented May 31, 2026 •

edited

Loading