Skip to content

TML-2736: bespoke LLM judge — intent correctness signal + classifiers#654

Merged
wmadden merged 3 commits into
mainfrom
tml-2736-llm-judge
May 31, 2026
Merged

TML-2736: bespoke LLM judge — intent correctness signal + classifiers#654
wmadden merged 3 commits into
mainfrom
tml-2736-llm-judge

Conversation

@wmadden-electric
Copy link
Copy Markdown
Contributor

@wmadden-electric wmadden-electric commented May 31, 2026

Linked issue

Refs TML-2736. Third slice of the Drive — Judge + live-experiment harness project; builds on the two-tier scorecard (#640) and the golden-case harness (#641).

At a glance

The judge grades one Drive run and emits the intent correctness signal the scorecard already reads. Two invariants do the load-bearing work. A malformed model response is never a silent pass:

const validated = RubricResponse(parsed);
if (validated instanceof type.errors) {
  return { intent: null, reasons: [`malformed model output: ${validated.summary}`] };
}

…and the emission preserves any gate-recorded mechanical/qa rather than clobbering it, because the scorecard is last-write-wins on the whole triple:

export function mergedCorrectnessPayload(events, projectRunId, intent) {
  const prior = latestCorrectness(events, projectRunId);
  return { mechanical: prior?.mechanical ?? null, qa: prior?.qa ?? null, intent };
}

Before this slice, the scorecard's intent component was always null → every run was not-computable. This slice makes it producible.

Summary

This PR ships a bespoke-minimal LLM judge under skills-contrib/drive-judge-harness/judge/. It carries two substantive pieces:

  1. The judge itself — grades a completed Drive run (the produced diff + the run's trace) against a golden case's acceptance.md, through a cross-family judge model, and emits the intent correctness component. Three prompt sets: a requirements+intent rubric, a failure-mode classifier, and an operator-turn classifier.
  2. The recorded decision to build it bespoke — a time-boxed spike compared Inspect / Braintrust / promptfoo and confirmed bespoke-minimal. The rationale lands in the project spec.md and design-notes.md; promptfoo is the recorded escape hatch.

The judge model is injected everywhere, so the whole subtree typechecks, tests, and lints with no CURSOR_API_KEY and @cursor/sdk absent — tests pass a mock. The live adapter is reached only behind the same --live + key gate as the harness.

How it fits together

  1. The model boundary (judge/judge-model.ts) — a one-method JudgeModel interface (grade(prompt) => Promise<string>). Everything downstream takes it as a dependency; tests inject a mock and never make a real call.
  2. The live adapter (judge/judge-model-sdk.ts) — pins a cross-family judge id (default gpt-5.5) and rejects a same-family judge id at construction (a Claude judge grading a Claude orchestrator throws before any SDK code runs). The @cursor/sdk import is lazy, so module load stays green without the package.
  3. The three prompt sets (rubric-correctness.ts, classify-failure.ts, classify-operator.ts) — each renders a prompt, calls the model, and parses an arktype-validated verdict. The operator-turn classifier uses the measurement model's five canonical buckets (docs/drive/measurement-model.md): legitimate-design, legitimate-authorisation, illegitimate-asked, illegitimate-correction, illegitimate-rescue.
  4. The merge-preserving emission (judge/emit-correctness.ts) — folds the rubric's intent into the run's latest recorded {mechanical, qa} and emits one correctness-recorded event through the deterministic emitter. The slice-1 scorecard composes it; no scorecard or schema edits.
  5. The calibration harness (judge/calibration.ts + judge/calibration/labels.md) — a judge-vs-human agreement tally with a ≥0.80 gate. The machinery lands; the calibration run is parked (see Reviewer notes).

Reviewer notes

  • The calibration run is deliberately parked, not forgotten. Calibration needs ~10–20 instrumented runs, and corpus generation is real-dollar spend the operator is holding. So this slice ships the gate machinery and an honest "uncalibrated" status; the project-DoD calibration item stays unchecked. SKILL.md and calibration/labels.md both record the deferral and the operator-spend gate.
  • The merge rule is the subtle part. computeScorecard is last-write-wins on the whole {mechanical, qa, intent} triple — it does not merge components. A naive judge emitting {mechanical:null, qa:null, intent:pass} would erase a gate's recorded pass. emit-correctness.ts reads-merges-emits so that can't happen; the end-to-end test asserts a prior mechanical:pass survives.
  • One unplanned helper. judge/parse-json.ts lifts a JSON object out of a model response (bare / fenced / embedded) — factored out so the malformed-→null path lives in one place rather than three copies.
  • The planning commit rides along. The first commit scaffolds the slice (spec, plan, trace) and records the spike; the second is the implementation. They're one reviewable unit.

Testing performed

  • node --test over the six new judge suites — 43 cases, all green, run with CURSOR_API_KEY unset.
  • pnpm typecheck — clean.
  • pnpm lint:deps — no dependency violations.
  • pnpm lint:castsdelta=0 (no new bare casts).
  • pnpm test:scripts — 545 cases green (nothing else regressed).

Skill update

skills-contrib/drive-judge-harness/SKILL.md documents the judge, the cross-family requirement, the correctness-recorded merge rule, the fail-to-null invariant, and the parked calibration.

Checklist

  • DCO sign-off on every commit
  • Tests written first and passing
  • Title follows the TML-NNNN: convention
  • No new bare casts (lint:casts delta 0)

Alternatives considered

  • Adopt an off-the-shelf eval framework (Inspect / Braintrust / promptfoo). Confirmed-rejected by the spike. They grade (input → model output); our unit is a whole Drive run scored from trace + diff + golden acceptance set. A framework can host the tiny grading call but not the integration with our trace/scorecard/golden assets — that glue is bespoke either way. promptfoo (TS, MIT, local) is recorded as the escape hatch if the bespoke scorer grows hairy.
  • Emit the intent component on its own. Rejected — it would clobber the gate-recorded mechanical/qa under last-write-wins. Hence the merge-preserving helper.
  • An other operator-turn bucket + non-null fallback. Rejected — the measurement model defines exactly five buckets; a malformed response yields bucket: null (same fail-to-null discipline as the rubric) rather than an off-doc catch-all.
  • Run the calibration now. Rejected — corpus generation is held on cost. The judge ships uncalibrated-but-honest; the gate is computable the moment the corpus exists.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced LLM-based judge system for Drive orchestrator evaluation with failure mode classification, operator turn assessment, and correctness rubric grading
    • Implemented cross-family model constraint enforcement between judge and orchestrator
    • Added calibration framework for judge accuracy validation with agreement-rate metrics
  • Documentation

    • Expanded judge harness documentation with detailed module descriptions and key invariants
    • Added calibration corpus specification and workflow guidance
  • Tests

    • Added comprehensive test coverage for judge components, classifiers, and calibration logic

wmadden added 2 commits May 31, 2026 11:59
…-judge slice

Record the slice-3 spike outcome (bespoke-minimal confirmed over Inspect /
Braintrust / promptfoo; impedance mismatch in the eval unit; promptfoo as the
escape hatch) in the project spec and design-notes, and scaffold the llm-judge
(TML-2736) slice: spec, dispatch plan, and slice-scoped trace.

Signed-off-by: Will Madden <madden@prisma.io>
Add a bespoke-minimal LLM judge under skills-contrib/drive-judge-harness/judge/
that grades one Drive run against a golden case's acceptance set behind an
injected, mockable JudgeModel boundary — so typecheck/test/lint stay green with
no CURSOR_API_KEY and @cursor/sdk absent.

- judge-model{,-sdk}.ts: injected boundary + live adapter with a synchronous
  cross-family guard (rejects a same-family judge id) and a lazy SDK import.
- rubric-correctness.ts: requirements+intent rubric; a malformed model response
  yields intent:null (never a false pass).
- classify-failure.ts / classify-operator.ts: diagnostic classifiers; the
  operator-turn buckets follow the measurement model's five canonical buckets.
- emit-correctness.ts: merge-preserving emission — fills intent while preserving
  any already-recorded mechanical/qa, because the scorecard is last-write-wins
  on the whole triple.
- calibration.ts + calibration/labels.md: agreement tally with the >=0.80 gate;
  the calibration run itself is parked (corpus-gated, operator approves spend).

43 tests (node --test) green; typecheck / lint:deps / lint:casts clean. Pins the
fail-to-null and merge-preserving invariants end-to-end against the scorecard.

Signed-off-by: Will Madden <madden@prisma.io>
@wmadden-electric wmadden-electric requested a review from a team as a code owner May 31, 2026 10:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Review Change Stack

Warning

Review limit reached

@wmadden-electric, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 42 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f54c6178-368d-4432-b3a0-2e19aa905e63

📥 Commits

Reviewing files that changed from the base of the PR and between bb9f4fb and a724fc0.

📒 Files selected for processing (2)
  • skills-contrib/drive-judge-harness/judge/calibration/labels.md
  • skills-contrib/drive-judge-harness/judge/classify-failure.ts
📝 Walkthrough

Walkthrough

This PR introduces a complete LLM judge harness for Drive orchestrator evaluation. It establishes a JudgeModel interface backed by an SDK adapter, implements classifiers for failure modes and operator turns, grades correctness rubrics, and emits merged verdicts to trace files while preserving prior mechanical and QA scores. Comprehensive tests and calibration infrastructure are included.

Changes

Drive LLM Judge Grading & Correctness Pipeline

Layer / File(s) Summary
Judge model interface and SDK adapter with cross-family guard
skills-contrib/drive-judge-harness/judge/judge-model.ts, judge/judge-model-sdk.ts, test/judge-model-sdk.test.ts
JudgeModel defines the grading contract. createSdkJudgeModel infers model families, enforces cross-family pairing synchronously, and returns an adapter that lazily imports @cursor/sdk, streams messages, and accumulates extracted assistant text.
JSON extraction and parsing utility for model responses
skills-contrib/drive-judge-harness/judge/parse-json.ts
parseJsonFromModel safely extracts and parses JSON objects from model output, supporting direct JSON, fenced code blocks, and braced substrings, returning undefined on failure.
Failure mode classification with taxonomy validation
skills-contrib/drive-judge-harness/judge/classify-failure.ts, test/classify-failure.test.ts
Fixed failure-mode taxonomy (F1–F15, scope-trap, qa-coverage-gap) is graded via prompts built from acceptance markdown and diffs. JSON responses are validated against arktype schema; malformed output yields empty failureModes with diagnostic reasons.
Operator turn classification into five buckets
skills-contrib/drive-judge-harness/judge/classify-operator.ts, test/classify-operator.test.ts
Operator actions are classified into five allowed buckets. Prompts embed operator turn text and trace context; JSON validation returns bucket: null with reasons when parsing or validation fails.
Correctness rubric grading with intent verdict
skills-contrib/drive-judge-harness/judge/rubric-correctness.ts, test/rubric-correctness.test.ts
Grades whether produced diffs satisfy acceptance criteria and design-quality clauses. Parses intent and reasons from model JSON; returns intent: null with diagnostic reasons on parse/validation failure.
Correctness event emission with mechanical/QA preservation
skills-contrib/drive-judge-harness/judge/emit-correctness.ts, test/emit-correctness.test.ts
Merges the judge's new intent verdict with prior mechanical and qa scores from trace events, preserving older values and writing deterministic correctness-recorded JSONL lines to trace files.
Calibration threshold, verdict types, and agreement measurement
skills-contrib/drive-judge-harness/judge/calibration.ts, judge/calibration/labels.md, test/calibration.test.ts
CALIBRATION_THRESHOLD (0.8) and verdict types support judge-vs-human agreement tracking. agreementRate() computes exact-match agreement and derives a passes boolean from the threshold.
Judge harness overview documentation and test script updates
skills-contrib/drive-judge-harness/SKILL.md, package.json
SKILL.md describes the judge module layout, invariants (safe-fail parsing, merge rules, cross-family requirement), and parked calibration workflow. Root test:scripts is updated to include new test files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • prisma/prisma-next#633: Introduces the emitEvent trace-file emission mechanism in drive-record-traces/emit.ts, which is consumed by the main PR's judge/emit-correctness.ts to append correctness verdicts to trace JSONL files.
  • prisma/prisma-next#582: Updates the same package.json test:scripts command to expand Node test coverage in CI, overlapping with the main PR's script enhancement.
  • prisma/prisma-next#641: Extends the same skills-contrib/drive-judge-harness directory with live @cursor/sdk integration, sharing the underlying SDK adapter pattern.

Poem

🐰 With whiskers twitching and tail held high,
A judge is born to grade and certify!
It grids the rubric, parses what models say,
Preserves old scores while guiding the way.
Correctness flows, no family conflict here—
The harness hops forward, crystal clear!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly references the main change: implementation of a bespoke LLM judge that emits an intent correctness signal and includes three classifier modules (rubric, failure, operator).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tml-2736-llm-judge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

size-limit report 📦

Path Size
postgres / no-emit 135.37 KB (0%)
postgres / emit 125.16 KB (0%)
mongo / no-emit 73.9 KB (0%)
mongo / emit 68.89 KB (0%)

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 31, 2026

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@654

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@654

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@654

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@654

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@654

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@654

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@654

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@654

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@654

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@654

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@654

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@654

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@654

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@654

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@654

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@654

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@654

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@654

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@654

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@654

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@654

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@654

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@654

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@654

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@654

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@654

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@654

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@654

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@654

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@654

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@654

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@654

prisma-next

npm i https://pkg.pr.new/prisma-next@654

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@654

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@654

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@654

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@654

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@654

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@654

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@654

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@654

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@654

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@654

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@654

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@654

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@654

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@654

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@654

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@654

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@654

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@654

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@654

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@654

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@654

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@654

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@654

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@654

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@654

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@654

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@654

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@654

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@654

commit: a724fc0

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
skills-contrib/drive-judge-harness/judge/parse-json.ts (1)

27-32: 💤 Low value

Braced span extraction may incorrectly match unrelated braces.

extractBracedSpan uses indexOf('{') and lastIndexOf('}') to find the first opening and last closing brace. If the model's response contains multiple independent JSON objects or braced prose (e.g., "The result is {good}. Also here is the JSON: {\"intent\":\"pass\",...}"), this will extract everything from the first { to the last }, producing invalid JSON.

This is mitigated by tryParseObject rejecting invalid JSON, but the function could be more precise by scanning for balanced braces or documenting this limitation.

Possible refinement

Add a comment documenting the limitation:

 function extractBracedSpan(raw: string): string | undefined {
+  // Extracts from first '{' to last '}'; malformed spans are rejected by tryParseObject.
   const start = raw.indexOf('{');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills-contrib/drive-judge-harness/judge/parse-json.ts` around lines 27 - 32,
extractBracedSpan can grab unrelated braces because it uses indexOf/lastIndexOf;
update extractBracedSpan to locate the first opening brace and then scan forward
counting nested braces (increment on '{', decrement on '}') until the count
returns to zero, returning that balanced slice; if no balanced span is found
return undefined — reference the extractBracedSpan function and ensure the
scanner handles nested objects and stops at the first balanced closing brace to
avoid spanning multiple JSON objects or prose with braces.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills-contrib/drive-judge-harness/judge/calibration/labels.md`:
- Around line 47-48: The labels.md file contains a transient project artifact
reference to `projects/drive-judge-harness/slices/llm-judge/spec.md`; remove
that reference or replace it with a link to a stable architecture or design doc
(for example a canonical README or architecture doc) and update the line
"Calibration harness — built, run deferred." if needed to point to the durable
doc; ensure no other durable docs in `labels.md` reference `projects/` artifacts
and update any link text to reflect the stable source.

In `@skills-contrib/drive-judge-harness/judge/classify-failure.ts`:
- Around line 45-50: FailureResponse currently hardcodes the failure-mode
string-union which can drift from the canonical FAILURE_MODE_CODES; change the
FailureResponse definition so its failureModes arktype is generated from
FAILURE_MODE_CODES (e.g., build the union string or enum values from
FAILURE_MODE_CODES keys/values and pass that into the existing
type(...).array()) instead of the literal string list—update the FailureResponse
declaration and any helper used to construct the union so the validator always
reflects FAILURE_MODE_CODES.

---

Nitpick comments:
In `@skills-contrib/drive-judge-harness/judge/parse-json.ts`:
- Around line 27-32: extractBracedSpan can grab unrelated braces because it uses
indexOf/lastIndexOf; update extractBracedSpan to locate the first opening brace
and then scan forward counting nested braces (increment on '{', decrement on
'}') until the count returns to zero, returning that balanced slice; if no
balanced span is found return undefined — reference the extractBracedSpan
function and ensure the scanner handles nested objects and stops at the first
balanced closing brace to avoid spanning multiple JSON objects or prose with
braces.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 42abe22c-cf4e-49c9-bf42-428fa52c14d5

📥 Commits

Reviewing files that changed from the base of the PR and between 4dd15e6 and bb9f4fb.

⛔ Files ignored due to path filters (5)
  • projects/drive-judge-harness/design-notes.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/llm-judge/plan.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/llm-judge/spec.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/llm-judge/trace.jsonl is excluded by !projects/**
  • projects/drive-judge-harness/spec.md is excluded by !projects/**
📒 Files selected for processing (17)
  • package.json
  • skills-contrib/drive-judge-harness/SKILL.md
  • skills-contrib/drive-judge-harness/judge/calibration.ts
  • skills-contrib/drive-judge-harness/judge/calibration/labels.md
  • skills-contrib/drive-judge-harness/judge/classify-failure.ts
  • skills-contrib/drive-judge-harness/judge/classify-operator.ts
  • skills-contrib/drive-judge-harness/judge/emit-correctness.ts
  • skills-contrib/drive-judge-harness/judge/judge-model-sdk.ts
  • skills-contrib/drive-judge-harness/judge/judge-model.ts
  • skills-contrib/drive-judge-harness/judge/parse-json.ts
  • skills-contrib/drive-judge-harness/judge/rubric-correctness.ts
  • skills-contrib/drive-judge-harness/test/calibration.test.ts
  • skills-contrib/drive-judge-harness/test/classify-failure.test.ts
  • skills-contrib/drive-judge-harness/test/classify-operator.test.ts
  • skills-contrib/drive-judge-harness/test/emit-correctness.test.ts
  • skills-contrib/drive-judge-harness/test/judge-model-sdk.test.ts
  • skills-contrib/drive-judge-harness/test/rubric-correctness.test.ts

Comment thread skills-contrib/drive-judge-harness/judge/calibration/labels.md Outdated
Comment thread skills-contrib/drive-judge-harness/judge/classify-failure.ts
- derive the failure-mode arktype validator from FAILURE_MODE_CODES via
  type.enumerated(...) so the taxonomy has a single source of truth
- drop the transient projects/ spec link from the durable labels.md doc;
  point at the co-located SKILL.md instead

Signed-off-by: Will Madden <madden@prisma.io>
@wmadden wmadden merged commit 70bd5ed into main May 31, 2026
21 checks passed
@wmadden wmadden deleted the tml-2736-llm-judge branch May 31, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants