tml-2720: two-tier scorecard + token/correctness trace vocabulary by wmadden · Pull Request #640 · prisma/prisma-next

wmadden · 2026-05-30T16:12:38Z

The decision: the report must refuse to imply "good" without a correctness signal

drive-diagnose-run printed twenty disaggregated metrics and a static "Not computable" caveat, but nothing bound correctness to efficiency — and the token row claimed "not instrumented". A reader skimming all-green metrics could quietly conclude the run was good. This PR makes that impossible: the report headline is now a two-tier scorecard, and when no external correctness signal is present the verdict line reads not computable and names the missing input rather than staying silent.

not computable is the correct, shippable state of the scorecard for the entire window before the judge exists — it is not a stub.

Two-tier scorecard

Tier 1 — correctness gate (binary, external). Per project_run_id, the gate reads the external correctness-recorded feed (mechanical / qa / intent). All three pass → CORRECT; any fail → INCORRECT; any null or a missing feed → not computable, naming the missing component(s) (or "external correctness signal" when the feed is absent entirely).
Tier 2 — efficiency (CORRECT runs only). Tokens (from the per-run tokens-recorded feed), wall-clock, and rework — rendered only over runs that passed Tier 1, and hidden with a one-line reason otherwise. Null/absent token figures render n/a (no signal). Scoring an incorrect or ungraded run's efficiency is meaningless, so Tier 2 is gated on Tier 1.

New module scorecard.ts computes the scorecard; report.ts renders it as the headline (replacing the old static verdict block) and the stale "token usage: not instrumented" operator row is gone.

Trace vocabulary additions

Two per-run event types added to the single canonical schema (skills-contrib/drive-record-traces/schema.ts), the union, and KNOWN_EVENT_TYPES; both documented in events.md:

tokens-recorded — input_tokens / output_tokens / cache_read_tokens / cache_write_tokens, each integer ≥ 0 | null, sourced from the Cursor SDK's TurnEndedUpdate.usage (snake_case to match the vocabulary; mapped 1:1 to the SDK's camelCase). Hand-runs never emit it.
correctness-recorded — the external Tier-1 verdict slot the judge will populate: mechanical / qa / intent, each "pass" | "fail" | null. This PR builds only the slot.

Both feeds are emitted after and outside the orchestrated run (the harness accumulates tokens; the judge grades correctness post-hoc), so they are discrete per-run events rather than fields on a lifecycle event whose writer isn't present when the value becomes known.

Scope — deliberately deferred to later slices

The LLM judge that fills correctness-recorded (slice llm-judge, TML-2736).
The k=N A/B / experiment engine, cross-run aggregation, composite ranker, dashboard, CI gate (slice experiment-engine, TML-2737).
The golden-case library + SDK harness (parallel slice golden-case-harness, TML-2735) — posthoc.ts untouched, no @cursor/sdk dependency, no golden cases.
Per-dispatch token attribution (deferred).

Tests + gates

Test-first throughout (node --test). New: scorecard.test.ts (verdict classification, missing-input naming, token aggregation over CORRECT runs); emit.test.ts cases (accept well-formed / reject malformed for both new events); report.test.ts cases (not computable + named missing signal, Tier-2 hidden for non-correct runs, n/a (no signal) for null tokens).

Green: pnpm lint:deps, pnpm lint:casts (delta 0), pnpm test:scripts (467 tests incl. the registered scorecard.test.ts), and a direct tsc --noEmit over the edited skills-contrib files (these aren't in the turbo typecheck graph). The scorecard renders the honest not computable verdict end-to-end over this slice's own trace.

Stacking

Stacked on planning PR #638 (base main). This branch carries #638's planning commits until #638 merges; review the four commits from Drive(scorecard-and-trace-inputs): slice spec + plan onward.

Linear: TML-2720.

Scaffold the "Drive — Judge + live-experiment harness" project workspace: two-tier correctness-first scorecard, an LLM judge calibrated against an accreting instrumented-run corpus, and an SDK-spawned k=N A/B harness. Four slices (TML-2720 scorecard+vocabulary, TML-2735 golden-case harness, TML-2736 judge, TML-2737 experiment engine); two foundation slices run in parallel, judge + engine stack on top. Trace.jsonl carries the first natively-instrumented project-started/spec-authored/plan-authored events, emitted via the deterministic emitter merged in PR #633. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

… spike frameworks Operator steer: keep the implementation minimal. Default to a bespoke LLM-judge + held-out agreement tally; adopt a third-party eval framework (Inspect/Braintrust/promptfoo) only if a time-boxed slice-3 spike shows it reduces net complexity. Run-production harness stays bespoke regardless. Adds spec non-goal + Open Question 6, design-notes alternative, plan slice-3 spike note, and the spike to TML-2736. Trace carries spec-amended/plan-amended. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

Settle the six project-level open questions into decisions: - one project (judge + harness kept together; the feed->consume loop is the project) - judge model cross-family (hard); default GPT 5.5 vs the Claude orchestrator - per-run token signal from the SDK TurnEndedUpdate.usage - composed correctness gate (validation gates + QA run + judge intent); CI/merge is real-PR-only since sandboxed runs cannot use CI without an isolated fork - QA plans pre-written in each golden case acceptance set - baseline = previous skill version - bespoke-minimal scorer; slice-3 spike gates any framework on a net-complexity win Open Questions section now empty; decisions logged in spec + design-notes. Trace carries spec-amended. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

Signed-off-by: Will Madden <madden@prisma.io>

… events Signed-off-by: Will Madden <madden@prisma.io>

…le verdict Signed-off-by: Will Madden <madden@prisma.io>

…ster test Signed-off-by: Will Madden <madden@prisma.io>

coderabbitai · 2026-05-30T16:12:46Z

Warning

Review limit reached

@wmadden, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 37 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: abf1885a-d0ba-4688-b28a-bcd6900b8045

📥 Commits

Reviewing files that changed from the base of the PR and between a91c750 and 6966f9e.

⛔ Files ignored due to path filters (7)

projects/drive-judge-harness/design-notes.md is excluded by !projects/**
projects/drive-judge-harness/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/scorecard-and-trace-inputs/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/scorecard-and-trace-inputs/spec.md is excluded by !projects/**
projects/drive-judge-harness/slices/scorecard-and-trace-inputs/trace.jsonl is excluded by !projects/**
projects/drive-judge-harness/spec.md is excluded by !projects/**
projects/drive-judge-harness/trace.jsonl is excluded by !projects/**

📒 Files selected for processing (11)

package.json
skills-contrib/drive-diagnose-run/SKILL.md
skills-contrib/drive-diagnose-run/cli.ts
skills-contrib/drive-diagnose-run/report.ts
skills-contrib/drive-diagnose-run/scorecard.ts
skills-contrib/drive-diagnose-run/test/report.test.ts
skills-contrib/drive-diagnose-run/test/scorecard.test.ts
skills-contrib/drive-record-traces/SKILL.md
skills-contrib/drive-record-traces/events.md
skills-contrib/drive-record-traces/schema.ts
skills-contrib/drive-record-traces/test/emit.test.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tml-2720-scorecard-and-trace-inputs

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-30T16:14:48Z

size-limit report 📦

Path	Size
postgres / no-emit	135.35 KB (0%)
postgres / emit	125.15 KB (0%)
mongo / no-emit	73.85 KB (0%)
mongo / emit	68.85 KB (0%)

pkg-pr-new · 2026-05-30T16:15:23Z

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@640

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@640

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@640

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@640

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@640

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@640

@prisma-next/extension-cipherstash

npm i https://pkg.pr.new/@prisma-next/extension-cipherstash@640

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@640

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@640

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@640

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@640

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@640

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@640

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@640

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@640

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@640

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@640

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@640

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@640

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@640

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@640

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@640

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@640

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@640

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@640

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@640

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@640

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@640

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@640

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@640

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@640

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@640

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@640

prisma-next

npm i https://pkg.pr.new/prisma-next@640

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@640

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@640

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@640

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@640

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@640

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@640

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@640

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@640

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@640

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@640

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@640

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@640

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@640

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@640

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@640

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@640

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@640

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@640

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@640

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@640

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@640

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@640

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@640

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@640

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@640

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@640

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@640

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@640

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@640

commit: 6966f9e

…#654) ## Linked issue Refs [TML-2736](https://linear.app/prisma-company/issue/TML-2736). Third slice of the **Drive — Judge + live-experiment harness** project; builds on the two-tier scorecard ([#640](#640)) and the golden-case harness ([#641](#641)). ## At a glance The judge grades one Drive run and emits the `intent` correctness signal the scorecard already reads. Two invariants do the load-bearing work. A malformed model response is never a silent pass: ```ts const validated = RubricResponse(parsed); if (validated instanceof type.errors) { return { intent: null, reasons: [`malformed model output: ${validated.summary}`] }; } ``` …and the emission preserves any gate-recorded `mechanical`/`qa` rather than clobbering it, because the scorecard is last-write-wins on the whole triple: ```ts export function mergedCorrectnessPayload(events, projectRunId, intent) { const prior = latestCorrectness(events, projectRunId); return { mechanical: prior?.mechanical ?? null, qa: prior?.qa ?? null, intent }; } ``` Before this slice, the scorecard's `intent` component was always `null` → every run was `not-computable`. This slice makes it producible. ## Summary This PR ships a **bespoke-minimal LLM judge** under `skills-contrib/drive-judge-harness/judge/`. It carries two substantive pieces: 1. **The judge itself** — grades a completed Drive run (the produced diff + the run's trace) against a golden case's `acceptance.md`, through a cross-family judge model, and emits the `intent` correctness component. Three prompt sets: a requirements+intent rubric, a failure-mode classifier, and an operator-turn classifier. 2. **The recorded decision to build it bespoke** — a time-boxed spike compared Inspect / Braintrust / promptfoo and confirmed bespoke-minimal. The rationale lands in the project `spec.md` and `design-notes.md`; promptfoo is the recorded escape hatch. The judge model is **injected** everywhere, so the whole subtree typechecks, tests, and lints with **no `CURSOR_API_KEY`** and **`@cursor/sdk` absent** — tests pass a mock. The live adapter is reached only behind the same `--live` + key gate as the harness. ## How it fits together 1. **The model boundary** (`judge/judge-model.ts`) — a one-method `JudgeModel` interface (`grade(prompt) => Promise<string>`). Everything downstream takes it as a dependency; tests inject a mock and never make a real call. 2. **The live adapter** (`judge/judge-model-sdk.ts`) — pins a cross-family judge id (default `gpt-5.5`) and **rejects a same-family judge id at construction** (a Claude judge grading a Claude orchestrator throws before any SDK code runs). The `@cursor/sdk` import is lazy, so module load stays green without the package. 3. **The three prompt sets** (`rubric-correctness.ts`, `classify-failure.ts`, `classify-operator.ts`) — each renders a prompt, calls the model, and parses an arktype-validated verdict. The operator-turn classifier uses the measurement model's five canonical buckets (`docs/drive/measurement-model.md`): legitimate-design, legitimate-authorisation, illegitimate-asked, illegitimate-correction, illegitimate-rescue. 4. **The merge-preserving emission** (`judge/emit-correctness.ts`) — folds the rubric's `intent` into the run's latest recorded `{mechanical, qa}` and emits one `correctness-recorded` event through the deterministic emitter. The slice-1 scorecard composes it; no scorecard or schema edits. 5. **The calibration harness** (`judge/calibration.ts` + `judge/calibration/labels.md`) — a judge-vs-human agreement tally with a ≥0.80 gate. The machinery lands; the calibration *run* is parked (see Reviewer notes). ## Reviewer notes - **The calibration run is deliberately parked, not forgotten.** Calibration needs ~10–20 instrumented runs, and corpus generation is real-dollar spend the operator is holding. So this slice ships the gate machinery and an honest "uncalibrated" status; the project-DoD calibration item stays unchecked. `SKILL.md` and `calibration/labels.md` both record the deferral and the operator-spend gate. - **The merge rule is the subtle part.** `computeScorecard` is last-write-wins on the whole `{mechanical, qa, intent}` triple — it does not merge components. A naive judge emitting `{mechanical:null, qa:null, intent:pass}` would erase a gate's recorded pass. `emit-correctness.ts` reads-merges-emits so that can't happen; the end-to-end test asserts a prior `mechanical:pass` survives. - **One unplanned helper.** `judge/parse-json.ts` lifts a JSON object out of a model response (bare / fenced / embedded) — factored out so the malformed-→null path lives in one place rather than three copies. - **The planning commit rides along.** The first commit scaffolds the slice (spec, plan, trace) and records the spike; the second is the implementation. They're one reviewable unit. ## Testing performed - `node --test` over the six new judge suites — **43 cases, all green**, run with `CURSOR_API_KEY` unset. - `pnpm typecheck` — clean. - `pnpm lint:deps` — no dependency violations. - `pnpm lint:casts` — `delta=0` (no new bare casts). - `pnpm test:scripts` — 545 cases green (nothing else regressed). ## Skill update `skills-contrib/drive-judge-harness/SKILL.md` documents the judge, the cross-family requirement, the `correctness-recorded` merge rule, the fail-to-null invariant, and the parked calibration. ## Checklist - [x] DCO sign-off on every commit - [x] Tests written first and passing - [x] Title follows the `TML-NNNN:` convention - [x] No new bare casts (`lint:casts` delta 0) ## Alternatives considered - **Adopt an off-the-shelf eval framework (Inspect / Braintrust / promptfoo).** Confirmed-rejected by the spike. They grade `(input → model output)`; our unit is a whole Drive run scored from trace + diff + golden acceptance set. A framework can host the tiny grading call but not the integration with our trace/scorecard/golden assets — that glue is bespoke either way. promptfoo (TS, MIT, local) is recorded as the escape hatch if the bespoke scorer grows hairy. - **Emit the `intent` component on its own.** Rejected — it would clobber the gate-recorded `mechanical`/`qa` under last-write-wins. Hence the merge-preserving helper. - **An `other` operator-turn bucket + non-null fallback.** Rejected — the measurement model defines exactly five buckets; a malformed response yields `bucket: null` (same fail-to-null discipline as the rubric) rather than an off-doc catch-all. - **Run the calibration now.** Rejected — corpus generation is held on cost. The judge ships uncalibrated-but-honest; the gate is computable the moment the corpus exists.  ## Summary by CodeRabbit # Release Notes * **New Features** * Introduced LLM-based judge system for Drive orchestrator evaluation with failure mode classification, operator turn assessment, and correctness rubric grading * Implemented cross-family model constraint enforcement between judge and orchestrator * Added calibration framework for judge accuracy validation with agreement-rate metrics * **Documentation** * Expanded judge harness documentation with detailed module descriptions and key invariants * Added calibration corpus specification and workflow guidance * **Tests** * Added comprehensive test coverage for judge components, classifiers, and calibration logic  --------- Signed-off-by: Will Madden <madden@prisma.io> Co-authored-by: Will Madden <madden@prisma.io>

wmadden-electric and others added 7 commits May 30, 2026 16:14

Drive(scorecard-and-trace-inputs): slice spec + plan

1874f60

Signed-off-by: Will Madden <madden@prisma.io>

feat(drive-record-traces): add tokens-recorded + correctness-recorded…

4280e7c

… events Signed-off-by: Will Madden <madden@prisma.io>

feat(drive-diagnose-run): two-tier scorecard with honest not-computab…

8810988

…le verdict Signed-off-by: Will Madden <madden@prisma.io>

docs(drive): document tokens/correctness vocabulary + scorecard; regi…

6966f9e

…ster test Signed-off-by: Will Madden <madden@prisma.io>

wmadden requested a review from a team as a code owner May 30, 2026 16:12

wmadden merged commit 0fcead6 into main May 30, 2026
21 checks passed

wmadden deleted the tml-2720-scorecard-and-trace-inputs branch May 30, 2026 17:25

wmadden mentioned this pull request May 30, 2026

Drive: Judge + live-experiment harness — project shaping (spec, plan, design-notes) #638

Closed

wmadden-electric mentioned this pull request May 31, 2026

TML-2736: bespoke LLM judge — intent correctness signal + classifiers #654

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tml-2720: two-tier scorecard + token/correctness trace vocabulary#640

tml-2720: two-tier scorecard + token/correctness trace vocabulary#640
wmadden merged 7 commits into
mainfrom
tml-2720-scorecard-and-trace-inputs

wmadden commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026

Review limit reached

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

pkg-pr-new Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wmadden commented May 30, 2026

The decision: the report must refuse to imply "good" without a correctness signal

Two-tier scorecard

Trace vocabulary additions

Scope — deliberately deferred to later slices

Tests + gates

Stacking

Uh oh!

coderabbitai Bot commented May 30, 2026

Review limit reached

Uh oh!

github-actions Bot commented May 30, 2026

size-limit report 📦

Uh oh!

pkg-pr-new Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants