FE-813: Cook harness fidelity — trustworthy per-slice completion signal by kostandinang · Pull Request #170 · hashintel/brunch

kostandinang · 2026-06-04T16:01:21Z

Stack context

Stacked on #167 (FE-800). Implements the cook-harness-fidelity frontier: a trustworthy per-slice completion signal — the evaluator observes, it does not produce.

What

Slice 1 — evaluator read-only (d2139d8c): evaluate-done ran with read,write,edit,bash, so it could fix code during evaluation and report done on the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Adds per-action tool scoping (toolsForAction).

Slice 2 — done = real test execution (fcba8ab3, hardened in d979a0aa): evaluate-done now executes the slice's verification targets (bun test) and gates done on real pass/fail (evaluateVerificationTargets: requires ≥1 target and every target passing), replacing the LLM verdict over criterion prose. The evaluator no longer dispatches pi at all. (d979a0aa extracted the runner into test-runner.ts with stderr capture.)

Scope / limit (read this)

This closes the evaluator-trust gap: done now requires real test execution, not an LLM rubber-stamp over prose. It does not yet guarantee an integrated feature. The verification targets are agent-authored against the emitter's synthesized target paths, so a weak / integration-blind target (e.g. a test of an orphan component in isolation) can still pass. Making done require real integration needs the emitter to emit integration-demanding targets — tracked as the spec-to-cook-plan (FE-800) integration-blind-verification follow-on.

Verification

New pi-actions.test.ts (tool scoping + gate logic) + test-runner.test.ts + npm run check pass. Two failures in the full suite (build-boundary.test.ts, app.test.ts) are inherited from the fe-800 base and a parallel-run flake respectively — not from this branch's diff (proven by re-running with this branch's changes stashed); see #167.

🤖 Generated with Claude Code

`evaluate-done` ran with `read,write,edit,bash`, so the evaluator could fix code during evaluation and report `done` on the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Add per-action tool scoping (`toolsForAction`): the evaluator is read-only; code-producing actions keep the full toolset. Threads `tools` through `runPi`. Opens the `cook-harness-fidelity` frontier (Slice 1) and records it in PLAN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

kostandinang · 2026-06-04T16:01:56Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

`evaluate-done` returned an LLM verdict over criterion prose — a soft oracle a standalone component or Ladle story could satisfy without the feature working. It now executes the slice's verification targets (`bun test`) and gates `done` on real pass/fail: `evaluateVerificationTargets` requires ≥1 target and every target passing. The evaluator no longer dispatches pi, so it cannot mutate the sandbox at all. Removes the now-dead `extractJson` helper. Completes the cook-harness-fidelity frontier (Slice 2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…stics `bun test` writes its results (failure detail, pass/fail counts) to stderr and only the version banner to stdout. The execSync failure path returned `err.stdout`, so every failing tests-run / epic-verified report carried just "bun test v1.3.14" with zero diagnostics (observed on cook run 289c9843). Switch to spawnSync and concatenate both streams; the arg-array form also drops shell interpolation of the target path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

cursor · 2026-06-04T16:27:23Z

PR Summary

Medium Risk
Changes core cook TDD completion semantics and pi tool wiring; wrong gating or stderr handling could mark slices done incorrectly or hide failures, though new unit/integration tests target the main paths.

Overview
Cook harness fidelity (FE-813): per-slice done is no longer an LLM judgment — it comes from running verification targets with real pass/fail, and the evaluator path can no longer mutate the sandbox during evaluation.

evaluate-done stops dispatching pi with evaluator.md. It runs evaluateVerificationTargets, which requires at least one target and all targets passing, using bun test in the slice sandbox. Reports carry { done, results } per target instead of prose reasoning.

Per-action tool scoping via toolsForAction: evaluate-done gets read only; write-tests, write-code, and verify-epic keep read,write,edit,bash. This prevents the brownfield regression where the evaluator fixed code on the first pass and skipped the TDD loop.

Writer prompts are tightened so tests are the oracle (test-writer.md, code-writer.md); evaluator.md is removed.

BunTestRunner concatenates stdout and stderr so failing runs retain diagnostics in harness reports (regression from runs that only showed the version banner).

memory/PLAN.md documents the cook-harness-fidelity frontier and marks the evaluate-done TDD-collapse follow-on resolved.

^{Reviewed by Cursor Bugbot for commit 85753f7. Bugbot is set up for automated code reviews on this repo. Configure here.}

Slice 2 made the evaluator a deterministic oracle (it executes the verification targets; `done` = all pass), which makes the test-writer the load-bearing oracle: a test that passes without exercising the slice's behavior now marks broken code DONE. Port ln-build discipline into the two live writer prompts to close that gap: - test-writer: orient first (read + match the repo's conventions); one observable behavior per test mapped to an acceptance criterion; test through the public interface; make the red meaningful; ban trivially-passing tests (a false oracle under the deterministic gate). - code-writer: orient + inside-out build; no speculative abstraction; never weaken a test to go green; pre-release deletion posture. - code-writer: drop the hardcoded "TypeScript" — derive the language from the repo (the harness is meant to be host-agnostic; the remaining `bun test` coupling in test-writer is the executor's, to be removed with the ProjectProfile/toolchain adapter). Delete evaluator.md: dead since Slice 2 stopped dispatching pi for evaluation (verified zero references repo-wide). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ter-prompt slice The "pi-actions evaluate-done collapses the TDD workflow" follow-on (under cook-codebase-mode) was closed by FE-813 Slices 1+2 but still listed as open. Mark it resolved with commit pointers, and extend the FE-813 status with Slice 3 (writer-prompt hardening, 9fb5af1) and the still-unowned bun→host test-runner decoupling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 85753f7. Configure here.}

cursor · 2026-06-04T21:25:14Z

-        });
+      const { done, results } = await evaluateVerificationTargets(ctx.slice.verification, (target) =>
+        runBunTest(target, ctx.sandboxDir),
+      );


Evaluate retry loop never halts

High Severity

evaluate-done now requires every verification target to pass, but the Petri retry path still only re-runs tests for the first target. When a later target keeps failing, run-tests can pass while evaluation stays false, so the slice cycles forever without hitting retry exhaustion.

^{Reviewed by Cursor Bugbot for commit 85753f7. Configure here.}

cursor · 2026-06-04T21:25:14Z

-    return JSON.parse(match[0]) as Record<string, unknown>;
-  } catch {
-    return undefined;
+    const { stdout } = await execAsync(`bun test ${target}`, {


Shell splits verification target path

Medium Severity

New runBunTest runs bun test via a shell string built from the target path. Paths with spaces or shell metacharacters are word-split or interpreted by the shell, so the wrong files run and pass/fail can be wrong.

^{Reviewed by Cursor Bugbot for commit 85753f7. Configure here.}

kostandinang mentioned this pull request Jun 4, 2026

FE-812: Completed-spec fixture for the plan emitter #168

Open

kostandinang changed the title ~~FE-813: Scope the cook evaluator to read-only tools~~ FE-813: Cook harness fidelity — trustworthy per-slice completion signal Jun 4, 2026

kostandinang and others added 2 commits June 4, 2026 18:10

kostandinang marked this pull request as ready for review June 4, 2026 16:27

kostandinang and others added 2 commits June 4, 2026 18:43

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FE-813: Cook harness fidelity — trustworthy per-slice completion signal#170

FE-813: Cook harness fidelity — trustworthy per-slice completion signal#170
kostandinang wants to merge 5 commits into
ka/fe-800-spec-to-cook-planfrom
ka/fe-813-cook-harness-fidelity

kostandinang commented Jun 4, 2026 •

edited

Loading

Uh oh!

kostandinang commented Jun 4, 2026

Uh oh!

cursor Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 4, 2026

Uh oh!

cursor Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kostandinang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack context

What

Scope / limit (read this)

Verification

Uh oh!

kostandinang commented Jun 4, 2026

Uh oh!

cursor Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Evaluate retry loop never halts

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Shell splits verification target path

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kostandinang commented Jun 4, 2026 •

edited

Loading

cursor Bot commented Jun 4, 2026 •

edited

Loading