FE-813: Cook harness fidelity — trustworthy per-slice completion signal#170
FE-813: Cook harness fidelity — trustworthy per-slice completion signal#170kostandinang wants to merge 5 commits into
Conversation
`evaluate-done` ran with `read,write,edit,bash`, so the evaluator could fix code during evaluation and report `done` on the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Add per-action tool scoping (`toolsForAction`): the evaluator is read-only; code-producing actions keep the full toolset. Threads `tools` through `runPi`. Opens the `cook-harness-fidelity` frontier (Slice 1) and records it in PLAN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
`evaluate-done` returned an LLM verdict over criterion prose — a soft oracle a standalone component or Ladle story could satisfy without the feature working. It now executes the slice's verification targets (`bun test`) and gates `done` on real pass/fail: `evaluateVerificationTargets` requires ≥1 target and every target passing. The evaluator no longer dispatches pi, so it cannot mutate the sandbox at all. Removes the now-dead `extractJson` helper. Completes the cook-harness-fidelity frontier (Slice 2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…stics `bun test` writes its results (failure detail, pass/fail counts) to stderr and only the version banner to stdout. The execSync failure path returned `err.stdout`, so every failing tests-run / epic-verified report carried just "bun test v1.3.14" with zero diagnostics (observed on cook run 289c9843). Switch to spawnSync and concatenate both streams; the arg-array form also drops shell interpolation of the target path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
PR SummaryMedium Risk Overview
Per-action tool scoping via Writer prompts are tightened so tests are the oracle (
Reviewed by Cursor Bugbot for commit 85753f7. Bugbot is set up for automated code reviews on this repo. Configure here. |
Slice 2 made the evaluator a deterministic oracle (it executes the verification targets; `done` = all pass), which makes the test-writer the load-bearing oracle: a test that passes without exercising the slice's behavior now marks broken code DONE. Port ln-build discipline into the two live writer prompts to close that gap: - test-writer: orient first (read + match the repo's conventions); one observable behavior per test mapped to an acceptance criterion; test through the public interface; make the red meaningful; ban trivially-passing tests (a false oracle under the deterministic gate). - code-writer: orient + inside-out build; no speculative abstraction; never weaken a test to go green; pre-release deletion posture. - code-writer: drop the hardcoded "TypeScript" — derive the language from the repo (the harness is meant to be host-agnostic; the remaining `bun test` coupling in test-writer is the executor's, to be removed with the ProjectProfile/toolchain adapter). Delete evaluator.md: dead since Slice 2 stopped dispatching pi for evaluation (verified zero references repo-wide). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ter-prompt slice The "pi-actions evaluate-done collapses the TDD workflow" follow-on (under cook-codebase-mode) was closed by FE-813 Slices 1+2 but still listed as open. Mark it resolved with commit pointers, and extend the FE-813 status with Slice 3 (writer-prompt hardening, 9fb5af1) and the still-unowned bun→host test-runner decoupling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 85753f7. Configure here.
| }); | ||
| const { done, results } = await evaluateVerificationTargets(ctx.slice.verification, (target) => | ||
| runBunTest(target, ctx.sandboxDir), | ||
| ); |
There was a problem hiding this comment.
Evaluate retry loop never halts
High Severity
evaluate-done now requires every verification target to pass, but the Petri retry path still only re-runs tests for the first target. When a later target keeps failing, run-tests can pass while evaluation stays false, so the slice cycles forever without hitting retry exhaustion.
Reviewed by Cursor Bugbot for commit 85753f7. Configure here.
| return JSON.parse(match[0]) as Record<string, unknown>; | ||
| } catch { | ||
| return undefined; | ||
| const { stdout } = await execAsync(`bun test ${target}`, { |
There was a problem hiding this comment.
Shell splits verification target path
Medium Severity
New runBunTest runs bun test via a shell string built from the target path. Paths with spaces or shell metacharacters are word-split or interpreted by the shell, so the wrong files run and pass/fail can be wrong.
Reviewed by Cursor Bugbot for commit 85753f7. Configure here.



Stack context
Stacked on #167 (FE-800). Implements the
cook-harness-fidelityfrontier: a trustworthy per-slice completion signal — the evaluator observes, it does not produce.What
Slice 1 — evaluator read-only (
d2139d8c):evaluate-doneran withread,write,edit,bash, so it could fix code during evaluation and reportdoneon the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Adds per-action tool scoping (toolsForAction).Slice 2 — done = real test execution (
fcba8ab3, hardened ind979a0aa):evaluate-donenow executes the slice's verification targets (bun test) and gatesdoneon real pass/fail (evaluateVerificationTargets: requires ≥1 target and every target passing), replacing the LLM verdict over criterion prose. The evaluator no longer dispatchespiat all. (d979a0aaextracted the runner intotest-runner.tswith stderr capture.)Scope / limit (read this)
This closes the evaluator-trust gap:
donenow requires real test execution, not an LLM rubber-stamp over prose. It does not yet guarantee an integrated feature. The verification targets are agent-authored against the emitter's synthesized target paths, so a weak / integration-blind target (e.g. a test of an orphan component in isolation) can still pass. Makingdonerequire real integration needs the emitter to emit integration-demanding targets — tracked as thespec-to-cook-plan(FE-800) integration-blind-verification follow-on.Verification
New
pi-actions.test.ts(tool scoping + gate logic) +test-runner.test.ts+npm run checkpass. Two failures in the full suite (build-boundary.test.ts,app.test.ts) are inherited from the fe-800 base and a parallel-run flake respectively — not from this branch's diff (proven by re-running with this branch's changes stashed); see #167.🤖 Generated with Claude Code