💡 Adversarial Plan-Critic + Structural Gates for the Initiative-Planner (Bob) #593

don-petry · 2026-06-11T23:27:05Z

don-petry
Jun 11, 2026
Maintainer

Summary

Add a quality layer to the initiative-planner (Bob) so the plans it auto-generates are adversarially reviewed and structurally safe before they materialize as epics. Today the planner runs plan → validate-plan.py → apply-plan.sh, but validate-plan.py only checks DAG structure (unique ids, acyclic, no dangling edges) — nothing checks semantic quality, and apply-plan.sh has latent gaps (no idempotency, open-questions emitted as a comment while every story is stamped ready-for-dev, dependency edges that can't represent non-issue prerequisites). The proposal: an adversarial plan-critic pass against a fixed rubric before materialize, plus a handful of structural fixes (idempotency guard, plan/apply separation, open-questions-as-gate, grounding checks). This is the same eval-gated discipline Epic #581 proposes for skills — applied to the planner that produced #581.

Market Signal

Generator → critic → revise is the highest-ROI quality pattern for production agent output, and "structure-valid ≠ good" is a known trap:

Reflective / critic-driven optimization outperforms naive single-pass generation: GEPA's reflect-and-revise loop beats RL while using up to ~35× fewer rollouts (GEPA 2507.19457, DSPy GEPA), and the same actor-critic/skill-library lineage drove Voyager's gains (Voyager).
Eval gates belong in the CI path, not just the schema: 2026 production guidance treats an explicit quality gate (LLM-judge / rubric) as a required step before an agent artifact ships, distinct from structural validation (AI-native CI/CD eval gates).
Agent-authored artifacts need a review discipline: GitHub's own guidance on the flood of agent-authored PRs is that a structured human/automated review gate — not raw acceptance — is what keeps quality up (reviewing agent PRs). A "plan then apply" split with a review point between is the same idea infrastructure-as-code settled on years ago.

Hype filter: the shippable piece is a bounded single critic pass against a fixed rubric plus deterministic structural guards — not an open-ended self-reflection loop. The structural fixes aren't even LLM work.

User Signal

This came directly out of the planner's first real exercise. Planning Idea: SkillOpt-style self-improving skills into Epic #581 surfaced concrete, repeatable weaknesses in the automation itself:

No semantic gate: a hand rubber-duck review of Initiative: Eval-gated, human-reviewed self-improving skills (SkillOpt-style) #581 found six real issues (a contested open-question baked into a story's acceptance criteria, no initiative-level success metric, no cost cap, a reward-hacking/overfitting hole) — none catchable by validate-plan.py, all catchable by a rubric.
Latent duplication bug: apply-plan.sh creates the epic unconditionally, so a second dispatch for the same idea creates a duplicate epic + DAG (the concurrency group only blocks concurrent runs). Confirmed in scripts/initiative-planner/apply-plan.sh.
Open-questions are decorative: Bob raised four good questions but every story still ships ready-for-dev with the contested decisions baked into ACs, so they flow straight to dev-lead once initiative:auto is added.
Dry-run ≠ real: the real run re-plans from scratch instead of consuming the dry-run's plan.json, so the preview doesn't bind the result.
Young pipeline: the planner is new (PR feat(initiatives): idea-triage + BMAD Scrum Master initiative-planner #567) and already needed two infra fixes this week (the discussion-event break in bug(initiative-planner): claude-code-action rejects discussion events — the idea:approved auto-trigger never plans #591 and a gather-context.sh ref bug) — exactly when an automated quality gate pays off most.

Technical Opportunity

All additive, mostly in this repo's scripts/initiative-planner/:

Adversarial critic pass (Tier 1): a second pass (second claude-code-action step or a second turn in Bob's prompt) reviews the draft plan.json against a fixed rubric — contested-AC, missing success metric, missing cost cap, untracked prerequisites, story reviewability, and (for eval/optimization stories) overfitting + artifact-immutability — emits structured findings, and Bob revises before apply-plan.sh.
Idempotency guard (Tier 1, non-LLM): before creating the epic, search open issues for the discussion back-reference the epic body already embeds (Planned from idea discussion #N) and skip-or-update.
Plan/apply split (Tier 2): dry-run emits the authoritative plan.json; a human may review/edit it; the apply run loads that exact artifact (apply-plan.sh already reads PLAN_PATH) instead of re-planning.
Open-questions-as-gate (Tier 2): extend plan.schema.json so each open-question carries affected_story_ids; apply-plan.sh stamps those stories planning:needs-input, withholds ready-for-dev, and labels the epic so a maintainer resolves them before initiative:auto.
Grounding check (Tier 2): verify every references / target_surface path resolves in the checkout; flag hallucinated anchors.
Express non-issue prerequisites: a schema field for discussion/external prereqs that renders as an explicit "untracked prerequisites" checklist on the epic instead of a buried dev-note.
Rubric-in-prompt (Tier 3): a near-zero-cost partial of test issue from agent #1 — bake the checklist into Bob's prompt even before a separate critic exists.

Assessment

Dimension	Score	Rationale
Feasibility	med	The structural fixes (#2, #4, #5) are small, deterministic, test-coverable changes to existing scripts; the critic pass (#1) is one more bounded agent step. No new infrastructure.
Impact	high	Raises the quality of every future initiative, at the pipeline's leverage point — one fix improves all downstream dev-lead work. #2 also closes a latent data-corruption bug.
Urgency	med	No fire, but the planner is young and actively used; each unreviewed plan that reaches `initiative:auto` compounds. Cheapest fixes (#2, #7) are worth doing now.

Adversarial Review

Strongest objection: A critic pass doubles the planner's cost and latency, and a second LLM reviewing the first is unreliable — it may rubber-stamp, hallucinate findings, or spawn an unbounded self-review loop. We'd be adding AI to check AI.

Rebuttal: The critic is deliberately bounded — a single pass against a fixed rubric emitting structured findings, then one revise; no loop. Its cost is trivially less than a flawed initiative reaching dev-lead and generating wrong PRs across the org. Reliability is bounded the same way LLM-judges are in production eval gates: a fixed rubric and structured output, with the human initiative:auto gate still the final say. Crucially, the highest-value fixes here (#2 idempotency, #4 open-questions-as-gate, #5 grounding) are not LLM work at all — they're deterministic guards — and #7 (rubric-in-prompt) captures much of the critic's value at near-zero cost. So the proposal degrades gracefully: even if the critic pass is deferred, the structural fixes stand alone and are strictly safer than today.

Suggested Next Step

Sequence by cost: land the two non-LLM structural fixes first — idempotency guard (#2) and open-questions-as-gate (#4) — plus the prompt rubric (#7), since they're cheap, deterministic, and independently testable against the existing tests/test_initiative_planner.bats. Then pilot the adversarial critic pass (#1), using the #581 plan as a regression fixture (it should flag the six findings the hand review found). Treat plan/apply separation (#3) as a follow-on once the critic is trusted.

2026-06-11T23:44:00Z

github-actions[bot]
Bot Jun 11, 2026

📋 Initiative planned by the BMAD Scrum Master (Bob).

Epic #597 — Initiative: Adversarial plan-critic + structural gates for the initiative-planner (Bob)

7 stories created (inert — labelled initiative, NOT initiative:auto):

[Phase 1] Idempotency guard: never create a duplicate epic for the same idea #598 (M) — [Phase 1] Idempotency guard: never create a duplicate epic for the same idea
[Phase 1] Bake the fixed critic rubric into Bob's planning prompt #599 (S) — [Phase 1] Bake the fixed critic rubric into Bob's planning prompt
[Phase 1] Open-questions-as-gate: withhold ready-for-dev on affected stories #600 (M) — [Phase 1] Open-questions-as-gate: withhold ready-for-dev on affected stories
[Phase 1] Express untracked (non-issue) prerequisites as an epic checklist #601 (S) — [Phase 1] Express untracked (non-issue) prerequisites as an epic checklist
[Phase 2] Grounding check: verify references/target_surface paths resolve #602 (M) — [Phase 2] Grounding check: verify references/target_surface paths resolve
[Phase 2] Adversarial plan-critic pass against the fixed rubric #603 (L) — [Phase 2] Adversarial plan-critic pass against the fixed rubric
[Phase 3] Plan/apply split: apply consumes the dry-run's authoritative plan.json #604 (M) — [Phase 3] Plan/apply split: apply consumes the dry-run's authoritative plan.json

Open questions for review:

Critic wiring (story 6): a second claude-code-action step vs a second turn in Bob's existing prompt — which does the team prefer? It affects cost accounting and the workflow shape.
Gate label names (story 3): confirm/create planning:needs-input (story) and initiative:needs-input (epic) in the repo label set, or reuse initiative:hold — REST issue creation fails on unknown labels.
Grounding heuristic (story 5): the exact rule for distinguishing a repo-relative file path from a prose/URL citation in references/target_surface, to avoid false failures.
Plan/apply handoff (story 7): the mechanism for feeding the reviewed dry-run plan.json into the apply run (artifact download by run id vs a committed file vs a new workflow input).
Initiative-level success metric + cost cap (rubric, stories 2/6): the rubric requires every initiative to carry a success metric and a cost cap — should these become first-class schema fields on the epic, or remain rubric-enforced prose?

Review the epic and its sub-issue DAG, adjust as needed, then add initiative:auto to epic #597 to hand it to initiative-driver for auto-implementation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 Adversarial Plan-Critic + Structural Gates for the Initiative-Planner (Bob) #593

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

💡 Adversarial Plan-Critic + Structural Gates for the Initiative-Planner (Bob) #593

Uh oh!

don-petry Jun 11, 2026 Maintainer

Summary

Market Signal

User Signal

Technical Opportunity

Assessment

Adversarial Review

Suggested Next Step

Replies: 1 comment

Uh oh!

github-actions[bot] Bot Jun 11, 2026

don-petry
Jun 11, 2026
Maintainer

github-actions[bot]
Bot Jun 11, 2026