Replies: 1 comment
-
|
📋 Initiative planned by the BMAD Scrum Master (Bob). Epic #597 — Initiative: Adversarial plan-critic + structural gates for the initiative-planner (Bob) 7 stories created (inert — labelled
Open questions for review:
Review the epic and its sub-issue DAG, adjust as needed, then add |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Add a quality layer to the initiative-planner (Bob) so the plans it auto-generates are adversarially reviewed and structurally safe before they materialize as epics. Today the planner runs
plan → validate-plan.py → apply-plan.sh, butvalidate-plan.pyonly checks DAG structure (unique ids, acyclic, no dangling edges) — nothing checks semantic quality, andapply-plan.shhas latent gaps (no idempotency, open-questions emitted as a comment while every story is stampedready-for-dev, dependency edges that can't represent non-issue prerequisites). The proposal: an adversarial plan-critic pass against a fixed rubric before materialize, plus a handful of structural fixes (idempotency guard, plan/apply separation, open-questions-as-gate, grounding checks). This is the same eval-gated discipline Epic #581 proposes for skills — applied to the planner that produced #581.Market Signal
Generator → critic → revise is the highest-ROI quality pattern for production agent output, and "structure-valid ≠ good" is a known trap:
Hype filter: the shippable piece is a bounded single critic pass against a fixed rubric plus deterministic structural guards — not an open-ended self-reflection loop. The structural fixes aren't even LLM work.
User Signal
This came directly out of the planner's first real exercise. Planning Idea: SkillOpt-style self-improving skills into Epic #581 surfaced concrete, repeatable weaknesses in the automation itself:
validate-plan.py, all catchable by a rubric.apply-plan.shcreates the epic unconditionally, so a second dispatch for the same idea creates a duplicate epic + DAG (the concurrency group only blocks concurrent runs). Confirmed inscripts/initiative-planner/apply-plan.sh.ready-for-devwith the contested decisions baked into ACs, so they flow straight to dev-lead onceinitiative:autois added.plan.json, so the preview doesn't bind the result.discussion-event break in bug(initiative-planner): claude-code-action rejectsdiscussionevents — theidea:approvedauto-trigger never plans #591 and agather-context.shref bug) — exactly when an automated quality gate pays off most.Technical Opportunity
All additive, mostly in this repo's
scripts/initiative-planner/:claude-code-actionstep or a second turn in Bob's prompt) reviews the draftplan.jsonagainst a fixed rubric — contested-AC, missing success metric, missing cost cap, untracked prerequisites, story reviewability, and (for eval/optimization stories) overfitting + artifact-immutability — emits structured findings, and Bob revises beforeapply-plan.sh.Planned from idea discussion #N) and skip-or-update.plan.json; a human may review/edit it; the apply run loads that exact artifact (apply-plan.shalready readsPLAN_PATH) instead of re-planning.plan.schema.jsonso each open-question carriesaffected_story_ids;apply-plan.shstamps those storiesplanning:needs-input, withholdsready-for-dev, and labels the epic so a maintainer resolves them beforeinitiative:auto.references/target_surfacepath resolves in the checkout; flag hallucinated anchors.Assessment
initiative:autocompounds. Cheapest fixes (#2, #7) are worth doing now.Adversarial Review
Strongest objection: A critic pass doubles the planner's cost and latency, and a second LLM reviewing the first is unreliable — it may rubber-stamp, hallucinate findings, or spawn an unbounded self-review loop. We'd be adding AI to check AI.
Rebuttal: The critic is deliberately bounded — a single pass against a fixed rubric emitting structured findings, then one revise; no loop. Its cost is trivially less than a flawed initiative reaching dev-lead and generating wrong PRs across the org. Reliability is bounded the same way LLM-judges are in production eval gates: a fixed rubric and structured output, with the human
initiative:autogate still the final say. Crucially, the highest-value fixes here (#2 idempotency, #4 open-questions-as-gate, #5 grounding) are not LLM work at all — they're deterministic guards — and #7 (rubric-in-prompt) captures much of the critic's value at near-zero cost. So the proposal degrades gracefully: even if the critic pass is deferred, the structural fixes stand alone and are strictly safer than today.Suggested Next Step
Sequence by cost: land the two non-LLM structural fixes first — idempotency guard (#2) and open-questions-as-gate (#4) — plus the prompt rubric (#7), since they're cheap, deterministic, and independently testable against the existing
tests/test_initiative_planner.bats. Then pilot the adversarial critic pass (#1), using the #581 plan as a regression fixture (it should flag the six findings the hand review found). Treat plan/apply separation (#3) as a follow-on once the critic is trusted.Beta Was this translation helpful? Give feedback.
All reactions