Skip to content

OBS-09: Rule audit across agents and decompose skills #2

@b-coman

Description

@b-coman

Origin

E2E Round 3, OBS-09 — Design Meta (see reports/e2e-003-ido4shape-cloud.md lines 407–453).

Surfaced while scoping OBS-07 and OBS-08 fixes: the proposed fixes were themselves over-prescriptive toward the agents — piling on rules, enforcement lists, detection patterns, and forbidden-term tables. The irony: OBS-07 complains that technical-spec-writer over-prescribes to implementer agents, then proposes to fix it by over-prescribing to technical-spec-writer.

Problem

Agent definitions accumulate rules until the cumulative effect is harmful:

  • Cognitive overload — every rule is a constraint the agent checks; past a threshold, agents spend more energy rule-checking than problem-solving
  • Rule conflicts — "be specific" vs. "leave room for implementation" point opposite directions without clear hierarchy
  • Literal-mindedness — agents following many rules become rule-followers, checking the letter and missing the spirit
  • Reduced judgment — more explicit rules = less agent thinking (exactly the failure mode OBS-07 identifies)
  • Diminishing returns — each new rule is marginally less effective; after ~7-10 rules per agent, marginal value turns negative

Additionally, Anthropic's Claude 4 best practices explicitly warn that MUST/NEVER/ALWAYS capitalized imperatives cause overtriggering on Opus 4.5/4.6: "The fix is to dial back any aggressive language. Where you might have said 'CRITICAL: You MUST use this tool when...', you can use more normal prompting like 'Use this tool when...'." Our agent definitions are full of this language. The rigidity observed in Round 3's tech-spec output may be partly caused by our own agent definitions.

Current rule inventory

File Lines Rules / enforcement points Shape
agents/code-analyzer.md 227 7 numbered rules + 3 mode-specific instruction blocks Mix of hard constraints (rule 3 context preservation, rule 7 Read-not-cat) and judgment calls (rule 4 "be honest", rule 5 "don't design solutions")
agents/technical-spec-writer.md 218 6 numbered rules + Goldilocks section + Metadata sub-sections + Structure sub-section + 6-step Process with embedded rules (Step 5 alone has 7 items) Heaviest accumulator. Unclear which rules are load-bearing
agents/spec-reviewer.md 95 Stage 1 (9 format checks) + Stage 2 (8 quality checks) + Governance Implications (4 items) + 3-tier classification Dense enumeration. Format checks are legitimate (parser compliance). Quality checks are mostly judgment calls shaped as rules
skills/decompose/SKILL.md 168 Post-round-3 inline-execution rewrite Watch for rule accumulation from Phase 1 fix
skills/decompose-tasks/SKILL.md 136 Same pattern Same concern
skills/decompose-validate/SKILL.md 179 Same pattern Same concern

Not in scope: agents/project-manager/AGENT.md (384 lines, not tested yet — deferred to Round 5+).

Audit methodology — enforcement inventory

For each rule or rule-shaped item, classify into one of three cases:

Case A — Double-enforced (delete from prose)

Rule exists in agent/skill prose AND in a downstream validator (parser, spec-reviewer, ingest_spec).
Example: "task refs match [A-Z]{2,5}-\d{2,3}[A-Z]?" is in technical-spec-writer.md rule 5 AND in spec-reviewer.md Stage 1 AND in spec-parser.ts.
Action: Delete from writer prose. The parser + reviewer enforce it.

Case B — Prose-only, genuinely qualitative (reshape as principle + example)

No parser can check this. It's a judgment call.
Example: "be honest about complexity."
Action: Keep in prose. Reshape from rule → principle + good/bad example pair.

Case C — Prose-only but enforceable (keep prose + consider enforcement)

Today it's prose-only, but a reviewer/hook/schema could catch violations.
Example: "preserve strategic context verbatim."
Action: Keep in prose (reshaped as explanation). Optionally build enforcement (see #6).

Per-rule audit criteria

  • Hard constraint or judgment call?
  • Redundant with another rule? → Consolidate
  • Conflicts with another rule? → Resolve with explicit priority, or remove the weaker
  • Can be replaced with an example that teaches the same thing? → Prefer example
  • Aspirational or actually affects behavior? → Remove if aspirational
  • Uses enforcement language (MUST, NEVER, ALWAYS) for something that's actually a judgment call? → Downgrade to principle

Empirical evidence informing the audit

Verbatim preservation (code-analyzer rule 3) — WORKING

Traced 5 capabilities (AUTH-02, STOR-01, PLUG-02, VIEW-05, PROJ-01) through all 3 artifacts:

  • Strategic spec → Canvas: verbatim preservation across all 5 capabilities
  • Canvas → Technical spec: first 1-2 sentences verbatim, then code context added via Per canvas: and Per D9: citations
  • Success conditions: preserved verbatim in canvas, migrated to task-level in tech spec
  • Group context: reliably included as "Part of [Group] ([priority])" in tech spec
  • Verdict: Rule 3 stays as a hard rule (Case C). Reshape language only — drop MUST carry forward verbatim all-caps, add explanation of WHY preservation matters.

Over-prescription in tech spec output — CONFIRMED

7 sampled tasks (PLAT-01A, PLAT-01B, PLAT-01D, STOR-01A, STOR-01B, PLUG-02A, PLUG-02B) show:

  • File paths and function signatures fully dictated
  • Directory structures enumerated
  • Config values and algorithms pinned
  • "Decisions" pre-made in parentheses
  • This correlates with the agent definitions' own over-prescriptive rule language

Deliverables

  1. Enforcement inventory table — per-rule classification (A/B/C) with rationale, returned for user sign-off before edits
  2. Rule edits — delete Case A duplicates, reshape Case B as principle + example, reshape Case C language
  3. Language pass — sweep all edited files for MUST/NEVER/ALWAYS/CRITICAL/IMPORTANT all-caps; replace with explanation + normal-case
  4. Rule-order reordering — identity/context at top, process in middle, hard rules and principles near end
  5. Frontmatter alignment — fix format gaps discovered during Anthropic spec investigation (see Align skill and agent frontmatter with Anthropic's current spec #7)
  6. Regression safety log — for every rule removed/downgraded, document: original text, failure it prevented, which layer now catches it, whether Round 5 calibration should watch for it

Acceptance criteria

  • Enforcement inventory table produced and user-approved before edits
  • Net rule count is reduced or equal (not increased) across all audited files
  • No MUST/NEVER/ALWAYS all-caps language remains in qualitative (Case B) rules
  • Hard rules (Case A/C) are reshaped with explanation of WHY, not just commandments
  • Every deleted rule has a regression-safety entry in the e2e report
  • claude plugin validate . passes after all edits

Dependencies

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentAgent definition modificationauditCode/rule audit worke2e-obsTraced from E2E test observationround-4v0.8.0 design refinement package

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions