Skip to content

refactor(panel): rebuild apm-review-panel as advisory, scannable, and self-healing#1093

Merged
danielmeppiel merged 7 commits intomainfrom
refactor/panel-advisory-regime
May 1, 2026
Merged

refactor(panel): rebuild apm-review-panel as advisory, scannable, and self-healing#1093
danielmeppiel merged 7 commits intomainfrom
refactor/panel-advisory-regime

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel commented May 1, 2026

refactor(panel): rebuild apm-review-panel as advisory, scannable, and self-healing

TL;DR

The apm-review-panel skill posted binary panel-rejected verdicts on PRs whose real signal was a handful of recommended improvements plus polish nits. This PR rebuilds the panel as an advisory regime: the CEO returns a ship_recommendation.stance (prose, never a label), panelists rank findings on a three-bucket severity scale that the orchestrator never reads as a gate, and the comment template top-loads the busy-maintainer view that PR #1084 demonstrated by hand. Adds two new conditional panelists (doc-writer, test-coverage-expert), wires test evidence as load-bearing in CEO arbitration, lands the python-architect class diagram in the top comment, and notifies the PR author + reviewers via @-ping so a fresh pass surfaces in inboxes without re-introducing labels. Closes #1091.

Important

The panel never blocks merge anymore. The maintainer ships. Re-applying the panel-review label re-runs the panel cleanly, and the orchestrator now sweeps stale panel-approved / panel-rejected labels left behind by the previous regime.

Why

The pre-refactor panel had five coupled defects that compounded into "loud, adversarial, easy to ignore":

  • Schema forced false rejections. panelist-return-schema.json had two slots: required[] (blocks merge) XOR nits[] (skip). A real-but-not-blocking finding had nowhere safe to live; defensively it landed in required[], the deterministic gate verdict = APPROVE iff sum(required)==0 flipped to REJECT, and the maintainer got noise where they needed signal.
  • Binary verdict + auto-applied labels. A multi-persona LLM ensemble is not a deterministic gate. Pretending it was created adversarial framing (Verdict: REJECT topline, Required before merge (N) next) and hid where the lenses agreed, where they disagreed, and what was worth doing now vs later.
  • Stale verdict labels. A previously-rejected PR re-triggered with a green run never refreshed correctly; panel-rejected survived. Maintainer-confirmed.
  • Comment shape didn't help a busy maintainer. Flat lists, no per-persona summary table, no architecture diagrams unless one panelist happened to supply them, CEO arbitration buried below the verdict block. PR fix(install): allow credential helpers in --update preflight for generic hosts #1084's hand-written comment showed the shape reviewers actually want: top-loaded synthesis, single summary table, diagrams in the body, curated follow-ups, recommendation prose.
  • Two unowned failure modes. Documentation drift (code changed, docs didn't) and silent test-coverage drift (CLI surface or auth or lockfile semantics changed without a regression-trap test) had no lens. The panel could surface "this is risky" but no persona asked the one question that protects users from quiet re-regression: "if this code drifts six months from now, will any test fail before a user does?"

APM's positioning is pragmatic as npm. npm doesn't have a bot that blocks publish on stylistic LLM disagreement. Neither should we. The panel is here to inform the maintainer, not to gate them.

What

Aspect Before After
Verdict APPROVE / REJECT (binary, deterministic from sum(required)) None. CEO returns ship_recommendation.stance prose: ship_now / ship_with_followups / needs_discussion / needs_rework.
Severity required[] (blocks) + nits[] (skip) findings[] with severity: blocking | recommended | nit. Severity is signal strength, never read as a gate.
Labels written panel-approved XOR panel-rejected (auto), remove panel-review No add-labels call. remove-labels sweeps [panel-review, panel-approved, panel-rejected] on every run -- defensive cleanup of legacy verdict labels, plus re-run idempotency.
Comment shape Verdict line, Required, Nits, CEO arbitration, per-persona detail cc @author @reviewers -> stance pill -> headline -> CEO synthesis -> per-persona summary table -> top-N curated follow-ups -> architecture diagrams (class + sequence) in the top comment -> recommendation prose. Full per-persona findings collapsed at the bottom.
Personas 5 mandatory + auth-expert conditional 5 mandatory + auth-expert (cond.) + doc-writer (cond. on docs/CHANGELOG/README/skill+agent bundles) + test-coverage-expert (mandatory whenever the diff touches src/**/*.py; conditional otherwise).
Test evidence Opinion-only findings Each finding may carry an evidence { outcome, test_file, test_name, assertion_excerpt, proves, principles } block. CEO arbitration weights passed/failed/missing test outcomes above opinion, with missing on a secure-by-default/governed-by-policy/portability-by-manifest surface ranked at or near blocking-tier.
Notification Verdict labels acted as inbox cue Orchestrator resolves notify_audience = PR author + CODEOWNERS-resolved reviewers + requested teams (bots filtered, capped at 6, teams preferred over individuals when capped); template renders cc @... directly under the headline.
Architecture diagrams in comment Buried in per-persona detail (often missing) python-architect persona produces a class_diagram slot rendered FIRST under ### Architecture in the top comment, followed by sequence diagram. Schema, template, and persona contract all updated to wire this end-to-end.
Companion workflow pr-panel-label-reset.yml (deterministic verdict-label sweeper) Deleted. No verdict labels exist to reset.
Local eval harness One ad-hoc verdict harness, two pre-refactor content fixtures render_eval.py + two scenario fixtures (ship_now shape + needs_rework shape) that render the template's actual rules and lint ASCII. Both fixtures exercise all 8 panelist slots, the cc @... line (rendered in 02, skipped in 01), the class diagram slot, and evidence blocks (passed in 01, missing in 02).

How

flowchart TD
    Trigger[panel-review label] --> Orch[orchestrator]
    Orch --> PA[python-architect]
    Orch --> CL[cli-logging-expert]
    Orch --> DX[devx-ux-expert]
    Orch --> SC[supply-chain-security]
    Orch --> GH[oss-growth-hacker]
    Orch --> AE[auth-expert / cond.]
    Orch --> DW[doc-writer / cond.]
    Orch --> TC[test-coverage / mand. on src.py]
    PA --> Sch[findings array<br/>severity: blocking / recommended / nit<br/>+ optional evidence block]
    CL --> Sch
    DX --> Sch
    SC --> Sch
    GH --> Sch
    AE --> Sch
    DW --> Sch
    TC --> Sch
    PA --> Diag[class + sequence diagrams<br/>rendered in top comment]
    Sch --> CEO[apm-ceo synthesizer<br/>ship_recommendation.stance<br/>test evidence load-bearing]
    Orch --> Audience[resolve notify_audience<br/>author + reviewers + teams, bots filtered]
    CEO --> Render[render recommendation-template.md]
    Diag --> Render
    Audience --> Render
    Render --> Comment[add-comment: 1 comment<br/>cc @author @reviewers]
    Render --> Sweep[remove-labels:<br/>panel-review, panel-approved, panel-rejected]
    classDef writer fill:#0a3d2c,stroke:#1f7a4d,color:#fff
    classDef synth fill:#1a2a4a,stroke:#3d5a8a,color:#fff
    class Comment,Sweep writer
    class CEO synth
Loading

Concretely, the changes touch four layers and both .github/ + .apm/ mirrors:

  • Skill contract. SKILL.md rewritten with advisory invariants up top (no verdict, no add-labels, severity is signal not gate, single-writer interlock sweeps all three labels). 8-persona roster with explicit conditional rules. Calibrated severity contract embedded in the panelist prompt template.
  • Schemas. panelist-return-schema.json collapses required/nits into a single findings[] array with a severity enum and an optional evidence block; adds a class_diagram schema slot. ceo-return-schema.json replaces verdict with ship_recommendation { stance, prose }, adds headline, arbitration, principle_alignment, recommended_followups[], and notify_audience[].
  • Personas. agents/test-coverage-expert.agent.md (new): scope inherits north star from DevX UX; enumerates critical user-promise surfaces (CLI / error wording / install / lockfile / auth / hooks / marketplace / cross-module integration); mandates view/grep probe before any "no test exists" claim; every finding MUST carry an evidence block. agents/python-architect.agent.md updated to require both class_diagram and sequence_diagram outputs.
  • Template + workflow. recommendation-template.md (new, replaces verdict-template.md) renders sections only when their source field is non-empty -- no placeholder noise. Class diagram rendered FIRST under ### Architecture. cc @... line rendered directly under headline. pr-review-panel.md workflow drops add-labels; remove-labels.allowed lists all three labels with max: 3. pr-panel-label-reset.yml deleted. Lock recompiled with gh aw compile.
  • Eval scaffolding. evals/render_eval.py is a specification-test renderer that applies the template's rules deterministically (without an LLM) so a maintainer can eyeball offline. Two fixtures: 01-ship-now-pr1084-shape.json (the common case) and 02-needs-rework-shape.json (real regressions: path-traversal + Windows-encoding + circular import + missing regression-trap test).

Proof

The eval scaffolding shipped in this PR is the executable contract. The skill ships no Python runtime change, so unit/integration tests in the conventional pytest sense do not apply -- the rendered fixtures, the workflow lock, and the diff -rq mirror checks ARE this PR's executable evidence.

Eval harness output:

$ python3 .github/skills/apm-review-panel/evals/render_eval.py
[OK] 01-ship-now-pr1084-shape.json -> 01-ship-now-pr1084-shape.rendered.md (137 lines)
[OK] 02-needs-rework-shape.json -> 02-needs-rework-shape.rendered.md (138 lines)

Workflow lock confirms verdict-label removal:

$ grep "add_labels\|remove_labels" .github/workflows/pr-review-panel.lock.yml
"remove_labels":{"allowed":["panel-review","panel-approved","panel-rejected"],"max":3}

$ gh aw compile .github/workflows/pr-review-panel.md
Compiled 1 workflow(s): 0 error(s), 0 warning(s)

.github/ and .apm/ mirrors verified identical:

$ diff -rq .github/skills/apm-review-panel .apm/skills/apm-review-panel
$ diff -q .github/agents/test-coverage-expert.agent.md .apm/agents/test-coverage-expert.agent.md

Scenario Evidence -- promises mapped to executable proof:

# Scenario (user promise) Principle(s) Proof
1 Panel never blocks merge: a PR with real findings still ships at the maintainer's discretion DevX, Governed by policy pr-review-panel.lock.yml (add_labels absent); 02-needs-rework-shape.rendered.md (no "Verdict"/"REJECT" anywhere)
2 Re-applying panel-review on a previously panel-rejected PR self-heals the stale label Governed by policy, DevX pr-review-panel.lock.yml: remove_labels.allowed enumerates all three labels with max: 3 (sweep is unconditional)
3 A busy maintainer can read one panel comment in ~30 seconds DevX 01-ship-now-pr1084-shape.rendered.md (137 lines); 02-needs-rework-shape.rendered.md (138 lines); template top-loads cc-line + stance pill + headline + table
4 A fresh advisory pass surfaces in the PR author's and reviewers' inboxes without re-introducing labels DevX 02-needs-rework-shape.rendered.md renders cc @danielmeppiel @microsoft/apm-maintainers ...; 01-ship-now-pr1084-shape.rendered.md skips the line cleanly when notify_audience is absent
5 The python-architect class diagram lands in the top comment without expanding per-persona details DevX 02-needs-rework-shape.rendered.md renders the class diagram FIRST under ### Architecture, before the sequence diagram
6 Test-coverage-expert is mandatory on src/**/*.py changes (its evidence is load-bearing for CEO arbitration) Secure by default, OSS agents/test-coverage-expert.agent.md Activation logic; 01-ship-now-pr1084-shape.rendered.md shows TC active on a behavior-change PR
7 Test-coverage-expert never asserts "no test exists" without a tool-call probe OSS, DevX agents/test-coverage-expert.agent.md Review procedure step 5 mandates view/grep/glob; Anti-patterns section names "Asserting 'no test exists' without grepping" as disqualifying
8 Bug-fix PRs include the regression-trap test (the test that would have caught the bug had it existed) Secure by default, Governed by policy assets/scenario-evidence-rubric.md Required-mapping-shape; test-coverage-expert missing-regression-trap gap class is always at least recommended
9 "Secure by default" scenarios cannot be proven by mocking the security boundary they assert on Secure by default assets/scenario-evidence-rubric.md Anti-patterns refuses "Mocked boundary on a security scenario"; TC classifies as mocked-boundary-on-security-scenario -> blocking
10 Test evidence outweighs opinion in CEO arbitration Governed by policy agents/apm-ceo.agent.md "Treat test evidence as load-bearing" subsection; 02-needs-rework-shape.rendered.md renders verbatim assertion excerpts under each blocking finding
11 Single source of truth: changing the rubric once updates both consumers OSS, Portability diff -rq confirms .github/ and .apm/ mirrors identical; SKILL.md + section-rubric.md + test-coverage-expert.agent.md all link to the same scenario-evidence-rubric.md path

End-to-end validation happens by labeling a real PR with panel-review after merge.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…val harness; legacy-label sweep

The apm-review-panel skill posted binary REJECT verdicts on PRs whose
real signal was a handful of recommended improvements + polish nits.
Root cause: the panelist schema forced every finding into required[]
(blocks merge) XOR nits[] (skip), with no middle slot. Careful reviewers
defaulted to required defensively -> deterministic REJECT -> noise.

This PR refactors the panel from a gate regime to an advisory regime:

- New panelist schema: unified findings[] with severity enum
  (blocking | recommended | nit). Severity is signal strength,
  never read by the orchestrator to gate anything.
- New CEO schema: returns ship_recommendation { stance, prose }
  where stance is one of ship_now | ship_with_followups |
  needs_discussion | needs_rework. Prose for the human reviewer,
  never auto-applied as a label or status check.
- New comment template (assets/recommendation-template.md): top-loads
  CEO synthesis + per-persona summary table + top-N curated follow-ups
  + (optional) architecture diagrams + recommendation prose. Full
  per-persona findings collapsed at the bottom. Mirrors the manual
  comment shape on PR #1084.
- New conditional persona: doc-writer. Activates on README, CHANGELOG,
  docs/, .apm/skills|agents/, .github/skills|agents|instructions/,
  and .github/workflows/*.md changes. Also detects DRIFT (code changed
  but docs didn't), not just doc edits.
- Workflow safe-outputs: drop add-labels (no verdict to encode);
  remove-labels now sweeps panel-review (re-run idempotency) AND
  legacy panel-approved / panel-rejected (defensive cleanup of stale
  verdict labels left over from the pre-advisory regime).
- Delete pr-panel-label-reset.yml: no verdict labels to reset.
- Eval harness: render_eval.py + two fixtures (ship-now shape from
  PR #1084, needs-rework shape with blocking-severity findings).
  Specification test that confirms the rendered comment is short,
  scannable, and doesn't bury the lede.

Closes #1091

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 1, 2026 08:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the apm-review-panel automation from a deterministic verdict/labeling system into an advisory review panel that posts a single scannable recommendation comment, never applies verdict labels, and defensively sweeps legacy labels to make re-runs idempotent.

Changes:

  • Update the gh-aw workflow to remove legacy verdict-label writes and expand label removal to sweep panel-review, panel-approved, and panel-rejected.
  • Redesign panelist/CEO JSON schemas and the skill contract around advisory severities (blocking | recommended | nit) and a CEO ship_recommendation.stance (no APPROVE/REJECT).
  • Replace verdict comment assets and eval artifacts with a recommendation template plus a deterministic render_eval.py harness and fixtures.
Show a summary per file
File Description
.github/workflows/pr-review-panel.md Removes verdict-label write surface; expands remove-label sweep and updates workflow prose to advisory contract.
.github/workflows/pr-review-panel.lock.yml Recompiled lock reflecting removal of add_labels safe output and updated remove_labels constraints.
.github/workflows/pr-panel-label-reset.yml Deletes legacy deterministic verdict-label reset workflow (now obsolete under advisory regime).
.github/skills/apm-review-panel/SKILL.md Updates panel contract, roster (adds conditional doc-writer), and advisory rendering/sweep rules.
.github/skills/apm-review-panel/evals/trigger-evals.json Updates routing/trigger eval phrasing from “verdict” to “recommendation”.
.github/skills/apm-review-panel/evals/run-verdict-harness.py Removes verdict-era deterministic harness.
.github/skills/apm-review-panel/evals/results-trigger-2026-04-28.md Removes old trigger eval write-up (verdict-era).
.github/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md Removes old label-reset workflow validation write-up.
.github/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md Removes verdict-era e2e eval write-up.
.github/skills/apm-review-panel/evals/render_eval.py Adds deterministic renderer for fixtures + ASCII lint to validate output shape offline.
.github/skills/apm-review-panel/evals/README.md Updates eval documentation to focus on render eval + fixtures and the trigger eval set.
.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json Adds “ship_now” scenario fixture for the new advisory schema.
.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.rendered.md Adds rendered markdown output for the “ship_now” fixture.
.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json Adds “needs_rework” scenario fixture demonstrating high-signal advisory findings.
.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.rendered.md Adds rendered markdown output for the “needs_rework” fixture.
.github/skills/apm-review-panel/evals/content-eval-rejected-pr.md Removes verdict-era content eval doc.
.github/skills/apm-review-panel/evals/content-eval-clean-pr.md Removes verdict-era content eval doc.
.github/skills/apm-review-panel/assets/verdict-template.md Removes verdict-era comment template asset.
.github/skills/apm-review-panel/assets/recommendation-template.md Adds new top-loaded advisory recommendation template (with per-persona table + optional diagrams).
.github/skills/apm-review-panel/assets/panelist-return-schema.json Replaces required/nits schema with findings[] + severity and adds doc-writer persona.
.github/skills/apm-review-panel/assets/ceo-return-schema.json Replaces verdict-era CEO shape with advisory fields including ship_recommendation and curated follow-ups.
.apm/skills/apm-review-panel/SKILL.md Mirrors the advisory contract changes into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/trigger-evals.json Mirrors the trigger eval phrasing updates into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/run-verdict-harness.py Mirrors removal of the verdict-era harness into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/results-trigger-2026-04-28.md Mirrors removal of verdict-era eval write-up into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md Mirrors removal of verdict-era label-reset eval write-up into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md Mirrors removal of verdict-era e2e eval write-up into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/render_eval.py Mirrors addition of deterministic render harness into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/README.md Mirrors updated eval documentation into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json Mirrors new fixture into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.rendered.md Mirrors rendered output into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json Mirrors new fixture into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.rendered.md Mirrors rendered output into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md Mirrors removal of verdict-era content eval doc into the .apm/ bundle.
.apm/skills/apm-review-panel/evals/content-eval-clean-pr.md Mirrors removal of verdict-era content eval doc into the .apm/ bundle.
.apm/skills/apm-review-panel/assets/verdict-template.md Mirrors removal of verdict-era template into the .apm/ bundle.
.apm/skills/apm-review-panel/assets/recommendation-template.md Mirrors addition of advisory template into the .apm/ bundle.
.apm/skills/apm-review-panel/assets/panelist-return-schema.json Mirrors advisory panelist schema into the .apm/ bundle.
.apm/skills/apm-review-panel/assets/ceo-return-schema.json Mirrors advisory CEO schema into the .apm/ bundle.

Copilot's findings

  • Files reviewed: 39/39 changed files
  • Comments generated: 6

Comment on lines 23 to 27
"active": {
"type": "boolean",
"description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.",
"default": true
"description": "Set to false ONLY for conditional personas (auth-expert, doc-writer) when their fast-path file triggers and fallback self-check both miss. All mandatory personas MUST set active=true. When false, findings MUST be empty and inactive_reason MUST be a one-sentence explanation citing the touched files."
},
"inactive_reason": {
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inactive_reason is documented as required when active=false, and findings is supposed to be empty in that case, but the schema doesn't actually enforce either constraint. As written, the S4 schema gate could accept an inactive panelist with missing inactive_reason (or with findings), which would render badly in the template. Consider adding an if/then (or allOf with dependentRequired) to require inactive_reason and findings: [] when active is false.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +50
"description": "Reserved for python-architect. Two mermaid blocks (string each) used by the comment template. If absent, the orchestrator renders a placeholder.",
"properties": {
"component": { "type": "string", "description": "Mermaid flowchart of the components/data flow touched by the PR." },
"sequence": { "type": "string", "description": "Mermaid sequenceDiagram of the user-visible behavior change." }
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema/docs here say missing diagrams should render a placeholder, but assets/recommendation-template.md explicitly says to never render placeholders and to omit the Architecture section unless diagrams are supplied. Please align the schema wording with the template/rendering rules (and with SKILL.md) to avoid contradictory guidance for the orchestrator/panelists.

Suggested change
"description": "Reserved for python-architect. Two mermaid blocks (string each) used by the comment template. If absent, the orchestrator renders a placeholder.",
"properties": {
"component": { "type": "string", "description": "Mermaid flowchart of the components/data flow touched by the PR." },
"sequence": { "type": "string", "description": "Mermaid sequenceDiagram of the user-visible behavior change." }
"description": "Reserved for python-architect. Optional mermaid blocks (string each) used by the comment template. Never render placeholders; if diagrams are absent, omit the Architecture section.",
"properties": {
"component": { "type": "string", "description": "Optional mermaid flowchart of the components or data flow touched by the PR." },
"sequence": { "type": "string", "description": "Optional mermaid sequenceDiagram of the user-visible behavior change." }

Copilot uses AI. Check for mistakes.
Comment on lines +261 to +262
If absent, render the placeholder lines from the template (do NOT
invent diagrams).
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step says to render placeholder lines when diagrams are absent, but the new assets/recommendation-template.md says the opposite ("NEVER render a placeholder" / only render diagrams when supplied). Please make SKILL.md match the template: either require diagrams be present, or specify that the Architecture section is omitted when missing.

Suggested change
If absent, render the placeholder lines from the template (do NOT
invent diagrams).
Render the Architecture section ONLY when diagrams are supplied.
If absent, omit that section entirely and do NOT invent diagrams or
render placeholder lines.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +83
{{#if python_architect.extras.diagrams.sequence }}
```mermaid
{{ python_architect.extras.diagrams.sequence }}
```
{{/if}}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, the ### Architecture heading is only included when the component diagram exists. If only python_architect.extras.diagrams.sequence is present, the template will emit a mermaid block without the section heading. Consider restructuring so the heading appears when either diagram exists (this also matches the behavior in evals/render_eval.py).

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +195
for i, line in enumerate(text.splitlines(), 1):
for ch in line:
cp = ord(ch)
if ch == "\n" or ch == "\t":
continue
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint_ascii() skips tab characters (\t), but the repo encoding rule is printable ASCII U+0020–U+007E, so tabs (0x09) are non-printable and should be flagged. Dropping the special-case for \t (or explicitly reporting it) will make the eval harness properly catch accidental tabs in rendered output.

Copilot generated this review using guidance from repository custom instructions.
"description": "Shape the apm-ceo synthesizer MUST return when invoked by the apm-review-panel skill. The CEO produces ARBITRATION and SHIP-RECOMMENDATION prose. The CEO does NOT pick a binary verdict; the panel is advisory. The orchestrator renders these fields into the comment template.",
"type": "object",
"required": ["arbitration"],
"required": ["headline", "arbitration", "principle_alignment", "ship_recommendation"],
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommended_followups is described as part of the CEO output and the template expects it (e.g., .length checks). Since it's not in the schema's required list, a valid CEO return could omit it entirely, forcing the orchestrator to invent defaults. Consider making recommended_followups required (allowing an empty array) to keep the schema/template contract deterministic.

Suggested change
"required": ["headline", "arbitration", "principle_alignment", "ship_recommendation"],
"required": ["headline", "arbitration", "principle_alignment", "recommended_followups", "ship_recommendation"],

Copilot uses AI. Check for mistakes.
Copilot and others added 3 commits May 1, 2026 11:38
Pairs with the DevX UX lens to assess whether behavior-change PRs are
protected by tests that catch regressions in user-facing promises.
Conditional like auth-expert and doc-writer: orchestrator always spawns
for schema-shape uniformity; persona self-reports active=false when
off-condition (pure docs / pure refactor / pure asset bumps).

Calibrated severity: blocking ONLY when a critical user-promise surface
(auth, lockfile, install, marketplace, hooks, exit codes, error wording)
changes AND no test exercises the new behavior AND the persona can name
the specific test file path. Mandatory probe: must view/grep the test
tree before emitting any 'no test exists' claim -- false positives here
destroy maintainer trust.

Boundaries documented vs python-architect (test code design),
auth-expert (auth-specific assertions), and devx-ux-expert (which user
promises exist).

- New agent: .apm/agents/test-coverage-expert.agent.md (mirrored to
  .github/agents/)
- SKILL.md: roster (5 mandatory + 3 conditional), topology, conditional
  intro count, new section, routing matrix, fan-out list, gotchas
- panelist-return-schema.json: persona enum + active description
- Eval fixtures 01 + 02: TC row added; needs-rework demonstrates the
  highest-value case (path-traversal regression where the regression-
  trap test is the test that prevents re-shipping)
- render_eval.py: PERSONA_LABELS + canonical_order extended

Verified: render_eval.py emits [OK] on both fixtures; .apm/ and
.github/ mirrors verified identical via diff -rq.

Refs #1091

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…orks

Single shared asset (scenario-evidence-rubric.md) consumed by two
primitives: pr-description-skill (as a required Validation subsection
on every behavior-change PR) AND test-coverage-expert.agent.md (as the
evaluation lens applied during panel review).

The rubric answers ONE question from the maintainer's chair: for each
user promise this PR touches, which test exercises the scenario, and
which APM principle does that scenario serve? Not pure coverage --
SCENARIO coverage tied to the principles named in the README:
Portability by manifest, Secure by default, Governed by policy,
Multi-harness support, Vendor-neutral, DevX (pragmatic as npm), OSS.

Composition (R3 EXTRACT + DUAL CONSUMER):
- Asset lives once at .github/skills/pr-description-skill/assets/
  scenario-evidence-rubric.md (mirrored to .apm/).
- pr-description-skill: pr-body-template.md adds Scenario Evidence
  subsection; section-rubric.md adds 9b acceptance + refusal criteria
  and a final-pass checkbox; SKILL.md activation contract adds
  scenario-test-mapping input; required-body-structure row 9 updated.
- test-coverage-expert: review procedure now READS the table FIRST,
  audits each row against the diff, then probes for unmapped behavior.
  New gap classes added: mocked-boundary-on-security-scenario
  (blocking; the rubric explicitly refuses tautological tests) and
  principle-mismapping (recommended).

Skip clause for the table: docs-only / asset-bump-only / pure-refactor
PRs (author states which case applies in trade-offs).

Bug-fix PRs MUST include the regression-trap test row tagged with the
issue it would have caught -- closes the loop with the
test-coverage-expert's missing-regression-trap finding class.

Verified: render_eval emits [OK] on both fixtures (no panel-side schema
change). .apm/ and .github/ mirrors verified identical via diff -rq.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests, when coded right, are irrefutable on a given commit. The CEO
synthesizer should weight a passed/failed test outcome above any
opinion-only finding. This commit wires that discipline into the
panel contract.

Schema: panelist-return-schema.json gains optional finding.evidence
{ outcome, test_file, test_name, assertion_excerpt, proves, principles }.
test-coverage-expert REQUIRES it on every finding (its contract is
stricter than other panelists -- the evidence IS its value-add).

apm-ceo: new 'Treat test evidence as load-bearing' subsection.
- outcome=passed: proves the user promise holds; cite test+assertion
- outcome=failed: strongest signal short of a CVE; surface in headline
- outcome=missing on secure-by-default / governed-by-policy / portability
  surface ranks at or near blocking-tier opinion findings (the absence
  of an automated guardrail IS a real defect on those surfaces)
- outcome=manual: arbitrated as missing
- outcome=unknown: discarded from arbitration weight
- two-tier rule for follow-ups: failed-test rows always rank above
  opinion-only rows of the same severity; missing-test on critical
  promise rank above any opinion 'recommended' finding

Render: render_eval.py + recommendation-template.md surface a 'Proof'
line per finding showing test ref, outcome, what it proves, and the
APM principles it covers, with the verbatim assertion excerpt below.

Fixtures 01 + 02 demonstrate both sides:
- 01: outcome=passed on the credential-helper env-var contract
- 02: outcome=missing on path-traversal regression-trap (secure-by-
  default, blocking) + integration boundary (portability, recommended)

Both fixtures still render [OK].

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel force-pushed the refactor/panel-advisory-regime branch from 07ca716 to e502fa2 Compare May 1, 2026 10:03
Copilot and others added 3 commits May 1, 2026 12:28
Two coupled fixes folded into PR #1093 via the genesis discipline.

1) Restore the maintainer-notification signal that the verdict-label
   sweep removed. The panel now resolves an @-ping audience from
   'gh pr view --json author,reviewRequests' (PR author +
   CODEOWNERS-resolved requested reviewers + requested teams), filters
   bots, caps at 6 handles, and renders a 'cc @author @Reviewer ...'
   line right under the headline. This is the only mechanism by which
   a fresh advisory pass surfaces in inboxes; without it, a panel
   re-run on a previously-reviewed PR is silent.

2) Make test-coverage-expert mandatory on every PR that touches
   'src/**/*.py'. The persona was conditional for symmetry with
   auth/doc-writer, but that symmetry broke when test evidence became
   load-bearing in CEO arbitration (commit e502fa2). A persona whose
   findings outweigh opinion cannot be silently skipped on a
   heuristic. Skip ONLY on documentation-only PRs (zero src/*.py
   files in the diff).

Eval fixtures updated: fixture 02 now carries a non-empty
notify_audience to prove the cc line renders; fixture 01 leaves it
empty to prove the line is skipped (not rendered as placeholder).
Both evals green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The .apm/ rendered fixture for 01 was stale from before commit
e502fa2 (proof-evidence rendering). 'apm install' regenerates
.github/ from .apm/, so the integration drift check failed because
.github/ had the Proof line but .apm/ source did not. Verified
clean via local 'apm install'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds extras.diagrams.class_diagram schema slot, renders it first under
### Architecture, asks the persona for it explicitly, and regenerates
both rendered.md fixtures so the panel comment shows load-bearing
visual evidence without requiring readers to expand the per-persona
findings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel merged commit 6c99c5b into main May 1, 2026
16 checks passed
@danielmeppiel danielmeppiel deleted the refactor/panel-advisory-regime branch May 1, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor apm-review-panel: VERDICT regime -> ADVISORY regime (drop binary APPROVE/REJECT, add doc-writer, restructure comment)

2 participants