refactor(panel): rebuild apm-review-panel as advisory, scannable, and self-healing#1093
Conversation
…val harness; legacy-label sweep
The apm-review-panel skill posted binary REJECT verdicts on PRs whose
real signal was a handful of recommended improvements + polish nits.
Root cause: the panelist schema forced every finding into required[]
(blocks merge) XOR nits[] (skip), with no middle slot. Careful reviewers
defaulted to required defensively -> deterministic REJECT -> noise.
This PR refactors the panel from a gate regime to an advisory regime:
- New panelist schema: unified findings[] with severity enum
(blocking | recommended | nit). Severity is signal strength,
never read by the orchestrator to gate anything.
- New CEO schema: returns ship_recommendation { stance, prose }
where stance is one of ship_now | ship_with_followups |
needs_discussion | needs_rework. Prose for the human reviewer,
never auto-applied as a label or status check.
- New comment template (assets/recommendation-template.md): top-loads
CEO synthesis + per-persona summary table + top-N curated follow-ups
+ (optional) architecture diagrams + recommendation prose. Full
per-persona findings collapsed at the bottom. Mirrors the manual
comment shape on PR #1084.
- New conditional persona: doc-writer. Activates on README, CHANGELOG,
docs/, .apm/skills|agents/, .github/skills|agents|instructions/,
and .github/workflows/*.md changes. Also detects DRIFT (code changed
but docs didn't), not just doc edits.
- Workflow safe-outputs: drop add-labels (no verdict to encode);
remove-labels now sweeps panel-review (re-run idempotency) AND
legacy panel-approved / panel-rejected (defensive cleanup of stale
verdict labels left over from the pre-advisory regime).
- Delete pr-panel-label-reset.yml: no verdict labels to reset.
- Eval harness: render_eval.py + two fixtures (ship-now shape from
PR #1084, needs-rework shape with blocking-severity findings).
Specification test that confirms the rendered comment is short,
scannable, and doesn't bury the lede.
Closes #1091
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Refactors the apm-review-panel automation from a deterministic verdict/labeling system into an advisory review panel that posts a single scannable recommendation comment, never applies verdict labels, and defensively sweeps legacy labels to make re-runs idempotent.
Changes:
- Update the gh-aw workflow to remove legacy verdict-label writes and expand label removal to sweep
panel-review,panel-approved, andpanel-rejected. - Redesign panelist/CEO JSON schemas and the skill contract around advisory severities (
blocking | recommended | nit) and a CEOship_recommendation.stance(no APPROVE/REJECT). - Replace verdict comment assets and eval artifacts with a recommendation template plus a deterministic
render_eval.pyharness and fixtures.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/pr-review-panel.md | Removes verdict-label write surface; expands remove-label sweep and updates workflow prose to advisory contract. |
| .github/workflows/pr-review-panel.lock.yml | Recompiled lock reflecting removal of add_labels safe output and updated remove_labels constraints. |
| .github/workflows/pr-panel-label-reset.yml | Deletes legacy deterministic verdict-label reset workflow (now obsolete under advisory regime). |
| .github/skills/apm-review-panel/SKILL.md | Updates panel contract, roster (adds conditional doc-writer), and advisory rendering/sweep rules. |
| .github/skills/apm-review-panel/evals/trigger-evals.json | Updates routing/trigger eval phrasing from “verdict” to “recommendation”. |
| .github/skills/apm-review-panel/evals/run-verdict-harness.py | Removes verdict-era deterministic harness. |
| .github/skills/apm-review-panel/evals/results-trigger-2026-04-28.md | Removes old trigger eval write-up (verdict-era). |
| .github/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md | Removes old label-reset workflow validation write-up. |
| .github/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md | Removes verdict-era e2e eval write-up. |
| .github/skills/apm-review-panel/evals/render_eval.py | Adds deterministic renderer for fixtures + ASCII lint to validate output shape offline. |
| .github/skills/apm-review-panel/evals/README.md | Updates eval documentation to focus on render eval + fixtures and the trigger eval set. |
| .github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json | Adds “ship_now” scenario fixture for the new advisory schema. |
| .github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.rendered.md | Adds rendered markdown output for the “ship_now” fixture. |
| .github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json | Adds “needs_rework” scenario fixture demonstrating high-signal advisory findings. |
| .github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.rendered.md | Adds rendered markdown output for the “needs_rework” fixture. |
| .github/skills/apm-review-panel/evals/content-eval-rejected-pr.md | Removes verdict-era content eval doc. |
| .github/skills/apm-review-panel/evals/content-eval-clean-pr.md | Removes verdict-era content eval doc. |
| .github/skills/apm-review-panel/assets/verdict-template.md | Removes verdict-era comment template asset. |
| .github/skills/apm-review-panel/assets/recommendation-template.md | Adds new top-loaded advisory recommendation template (with per-persona table + optional diagrams). |
| .github/skills/apm-review-panel/assets/panelist-return-schema.json | Replaces required/nits schema with findings[] + severity and adds doc-writer persona. |
| .github/skills/apm-review-panel/assets/ceo-return-schema.json | Replaces verdict-era CEO shape with advisory fields including ship_recommendation and curated follow-ups. |
| .apm/skills/apm-review-panel/SKILL.md | Mirrors the advisory contract changes into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/trigger-evals.json | Mirrors the trigger eval phrasing updates into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/run-verdict-harness.py | Mirrors removal of the verdict-era harness into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/results-trigger-2026-04-28.md | Mirrors removal of verdict-era eval write-up into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md | Mirrors removal of verdict-era label-reset eval write-up into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md | Mirrors removal of verdict-era e2e eval write-up into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/render_eval.py | Mirrors addition of deterministic render harness into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/README.md | Mirrors updated eval documentation into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json | Mirrors new fixture into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.rendered.md | Mirrors rendered output into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json | Mirrors new fixture into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.rendered.md | Mirrors rendered output into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md | Mirrors removal of verdict-era content eval doc into the .apm/ bundle. |
| .apm/skills/apm-review-panel/evals/content-eval-clean-pr.md | Mirrors removal of verdict-era content eval doc into the .apm/ bundle. |
| .apm/skills/apm-review-panel/assets/verdict-template.md | Mirrors removal of verdict-era template into the .apm/ bundle. |
| .apm/skills/apm-review-panel/assets/recommendation-template.md | Mirrors addition of advisory template into the .apm/ bundle. |
| .apm/skills/apm-review-panel/assets/panelist-return-schema.json | Mirrors advisory panelist schema into the .apm/ bundle. |
| .apm/skills/apm-review-panel/assets/ceo-return-schema.json | Mirrors advisory CEO schema into the .apm/ bundle. |
Copilot's findings
- Files reviewed: 39/39 changed files
- Comments generated: 6
| "active": { | ||
| "type": "boolean", | ||
| "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.", | ||
| "default": true | ||
| "description": "Set to false ONLY for conditional personas (auth-expert, doc-writer) when their fast-path file triggers and fallback self-check both miss. All mandatory personas MUST set active=true. When false, findings MUST be empty and inactive_reason MUST be a one-sentence explanation citing the touched files." | ||
| }, | ||
| "inactive_reason": { |
There was a problem hiding this comment.
inactive_reason is documented as required when active=false, and findings is supposed to be empty in that case, but the schema doesn't actually enforce either constraint. As written, the S4 schema gate could accept an inactive panelist with missing inactive_reason (or with findings), which would render badly in the template. Consider adding an if/then (or allOf with dependentRequired) to require inactive_reason and findings: [] when active is false.
| "description": "Reserved for python-architect. Two mermaid blocks (string each) used by the comment template. If absent, the orchestrator renders a placeholder.", | ||
| "properties": { | ||
| "component": { "type": "string", "description": "Mermaid flowchart of the components/data flow touched by the PR." }, | ||
| "sequence": { "type": "string", "description": "Mermaid sequenceDiagram of the user-visible behavior change." } |
There was a problem hiding this comment.
The schema/docs here say missing diagrams should render a placeholder, but assets/recommendation-template.md explicitly says to never render placeholders and to omit the Architecture section unless diagrams are supplied. Please align the schema wording with the template/rendering rules (and with SKILL.md) to avoid contradictory guidance for the orchestrator/panelists.
| "description": "Reserved for python-architect. Two mermaid blocks (string each) used by the comment template. If absent, the orchestrator renders a placeholder.", | |
| "properties": { | |
| "component": { "type": "string", "description": "Mermaid flowchart of the components/data flow touched by the PR." }, | |
| "sequence": { "type": "string", "description": "Mermaid sequenceDiagram of the user-visible behavior change." } | |
| "description": "Reserved for python-architect. Optional mermaid blocks (string each) used by the comment template. Never render placeholders; if diagrams are absent, omit the Architecture section.", | |
| "properties": { | |
| "component": { "type": "string", "description": "Optional mermaid flowchart of the components or data flow touched by the PR." }, | |
| "sequence": { "type": "string", "description": "Optional mermaid sequenceDiagram of the user-visible behavior change." } |
| If absent, render the placeholder lines from the template (do NOT | ||
| invent diagrams). |
There was a problem hiding this comment.
This step says to render placeholder lines when diagrams are absent, but the new assets/recommendation-template.md says the opposite ("NEVER render a placeholder" / only render diagrams when supplied). Please make SKILL.md match the template: either require diagrams be present, or specify that the Architecture section is omitted when missing.
| If absent, render the placeholder lines from the template (do NOT | |
| invent diagrams). | |
| Render the Architecture section ONLY when diagrams are supplied. | |
| If absent, omit that section entirely and do NOT invent diagrams or | |
| render placeholder lines. |
| {{#if python_architect.extras.diagrams.sequence }} | ||
| ```mermaid | ||
| {{ python_architect.extras.diagrams.sequence }} | ||
| ``` | ||
| {{/if}} |
There was a problem hiding this comment.
As written, the ### Architecture heading is only included when the component diagram exists. If only python_architect.extras.diagrams.sequence is present, the template will emit a mermaid block without the section heading. Consider restructuring so the heading appears when either diagram exists (this also matches the behavior in evals/render_eval.py).
| for i, line in enumerate(text.splitlines(), 1): | ||
| for ch in line: | ||
| cp = ord(ch) | ||
| if ch == "\n" or ch == "\t": | ||
| continue |
There was a problem hiding this comment.
lint_ascii() skips tab characters (\t), but the repo encoding rule is printable ASCII U+0020–U+007E, so tabs (0x09) are non-printable and should be flagged. Dropping the special-case for \t (or explicitly reporting it) will make the eval harness properly catch accidental tabs in rendered output.
| "description": "Shape the apm-ceo synthesizer MUST return when invoked by the apm-review-panel skill. The CEO produces ARBITRATION and SHIP-RECOMMENDATION prose. The CEO does NOT pick a binary verdict; the panel is advisory. The orchestrator renders these fields into the comment template.", | ||
| "type": "object", | ||
| "required": ["arbitration"], | ||
| "required": ["headline", "arbitration", "principle_alignment", "ship_recommendation"], |
There was a problem hiding this comment.
recommended_followups is described as part of the CEO output and the template expects it (e.g., .length checks). Since it's not in the schema's required list, a valid CEO return could omit it entirely, forcing the orchestrator to invent defaults. Consider making recommended_followups required (allowing an empty array) to keep the schema/template contract deterministic.
| "required": ["headline", "arbitration", "principle_alignment", "ship_recommendation"], | |
| "required": ["headline", "arbitration", "principle_alignment", "recommended_followups", "ship_recommendation"], |
Pairs with the DevX UX lens to assess whether behavior-change PRs are protected by tests that catch regressions in user-facing promises. Conditional like auth-expert and doc-writer: orchestrator always spawns for schema-shape uniformity; persona self-reports active=false when off-condition (pure docs / pure refactor / pure asset bumps). Calibrated severity: blocking ONLY when a critical user-promise surface (auth, lockfile, install, marketplace, hooks, exit codes, error wording) changes AND no test exercises the new behavior AND the persona can name the specific test file path. Mandatory probe: must view/grep the test tree before emitting any 'no test exists' claim -- false positives here destroy maintainer trust. Boundaries documented vs python-architect (test code design), auth-expert (auth-specific assertions), and devx-ux-expert (which user promises exist). - New agent: .apm/agents/test-coverage-expert.agent.md (mirrored to .github/agents/) - SKILL.md: roster (5 mandatory + 3 conditional), topology, conditional intro count, new section, routing matrix, fan-out list, gotchas - panelist-return-schema.json: persona enum + active description - Eval fixtures 01 + 02: TC row added; needs-rework demonstrates the highest-value case (path-traversal regression where the regression- trap test is the test that prevents re-shipping) - render_eval.py: PERSONA_LABELS + canonical_order extended Verified: render_eval.py emits [OK] on both fixtures; .apm/ and .github/ mirrors verified identical via diff -rq. Refs #1091 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…orks Single shared asset (scenario-evidence-rubric.md) consumed by two primitives: pr-description-skill (as a required Validation subsection on every behavior-change PR) AND test-coverage-expert.agent.md (as the evaluation lens applied during panel review). The rubric answers ONE question from the maintainer's chair: for each user promise this PR touches, which test exercises the scenario, and which APM principle does that scenario serve? Not pure coverage -- SCENARIO coverage tied to the principles named in the README: Portability by manifest, Secure by default, Governed by policy, Multi-harness support, Vendor-neutral, DevX (pragmatic as npm), OSS. Composition (R3 EXTRACT + DUAL CONSUMER): - Asset lives once at .github/skills/pr-description-skill/assets/ scenario-evidence-rubric.md (mirrored to .apm/). - pr-description-skill: pr-body-template.md adds Scenario Evidence subsection; section-rubric.md adds 9b acceptance + refusal criteria and a final-pass checkbox; SKILL.md activation contract adds scenario-test-mapping input; required-body-structure row 9 updated. - test-coverage-expert: review procedure now READS the table FIRST, audits each row against the diff, then probes for unmapped behavior. New gap classes added: mocked-boundary-on-security-scenario (blocking; the rubric explicitly refuses tautological tests) and principle-mismapping (recommended). Skip clause for the table: docs-only / asset-bump-only / pure-refactor PRs (author states which case applies in trade-offs). Bug-fix PRs MUST include the regression-trap test row tagged with the issue it would have caught -- closes the loop with the test-coverage-expert's missing-regression-trap finding class. Verified: render_eval emits [OK] on both fixtures (no panel-side schema change). .apm/ and .github/ mirrors verified identical via diff -rq. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests, when coded right, are irrefutable on a given commit. The CEO
synthesizer should weight a passed/failed test outcome above any
opinion-only finding. This commit wires that discipline into the
panel contract.
Schema: panelist-return-schema.json gains optional finding.evidence
{ outcome, test_file, test_name, assertion_excerpt, proves, principles }.
test-coverage-expert REQUIRES it on every finding (its contract is
stricter than other panelists -- the evidence IS its value-add).
apm-ceo: new 'Treat test evidence as load-bearing' subsection.
- outcome=passed: proves the user promise holds; cite test+assertion
- outcome=failed: strongest signal short of a CVE; surface in headline
- outcome=missing on secure-by-default / governed-by-policy / portability
surface ranks at or near blocking-tier opinion findings (the absence
of an automated guardrail IS a real defect on those surfaces)
- outcome=manual: arbitrated as missing
- outcome=unknown: discarded from arbitration weight
- two-tier rule for follow-ups: failed-test rows always rank above
opinion-only rows of the same severity; missing-test on critical
promise rank above any opinion 'recommended' finding
Render: render_eval.py + recommendation-template.md surface a 'Proof'
line per finding showing test ref, outcome, what it proves, and the
APM principles it covers, with the verbatim assertion excerpt below.
Fixtures 01 + 02 demonstrate both sides:
- 01: outcome=passed on the credential-helper env-var contract
- 02: outcome=missing on path-traversal regression-trap (secure-by-
default, blocking) + integration boundary (portability, recommended)
Both fixtures still render [OK].
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
07ca716 to
e502fa2
Compare
Two coupled fixes folded into PR #1093 via the genesis discipline. 1) Restore the maintainer-notification signal that the verdict-label sweep removed. The panel now resolves an @-ping audience from 'gh pr view --json author,reviewRequests' (PR author + CODEOWNERS-resolved requested reviewers + requested teams), filters bots, caps at 6 handles, and renders a 'cc @author @Reviewer ...' line right under the headline. This is the only mechanism by which a fresh advisory pass surfaces in inboxes; without it, a panel re-run on a previously-reviewed PR is silent. 2) Make test-coverage-expert mandatory on every PR that touches 'src/**/*.py'. The persona was conditional for symmetry with auth/doc-writer, but that symmetry broke when test evidence became load-bearing in CEO arbitration (commit e502fa2). A persona whose findings outweigh opinion cannot be silently skipped on a heuristic. Skip ONLY on documentation-only PRs (zero src/*.py files in the diff). Eval fixtures updated: fixture 02 now carries a non-empty notify_audience to prove the cc line renders; fixture 01 leaves it empty to prove the line is skipped (not rendered as placeholder). Both evals green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The .apm/ rendered fixture for 01 was stale from before commit e502fa2 (proof-evidence rendering). 'apm install' regenerates .github/ from .apm/, so the integration drift check failed because .github/ had the Proof line but .apm/ source did not. Verified clean via local 'apm install'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds extras.diagrams.class_diagram schema slot, renders it first under ### Architecture, asks the persona for it explicitly, and regenerates both rendered.md fixtures so the panel comment shows load-bearing visual evidence without requiring readers to expand the per-persona findings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
refactor(panel): rebuild apm-review-panel as advisory, scannable, and self-healing
TL;DR
The
apm-review-panelskill posted binarypanel-rejectedverdicts on PRs whose real signal was a handful of recommended improvements plus polish nits. This PR rebuilds the panel as an advisory regime: the CEO returns aship_recommendation.stance(prose, never a label), panelists rank findings on a three-bucket severity scale that the orchestrator never reads as a gate, and the comment template top-loads the busy-maintainer view that PR #1084 demonstrated by hand. Adds two new conditional panelists (doc-writer,test-coverage-expert), wires test evidence as load-bearing in CEO arbitration, lands the python-architect class diagram in the top comment, and notifies the PR author + reviewers via@-pingso a fresh pass surfaces in inboxes without re-introducing labels. Closes #1091.Important
The panel never blocks merge anymore. The maintainer ships. Re-applying the
panel-reviewlabel re-runs the panel cleanly, and the orchestrator now sweeps stalepanel-approved/panel-rejectedlabels left behind by the previous regime.Why
The pre-refactor panel had five coupled defects that compounded into "loud, adversarial, easy to ignore":
panelist-return-schema.jsonhad two slots:required[](blocks merge) XORnits[](skip). A real-but-not-blocking finding had nowhere safe to live; defensively it landed inrequired[], the deterministic gateverdict = APPROVE iff sum(required)==0flipped to REJECT, and the maintainer got noise where they needed signal.Verdict: REJECTtopline,Required before merge (N)next) and hid where the lenses agreed, where they disagreed, and what was worth doing now vs later.panel-rejectedsurvived. Maintainer-confirmed.APM's positioning is pragmatic as npm. npm doesn't have a bot that blocks publish on stylistic LLM disagreement. Neither should we. The panel is here to inform the maintainer, not to gate them.
What
APPROVE/REJECT(binary, deterministic fromsum(required))ship_recommendation.stanceprose:ship_now/ship_with_followups/needs_discussion/needs_rework.required[](blocks) +nits[](skip)findings[]withseverity: blocking | recommended | nit. Severity is signal strength, never read as a gate.panel-approvedXORpanel-rejected(auto), removepanel-reviewadd-labelscall.remove-labelssweeps[panel-review, panel-approved, panel-rejected]on every run -- defensive cleanup of legacy verdict labels, plus re-run idempotency.cc @author @reviewers-> stance pill -> headline -> CEO synthesis -> per-persona summary table -> top-N curated follow-ups -> architecture diagrams (class + sequence) in the top comment -> recommendation prose. Full per-persona findings collapsed at the bottom.auth-expertconditionalauth-expert(cond.) +doc-writer(cond. on docs/CHANGELOG/README/skill+agent bundles) +test-coverage-expert(mandatory whenever the diff touchessrc/**/*.py; conditional otherwise).evidence { outcome, test_file, test_name, assertion_excerpt, proves, principles }block. CEO arbitration weightspassed/failed/missingtest outcomes above opinion, withmissingon asecure-by-default/governed-by-policy/portability-by-manifestsurface ranked at or near blocking-tier.notify_audience= PR author + CODEOWNERS-resolved reviewers + requested teams (bots filtered, capped at 6, teams preferred over individuals when capped); template renderscc @...directly under the headline.class_diagramslot rendered FIRST under### Architecturein the top comment, followed by sequence diagram. Schema, template, and persona contract all updated to wire this end-to-end.pr-panel-label-reset.yml(deterministic verdict-label sweeper)render_eval.py+ two scenario fixtures (ship_nowshape +needs_reworkshape) that render the template's actual rules and lint ASCII. Both fixtures exercise all 8 panelist slots, thecc @...line (rendered in 02, skipped in 01), the class diagram slot, andevidenceblocks (passed in 01, missing in 02).How
flowchart TD Trigger[panel-review label] --> Orch[orchestrator] Orch --> PA[python-architect] Orch --> CL[cli-logging-expert] Orch --> DX[devx-ux-expert] Orch --> SC[supply-chain-security] Orch --> GH[oss-growth-hacker] Orch --> AE[auth-expert / cond.] Orch --> DW[doc-writer / cond.] Orch --> TC[test-coverage / mand. on src.py] PA --> Sch[findings array<br/>severity: blocking / recommended / nit<br/>+ optional evidence block] CL --> Sch DX --> Sch SC --> Sch GH --> Sch AE --> Sch DW --> Sch TC --> Sch PA --> Diag[class + sequence diagrams<br/>rendered in top comment] Sch --> CEO[apm-ceo synthesizer<br/>ship_recommendation.stance<br/>test evidence load-bearing] Orch --> Audience[resolve notify_audience<br/>author + reviewers + teams, bots filtered] CEO --> Render[render recommendation-template.md] Diag --> Render Audience --> Render Render --> Comment[add-comment: 1 comment<br/>cc @author @reviewers] Render --> Sweep[remove-labels:<br/>panel-review, panel-approved, panel-rejected] classDef writer fill:#0a3d2c,stroke:#1f7a4d,color:#fff classDef synth fill:#1a2a4a,stroke:#3d5a8a,color:#fff class Comment,Sweep writer class CEO synthConcretely, the changes touch four layers and both
.github/+.apm/mirrors:SKILL.mdrewritten with advisory invariants up top (no verdict, no add-labels, severity is signal not gate, single-writer interlock sweeps all three labels). 8-persona roster with explicit conditional rules. Calibrated severity contract embedded in the panelist prompt template.panelist-return-schema.jsoncollapses required/nits into a singlefindings[]array with aseverityenum and an optionalevidenceblock; adds aclass_diagramschema slot.ceo-return-schema.jsonreplacesverdictwithship_recommendation { stance, prose }, addsheadline,arbitration,principle_alignment,recommended_followups[], andnotify_audience[].agents/test-coverage-expert.agent.md(new): scope inherits north star from DevX UX; enumerates critical user-promise surfaces (CLI / error wording / install / lockfile / auth / hooks / marketplace / cross-module integration); mandatesview/grepprobe before any "no test exists" claim; every finding MUST carry anevidenceblock.agents/python-architect.agent.mdupdated to require bothclass_diagramandsequence_diagramoutputs.recommendation-template.md(new, replacesverdict-template.md) renders sections only when their source field is non-empty -- no placeholder noise. Class diagram rendered FIRST under### Architecture.cc @...line rendered directly under headline.pr-review-panel.mdworkflow dropsadd-labels;remove-labels.allowedlists all three labels withmax: 3.pr-panel-label-reset.ymldeleted. Lock recompiled withgh aw compile.evals/render_eval.pyis a specification-test renderer that applies the template's rules deterministically (without an LLM) so a maintainer can eyeball offline. Two fixtures:01-ship-now-pr1084-shape.json(the common case) and02-needs-rework-shape.json(real regressions: path-traversal + Windows-encoding + circular import + missing regression-trap test).Proof
The eval scaffolding shipped in this PR is the executable contract. The skill ships no Python runtime change, so unit/integration tests in the conventional
pytestsense do not apply -- the rendered fixtures, the workflow lock, and thediff -rqmirror checks ARE this PR's executable evidence.Eval harness output:
Workflow lock confirms verdict-label removal:
.github/and.apm/mirrors verified identical:Scenario Evidence -- promises mapped to executable proof:
pr-review-panel.lock.yml(add_labelsabsent);02-needs-rework-shape.rendered.md(no "Verdict"/"REJECT" anywhere)panel-reviewon a previouslypanel-rejectedPR self-heals the stale labelpr-review-panel.lock.yml:remove_labels.allowedenumerates all three labels withmax: 3(sweep is unconditional)01-ship-now-pr1084-shape.rendered.md(137 lines);02-needs-rework-shape.rendered.md(138 lines); template top-loads cc-line + stance pill + headline + table02-needs-rework-shape.rendered.mdrenderscc @danielmeppiel @microsoft/apm-maintainers ...;01-ship-now-pr1084-shape.rendered.mdskips the line cleanly whennotify_audienceis absent02-needs-rework-shape.rendered.mdrenders the class diagram FIRST under### Architecture, before the sequence diagramsrc/**/*.pychanges (its evidence is load-bearing for CEO arbitration)agents/test-coverage-expert.agent.mdActivation logic;01-ship-now-pr1084-shape.rendered.mdshows TC active on a behavior-change PRagents/test-coverage-expert.agent.mdReview procedure step 5 mandatesview/grep/glob; Anti-patterns section names "Asserting 'no test exists' without grepping" as disqualifyingassets/scenario-evidence-rubric.mdRequired-mapping-shape; test-coverage-expertmissing-regression-trapgap class is always at leastrecommendedassets/scenario-evidence-rubric.mdAnti-patterns refuses "Mocked boundary on a security scenario"; TC classifies asmocked-boundary-on-security-scenario->blockingagents/apm-ceo.agent.md"Treat test evidence as load-bearing" subsection;02-needs-rework-shape.rendered.mdrenders verbatim assertion excerpts under each blocking findingdiff -rqconfirms.github/and.apm/mirrors identical; SKILL.md + section-rubric.md + test-coverage-expert.agent.md all link to the samescenario-evidence-rubric.mdpathEnd-to-end validation happens by labeling a real PR with
panel-reviewafter merge.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com