refactor(panel): add tier axis to test-coverage evidence by danielmeppiel · Pull Request #1102 · microsoft/apm

danielmeppiel · 2026-05-02T09:50:13Z

TL;DR

Add a required tier axis to the apm-review-panel test-coverage
evidence schema and teach the test-coverage-expert persona a tier-floor
matrix + S7 probe rule, so unit-tier passed proof can no longer
silently certify a critical user-promise surface that needs
fixture-grade integration coverage.

Problem (WHY)

While running the panel ad-hoc against PR #1097, the test-coverage-expert
persona returned outcome: passed on a critical install-cascade
surface backed only by unit tests with mocked boundaries plus a ruff
lint pass. The CEO synthesizer would have weighted that the same as
fixture-grade passed evidence and rendered "ship".

Diagnosed via the genesis skill. Five primitive-design pattern
violations stack up:

R3 EXTRACT not done. Persona body fuses two lenses (PRESENCE
-- "does any test exist?" -- and TIER -- "is it the right
kind?"). PRESENCE leaks into the schema; TIER stays in prose
advice and is silently dropped at synthesis time.
S7 DETERMINISTIC TOOL BRIDGE skipped. Persona certifies
user-promise defence by READING test code. No required tool-call
probe to actually RUN integration tests against real fixtures.
Fact-that-must-be-true reduced to LLM assertion -> TOOLLESS
ASSERTION -> HAND-ROLLED HALLUCINATION -> silent drift.
A9 SUPERVISED EXECUTION not wrapped around critical findings.
PROSE Reduced Scope violation. One outcome axis collapses
cheap proof (unit, mocks at boundary) and expensive proof
(integration with real fixtures); cheap proof always wins under
context pressure.
B4 PLAN MEMENTO weak in CEO synthesis. Per-persona row in the
recommendation comment shows "N findings" -- no tier breakdown,
so a maintainer cannot see at a glance whether evidence is
fixture-grade.

The CEO contract (apm-ceo.agent.md "Treat test evidence as
load-bearing") says failed evidence outranks opinion. Equally,
today, a unit-tier passed outranks an opinion-only "this needs
integration coverage" finding from another panelist -- because the
schema gives the persona no way to admit "I have evidence, but only
at the wrong tier".

Approach (WHAT)

This PR ships Fix 1 + Fix 2 of a four-fix proposal (Fix 3 CEO
weighting + Fix 4 template breakdown deferred as follow-ups -- see
"Trade-offs" below):

Schema: add a required tier enum to every evidence block,
plus optional run_evidence for capturing pytest output when an
integration test was actually executed in-session.
Persona: introduce a Tier floor by surface matrix that maps
the 8 critical-promise surfaces to a minimum acceptable tier; add
a TWO-evidence-row discipline for sub-floor coverage; add an S7
PROBE RULE that requires actually running integration-tier tests
instead of reading them.

Implementation (HOW)

assets/panelist-return-schema.json

New required field evidence.tier with enum
unit | integration-with-fixtures | e2e | manual-only | static.
New optional field evidence.run_evidence (string) for the pytest
invocation + pass/fail line + duration.
evidence.outcome description updated:

"passed at tier X certifies tier X only, not tiers above it."

agents/test-coverage-expert.agent.md

Tier floor matrix (load-bearing artifact -- the persona reads this
before emitting any outcome: passed):

Surface	Floor tier
CLI command surface	integration-with-fixtures
Error wording (string shape)	unit
Error wording (cascade reachability)	integration-with-fixtures
Install pipeline	integration-with-fixtures
Lockfile determinism	integration-with-fixtures
Auth resolution (new path)	integration-with-fixtures
Hook execution / routing	integration-with-fixtures
Marketplace download + integrity	integration-with-fixtures
Cross-module integration	integration-with-fixtures

Two new disciplines:

TWO-row rule (sub-floor coverage): when only unit coverage
exists for a surface above unit floor, the persona MUST emit
passed/unit AND missing/integration-with-fixtures rows. The
cheap-proof row does not silence the integration-tier ask.
S7 PROBE RULE: an outcome: passed at
tier: integration-with-fixtures or e2e on a critical-promise
surface MUST have actually run in this review session, with the
pytest invocation + result line + duration recorded in
evidence.run_evidence. Skip-condition (missing creds, etc.)
downgrades outcome to unknown instead of certifying.

Two new anti-patterns added: Reading a test instead of running it
and Collapsing tier under one outcome.

Trade-offs (what this does NOT fix)

CEO synthesizer is still tier-blind (Fix 3 deferred). The
schema-required tier field is now PRESENT in every evidence
row, but apm-ceo.agent.md does not read it. An LLM-as-persona
could still emit only the passed/unit row and silently drop
the missing/integration row -- nothing in the contract
prevents that. Persona-side discipline (TWO-row rule) is the
load-bearing mitigation today; CEO weighting can be added when
signal demands.
Recommendation template has no tier-coverage breakdown
(Fix 4 deferred). A maintainer scanning the PR comment still
sees "N findings" per persona row instead of a per-tier
breakdown.

Both follow-ups are tracked in the diagnosis on
session-state plan.md. They are recommended once the schema
change has propagated.

Validation evidence

jsonschema validates updated schema; both fixtures
(01-ship-now-pr1084-shape.json, 02-needs-rework-shape.json)
validate clean against the new contract (16 panelist returns
total).
render_eval.py re-renders both fixtures clean
(01-...rendered.md 137 lines, 02-...rendered.md 138 lines).
uv run --extra dev ruff check src/ tests/ -- silent.
uv run --extra dev ruff format --check src/ tests/ -- silent
(620 files already formatted).
uv run --extra dev pytest tests/unit -k "panel or coverage or fixture or schema" -- 210 passed, 17 subtests passed in 6.52s.

How to test

Inspect the schema change:
git diff origin/main -- .apm/skills/apm-review-panel/assets/panelist-return-schema.json

Re-validate the fixtures:

uv run --extra dev python -c "
import json
from jsonschema import validate
schema = json.load(open('.apm/skills/apm-review-panel/assets/panelist-return-schema.json'))
for fp in [
    '.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json',
    '.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json',
]:
    fix = json.load(open(fp))
    for p in fix.get('panelists', []):
        validate(p, schema)
print('OK')
"

Re-render the fixtures:
uv run --extra dev python .apm/skills/apm-review-panel/evals/render_eval.py
Dogfood: run apm-review-panel against any PR touching src/
and confirm the test-coverage-expert returns either:
- outcome: passed rows annotated with tier: integration-with-fixtures
  plus a run_evidence string, OR
- paired passed/unit + missing/integration-with-fixtures rows
  when only unit coverage exists for an above-floor surface.

Drafted by the genesis skill diagnosis. Diagnosis trigger:
PR #1097 ad-hoc panel run accepted unit tests + ruff lint as proof
of a critical install-cascade surface.

The panelist-return-schema's evidence.outcome had no tier dimension, so a unit-tier 'passed' and an integration-tier 'passed' were indistinguishable to the CEO synthesizer. This let the test-coverage-expert silently certify critical user-promise surfaces (install pipeline, auth resolution, lockfile determinism, hooks, marketplace download) on the basis of unit tests with mocked boundaries plus a ruff lint pass -- the exact failure mode that surfaced on PR #1097, where a unit-only 'passed' could have shipped the cascade-exit rework without any fixture-grade proof of the new dep-only escape hatch. Diagnosed via the genesis skill. Root cause is a PROSE Reduced Scope violation: one outcome axis collapsing cheap proof (unit, mocks at boundary) with expensive proof (integration with real fixtures). Persona body fused two lenses (PRESENCE + TIER) but only PRESENCE leaked into the schema, so TIER advice was silently dropped at synthesis. Schema: - evidence.tier required, enum {unit, integration-with-fixtures, e2e, manual-only, static}. - evidence.run_evidence optional, captures pytest invocation + pass/fail line + duration when an integration test was actually executed in-session. - outcome and evidence descriptions updated to spell out tier semantics (passed at tier X certifies tier X only, not tiers above it). Persona (test-coverage-expert): - New 'Tier floor by surface' matrix mapping 8 critical-promise surfaces to a minimum tier floor. CLI surface, install pipeline, lockfile determinism, auth resolution, hook execution, marketplace download + integrity, and cross-module integration all require integration-with-fixtures as the floor; only error-wording string-shape stays at unit. - TWO evidence rows on sub-floor coverage: when only unit coverage exists for a surface above unit floor, persona MUST emit 'passed/unit' AND 'missing/integration-with-fixtures' rows. Cheap proof does not silence integration-tier ask. - S7 PROBE RULE: a 'passed' at integration-with-fixtures or e2e on a critical surface MUST have run in-session with pytest output recorded in evidence.run_evidence. Reading a test is not running it. Skip-condition (e.g. missing creds) downgrades outcome to 'unknown' instead of certifying. - Two new anti-patterns added: 'Reading a test instead of running it' and 'Collapsing tier under one outcome'. Fixtures (evals shape references) updated for the new contract. Schema validates with jsonschema; render_eval renders both fixtures clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a required test-evidence tier axis to the apm-review-panel panelist return schema and updates the test-coverage expert persona + eval fixtures so evidence cannot be reported without stating what tier it certifies.

Changes:

Require evidence.tier in the panelist return JSON schema; add optional evidence.run_evidence.
Update the test-coverage-expert persona with a tier-floor matrix and an integration/e2e probe discipline.
Update both eval fixtures to include tier on evidence rows (mirrored under .github/ and .apm/).

Show a summary per file

File	Description
.github/skills/apm-review-panel/assets/panelist-return-schema.json	Adds required `tier` and optional `run_evidence` to `evidence`.
.github/agents/test-coverage-expert.agent.md	Introduces tier-floor matrix + probe rule guidance for tiered certification.
.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json	Updates fixture evidence to include `tier`.
.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json	Updates fixture evidence to include `tier` on missing rows.
.apm/skills/apm-review-panel/assets/panelist-return-schema.json	Same schema changes mirrored under `.apm/`.
.apm/agents/test-coverage-expert.agent.md	Same persona contract changes mirrored under `.apm/`.
.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json	Same fixture update mirrored under `.apm/`.
.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json	Same fixture update mirrored under `.apm/`.

Copilot's findings

Comments suppressed due to low confidence (2)

.github/agents/test-coverage-expert.agent.md:88

The rationale text for the CLI row reads "help-text rendering only manifest end-to-end"; grammatically this should be "only manifests end-to-end".

| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |

.apm/agents/test-coverage-expert.agent.md:88

Same grammar issue as the .github copy: "help-text rendering only manifest end-to-end" should be "only manifests end-to-end".

| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |

Files reviewed: 8/8 changed files
Comments generated: 4

+            "tier": {
+              "type": "string",
+              "enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"],
+              "description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block."
+            },
+            "run_evidence": {
+              "type": "string",
+              "description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)."
            },


+## Tier floor by surface (LOAD-BEARING; do not collapse to unit)
+
+A unit test that mocks the boundary it claims to defend is NOT proof.
+Reading test code is NOT running test code. For each critical surface
+above, the MINIMUM evidence tier required to certify
+`outcome: passed` is:
+
+| Surface | Floor tier | Rationale |
+|---|---|---|
+| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
+| Error wording (string shape) | `unit` | string literal assertion is sufficient |
+| Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command |
+| Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages |
+| Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff |
+| Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths |
+| Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior |
+| Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content |
+| Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary |


+            "tier": {
+              "type": "string",
+              "enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"],
+              "description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block."
+            },
+            "run_evidence": {
+              "type": "string",
+              "description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)."
            },


+## Tier floor by surface (LOAD-BEARING; do not collapse to unit)
+
+A unit test that mocks the boundary it claims to defend is NOT proof.
+Reading test code is NOT running test code. For each critical surface
+above, the MINIMUM evidence tier required to certify
+`outcome: passed` is:
+
+| Surface | Floor tier | Rationale |
+|---|---|---|
+| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
+| Error wording (string shape) | `unit` | string literal assertion is sufficient |
+| Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command |
+| Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages |
+| Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff |
+| Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths |
+| Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior |
+| Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content |
+| Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary |


Copilot AI review requested due to automatic review settings May 2, 2026 09:50

danielmeppiel requested a review from sergio-sisternes-epam as a code owner May 2, 2026 09:50

Copilot started reviewing on behalf of danielmeppiel May 2, 2026 09:50 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

danielmeppiel added the area/agentic-workflows label May 2, 2026

danielmeppiel merged commit 095424f into main May 2, 2026
30 checks passed

danielmeppiel deleted the panel/test-coverage-tier-axis branch May 2, 2026 10:16

danielmeppiel mentioned this pull request May 2, 2026

[docs] Update documentation for features from 2026-05-02 #1107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(panel): add tier axis to test-coverage evidence#1102

refactor(panel): add tier axis to test-coverage evidence#1102
danielmeppiel merged 1 commit intomainfrom
panel/test-coverage-tier-axis

danielmeppiel commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielmeppiel commented May 2, 2026

TL;DR

Problem (WHY)

Approach (WHAT)

Implementation (HOW)

Trade-offs (what this does NOT fix)

Validation evidence

How to test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants