refactor(panel): add tier axis to test-coverage evidence#1102
Merged
danielmeppiel merged 1 commit intomainfrom May 2, 2026
Merged
refactor(panel): add tier axis to test-coverage evidence#1102danielmeppiel merged 1 commit intomainfrom
danielmeppiel merged 1 commit intomainfrom
Conversation
The panelist-return-schema's evidence.outcome had no tier dimension, so a unit-tier 'passed' and an integration-tier 'passed' were indistinguishable to the CEO synthesizer. This let the test-coverage-expert silently certify critical user-promise surfaces (install pipeline, auth resolution, lockfile determinism, hooks, marketplace download) on the basis of unit tests with mocked boundaries plus a ruff lint pass -- the exact failure mode that surfaced on PR #1097, where a unit-only 'passed' could have shipped the cascade-exit rework without any fixture-grade proof of the new dep-only escape hatch. Diagnosed via the genesis skill. Root cause is a PROSE Reduced Scope violation: one outcome axis collapsing cheap proof (unit, mocks at boundary) with expensive proof (integration with real fixtures). Persona body fused two lenses (PRESENCE + TIER) but only PRESENCE leaked into the schema, so TIER advice was silently dropped at synthesis. Schema: - evidence.tier required, enum {unit, integration-with-fixtures, e2e, manual-only, static}. - evidence.run_evidence optional, captures pytest invocation + pass/fail line + duration when an integration test was actually executed in-session. - outcome and evidence descriptions updated to spell out tier semantics (passed at tier X certifies tier X only, not tiers above it). Persona (test-coverage-expert): - New 'Tier floor by surface' matrix mapping 8 critical-promise surfaces to a minimum tier floor. CLI surface, install pipeline, lockfile determinism, auth resolution, hook execution, marketplace download + integrity, and cross-module integration all require integration-with-fixtures as the floor; only error-wording string-shape stays at unit. - TWO evidence rows on sub-floor coverage: when only unit coverage exists for a surface above unit floor, persona MUST emit 'passed/unit' AND 'missing/integration-with-fixtures' rows. Cheap proof does not silence integration-tier ask. - S7 PROBE RULE: a 'passed' at integration-with-fixtures or e2e on a critical surface MUST have run in-session with pytest output recorded in evidence.run_evidence. Reading a test is not running it. Skip-condition (e.g. missing creds) downgrades outcome to 'unknown' instead of certifying. - Two new anti-patterns added: 'Reading a test instead of running it' and 'Collapsing tier under one outcome'. Fixtures (evals shape references) updated for the new contract. Schema validates with jsonschema; render_eval renders both fixtures clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a required test-evidence tier axis to the apm-review-panel panelist return schema and updates the test-coverage expert persona + eval fixtures so evidence cannot be reported without stating what tier it certifies.
Changes:
- Require
evidence.tierin the panelist return JSON schema; add optionalevidence.run_evidence. - Update the
test-coverage-expertpersona with a tier-floor matrix and an integration/e2e probe discipline. - Update both eval fixtures to include
tieron evidence rows (mirrored under.github/and.apm/).
Show a summary per file
| File | Description |
|---|---|
| .github/skills/apm-review-panel/assets/panelist-return-schema.json | Adds required tier and optional run_evidence to evidence. |
| .github/agents/test-coverage-expert.agent.md | Introduces tier-floor matrix + probe rule guidance for tiered certification. |
| .github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json | Updates fixture evidence to include tier. |
| .github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json | Updates fixture evidence to include tier on missing rows. |
| .apm/skills/apm-review-panel/assets/panelist-return-schema.json | Same schema changes mirrored under .apm/. |
| .apm/agents/test-coverage-expert.agent.md | Same persona contract changes mirrored under .apm/. |
| .apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json | Same fixture update mirrored under .apm/. |
| .apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json | Same fixture update mirrored under .apm/. |
Copilot's findings
Comments suppressed due to low confidence (2)
.github/agents/test-coverage-expert.agent.md:88
- The rationale text for the CLI row reads "help-text rendering only manifest end-to-end"; grammatically this should be "only manifests end-to-end".
| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |
.apm/agents/test-coverage-expert.agent.md:88
- Same grammar issue as the .github copy: "help-text rendering only manifest end-to-end" should be "only manifests end-to-end".
| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |
- Files reviewed: 8/8 changed files
- Comments generated: 4
Comment on lines
+115
to
123
| "tier": { | ||
| "type": "string", | ||
| "enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"], | ||
| "description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block." | ||
| }, | ||
| "run_evidence": { | ||
| "type": "string", | ||
| "description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)." | ||
| }, |
Comment on lines
+78
to
+95
| ## Tier floor by surface (LOAD-BEARING; do not collapse to unit) | ||
|
|
||
| A unit test that mocks the boundary it claims to defend is NOT proof. | ||
| Reading test code is NOT running test code. For each critical surface | ||
| above, the MINIMUM evidence tier required to certify | ||
| `outcome: passed` is: | ||
|
|
||
| | Surface | Floor tier | Rationale | | ||
| |---|---|---| | ||
| | CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end | | ||
| | Error wording (string shape) | `unit` | string literal assertion is sufficient | | ||
| | Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command | | ||
| | Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages | | ||
| | Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff | | ||
| | Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths | | ||
| | Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior | | ||
| | Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content | | ||
| | Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary | |
Comment on lines
+115
to
123
| "tier": { | ||
| "type": "string", | ||
| "enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"], | ||
| "description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block." | ||
| }, | ||
| "run_evidence": { | ||
| "type": "string", | ||
| "description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)." | ||
| }, |
Comment on lines
+78
to
+95
| ## Tier floor by surface (LOAD-BEARING; do not collapse to unit) | ||
|
|
||
| A unit test that mocks the boundary it claims to defend is NOT proof. | ||
| Reading test code is NOT running test code. For each critical surface | ||
| above, the MINIMUM evidence tier required to certify | ||
| `outcome: passed` is: | ||
|
|
||
| | Surface | Floor tier | Rationale | | ||
| |---|---|---| | ||
| | CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end | | ||
| | Error wording (string shape) | `unit` | string literal assertion is sufficient | | ||
| | Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command | | ||
| | Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages | | ||
| | Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff | | ||
| | Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths | | ||
| | Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior | | ||
| | Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content | | ||
| | Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Add a required
tieraxis to theapm-review-paneltest-coverageevidence schema and teach the
test-coverage-expertpersona a tier-floormatrix + S7 probe rule, so unit-tier
passedproof can no longersilently certify a critical user-promise surface that needs
fixture-grade integration coverage.
Problem (WHY)
While running the panel ad-hoc against PR #1097, the test-coverage-expert
persona returned
outcome: passedon a critical install-cascadesurface backed only by unit tests with mocked boundaries plus a ruff
lint pass. The CEO synthesizer would have weighted that the same as
fixture-grade
passedevidence and rendered "ship".Diagnosed via the
genesisskill. Five primitive-design patternviolations stack up:
PRESENCE-- "does any test exist?" -- and
TIER-- "is it the rightkind?"). PRESENCE leaks into the schema; TIER stays in prose
advice and is silently dropped at synthesis time.
user-promise defence by READING test code. No required tool-call
probe to actually RUN integration tests against real fixtures.
Fact-that-must-be-true reduced to LLM assertion -> TOOLLESS
ASSERTION -> HAND-ROLLED HALLUCINATION -> silent drift.
outcomeaxis collapsescheap proof (unit, mocks at boundary) and expensive proof
(integration with real fixtures); cheap proof always wins under
context pressure.
recommendation comment shows "N findings" -- no tier breakdown,
so a maintainer cannot see at a glance whether evidence is
fixture-grade.
The CEO contract (
apm-ceo.agent.md"Treat test evidence asload-bearing") says
failedevidence outranks opinion. Equally,today, a unit-tier
passedoutranks an opinion-only "this needsintegration coverage" finding from another panelist -- because the
schema gives the persona no way to admit "I have evidence, but only
at the wrong tier".
Approach (WHAT)
This PR ships Fix 1 + Fix 2 of a four-fix proposal (Fix 3 CEO
weighting + Fix 4 template breakdown deferred as follow-ups -- see
"Trade-offs" below):
tierenum to everyevidenceblock,plus optional
run_evidencefor capturing pytest output when anintegration test was actually executed in-session.
Tier floor by surfacematrix that mapsthe 8 critical-promise surfaces to a minimum acceptable tier; add
a TWO-evidence-row discipline for sub-floor coverage; add an S7
PROBE RULE that requires actually running integration-tier tests
instead of reading them.
Implementation (HOW)
assets/panelist-return-schema.jsonevidence.tierwith enumunit | integration-with-fixtures | e2e | manual-only | static.evidence.run_evidence(string) for the pytestinvocation + pass/fail line + duration.
evidence.outcomedescription updated:agents/test-coverage-expert.agent.mdTier floor matrix (load-bearing artifact -- the persona reads this
before emitting any
outcome: passed):Two new disciplines:
exists for a surface above unit floor, the persona MUST emit
passed/unitANDmissing/integration-with-fixturesrows. Thecheap-proof row does not silence the integration-tier ask.
outcome: passedattier: integration-with-fixturesore2eon a critical-promisesurface MUST have actually run in this review session, with the
pytest invocation + result line + duration recorded in
evidence.run_evidence. Skip-condition (missing creds, etc.)downgrades
outcometounknowninstead of certifying.Two new anti-patterns added:
Reading a test instead of running itand
Collapsing tier under one outcome.Trade-offs (what this does NOT fix)
schema-required
tierfield is now PRESENT in every evidencerow, but
apm-ceo.agent.mddoes not read it. An LLM-as-personacould still emit only the
passed/unitrow and silently dropthe
missing/integrationrow -- nothing in the contractprevents that. Persona-side discipline (TWO-row rule) is the
load-bearing mitigation today; CEO weighting can be added when
signal demands.
(Fix 4 deferred). A maintainer scanning the PR comment still
sees "N findings" per persona row instead of a per-tier
breakdown.
Both follow-ups are tracked in the diagnosis on
session-state plan.md. They are recommended once the schema
change has propagated.
Validation evidence
jsonschemavalidates updated schema; both fixtures(
01-ship-now-pr1084-shape.json,02-needs-rework-shape.json)validate clean against the new contract (16 panelist returns
total).
render_eval.pyre-renders both fixtures clean(
01-...rendered.md137 lines,02-...rendered.md138 lines).uv run --extra dev ruff check src/ tests/-- silent.uv run --extra dev ruff format --check src/ tests/-- silent(620 files already formatted).
uv run --extra dev pytest tests/unit -k "panel or coverage or fixture or schema"-- 210 passed, 17 subtests passed in 6.52s.How to test
git diff origin/main -- .apm/skills/apm-review-panel/assets/panelist-return-schema.jsonuv run --extra dev python .apm/skills/apm-review-panel/evals/render_eval.pyapm-review-panelagainst any PR touchingsrc/and confirm the test-coverage-expert returns either:
outcome: passedrows annotated withtier: integration-with-fixturesplus a
run_evidencestring, ORpassed/unit+missing/integration-with-fixturesrowswhen only unit coverage exists for an above-floor surface.
Drafted by the
genesisskill diagnosis. Diagnosis trigger:PR #1097 ad-hoc panel run accepted unit tests + ruff lint as proof
of a critical install-cascade surface.