Skip to content

refactor(panel): add tier axis to test-coverage evidence#1102

Merged
danielmeppiel merged 1 commit intomainfrom
panel/test-coverage-tier-axis
May 2, 2026
Merged

refactor(panel): add tier axis to test-coverage evidence#1102
danielmeppiel merged 1 commit intomainfrom
panel/test-coverage-tier-axis

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

TL;DR

Add a required tier axis to the apm-review-panel test-coverage
evidence schema and teach the test-coverage-expert persona a tier-floor
matrix + S7 probe rule, so unit-tier passed proof can no longer
silently certify a critical user-promise surface that needs
fixture-grade integration coverage.

Problem (WHY)

While running the panel ad-hoc against PR #1097, the test-coverage-expert
persona returned outcome: passed on a critical install-cascade
surface backed only by unit tests with mocked boundaries plus a ruff
lint pass. The CEO synthesizer would have weighted that the same as
fixture-grade passed evidence and rendered "ship".

Diagnosed via the genesis skill. Five primitive-design pattern
violations stack up:

  • R3 EXTRACT not done. Persona body fuses two lenses (PRESENCE
    -- "does any test exist?" -- and TIER -- "is it the right
    kind?"). PRESENCE leaks into the schema; TIER stays in prose
    advice and is silently dropped at synthesis time.
  • S7 DETERMINISTIC TOOL BRIDGE skipped. Persona certifies
    user-promise defence by READING test code. No required tool-call
    probe to actually RUN integration tests against real fixtures.
    Fact-that-must-be-true reduced to LLM assertion -> TOOLLESS
    ASSERTION -> HAND-ROLLED HALLUCINATION -> silent drift.
  • A9 SUPERVISED EXECUTION not wrapped around critical findings.
  • PROSE Reduced Scope violation. One outcome axis collapses
    cheap proof (unit, mocks at boundary) and expensive proof
    (integration with real fixtures); cheap proof always wins under
    context pressure.
  • B4 PLAN MEMENTO weak in CEO synthesis. Per-persona row in the
    recommendation comment shows "N findings" -- no tier breakdown,
    so a maintainer cannot see at a glance whether evidence is
    fixture-grade.

The CEO contract (apm-ceo.agent.md "Treat test evidence as
load-bearing") says failed evidence outranks opinion. Equally,
today, a unit-tier passed outranks an opinion-only "this needs
integration coverage" finding from another panelist -- because the
schema gives the persona no way to admit "I have evidence, but only
at the wrong tier".

Approach (WHAT)

This PR ships Fix 1 + Fix 2 of a four-fix proposal (Fix 3 CEO
weighting + Fix 4 template breakdown deferred as follow-ups -- see
"Trade-offs" below):

  1. Schema: add a required tier enum to every evidence block,
    plus optional run_evidence for capturing pytest output when an
    integration test was actually executed in-session.
  2. Persona: introduce a Tier floor by surface matrix that maps
    the 8 critical-promise surfaces to a minimum acceptable tier; add
    a TWO-evidence-row discipline for sub-floor coverage; add an S7
    PROBE RULE that requires actually running integration-tier tests
    instead of reading them.

Implementation (HOW)

assets/panelist-return-schema.json

  • New required field evidence.tier with enum
    unit | integration-with-fixtures | e2e | manual-only | static.
  • New optional field evidence.run_evidence (string) for the pytest
    invocation + pass/fail line + duration.
  • evidence.outcome description updated:

    "passed at tier X certifies tier X only, not tiers above it."

agents/test-coverage-expert.agent.md

Tier floor matrix (load-bearing artifact -- the persona reads this
before emitting any outcome: passed):

Surface Floor tier
CLI command surface integration-with-fixtures
Error wording (string shape) unit
Error wording (cascade reachability) integration-with-fixtures
Install pipeline integration-with-fixtures
Lockfile determinism integration-with-fixtures
Auth resolution (new path) integration-with-fixtures
Hook execution / routing integration-with-fixtures
Marketplace download + integrity integration-with-fixtures
Cross-module integration integration-with-fixtures

Two new disciplines:

  • TWO-row rule (sub-floor coverage): when only unit coverage
    exists for a surface above unit floor, the persona MUST emit
    passed/unit AND missing/integration-with-fixtures rows. The
    cheap-proof row does not silence the integration-tier ask.
  • S7 PROBE RULE: an outcome: passed at
    tier: integration-with-fixtures or e2e on a critical-promise
    surface MUST have actually run in this review session, with the
    pytest invocation + result line + duration recorded in
    evidence.run_evidence. Skip-condition (missing creds, etc.)
    downgrades outcome to unknown instead of certifying.

Two new anti-patterns added: Reading a test instead of running it
and Collapsing tier under one outcome.

Trade-offs (what this does NOT fix)

  • CEO synthesizer is still tier-blind (Fix 3 deferred). The
    schema-required tier field is now PRESENT in every evidence
    row, but apm-ceo.agent.md does not read it. An LLM-as-persona
    could still emit only the passed/unit row and silently drop
    the missing/integration row -- nothing in the contract
    prevents that. Persona-side discipline (TWO-row rule) is the
    load-bearing mitigation today; CEO weighting can be added when
    signal demands.
  • Recommendation template has no tier-coverage breakdown
    (Fix 4 deferred).
    A maintainer scanning the PR comment still
    sees "N findings" per persona row instead of a per-tier
    breakdown.

Both follow-ups are tracked in the diagnosis on
session-state plan.md. They are recommended once the schema
change has propagated.

Validation evidence

  • jsonschema validates updated schema; both fixtures
    (01-ship-now-pr1084-shape.json, 02-needs-rework-shape.json)
    validate clean against the new contract (16 panelist returns
    total).
  • render_eval.py re-renders both fixtures clean
    (01-...rendered.md 137 lines, 02-...rendered.md 138 lines).
  • uv run --extra dev ruff check src/ tests/ -- silent.
  • uv run --extra dev ruff format --check src/ tests/ -- silent
    (620 files already formatted).
  • uv run --extra dev pytest tests/unit -k "panel or coverage or fixture or schema" -- 210 passed, 17 subtests passed in 6.52s.

How to test

  1. Inspect the schema change:
    git diff origin/main -- .apm/skills/apm-review-panel/assets/panelist-return-schema.json
  2. Re-validate the fixtures:
    uv run --extra dev python -c "
    import json
    from jsonschema import validate
    schema = json.load(open('.apm/skills/apm-review-panel/assets/panelist-return-schema.json'))
    for fp in [
        '.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json',
        '.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json',
    ]:
        fix = json.load(open(fp))
        for p in fix.get('panelists', []):
            validate(p, schema)
    print('OK')
    "
    
  3. Re-render the fixtures:
    uv run --extra dev python .apm/skills/apm-review-panel/evals/render_eval.py
  4. Dogfood: run apm-review-panel against any PR touching src/
    and confirm the test-coverage-expert returns either:
    • outcome: passed rows annotated with tier: integration-with-fixtures
      plus a run_evidence string, OR
    • paired passed/unit + missing/integration-with-fixtures rows
      when only unit coverage exists for an above-floor surface.

Drafted by the genesis skill diagnosis. Diagnosis trigger:
PR #1097 ad-hoc panel run accepted unit tests + ruff lint as proof
of a critical install-cascade surface.

The panelist-return-schema's evidence.outcome had no tier dimension,
so a unit-tier 'passed' and an integration-tier 'passed' were
indistinguishable to the CEO synthesizer. This let the
test-coverage-expert silently certify critical user-promise surfaces
(install pipeline, auth resolution, lockfile determinism, hooks,
marketplace download) on the basis of unit tests with mocked
boundaries plus a ruff lint pass -- the exact failure mode that
surfaced on PR #1097, where a unit-only 'passed' could have shipped
the cascade-exit rework without any fixture-grade proof of the new
dep-only escape hatch.

Diagnosed via the genesis skill. Root cause is a PROSE Reduced Scope
violation: one outcome axis collapsing cheap proof (unit, mocks at
boundary) with expensive proof (integration with real fixtures).
Persona body fused two lenses (PRESENCE + TIER) but only PRESENCE
leaked into the schema, so TIER advice was silently dropped at
synthesis.

Schema:
- evidence.tier required, enum {unit, integration-with-fixtures,
  e2e, manual-only, static}.
- evidence.run_evidence optional, captures pytest invocation +
  pass/fail line + duration when an integration test was actually
  executed in-session.
- outcome and evidence descriptions updated to spell out tier
  semantics (passed at tier X certifies tier X only, not tiers
  above it).

Persona (test-coverage-expert):
- New 'Tier floor by surface' matrix mapping 8 critical-promise
  surfaces to a minimum tier floor. CLI surface, install pipeline,
  lockfile determinism, auth resolution, hook execution, marketplace
  download + integrity, and cross-module integration all require
  integration-with-fixtures as the floor; only error-wording
  string-shape stays at unit.
- TWO evidence rows on sub-floor coverage: when only unit coverage
  exists for a surface above unit floor, persona MUST emit
  'passed/unit' AND 'missing/integration-with-fixtures' rows.
  Cheap proof does not silence integration-tier ask.
- S7 PROBE RULE: a 'passed' at integration-with-fixtures or e2e on
  a critical surface MUST have run in-session with pytest output
  recorded in evidence.run_evidence. Reading a test is not running
  it. Skip-condition (e.g. missing creds) downgrades outcome to
  'unknown' instead of certifying.
- Two new anti-patterns added: 'Reading a test instead of running
  it' and 'Collapsing tier under one outcome'.

Fixtures (evals shape references) updated for the new contract.
Schema validates with jsonschema; render_eval renders both fixtures
clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 2, 2026 09:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a required test-evidence tier axis to the apm-review-panel panelist return schema and updates the test-coverage expert persona + eval fixtures so evidence cannot be reported without stating what tier it certifies.

Changes:

  • Require evidence.tier in the panelist return JSON schema; add optional evidence.run_evidence.
  • Update the test-coverage-expert persona with a tier-floor matrix and an integration/e2e probe discipline.
  • Update both eval fixtures to include tier on evidence rows (mirrored under .github/ and .apm/).
Show a summary per file
File Description
.github/skills/apm-review-panel/assets/panelist-return-schema.json Adds required tier and optional run_evidence to evidence.
.github/agents/test-coverage-expert.agent.md Introduces tier-floor matrix + probe rule guidance for tiered certification.
.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json Updates fixture evidence to include tier.
.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json Updates fixture evidence to include tier on missing rows.
.apm/skills/apm-review-panel/assets/panelist-return-schema.json Same schema changes mirrored under .apm/.
.apm/agents/test-coverage-expert.agent.md Same persona contract changes mirrored under .apm/.
.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json Same fixture update mirrored under .apm/.
.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json Same fixture update mirrored under .apm/.

Copilot's findings

Comments suppressed due to low confidence (2)

.github/agents/test-coverage-expert.agent.md:88

  • The rationale text for the CLI row reads "help-text rendering only manifest end-to-end"; grammatically this should be "only manifests end-to-end".
| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |

.apm/agents/test-coverage-expert.agent.md:88

  • Same grammar issue as the .github copy: "help-text rendering only manifest end-to-end" should be "only manifests end-to-end".
| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |
  • Files reviewed: 8/8 changed files
  • Comments generated: 4

Comment on lines +115 to 123
"tier": {
"type": "string",
"enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"],
"description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block."
},
"run_evidence": {
"type": "string",
"description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)."
},
Comment on lines +78 to +95
## Tier floor by surface (LOAD-BEARING; do not collapse to unit)

A unit test that mocks the boundary it claims to defend is NOT proof.
Reading test code is NOT running test code. For each critical surface
above, the MINIMUM evidence tier required to certify
`outcome: passed` is:

| Surface | Floor tier | Rationale |
|---|---|---|
| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |
| Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command |
| Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages |
| Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff |
| Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths |
| Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior |
| Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content |
| Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary |
Comment on lines +115 to 123
"tier": {
"type": "string",
"enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"],
"description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block."
},
"run_evidence": {
"type": "string",
"description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)."
},
Comment on lines +78 to +95
## Tier floor by surface (LOAD-BEARING; do not collapse to unit)

A unit test that mocks the boundary it claims to defend is NOT proof.
Reading test code is NOT running test code. For each critical surface
above, the MINIMUM evidence tier required to certify
`outcome: passed` is:

| Surface | Floor tier | Rationale |
|---|---|---|
| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end |
| Error wording (string shape) | `unit` | string literal assertion is sufficient |
| Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command |
| Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages |
| Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff |
| Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths |
| Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior |
| Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content |
| Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary |
@danielmeppiel danielmeppiel merged commit 095424f into main May 2, 2026
30 checks passed
@danielmeppiel danielmeppiel deleted the panel/test-coverage-tier-axis branch May 2, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants