From f287445a6ba5cfcc6f55a6b681f9cc9bf47bec49 Mon Sep 17 00:00:00 2001 From: Copilot Date: Sat, 2 May 2026 11:47:37 +0200 Subject: [PATCH] refactor(panel): add tier axis to test-coverage evidence The panelist-return-schema's evidence.outcome had no tier dimension, so a unit-tier 'passed' and an integration-tier 'passed' were indistinguishable to the CEO synthesizer. This let the test-coverage-expert silently certify critical user-promise surfaces (install pipeline, auth resolution, lockfile determinism, hooks, marketplace download) on the basis of unit tests with mocked boundaries plus a ruff lint pass -- the exact failure mode that surfaced on PR #1097, where a unit-only 'passed' could have shipped the cascade-exit rework without any fixture-grade proof of the new dep-only escape hatch. Diagnosed via the genesis skill. Root cause is a PROSE Reduced Scope violation: one outcome axis collapsing cheap proof (unit, mocks at boundary) with expensive proof (integration with real fixtures). Persona body fused two lenses (PRESENCE + TIER) but only PRESENCE leaked into the schema, so TIER advice was silently dropped at synthesis. Schema: - evidence.tier required, enum {unit, integration-with-fixtures, e2e, manual-only, static}. - evidence.run_evidence optional, captures pytest invocation + pass/fail line + duration when an integration test was actually executed in-session. - outcome and evidence descriptions updated to spell out tier semantics (passed at tier X certifies tier X only, not tiers above it). Persona (test-coverage-expert): - New 'Tier floor by surface' matrix mapping 8 critical-promise surfaces to a minimum tier floor. CLI surface, install pipeline, lockfile determinism, auth resolution, hook execution, marketplace download + integrity, and cross-module integration all require integration-with-fixtures as the floor; only error-wording string-shape stays at unit. - TWO evidence rows on sub-floor coverage: when only unit coverage exists for a surface above unit floor, persona MUST emit 'passed/unit' AND 'missing/integration-with-fixtures' rows. Cheap proof does not silence integration-tier ask. - S7 PROBE RULE: a 'passed' at integration-with-fixtures or e2e on a critical surface MUST have run in-session with pytest output recorded in evidence.run_evidence. Reading a test is not running it. Skip-condition (e.g. missing creds) downgrades outcome to 'unknown' instead of certifying. - Two new anti-patterns added: 'Reading a test instead of running it' and 'Collapsing tier under one outcome'. Fixtures (evals shape references) updated for the new contract. Schema validates with jsonschema; render_eval renders both fixtures clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .apm/agents/test-coverage-expert.agent.md | 119 ++++++++++++++---- .../assets/panelist-return-schema.json | 15 ++- .../fixtures/01-ship-now-pr1084-shape.json | 1 + .../evals/fixtures/02-needs-rework-shape.json | 2 + .github/agents/test-coverage-expert.agent.md | 119 ++++++++++++++---- .../assets/panelist-return-schema.json | 15 ++- .../fixtures/01-ship-now-pr1084-shape.json | 1 + .../evals/fixtures/02-needs-rework-shape.json | 2 + 8 files changed, 218 insertions(+), 56 deletions(-) diff --git a/.apm/agents/test-coverage-expert.agent.md b/.apm/agents/test-coverage-expert.agent.md index 48d85c96..fd30e55d 100644 --- a/.apm/agents/test-coverage-expert.agent.md +++ b/.apm/agents/test-coverage-expert.agent.md @@ -75,6 +75,48 @@ specific behavior change is your highest-priority finding. command needs to cover the new path -- a unit test on each module is necessary but not sufficient. +## Tier floor by surface (LOAD-BEARING; do not collapse to unit) + +A unit test that mocks the boundary it claims to defend is NOT proof. +Reading test code is NOT running test code. For each critical surface +above, the MINIMUM evidence tier required to certify +`outcome: passed` is: + +| Surface | Floor tier | Rationale | +|---|---|---| +| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end | +| Error wording (string shape) | `unit` | string literal assertion is sufficient | +| Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command | +| Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages | +| Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff | +| Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths | +| Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior | +| Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content | +| Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary | + +Two new disciplines follow from this matrix: + +1. **Tier-floor compliance check.** When you find a unit test that + covers a critical-surface change but no test at the floor tier + exists, return TWO evidence rows: one `outcome: passed, tier: unit` + for the unit coverage you found, and one `outcome: missing, + tier: integration-with-fixtures` for the floor gap. Severity on the + missing row is `recommended` by default; promote to `blocking` only + when the surface change is a security/auth/install promise AND there + is no reasonable fixture path. Do NOT silence the unit row -- the + unit test still defends the function in isolation; you are saying + the user-promise is not yet certified end-to-end. +2. **S7 PROBE RULE on integration evidence.** When you return + `outcome: passed` at `tier: integration-with-fixtures` or `e2e` on + a critical-promise surface, you MUST have RUN the test (not just + read it) within this review. Capture the pytest invocation + the + pass/fail line + duration in `evidence.run_evidence` (verbatim, + under 240 chars). Reading test code is LLM assertion; running it + against real fixtures is irrefutable proof. Skip-condition: if the + test requires a credential you don't have (e.g. `GITHUB_APM_PAT`), + note the skip in `evidence.run_evidence` and downgrade `outcome` + to `unknown` for that row -- do NOT certify on a read. + ## Review procedure (MANDATORY -- do not skip) You are the panelist who makes claims about TEST PRESENCE. Every claim @@ -178,6 +220,18 @@ should live. "We should have more tests" is not a finding. - **Asserting "no test exists" without grepping.** You MUST verify via `view` / `grep`. A false-positive finding here destroys trust in the field. +- **Reading a test instead of running it.** When you certify + `outcome: passed` at `tier: integration-with-fixtures` or `e2e` on + a critical-promise surface, you MUST have actually run the test in + this review and recorded the invocation + result in + `evidence.run_evidence`. Reading test code is LLM assertion; + running it is irrefutable. This is the S7 PROBE RULE. +- **Collapsing tier under one outcome.** A unit test that mocks the + install pipeline at the boundary it claims to defend is NOT proof + of the install-pipeline user promise. Return TWO evidence rows when + you find sub-floor coverage: one `passed/unit` for the unit lens, + one `missing/integration-with-fixtures` for the floor gap. Do not + let the cheap proof silence the integration-tier ask. ## Boundaries @@ -249,43 +303,58 @@ that would post comments or apply labels. Your contract is STRICTER than other panelists: every finding you return MUST include the `evidence` object from -`assets/panelist-return-schema.json`. This is what makes your lens -load-bearing for the apm-ceo synthesizer -- tests, when coded right, -are irrefutable, and the CEO weights your `evidence` block above -opinion-only findings (see apm-ceo "Treat test evidence as -load-bearing"). +`assets/panelist-return-schema.json` AND the `tier` field on every +evidence row. This is what makes your lens load-bearing for the +apm-ceo synthesizer -- tests, when coded right and RUN against real +fixtures, are irrefutable, and the CEO weights your `evidence` block +above opinion-only findings (see apm-ceo "Treat test evidence as +load-bearing"). The tier field is what lets the CEO reason about +PROOF DEPTH, not just proof presence. Per outcome, the required shape: - `outcome: passed` -- `test_file` REQUIRED, `test_name` REQUIRED if the file has more than one test, `assertion_excerpt` REQUIRED (verbatim line carrying the assertion, under 240 chars), `proves` - REQUIRED (the user promise in user words), `principles` REQUIRED. - Use this shape when you affirm a scenario is covered (often a - `severity: recommended` follow-up "this test should also assert X" - or simply a body-text affirmation in the rationale). -- `outcome: failed` -- same shape as `passed` plus the failing - assertion's actual-vs-expected line in the rationale. This is the - load-bearing case for `severity: blocking`. Reproduce the failure - with the exact pytest command in `suggestion`. + REQUIRED (the user promise in user words), `principles` REQUIRED, + `tier` REQUIRED (`unit` | `integration-with-fixtures` | `e2e` | + `manual-only` | `static`). When `tier` is `integration-with-fixtures` + or `e2e` AND the surface is in the critical-promise list above, + `run_evidence` REQUIRED (per the S7 PROBE RULE: you actually ran + the test, you didn't just read it). Use this shape when you affirm + a scenario is covered (often a `severity: recommended` follow-up + "this test should also assert X" or simply a body-text affirmation + in the rationale). +- `outcome: failed` -- same shape as `passed` (including required + `tier` and `run_evidence` for integration/e2e on critical surfaces) + plus the failing assertion's actual-vs-expected line in the + rationale. This is the load-bearing case for `severity: blocking`. + Reproduce the failure with the exact pytest command in `suggestion`. - `outcome: missing` -- `test_file` REQUIRED (the path where the test SHOULD live), `test_name` REQUIRED (the name you would give it), `assertion_excerpt` REQUIRED (the line that WOULD assert, written as Python pseudocode), `proves` REQUIRED, `principles` - REQUIRED. You MUST have probed via `view` / `grep` / `glob` to - confirm absence before claiming `missing`. State the probe in the - rationale (e.g. "grep'd `tests/unit/install/` for `test_*overwrite*`, - no match"). This is the load-bearing case for regression-trap gaps - on bug-fix PRs and security-promise PRs. + REQUIRED, `tier` REQUIRED (the tier the surface FLOOR demands per + the matrix above; usually `integration-with-fixtures` for critical + surfaces). You MUST have probed via `view` / `grep` / `glob` to + confirm absence at the floor tier before claiming `missing`. State + the probe in the rationale (e.g. "grep'd `tests/integration/` for + `*install*pipeline*`, no match"). This is the load-bearing case + for regression-trap gaps on bug-fix PRs and security-promise PRs. - `outcome: manual` -- when only manual verification is referenced. CEO treats this as `missing`. Use sparingly; usually you should - emit `missing` instead and propose the test. + emit `missing` instead and propose the test. `tier` MUST be + `manual-only`. - `outcome: unknown` -- LAST RESORT. If you must return `unknown`, the rationale MUST explain WHY (e.g. "test exists but I cannot - determine if it exercises the changed branch without running it"). - CEO discards `unknown` from arbitration weight; do not lean on it. - -A finding without an `evidence` block is a malformed return from -your persona. The orchestrator may downweight it; the CEO will note -the malformation in `dissent_notes`. Your value to the panel IS the + determine if it exercises the changed branch without running it", + or "integration test exists but no GITHUB_APM_PAT in env to run + the S7 probe"). `tier` is still required (your best guess at the + tier of the test you couldn't fully verify). CEO discards `unknown` + from arbitration weight; do not lean on it. + +A finding without an `evidence` block, or with an evidence block +missing `tier`, is a malformed return from your persona. The +orchestrator may downweight it; the CEO will note the malformation +in `dissent_notes`. Your value to the panel IS the tier-aware evidence -- everyone else can argue from rules. diff --git a/.apm/skills/apm-review-panel/assets/panelist-return-schema.json b/.apm/skills/apm-review-panel/assets/panelist-return-schema.json index 0a14c5e8..cffb3cec 100644 --- a/.apm/skills/apm-review-panel/assets/panelist-return-schema.json +++ b/.apm/skills/apm-review-panel/assets/panelist-return-schema.json @@ -95,8 +95,8 @@ }, "evidence": { "type": "object", - "description": "Optional irrefutable evidence backing this finding -- a real test that exists (passed/failed) or a real test that should exist but does not (missing). When present, the apm-ceo synthesizer treats this as LOAD-BEARING: a passed test proves the asserted user promise holds; a failed test proves it does not; a missing test on a critical-promise surface proves the regression-trap gap. The CEO does not arbitrate against a passed/failed outcome except by naming a specific reason (test was wrong, environment issue, flakiness with run-count). REQUIRED on every test-coverage-expert finding (the persona's contract); STRONGLY ENCOURAGED on any finding from any other persona that points at a test (e.g. supply-chain-security citing a test proving an exploit, devx-ux citing a test proving an error wording).", - "required": ["outcome"], + "description": "Optional irrefutable evidence backing this finding -- a real test that exists (passed/failed) or a real test that should exist but does not (missing). When present, the apm-ceo synthesizer treats this as LOAD-BEARING: a passed test proves the asserted user promise holds AT THE STATED TIER; a failed test proves it does not; a missing test on a critical-promise surface proves the regression-trap gap. Tier matters: a `unit` passed does NOT certify the promise at integration-with-fixtures or e2e tier, and a critical user-promise surface (CLI command, install pipeline, lockfile, auth, hooks, marketplace, cross-module integration -- see test-coverage-expert tier-floor matrix) requires `integration-with-fixtures` or `e2e` to fully certify. The CEO does not arbitrate against a passed/failed outcome except by naming a specific reason (test was wrong, tier was below the surface floor, environment issue, flakiness with run-count). REQUIRED on every test-coverage-expert finding (the persona's contract); STRONGLY ENCOURAGED on any finding from any other persona that points at a test (e.g. supply-chain-security citing a test proving an exploit, devx-ux citing a test proving an error wording).", + "required": ["outcome", "tier"], "additionalProperties": false, "properties": { "test_file": { @@ -110,7 +110,16 @@ "outcome": { "type": "string", "enum": ["passed", "failed", "missing", "manual", "unknown"], - "description": "passed: the test exists in the diff or in main and proves the user promise on this PR's commit. failed: the test exists and does NOT pass on this PR's commit (CEO weighs heavily; this is the load-bearing case for blocking). missing: no test exists at the expected location; the persona has probed via view/grep and confirmed absence. manual: only manual verification; counts as no automated guardrail. unknown: outcome was not verifiable in this run; persona MUST explain why in the rationale." + "description": "passed: the test exists in the diff or in main and proves the user promise AT THE STATED TIER on this PR's commit. failed: the test exists and does NOT pass on this PR's commit (CEO weighs heavily; this is the load-bearing case for blocking). missing: no test exists at the expected location AT THE REQUIRED TIER for the surface; the persona has probed via view/grep and confirmed absence. Use `outcome=missing` with the floor tier when only sub-floor evidence exists for a critical surface (e.g. unit tests exist but the install-pipeline surface needs integration-with-fixtures). manual: only manual verification; counts as no automated guardrail. unknown: outcome was not verifiable in this run; persona MUST explain why in the rationale." + }, + "tier": { + "type": "string", + "enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"], + "description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block." + }, + "run_evidence": { + "type": "string", + "description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)." }, "assertion_excerpt": { "type": "string", diff --git a/.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json b/.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json index 093de5cd..499ff6ab 100644 --- a/.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json +++ b/.apm/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json @@ -83,6 +83,7 @@ "test_file": "tests/unit/install/test_pipeline_auth_preflight.py", "test_name": "test_install_update_does_not_disable_credential_helpers_on_generic_host", "outcome": "passed", + "tier": "unit", "assertion_excerpt": "assert os.environ.get('GIT_TERMINAL_PROMPT') is None", "proves": "On non-GitHub non-ADO hosts, install --update does not block the user's system credential helpers.", "principles": ["multi-harness-support", "vendor-neutral", "devx"] diff --git a/.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json b/.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json index 9608ec81..67d13962 100644 --- a/.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json +++ b/.apm/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json @@ -134,6 +134,7 @@ "test_file": "tests/unit/deps/test_resolver.py", "test_name": "test_resolver_rejects_path_traversal_in_dep_name", "outcome": "missing", + "tier": "unit", "assertion_excerpt": "with pytest.raises(ValueError): resolver.resolve(Dep(name='../../../etc/passwd', source='gh:...'))", "proves": "User-controlled dep.name cannot escape the install directory via traversal payloads.", "principles": [ @@ -153,6 +154,7 @@ "test_file": "tests/integration/test_install_pipeline.py", "test_name": "test_install_pipeline_resolver_to_downloader_e2e", "outcome": "missing", + "tier": "integration-with-fixtures", "assertion_excerpt": "result = install(manifest_path, scope=USER); assert (target_dir / 'expected.md').exists()", "proves": "The new Resolver -> Downloader boundary actually delivers files to disk for a real apm.yml.", "principles": [ diff --git a/.github/agents/test-coverage-expert.agent.md b/.github/agents/test-coverage-expert.agent.md index 48d85c96..fd30e55d 100644 --- a/.github/agents/test-coverage-expert.agent.md +++ b/.github/agents/test-coverage-expert.agent.md @@ -75,6 +75,48 @@ specific behavior change is your highest-priority finding. command needs to cover the new path -- a unit test on each module is necessary but not sufficient. +## Tier floor by surface (LOAD-BEARING; do not collapse to unit) + +A unit test that mocks the boundary it claims to defend is NOT proof. +Reading test code is NOT running test code. For each critical surface +above, the MINIMUM evidence tier required to certify +`outcome: passed` is: + +| Surface | Floor tier | Rationale | +|---|---|---| +| CLI command surface | `integration-with-fixtures` | argv parsing, exit codes, help-text rendering only manifest end-to-end | +| Error wording (string shape) | `unit` | string literal assertion is sufficient | +| Error wording (cascade reachability) | `integration-with-fixtures` | the user must actually hit the message via a real failing command | +| Install pipeline | `integration-with-fixtures` | resolution + download + integration + lockfile interplay only manifests with real packages | +| Lockfile determinism | `integration-with-fixtures` | round-trip behavior requires real read + real write + real diff | +| Auth resolution (new code path) | `integration-with-fixtures` | token precedence and host classification only manifest with real credential resolution paths | +| Hook execution / routing | `integration-with-fixtures` | filename-stem matching + content integration is filesystem behavior | +| Marketplace download + integrity | `integration-with-fixtures` | path segment + hash checks only meaningful against real downloaded content | +| Cross-module integration | `integration-with-fixtures` | unit tests on either side do not catch contract drift across the boundary | + +Two new disciplines follow from this matrix: + +1. **Tier-floor compliance check.** When you find a unit test that + covers a critical-surface change but no test at the floor tier + exists, return TWO evidence rows: one `outcome: passed, tier: unit` + for the unit coverage you found, and one `outcome: missing, + tier: integration-with-fixtures` for the floor gap. Severity on the + missing row is `recommended` by default; promote to `blocking` only + when the surface change is a security/auth/install promise AND there + is no reasonable fixture path. Do NOT silence the unit row -- the + unit test still defends the function in isolation; you are saying + the user-promise is not yet certified end-to-end. +2. **S7 PROBE RULE on integration evidence.** When you return + `outcome: passed` at `tier: integration-with-fixtures` or `e2e` on + a critical-promise surface, you MUST have RUN the test (not just + read it) within this review. Capture the pytest invocation + the + pass/fail line + duration in `evidence.run_evidence` (verbatim, + under 240 chars). Reading test code is LLM assertion; running it + against real fixtures is irrefutable proof. Skip-condition: if the + test requires a credential you don't have (e.g. `GITHUB_APM_PAT`), + note the skip in `evidence.run_evidence` and downgrade `outcome` + to `unknown` for that row -- do NOT certify on a read. + ## Review procedure (MANDATORY -- do not skip) You are the panelist who makes claims about TEST PRESENCE. Every claim @@ -178,6 +220,18 @@ should live. "We should have more tests" is not a finding. - **Asserting "no test exists" without grepping.** You MUST verify via `view` / `grep`. A false-positive finding here destroys trust in the field. +- **Reading a test instead of running it.** When you certify + `outcome: passed` at `tier: integration-with-fixtures` or `e2e` on + a critical-promise surface, you MUST have actually run the test in + this review and recorded the invocation + result in + `evidence.run_evidence`. Reading test code is LLM assertion; + running it is irrefutable. This is the S7 PROBE RULE. +- **Collapsing tier under one outcome.** A unit test that mocks the + install pipeline at the boundary it claims to defend is NOT proof + of the install-pipeline user promise. Return TWO evidence rows when + you find sub-floor coverage: one `passed/unit` for the unit lens, + one `missing/integration-with-fixtures` for the floor gap. Do not + let the cheap proof silence the integration-tier ask. ## Boundaries @@ -249,43 +303,58 @@ that would post comments or apply labels. Your contract is STRICTER than other panelists: every finding you return MUST include the `evidence` object from -`assets/panelist-return-schema.json`. This is what makes your lens -load-bearing for the apm-ceo synthesizer -- tests, when coded right, -are irrefutable, and the CEO weights your `evidence` block above -opinion-only findings (see apm-ceo "Treat test evidence as -load-bearing"). +`assets/panelist-return-schema.json` AND the `tier` field on every +evidence row. This is what makes your lens load-bearing for the +apm-ceo synthesizer -- tests, when coded right and RUN against real +fixtures, are irrefutable, and the CEO weights your `evidence` block +above opinion-only findings (see apm-ceo "Treat test evidence as +load-bearing"). The tier field is what lets the CEO reason about +PROOF DEPTH, not just proof presence. Per outcome, the required shape: - `outcome: passed` -- `test_file` REQUIRED, `test_name` REQUIRED if the file has more than one test, `assertion_excerpt` REQUIRED (verbatim line carrying the assertion, under 240 chars), `proves` - REQUIRED (the user promise in user words), `principles` REQUIRED. - Use this shape when you affirm a scenario is covered (often a - `severity: recommended` follow-up "this test should also assert X" - or simply a body-text affirmation in the rationale). -- `outcome: failed` -- same shape as `passed` plus the failing - assertion's actual-vs-expected line in the rationale. This is the - load-bearing case for `severity: blocking`. Reproduce the failure - with the exact pytest command in `suggestion`. + REQUIRED (the user promise in user words), `principles` REQUIRED, + `tier` REQUIRED (`unit` | `integration-with-fixtures` | `e2e` | + `manual-only` | `static`). When `tier` is `integration-with-fixtures` + or `e2e` AND the surface is in the critical-promise list above, + `run_evidence` REQUIRED (per the S7 PROBE RULE: you actually ran + the test, you didn't just read it). Use this shape when you affirm + a scenario is covered (often a `severity: recommended` follow-up + "this test should also assert X" or simply a body-text affirmation + in the rationale). +- `outcome: failed` -- same shape as `passed` (including required + `tier` and `run_evidence` for integration/e2e on critical surfaces) + plus the failing assertion's actual-vs-expected line in the + rationale. This is the load-bearing case for `severity: blocking`. + Reproduce the failure with the exact pytest command in `suggestion`. - `outcome: missing` -- `test_file` REQUIRED (the path where the test SHOULD live), `test_name` REQUIRED (the name you would give it), `assertion_excerpt` REQUIRED (the line that WOULD assert, written as Python pseudocode), `proves` REQUIRED, `principles` - REQUIRED. You MUST have probed via `view` / `grep` / `glob` to - confirm absence before claiming `missing`. State the probe in the - rationale (e.g. "grep'd `tests/unit/install/` for `test_*overwrite*`, - no match"). This is the load-bearing case for regression-trap gaps - on bug-fix PRs and security-promise PRs. + REQUIRED, `tier` REQUIRED (the tier the surface FLOOR demands per + the matrix above; usually `integration-with-fixtures` for critical + surfaces). You MUST have probed via `view` / `grep` / `glob` to + confirm absence at the floor tier before claiming `missing`. State + the probe in the rationale (e.g. "grep'd `tests/integration/` for + `*install*pipeline*`, no match"). This is the load-bearing case + for regression-trap gaps on bug-fix PRs and security-promise PRs. - `outcome: manual` -- when only manual verification is referenced. CEO treats this as `missing`. Use sparingly; usually you should - emit `missing` instead and propose the test. + emit `missing` instead and propose the test. `tier` MUST be + `manual-only`. - `outcome: unknown` -- LAST RESORT. If you must return `unknown`, the rationale MUST explain WHY (e.g. "test exists but I cannot - determine if it exercises the changed branch without running it"). - CEO discards `unknown` from arbitration weight; do not lean on it. - -A finding without an `evidence` block is a malformed return from -your persona. The orchestrator may downweight it; the CEO will note -the malformation in `dissent_notes`. Your value to the panel IS the + determine if it exercises the changed branch without running it", + or "integration test exists but no GITHUB_APM_PAT in env to run + the S7 probe"). `tier` is still required (your best guess at the + tier of the test you couldn't fully verify). CEO discards `unknown` + from arbitration weight; do not lean on it. + +A finding without an `evidence` block, or with an evidence block +missing `tier`, is a malformed return from your persona. The +orchestrator may downweight it; the CEO will note the malformation +in `dissent_notes`. Your value to the panel IS the tier-aware evidence -- everyone else can argue from rules. diff --git a/.github/skills/apm-review-panel/assets/panelist-return-schema.json b/.github/skills/apm-review-panel/assets/panelist-return-schema.json index 0a14c5e8..cffb3cec 100644 --- a/.github/skills/apm-review-panel/assets/panelist-return-schema.json +++ b/.github/skills/apm-review-panel/assets/panelist-return-schema.json @@ -95,8 +95,8 @@ }, "evidence": { "type": "object", - "description": "Optional irrefutable evidence backing this finding -- a real test that exists (passed/failed) or a real test that should exist but does not (missing). When present, the apm-ceo synthesizer treats this as LOAD-BEARING: a passed test proves the asserted user promise holds; a failed test proves it does not; a missing test on a critical-promise surface proves the regression-trap gap. The CEO does not arbitrate against a passed/failed outcome except by naming a specific reason (test was wrong, environment issue, flakiness with run-count). REQUIRED on every test-coverage-expert finding (the persona's contract); STRONGLY ENCOURAGED on any finding from any other persona that points at a test (e.g. supply-chain-security citing a test proving an exploit, devx-ux citing a test proving an error wording).", - "required": ["outcome"], + "description": "Optional irrefutable evidence backing this finding -- a real test that exists (passed/failed) or a real test that should exist but does not (missing). When present, the apm-ceo synthesizer treats this as LOAD-BEARING: a passed test proves the asserted user promise holds AT THE STATED TIER; a failed test proves it does not; a missing test on a critical-promise surface proves the regression-trap gap. Tier matters: a `unit` passed does NOT certify the promise at integration-with-fixtures or e2e tier, and a critical user-promise surface (CLI command, install pipeline, lockfile, auth, hooks, marketplace, cross-module integration -- see test-coverage-expert tier-floor matrix) requires `integration-with-fixtures` or `e2e` to fully certify. The CEO does not arbitrate against a passed/failed outcome except by naming a specific reason (test was wrong, tier was below the surface floor, environment issue, flakiness with run-count). REQUIRED on every test-coverage-expert finding (the persona's contract); STRONGLY ENCOURAGED on any finding from any other persona that points at a test (e.g. supply-chain-security citing a test proving an exploit, devx-ux citing a test proving an error wording).", + "required": ["outcome", "tier"], "additionalProperties": false, "properties": { "test_file": { @@ -110,7 +110,16 @@ "outcome": { "type": "string", "enum": ["passed", "failed", "missing", "manual", "unknown"], - "description": "passed: the test exists in the diff or in main and proves the user promise on this PR's commit. failed: the test exists and does NOT pass on this PR's commit (CEO weighs heavily; this is the load-bearing case for blocking). missing: no test exists at the expected location; the persona has probed via view/grep and confirmed absence. manual: only manual verification; counts as no automated guardrail. unknown: outcome was not verifiable in this run; persona MUST explain why in the rationale." + "description": "passed: the test exists in the diff or in main and proves the user promise AT THE STATED TIER on this PR's commit. failed: the test exists and does NOT pass on this PR's commit (CEO weighs heavily; this is the load-bearing case for blocking). missing: no test exists at the expected location AT THE REQUIRED TIER for the surface; the persona has probed via view/grep and confirmed absence. Use `outcome=missing` with the floor tier when only sub-floor evidence exists for a critical surface (e.g. unit tests exist but the install-pipeline surface needs integration-with-fixtures). manual: only manual verification; counts as no automated guardrail. unknown: outcome was not verifiable in this run; persona MUST explain why in the rationale." + }, + "tier": { + "type": "string", + "enum": ["unit", "integration-with-fixtures", "e2e", "manual-only", "static"], + "description": "Tier of evidence. unit: function-level test with mocks at the boundary; cheap, fast, narrow. integration-with-fixtures: real I/O against real fixtures (real files, real subprocess, real network when tagged), no mocked surface for the asserted contract. e2e: full CLI invocation end-to-end, real artifacts, real exit codes. manual-only: only a manual procedure; no automated guardrail. static: lint / type-check / schema validation only. The CEO weights tier against the SURFACE FLOOR named in the test-coverage-expert tier-floor matrix: a `unit` passed evidence on a critical-promise surface (CLI / install / lockfile / auth / hooks / marketplace / cross-module) does NOT silence an opinion-finding from another panelist asking for `integration-with-fixtures` coverage. Required on every evidence block." + }, + "run_evidence": { + "type": "string", + "description": "Optional. For `outcome=passed` at `tier=integration-with-fixtures` or `e2e` on a critical-promise surface, the persona SHOULD have actually run the test (not just read it) and recorded the pytest invocation + pass/fail line + duration here. Verbatim, under 240 chars. Reading code is not running code (S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true do not survive as LLM assertions)." }, "assertion_excerpt": { "type": "string", diff --git a/.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json b/.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json index 093de5cd..499ff6ab 100644 --- a/.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json +++ b/.github/skills/apm-review-panel/evals/fixtures/01-ship-now-pr1084-shape.json @@ -83,6 +83,7 @@ "test_file": "tests/unit/install/test_pipeline_auth_preflight.py", "test_name": "test_install_update_does_not_disable_credential_helpers_on_generic_host", "outcome": "passed", + "tier": "unit", "assertion_excerpt": "assert os.environ.get('GIT_TERMINAL_PROMPT') is None", "proves": "On non-GitHub non-ADO hosts, install --update does not block the user's system credential helpers.", "principles": ["multi-harness-support", "vendor-neutral", "devx"] diff --git a/.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json b/.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json index 9608ec81..67d13962 100644 --- a/.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json +++ b/.github/skills/apm-review-panel/evals/fixtures/02-needs-rework-shape.json @@ -134,6 +134,7 @@ "test_file": "tests/unit/deps/test_resolver.py", "test_name": "test_resolver_rejects_path_traversal_in_dep_name", "outcome": "missing", + "tier": "unit", "assertion_excerpt": "with pytest.raises(ValueError): resolver.resolve(Dep(name='../../../etc/passwd', source='gh:...'))", "proves": "User-controlled dep.name cannot escape the install directory via traversal payloads.", "principles": [ @@ -153,6 +154,7 @@ "test_file": "tests/integration/test_install_pipeline.py", "test_name": "test_install_pipeline_resolver_to_downloader_e2e", "outcome": "missing", + "tier": "integration-with-fixtures", "assertion_excerpt": "result = install(manifest_path, scope=USER); assert (target_dir / 'expected.md').exists()", "proves": "The new Resolver -> Downloader boundary actually delivers files to disk for a real apm.yml.", "principles": [