Skip to content

Fail mouse-latency cells below valid-rep target#1361

Merged
psaab merged 1 commit into
masterfrom
fix-1360-mouse-valid-reps
May 16, 2026
Merged

Fail mouse-latency cells below valid-rep target#1361
psaab merged 1 commit into
masterfrom
fix-1360-mouse-valid-reps

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 16, 2026

Summary

Closes #1360.

Validation

  • python3 test/incus/mouse_latency_aggregate_test.py
  • python3 -m py_compile test/incus/mouse_latency_aggregate.py
  • git diff --check
  • Replayed the surplus 100E100M artifact /tmp/xpf-100e100m-surplus-persistent-i20-20260515-235752: loaded 9/15 now reports INSUFFICIENT-VALID-REPS; final verdict is INSUFFICIENT-DATA; reducer exits 2.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Tightens the mouse-latency cell quality gate so cells that fall below the intended 10 valid reps (e.g., the observed 9/15 cell) are explicitly labeled INSUFFICIENT-VALID-REPS instead of OK, ensuring the final verdict is fail-closed as INSUFFICIENT-DATA. Plan doc references to the prior < 7 threshold are realigned with the runner's 10-valid-rep target.

Changes:

  • Bump the per-cell valid-rep threshold from 7 to 10 and rename the sub-status to INSUFFICIENT-VALID-REPS.
  • Extend the aggregate test suite with new 9/15-style cell and gate-cell cases; update the existing excludes_invalid_from_median test to the 10-valid baseline.
  • Update §4.5/§4.7/§7.2 of the #905 plan to reflect the 10-valid-rep gate-grade requirement.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
test/incus/mouse_latency_aggregate.py New REQUIRED_VALID_REPS=10 and INSUFFICIENT-VALID-REPS status; gate status check propagates this to INSUFFICIENT-DATA.
test/incus/mouse_latency_aggregate_test.py Adds tests for the 9/15 cell and gate-cell propagation; adjusts the median-exclusion test to 10 valid reps.
docs/pr/905-mouse-latency/plan.md Plan text updated from < 7 to < 10 valid-rep wording in §4.5, §4.7, §7.2.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 16, 2026

Claude round-1 review on eb30b32b

Verdict: MERGE-READY

3-file test-tooling PR. Trivial change. Hostile-verified at source.

What changed

  • mouse_latency_aggregate.py:39 adds named constant REQUIRED_VALID_REPS = 10
  • :126-127 threshold raised from len(valid) < 7 to len(valid) < REQUIRED_VALID_REPS
  • New status INSUFFICIENT-VALID-REPS (distinct from INSUFFICIENT-DATA which remains the final-verdict failure mode)
  • Top-of-file docstring updated to document the 10-valid-rep gate
  • Tests cover both insufficient (9 reps) and sufficient (10+ reps) paths
  • docs/pr/905-mouse-latency/plan.md text updated to match the 10-valid-rep target

Hostile-search results

  • Threshold parameterized cleanly — named constant, not magic number. If future runs need a different threshold, change one line.
  • Status distinction preserved: INSUFFICIENT-VALID-REPS for per-cell status, INSUFFICIENT-DATA for the final-verdict fail-closed signal. decide() correctly propagates cell-status into verdict.
  • Boundary correct: len(valid) < 10 means exactly 10 reps → OK, 9 → INSUFFICIENT.
  • Replay reproducible: PR body cites /tmp/xpf-100e100m-surplus-persistent-i20-20260515-235752 artifact where 9/15 → INSUFFICIENT-VALID-REPS → exit 2. Plausible given the threshold raise.
  • Backward compat: any prior run with 7-9 valid reps that USED to PASS now FAILs. This is intentional — the Measurement: mouse-latency tail under elephant load (100E100M unmeasured half) #905 plan §7.2 always specified 10 as the gate-grade target; the prior < 7 was a latent under-gate that this PR is closing. No deprecation window needed because the prior threshold was an oversight, not a contract.

Recommendation

MERGE-READY. Fail-closed gate strengthening with proper test coverage.

Awaiting Codex (`task-mp8ehjal-ovqthr`) and Gemini Pro 3 (`task-mp8ehuba-92g0kt`).

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 16, 2026

Round-1 quad-review consolidated synthesis on eb30b32b

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-NEEDS-MINOR (2 doc/rationale findings)
Gemini Pro 3 MERGE-NEEDS-MINOR (2 findings: backward-compat break + doc mismatch)
Copilot 0 inline findings

Convergence on substance

All four reviewers verified the core mechanism:

  • REQUIRED_VALID_REPS = 10 named constant (not magic number)
  • Threshold raised cleanly from < 7< 10
  • New INSUFFICIENT-VALID-REPS cell status distinct from INSUFFICIENT-DATA final-verdict
  • Fail-closed: any non-OK gate cell → INSUFFICIENT-DATA final + exit 2
  • Replay artifact xpf-100e100m-surplus-persistent-i20-20260515-235752 reproduces 9/15 → INSUFFICIENT-VALID-REPS → exit 2
  • Boundary correct (< 10 means exactly 10 OK, 11+ also OK)
  • Tests cover insufficient + sufficient paths

Codex MINOR — doc/rationale

  1. docs/pr/905-mouse-latency/plan.md:526-528 and :570-571 — still say "cells with fewer than 10 valid reps are reported INSUFFICIENT-DATA", but the implementation distinguishes cell-status INSUFFICIENT-VALID-REPS from final-verdict INSUFFICIENT-DATA. Real LOW; doc wording.
  2. Statistical rationale for 10-vs-7 is thin — the plan documents 10 reps generally via median-of-10 / IQR notes but doesn't strongly justify why 9 must be discarded. Policy decision, not a code blocker.

Gemini MINOR — same findings + parameterization suggestion

  1. Backward-compat break for old artifacts — runs that previously passed with 7-9 reps will now fail. Gemini suggests CLI parameterization (--required-valid-reps) to allow historical replay under prior threshold.

Recommendation

MERGE. Code is correct, threshold is fail-closed properly, tests cover both paths. The 2 doc-mismatch findings (status naming) are real LOW doc follow-ups; the CLI-parameterization suggestion is a future enhancement, not a blocker.

Merging per user authorization.

Codex task: `task-mp8ehjal-ovqthr`. Gemini Pro 3 task: `task-mp8ehuba-92g0kt`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mouse-latency matrix: cells with fewer than 10 valid reps still show status OK

2 participants