Skip to content

Skip wall-clock timing tests on CI to stop false-positive flake#336

Merged
igerber merged 1 commit intomainfrom
fix/ci-skip-walltime-tests
Apr 19, 2026
Merged

Skip wall-clock timing tests on CI to stop false-positive flake#336
igerber merged 1 commit intomainfrom
fix/ci-skip-walltime-tests

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

Follow-up to PR #330. That PR marked three wall-clock timing tests @pytest.mark.slow, which the default addopts = \"-m 'not slow'\" already excludes — so the default Python CI matrix became quiet. But the Rust-backend CI jobs at .github/workflows/rust-test.yml:155, 162, 190 explicitly override with -m '' to pull the full slow suite in (intentional — TROP parity tests need it). That's why PR #334's Windows py3.11 Rust job tripped a 0.120s vs 0.1s threshold failure that was pure runner noise.

This PR layers a skipif(os.environ.get(\"CI\") == \"true\", ...) marker on top of the existing @pytest.mark.slow. GitHub Actions sets CI=true on every runner, so the skip catches both invocation patterns. Local development (pytest, pytest -m slow, pytest -m '') is unaffected.

  • test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance
  • test_se_accuracy.py::TestPerformanceRegression (class-level)
  • test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit

The test_m0_short_circuit case is the one PR #330 left untouched because it uses wall-clock as a correctness proxy for the short-circuit path (fast <0.5s, slow would blow past). The existing TODO.md entry already tracks replacing it with a mock/spy; skipif(CI) is the interim guard until that refactor lands.

Methodology references (required if estimator / math changes)

  • Method name(s): N/A - no methodology changes
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: No new tests; existing tests gain a second skip marker.
  • Verified locally: `CI=true pytest … -m ''` reports 5 skipped (all three targets); `env -u CI pytest …` runs and passes the two cheap cases.
  • Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

PR #330 marked `test_timing_performance` and `TestPerformanceRegression`
with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not
slow'"` already excludes. That catches the default Python CI matrix but
misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155,
162, 190`, which explicitly override the marker filter with `-m ''` so
they can exercise the full slow suite (intentional — TROP parity tests
live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold
on Windows py3.11 under the Rust backend.

Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition
to `@pytest.mark.slow` on the affected tests:
- `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance`
- `test_se_accuracy.py::TestPerformanceRegression` (class-level)
- `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit`

GitHub Actions sets `CI=true` on every runner, so the skip covers both
the default-CI and Rust-CI invocation patterns. Local development flows
(`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI`
env var means the tests still run as on-demand performance sanity.

The `test_m0_short_circuit` case is special: it uses wall-clock time as
a proxy for "short-circuit path taken" (fast path <0.5s, slow
optimization would blow past that). The existing PR #330 TODO.md entry
already tracks replacing it with a mock/spy; the `skipif` here is the
interim guard until that refactor lands.

Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all
three targets); unset `CI` and the tests run and pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: tests/test_methodology_honest_did.py:L248-L275 still uses elapsed wall-clock time as a proxy for the M=0 short-circuit path and now skips that proxy on CI. This reduces automated runtime-path coverage, but it does not create a silent correctness bug, and the limitation is already tracked in TODO.md:L110-L110. Concrete fix: replace the timing proxy with a mock/spy on the optimizer call or an explicit short-circuit indicator, then remove the CI skip.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit 97a185d into main Apr 19, 2026
22 of 23 checks passed
@igerber igerber deleted the fix/ci-skip-walltime-tests branch April 19, 2026 19:39
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant