Skip wall-clock timing tests on CI to stop false-positive flake by igerber · Pull Request #336 · igerber/diff-diff

igerber · 2026-04-19T18:26:56Z

Summary

Follow-up to PR #330. That PR marked three wall-clock timing tests @pytest.mark.slow, which the default addopts = \"-m 'not slow'\" already excludes — so the default Python CI matrix became quiet. But the Rust-backend CI jobs at .github/workflows/rust-test.yml:155, 162, 190 explicitly override with -m '' to pull the full slow suite in (intentional — TROP parity tests need it). That's why PR #334's Windows py3.11 Rust job tripped a 0.120s vs 0.1s threshold failure that was pure runner noise.

This PR layers a skipif(os.environ.get(\"CI\") == \"true\", ...) marker on top of the existing @pytest.mark.slow. GitHub Actions sets CI=true on every runner, so the skip catches both invocation patterns. Local development (pytest, pytest -m slow, pytest -m '') is unaffected.

test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance
test_se_accuracy.py::TestPerformanceRegression (class-level)
test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit

The test_m0_short_circuit case is the one PR #330 left untouched because it uses wall-clock as a correctness proxy for the short-circuit path (fast <0.5s, slow would blow past). The existing TODO.md entry already tracks replacing it with a mock/spy; skipif(CI) is the interim guard until that refactor lands.

Methodology references (required if estimator / math changes)

Method name(s): N/A - no methodology changes
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: No new tests; existing tests gain a second skip marker.
Verified locally: `CI=true pytest … -m ''` reports 5 skipped (all three targets); `env -u CI pytest …` runs and passes the two cheap cases.
Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

PR #330 marked `test_timing_performance` and `TestPerformanceRegression` with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not slow'"` already excludes. That catches the default Python CI matrix but misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155, 162, 190`, which explicitly override the marker filter with `-m ''` so they can exercise the full slow suite (intentional — TROP parity tests live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold on Windows py3.11 under the Rust backend. Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition to `@pytest.mark.slow` on the affected tests: - `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance` - `test_se_accuracy.py::TestPerformanceRegression` (class-level) - `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit` GitHub Actions sets `CI=true` on every runner, so the skip covers both the default-CI and Rust-CI invocation patterns. Local development flows (`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI` env var means the tests still run as on-demand performance sanity. The `test_m0_short_circuit` case is special: it uses wall-clock time as a proxy for "short-circuit path taken" (fast path <0.5s, slow optimization would blow past that). The existing PR #330 TODO.md entry already tracks replacing it with a mock/spy; the `skipif` here is the interim guard until that refactor lands. Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all three targets); unset `CI` and the tests run and pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T18:30:16Z

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

No methodology-impacting code changed. The diff only adds CI-only skip markers and explanatory comments in test files covering CallawaySantAnna and HonestDiD; it does not modify estimator code, assumptions, weighting, variance/SE, or defaults. Cross-check against docs/methodology/REGISTRY.md:L294-L456 and docs/methodology/REGISTRY.md:L2406-L2455 found no methodology defect.
Pattern-consistency check passed: the repo’s actual wall-clock threshold assertions are the ones updated here, so this is not a partial fix.
P3: the PR intentionally removes PR-time CI enforcement of CallawaySantAnna wall-clock budgets by skipping those tests whenever CI=="true" in tests/test_se_accuracy.py:L266-L286 and tests/test_se_accuracy.py:L422-L457.
P3 (tracked): HonestDiD’s M=0 short-circuit check remains a timing-proxy test, now skipped on CI in tests/test_methodology_honest_did.py:L248-L275, but that limitation is already tracked in TODO.md:L110-L110 and non-timing M=0 correctness coverage still exists in tests/test_methodology_honest_did.py:L85-L100.

Methodology

No findings. Cross-checked the affected method families against docs/methodology/REGISTRY.md:L294-L456 and docs/methodology/REGISTRY.md:L2406-L2455; this PR is test-only and does not introduce any undocumented deviation in estimator math, identification, weighting, or inference.

Code Quality

No findings.

Performance

Severity: P3. Impact: default pytest already excludes slow in pyproject.toml:L100-L107, but Rust CI explicitly re-includes the full suite with -m '' in .github/workflows/rust-test.yml:L142-L190. The new skipif(CI=="true") markers in tests/test_se_accuracy.py:L266-L286 and tests/test_se_accuracy.py:L422-L457 stop flaky failures, but they also mean CI no longer detects performance regressions for these wall-clock budgets. Concrete fix: keep the CI skip, but move these checks to a dedicated perf job or replace wall-clock assertions with a deterministic signal.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: tests/test_methodology_honest_did.py:L248-L275 still uses elapsed wall-clock time as a proxy for the M=0 short-circuit path and now skips that proxy on CI. This reduces automated runtime-path coverage, but it does not create a silent correctness bug, and the limitation is already tracked in TODO.md:L110-L110. Concrete fix: replace the timing proxy with a mock/spy on the optimizer call or an explicit short-circuit indicator, then remove the CI skip.

Security

No findings.

Documentation/Tests

No additional findings beyond the P3 items above. The new comments accurately match the current test-selection behavior documented in pyproject.toml:L100-L107 and the Rust CI workflow in .github/workflows/rust-test.yml:L142-L190.

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit 97a185d into main Apr 19, 2026
22 of 23 checks passed

igerber deleted the fix/ci-skip-walltime-tests branch April 19, 2026 19:39

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip wall-clock timing tests on CI to stop false-positive flake#336

Skip wall-clock timing tests on CI to stop false-positive flake#336
igerber merged 1 commit intomainfrom
fix/ci-skip-walltime-tests

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant