Exclude flaky wall-clock timing tests from default CI by igerber · Pull Request #330 · igerber/diff-diff

igerber · 2026-04-19T11:52:53Z

Summary

CI was hitting false-positive failures on TestCallawaySantAnnaSEAccuracy.test_timing_performance — most recently on PR Extend PR #312 Y-normalization contract into SDID diagnostic methods #328 (Estimation took 0.103s, expected <0.1s, a 3% overshoot reflecting runner noise). The 20× margin the test comment already mentions isn't enough to absorb CI wall-clock variance (BLAS path variation, neighbor VM contention, cold caches).
Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (addopts already uses -m 'not slow'). They remain runnable on demand via pytest -m slow — same pattern TROP uses per CLAUDE.md.
Tests marked:
- tests/test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — the rest of the SE-correctness class keeps running in default CI)
- tests/test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI)
Intentionally left unchanged: tests/test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check entirely. Added a TODO.md entry to replace the timing proxy with a mock/spy in a follow-up.

Methodology references

N/A — test infrastructure only; no estimator, math, or inference changes.

Validation

pytest tests/test_se_accuracy.py -m "not slow" --collect-only verifies 4 timing tests now deselected (1 + 3 parametrizations), 8 SE-correctness tests still collected in default CI.
No test logic changed — only marker additions and docstring updates.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-04-19T12:00:36Z

Overall Assessment

⚠️ Needs changes

Executive Summary

simulate_power() is being reverted from a narrow failure whitelist to except Exception while the surfaced failure counter is removed. That reintroduces silent fallback behavior for estimator/result-extractor bugs and can change returned power results without an explicit signal.
The current registry explicitly documents the narrower contract and the exposed n_simulation_failures surface; this patch deletes both the implementation and the tests that enforce it.
The dCDH survey rewrite is substantial: it removes the cell-period IF allocator and narrows survey support to within-group-constant PSU/strata. The patch also rewrites REGISTRY.md Notes and tests to match, so I’m treating that as documented/informational rather than a blocker.
The timing-test changes themselves look reasonable, and the new HonestDiD timing TODO is properly tracked.

Methodology

P1 Impact: The patch changes simulate_power() to blanket-catch Exception in diff_diff/power.py:L2245-L2258, reversing the current registry contract in docs/methodology/REGISTRY.md:L2351-L2357. Today, only estimation/data-path failures are skipped; programming errors (TypeError, AttributeError, NameError, IndexError, etc.) are supposed to propagate, and partial failures are surfaced on SimulationPowerResults.n_simulation_failures. With the proposed change, those bugs can be absorbed as skipped simulations, altering power/coverage summaries while emitting no warning whenever failures stay at or below 10%. Concrete fix: restore the narrow exception whitelist and keep explicit failed-replicate metadata on the result surface.
P3 Impact: The dCDH survey path is intentionally narrowed from the current cell-period allocator support in diff_diff/chaisemartin_dhaultfoeuille.py:L719-L864 and diff_diff/chaisemartin_dhaultfoeuille.py:L5629-L5765 to a group-constant PSU/strata contract, with matching registry-note updates around docs/methodology/REGISTRY.md:L592-L653. Under the review rubric this is informational, not a defect, because the deviation is documented. Concrete fix: no blocker if intentional; otherwise split it into its own methodology PR/release-note item.

Code Quality
No additional findings beyond the exception-handling regression above.

Performance
No findings.

Maintainability

P2 Impact: The patch removes the public SimulationPowerResults.n_simulation_failures field and its summary()/to_dict() serialization surface in diff_diff/power.py:L997-L1004 and diff_diff/power.py:L1065-L1148. That is a backward-incompatible API contraction for callers that inspect partial-failure behavior, and it makes n_simulations ambiguous once failed replicates are skipped. Concrete fix: retain the field, or introduce a backward-compatible replacement such as n_requested plus n_failed with a deprecation window.

Tech Debt

P3 Impact: The new HonestDiD timing-proxy follow-up is properly tracked in TODO.md:L101. Concrete fix: none.

Security
No findings.

Documentation/Tests

P2 Impact: The patch deletes the exact regression tests that enforced the safer power-simulation contract: failure-counter exposure, serialization, and propagation of programming errors in tests/test_power.py:L559-L679. Removing those tests while weakening the implementation leaves the regression unguarded. Concrete fix: restore these tests or replace them with equivalent coverage for the intended contract.

Path to Approval

Restore the narrow exception whitelist in simulate_power() so programming errors still propagate.
Keep explicit failed-simulation metadata on SimulationPowerResults or add a backward-compatible replacement/deprecation path.
Reinstate regression tests for TypeError propagation, failure-count serialization, and partial-failure warning behavior.

CI was hitting false-positive failures on tests that assert wall-clock elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3% overshoot that reflects runner noise (BLAS path variation, neighbor VM contention, cold caches), not a real regression. The existing "20x margin" comment acknowledged the problem but no fixed threshold can absorb CI wall-clock variance. Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (existing addopts already uses `-m 'not slow'`). Tests remain runnable on-demand via `pytest -m slow` for local benchmarking — same pattern the TROP suite uses per CLAUDE.md. Tests marked: - test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — rest of the SE-correctness class still runs) - test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI together) Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check. Added a TODO.md entry to replace it with a mock/spy in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T12:04:26Z

/ai-review

github-actions · 2026-04-19T12:11:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 888b5d1f9f275932411e0ecfae443c3289834f08

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings in the current diff.

Executive Summary

The remaining diff is test-infrastructure only: one method-level slow marker, one class-level slow marker, and one new TODO entry in tests/test_se_accuracy.py:L255-L271, tests/test_se_accuracy.py:L407-L438, and TODO.md:L101-L101.
No estimator, math, weighting, SE, identification, or library-default behavior is changed; the Callaway-Sant’Anna methodology contract in docs/methodology/REGISTRY.md:L283-L370 is untouched.
The slow marker is already registered and excluded by default in pyproject.toml:L100-L106, so the new annotations are consistent with the existing pytest/CI selection model.
The scope of deselection is narrow: one timing test plus one three-case performance class. The SE-correctness coverage in tests/test_se_accuracy.py:L91-L404 remains in the default lane.
The HonestDiD timing-proxy limitation is now properly tracked in TODO.md:L101-L101 against tests/test_methodology_honest_did.py:L246-L258, so it is P3-informational under the review rubric.
The earlier AI review’s simulate_power() findings are not part of this remaining diff, so they do not affect this re-review outcome.
I could not run pytest in this environment because the pytest module is not installed.

Methodology

No findings. The diff does not touch implementation covered by docs/methodology/REGISTRY.md:L283-L370; it only reclassifies wall-clock tests in tests/test_se_accuracy.py:L255-L271 and tests/test_se_accuracy.py:L407-L438.

Code Quality

No findings.

Performance

Severity: P3 (informational). Impact: In the default pytest lane, these four wall-clock assertions will no longer run, so automatic performance-regression detection for this file now depends on the project’s slow-test execution policy. This is an intentional flake-reduction tradeoff and is consistent with pyproject.toml:L100-L106. Concrete fix: Optional only — add a dedicated scheduled or manual pytest -m slow lane if you want automated perf coverage without reintroducing noise to default CI.

Maintainability

No findings. The added docstrings in tests/test_se_accuracy.py:L257-L266 and tests/test_se_accuracy.py:L409-L415 make the marker rationale and invocation path explicit.

Tech Debt

Severity: P3 (informational, mitigated). Impact: The pre-existing HonestDiD test_m0_short_circuit still uses elapsed time as a correctness proxy in tests/test_methodology_honest_did.py:L246-L258, but the PR now tracks that limitation in TODO.md:L101-L101. Under the stated rubric, that makes it non-blocking. Concrete fix: None required in this PR; follow the TODO by replacing the timing assertion with a mock/spy or explicit short-circuit signal.

Security

No findings.

Documentation/Tests

No findings. The marker placement is explicit and confined to tests/test_se_accuracy.py:L255-L271 and tests/test_se_accuracy.py:L407-L438; the methodological SE checks in tests/test_se_accuracy.py:L91-L404 remain in default scope.
Note: Runtime collection was not verified here because pytest is unavailable in the review environment.

PR #330 marked `test_timing_performance` and `TestPerformanceRegression` with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not slow'"` already excludes. That catches the default Python CI matrix but misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155, 162, 190`, which explicitly override the marker filter with `-m ''` so they can exercise the full slow suite (intentional — TROP parity tests live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold on Windows py3.11 under the Rust backend. Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition to `@pytest.mark.slow` on the affected tests: - `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance` - `test_se_accuracy.py::TestPerformanceRegression` (class-level) - `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit` GitHub Actions sets `CI=true` on every runner, so the skip covers both the default-CI and Rust-CI invocation patterns. Local development flows (`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI` env var means the tests still run as on-demand performance sanity. The `test_m0_short_circuit` case is special: it uses wall-clock time as a proxy for "short-circuit path taken" (fast path <0.5s, slow optimization would blow past that). The existing PR #330 TODO.md entry already tracks replacing it with a mock/spy; the `skipif` here is the interim guard until that refactor lands. Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all three targets); unset `CI` and the tests run and pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber force-pushed the fix/flaky-timing-tests branch from b436a3f to 888b5d1 Compare April 19, 2026 12:04

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit 328dc33 into main Apr 19, 2026
21 of 22 checks passed

igerber deleted the fix/flaky-timing-tests branch April 19, 2026 13:29

igerber mentioned this pull request Apr 19, 2026

Skip wall-clock timing tests on CI to stop false-positive flake #336

Merged

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude flaky wall-clock timing tests from default CI#330

Exclude flaky wall-clock timing tests from default CI#330
igerber merged 1 commit intomainfrom
fix/flaky-timing-tests

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant