Skip to content

Exclude flaky wall-clock timing tests from default CI#330

Merged
igerber merged 1 commit intomainfrom
fix/flaky-timing-tests
Apr 19, 2026
Merged

Exclude flaky wall-clock timing tests from default CI#330
igerber merged 1 commit intomainfrom
fix/flaky-timing-tests

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

  • CI was hitting false-positive failures on TestCallawaySantAnnaSEAccuracy.test_timing_performance — most recently on PR Extend PR #312 Y-normalization contract into SDID diagnostic methods #328 (Estimation took 0.103s, expected <0.1s, a 3% overshoot reflecting runner noise). The 20× margin the test comment already mentions isn't enough to absorb CI wall-clock variance (BLAS path variation, neighbor VM contention, cold caches).
  • Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (addopts already uses -m 'not slow'). They remain runnable on demand via pytest -m slow — same pattern TROP uses per CLAUDE.md.
  • Tests marked:
    • tests/test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — the rest of the SE-correctness class keeps running in default CI)
    • tests/test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI)
  • Intentionally left unchanged: tests/test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check entirely. Added a TODO.md entry to replace the timing proxy with a mock/spy in a follow-up.

Methodology references

  • N/A — test infrastructure only; no estimator, math, or inference changes.

Validation

  • pytest tests/test_se_accuracy.py -m "not slow" --collect-only verifies 4 timing tests now deselected (1 + 3 parametrizations), 8 SE-correctness tests still collected in default CI.
  • No test logic changed — only marker additions and docstring updates.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Executive Summary

  • simulate_power() is being reverted from a narrow failure whitelist to except Exception while the surfaced failure counter is removed. That reintroduces silent fallback behavior for estimator/result-extractor bugs and can change returned power results without an explicit signal.
  • The current registry explicitly documents the narrower contract and the exposed n_simulation_failures surface; this patch deletes both the implementation and the tests that enforce it.
  • The dCDH survey rewrite is substantial: it removes the cell-period IF allocator and narrows survey support to within-group-constant PSU/strata. The patch also rewrites REGISTRY.md Notes and tests to match, so I’m treating that as documented/informational rather than a blocker.
  • The timing-test changes themselves look reasonable, and the new HonestDiD timing TODO is properly tracked.

Methodology

  • P1 Impact: The patch changes simulate_power() to blanket-catch Exception in diff_diff/power.py:L2245-L2258, reversing the current registry contract in docs/methodology/REGISTRY.md:L2351-L2357. Today, only estimation/data-path failures are skipped; programming errors (TypeError, AttributeError, NameError, IndexError, etc.) are supposed to propagate, and partial failures are surfaced on SimulationPowerResults.n_simulation_failures. With the proposed change, those bugs can be absorbed as skipped simulations, altering power/coverage summaries while emitting no warning whenever failures stay at or below 10%. Concrete fix: restore the narrow exception whitelist and keep explicit failed-replicate metadata on the result surface.
  • P3 Impact: The dCDH survey path is intentionally narrowed from the current cell-period allocator support in diff_diff/chaisemartin_dhaultfoeuille.py:L719-L864 and diff_diff/chaisemartin_dhaultfoeuille.py:L5629-L5765 to a group-constant PSU/strata contract, with matching registry-note updates around docs/methodology/REGISTRY.md:L592-L653. Under the review rubric this is informational, not a defect, because the deviation is documented. Concrete fix: no blocker if intentional; otherwise split it into its own methodology PR/release-note item.

Code Quality
No additional findings beyond the exception-handling regression above.

Performance
No findings.

Maintainability

  • P2 Impact: The patch removes the public SimulationPowerResults.n_simulation_failures field and its summary()/to_dict() serialization surface in diff_diff/power.py:L997-L1004 and diff_diff/power.py:L1065-L1148. That is a backward-incompatible API contraction for callers that inspect partial-failure behavior, and it makes n_simulations ambiguous once failed replicates are skipped. Concrete fix: retain the field, or introduce a backward-compatible replacement such as n_requested plus n_failed with a deprecation window.

Tech Debt

  • P3 Impact: The new HonestDiD timing-proxy follow-up is properly tracked in TODO.md:L101. Concrete fix: none.

Security
No findings.

Documentation/Tests

  • P2 Impact: The patch deletes the exact regression tests that enforced the safer power-simulation contract: failure-counter exposure, serialization, and propagation of programming errors in tests/test_power.py:L559-L679. Removing those tests while weakening the implementation leaves the regression unguarded. Concrete fix: restore these tests or replace them with equivalent coverage for the intended contract.

Path to Approval

  1. Restore the narrow exception whitelist in simulate_power() so programming errors still propagate.
  2. Keep explicit failed-simulation metadata on SimulationPowerResults or add a backward-compatible replacement/deprecation path.
  3. Reinstate regression tests for TypeError propagation, failure-count serialization, and partial-failure warning behavior.

CI was hitting false-positive failures on tests that assert wall-clock
elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance
failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3%
overshoot that reflects runner noise (BLAS path variation, neighbor VM
contention, cold caches), not a real regression. The existing "20x margin"
comment acknowledged the problem but no fixed threshold can absorb CI
wall-clock variance.

Mark the two CS timing tests @pytest.mark.slow so they're excluded from
default CI (existing addopts already uses `-m 'not slow'`). Tests remain
runnable on-demand via `pytest -m slow` for local benchmarking — same
pattern the TROP suite uses per CLAUDE.md.

Tests marked:
- test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance
  (method-level marker — rest of the SE-correctness class still runs)
- test_se_accuracy.py::TestPerformanceRegression (class-level marker — all
  three parametrized timing cases move out of default CI together)

Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses
wall-clock as a proxy for "short-circuit path taken" — that's a correctness
signal, not a performance signal. Marking it slow would remove the check.
Added a TODO.md entry to replace it with a mock/spy in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the fix/flaky-timing-tests branch from b436a3f to 888b5d1 Compare April 19, 2026 12:04
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 888b5d1f9f275932411e0ecfae443c3289834f08


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings in the current diff.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • Severity: P3 (informational). Impact: In the default pytest lane, these four wall-clock assertions will no longer run, so automatic performance-regression detection for this file now depends on the project’s slow-test execution policy. This is an intentional flake-reduction tradeoff and is consistent with pyproject.toml:L100-L106. Concrete fix: Optional only — add a dedicated scheduled or manual pytest -m slow lane if you want automated perf coverage without reintroducing noise to default CI.

Maintainability

Tech Debt

  • Severity: P3 (informational, mitigated). Impact: The pre-existing HonestDiD test_m0_short_circuit still uses elapsed time as a correctness proxy in tests/test_methodology_honest_did.py:L246-L258, but the PR now tracks that limitation in TODO.md:L101-L101. Under the stated rubric, that makes it non-blocking. Concrete fix: None required in this PR; follow the TODO by replacing the timing assertion with a mock/spy or explicit short-circuit signal.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit 328dc33 into main Apr 19, 2026
21 of 22 checks passed
@igerber igerber deleted the fix/flaky-timing-tests branch April 19, 2026 13:29
igerber added a commit that referenced this pull request Apr 19, 2026
PR #330 marked `test_timing_performance` and `TestPerformanceRegression`
with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not
slow'"` already excludes. That catches the default Python CI matrix but
misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155,
162, 190`, which explicitly override the marker filter with `-m ''` so
they can exercise the full slow suite (intentional — TROP parity tests
live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold
on Windows py3.11 under the Rust backend.

Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition
to `@pytest.mark.slow` on the affected tests:
- `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance`
- `test_se_accuracy.py::TestPerformanceRegression` (class-level)
- `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit`

GitHub Actions sets `CI=true` on every runner, so the skip covers both
the default-CI and Rust-CI invocation patterns. Local development flows
(`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI`
env var means the tests still run as on-demand performance sanity.

The `test_m0_short_circuit` case is special: it uses wall-clock time as
a proxy for "short-circuit path taken" (fast path <0.5s, slow
optimization would blow past that). The existing PR #330 TODO.md entry
already tracks replacing it with a mock/spy; the `skipif` here is the
interim guard until that refactor lands.

Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all
three targets); unset `CI` and the tests run and pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant