Skip wall-clock timing tests on CI to stop false-positive flake#336
Merged
Skip wall-clock timing tests on CI to stop false-positive flake#336
Conversation
PR #330 marked `test_timing_performance` and `TestPerformanceRegression` with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not slow'"` already excludes. That catches the default Python CI matrix but misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155, 162, 190`, which explicitly override the marker filter with `-m ''` so they can exercise the full slow suite (intentional — TROP parity tests live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold on Windows py3.11 under the Rust backend. Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition to `@pytest.mark.slow` on the affected tests: - `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance` - `test_se_accuracy.py::TestPerformanceRegression` (class-level) - `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit` GitHub Actions sets `CI=true` on every runner, so the skip covers both the default-CI and Rust-CI invocation patterns. Local development flows (`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI` env var means the tests still run as on-demand performance sanity. The `test_m0_short_circuit` case is special: it uses wall-clock time as a proxy for "short-circuit path taken" (fast path <0.5s, slow optimization would blow past that). The existing PR #330 TODO.md entry already tracks replacing it with a mock/spy; the `skipif` here is the interim guard until that refactor lands. Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all three targets); unset `CI` and the tests run and pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
igerber
added a commit
that referenced
this pull request
Apr 20, 2026
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to PR #330. That PR marked three wall-clock timing tests
@pytest.mark.slow, which the defaultaddopts = \"-m 'not slow'\"already excludes — so the default Python CI matrix became quiet. But the Rust-backend CI jobs at.github/workflows/rust-test.yml:155, 162, 190explicitly override with-m ''to pull the full slow suite in (intentional — TROP parity tests need it). That's why PR #334's Windows py3.11 Rust job tripped a0.120s vs 0.1sthreshold failure that was pure runner noise.This PR layers a
skipif(os.environ.get(\"CI\") == \"true\", ...)marker on top of the existing@pytest.mark.slow. GitHub Actions setsCI=trueon every runner, so the skip catches both invocation patterns. Local development (pytest,pytest -m slow,pytest -m '') is unaffected.test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performancetest_se_accuracy.py::TestPerformanceRegression(class-level)test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuitThe
test_m0_short_circuitcase is the one PR #330 left untouched because it uses wall-clock as a correctness proxy for the short-circuit path (fast <0.5s, slow would blow past). The existingTODO.mdentry already tracks replacing it with a mock/spy;skipif(CI)is the interim guard until that refactor lands.Methodology references (required if estimator / math changes)
Validation
Security / privacy
Generated with Claude Code