Guard TROP bootstrap loops against silent high-failure-rate runs#324
Merged
Guard TROP bootstrap loops against silent high-failure-rate runs#324
Conversation
Replace the hard-coded "< 10 successes" warning threshold in the four TROP bootstrap sites (local unit-resample, local Rao-Wu, global unit-resample, global Rao-Wu) plus the Rust global happy path with a proportional 5% failure-rate guard, matching the existing SyntheticDiD bootstrap and placebo convention. A shared helper `bootstrap_utils.warn_bootstrap_failure_rate` centralizes the check so future axis-D work (e.g. PowerAnalysis simulation counter) can reuse it. Before this change, a run with `n_bootstrap=200` and 11 successes (94.5% failure rate) passed silently because 11 >= 10. Now any run with failure rate > 5% emits a `UserWarning` surfacing the success count, total attempts, and failure rate. SDID bootstrap paths (`synthetic_did.py:1036-1070` and `:1229-1251`) were verified during this work to already have the same 5% proportional guard — D-2/D-3 in the audit are marked resolved rather than bundled. Covered by audit axis D (degenerate-replicate handling). Findings #13-#16 from `docs/audits/silent-failures-findings.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two P2 items from local AI review: 1. The local Rust bootstrap happy path (`trop_local.py:988`) previously returned silently whenever it received `>= 10` samples and otherwise fell back to Python. That left the same silent-failure window the main PR was closing — Rust returning 11 of 200 samples (94.5% failure rate) would return its SE with no warning. Replaced the `len >= 10` path-switch with the proportional `warn_bootstrap_failure_rate` guard; only a zero-success Rust result now triggers the Python fallback. 2. The helper docstring was ambiguous about which zero case was silent. Revised to distinguish `n_attempted == 0` (silent — caller handles) from `n_success == 0` with `n_attempted > 0` (warning fires). Added a regression test covering the Rust local path via a mocked `_rust_bootstrap_trop_variance` returning 11/200 samples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI AI review on PR #324 flagged (P3) that the delta changed the Rao-Wu local, Rao-Wu global, and global-Rust warning branches without direct assertions on those paths. Add three targeted tests mirroring the pattern of the existing four (mocked inner-fit side effects, direct method invocation, pytest.warns with the context string in the match regex): - `_bootstrap_variance_global` Rust happy path - `_bootstrap_rao_wu_local` (survey design with per-unit PSU) - `_bootstrap_rao_wu_global` (same survey setup) All six changed warning sites now have direct regression coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed TROP bootstrap paths. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
igerber
added a commit
that referenced
this pull request
Apr 20, 2026
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
warn_bootstrap_failure_ratehelper indiff_diff/bootstrap_utils.pythat emits a proportionalUserWarningwhen the bootstrap replicate failure rate exceeds 5% — matching the existing SyntheticDiD bootstrap and placebo convention.< 10 successesthreshold in all four TROP bootstrap sites (local unit-resample, local Rao-Wu, global unit-resample, global Rao-Wu) plus both Rust happy paths (local and global) with the proportional guard. Previously, a run withn_bootstrap=200and 11 successes (94.5% failure rate) passed silently; now any run with failure rate > 5% warns.docs/methodology/REGISTRY.md(TROP edge-cases section).Methodology references (required if estimator / math changes)
SE = NaNcontract is unchanged.synthetic_did.py:1060and:1245), not a deviation from any cited reference.Validation
tests/test_bootstrap_utils.py— 7 new tests for the shared helper (above-threshold, below-threshold, full-success, zero-attempts, zero-success, custom threshold, context string in message).tests/test_trop.py::TestTROPBootstrapFailureRateGuard— 4 integration tests covering the local Python, global Python, local Rust, and full-success-silence paths via mocked inner-fit side effects.Security / privacy
Audit context: this PR resolves findings #13–#16 from the axis-D (degenerate-replicate handling) slice of the in-flight silent-failures audit. SDID bootstrap/placebo (D-2/D-3) were verified as already-guarded during prep, so no SDID code change was bundled.