Guard TROP bootstrap loops against silent high-failure-rate runs by igerber · Pull Request #324 · igerber/diff-diff

igerber · 2026-04-18T22:07:16Z

Summary

Add shared warn_bootstrap_failure_rate helper in diff_diff/bootstrap_utils.py that emits a proportional UserWarning when the bootstrap replicate failure rate exceeds 5% — matching the existing SyntheticDiD bootstrap and placebo convention.
Replace the hard-coded < 10 successes threshold in all four TROP bootstrap sites (local unit-resample, local Rao-Wu, global unit-resample, global Rao-Wu) plus both Rust happy paths (local and global) with the proportional guard. Previously, a run with n_bootstrap=200 and 11 successes (94.5% failure rate) passed silently; now any run with failure rate > 5% warns.
Document the change in docs/methodology/REGISTRY.md (TROP edge-cases section).

Methodology references (required if estimator / math changes)

Method name(s): TROP bootstrap standard-error estimation (local and global, including Rao-Wu survey variant and Rust happy paths).
Paper / source link(s): No estimator math changed. This PR adjusts diagnostic/warning control flow only. The zero-success SE = NaN contract is unchanged.
Any intentional deviations from the source (and why): None. The 5% threshold is an internal diagnostic convention chosen to match the existing SyntheticDiD bootstrap and placebo guards (synthetic_did.py:1060 and :1245), not a deviation from any cited reference.

Validation

Tests added/updated:
- tests/test_bootstrap_utils.py — 7 new tests for the shared helper (above-threshold, below-threshold, full-success, zero-attempts, zero-success, custom threshold, context string in message).
- tests/test_trop.py::TestTROPBootstrapFailureRateGuard — 4 integration tests covering the local Python, global Python, local Rust, and full-success-silence paths via mocked inner-fit side effects.
Backtest / simulation / notebook evidence (if applicable): N/A — warning-behavior change, no numeric-contract change.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Audit context: this PR resolves findings #13–#16 from the axis-D (degenerate-replicate handling) slice of the in-flight silent-failures audit. SDID bootstrap/placebo (D-2/D-3) were verified as already-guarded during prep, so no SDID code change was bundled.

Replace the hard-coded "< 10 successes" warning threshold in the four TROP bootstrap sites (local unit-resample, local Rao-Wu, global unit-resample, global Rao-Wu) plus the Rust global happy path with a proportional 5% failure-rate guard, matching the existing SyntheticDiD bootstrap and placebo convention. A shared helper `bootstrap_utils.warn_bootstrap_failure_rate` centralizes the check so future axis-D work (e.g. PowerAnalysis simulation counter) can reuse it. Before this change, a run with `n_bootstrap=200` and 11 successes (94.5% failure rate) passed silently because 11 >= 10. Now any run with failure rate > 5% emits a `UserWarning` surfacing the success count, total attempts, and failure rate. SDID bootstrap paths (`synthetic_did.py:1036-1070` and `:1229-1251`) were verified during this work to already have the same 5% proportional guard — D-2/D-3 in the audit are marked resolved rather than bundled. Covered by audit axis D (degenerate-replicate handling). Findings #13-#16 from `docs/audits/silent-failures-findings.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two P2 items from local AI review: 1. The local Rust bootstrap happy path (`trop_local.py:988`) previously returned silently whenever it received `>= 10` samples and otherwise fell back to Python. That left the same silent-failure window the main PR was closing — Rust returning 11 of 200 samples (94.5% failure rate) would return its SE with no warning. Replaced the `len >= 10` path-switch with the proportional `warn_bootstrap_failure_rate` guard; only a zero-success Rust result now triggers the Python fallback. 2. The helper docstring was ambiguous about which zero case was silent. Revised to distinguish `n_attempted == 0` (silent — caller handles) from `n_success == 0` with `n_attempted > 0` (warning fires). Added a regression test covering the Rust local path via a mocked `_rust_bootstrap_trop_variance` returning 11/200 samples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-18T22:13:37Z

Overall Assessment

✅ Looks good

Executive Summary

Affected methods: TROP bootstrap SE diagnostics for local/global block bootstrap, Rao-Wu survey bootstrap, and both Rust happy paths.
No estimator math, weighting, variance formula, or identification logic changed. The diff only centralizes the post-loop warning logic in diff_diff/bootstrap_utils.py:L30-L70.
The new >5% failure-rate guard is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976 and matches the existing SyntheticDiD convention in diff_diff/synthetic_did.py:L1060-L1067, so there is no undocumented methodology deviation.
Pattern consistency is good: all touched TROP bootstrap sites now use the shared helper in diff_diff/trop_local.py:L988-L995, diff_diff/trop_local.py:L1075-L1084, diff_diff/trop_local.py:L1242-L1251, diff_diff/trop_global.py:L998-L1006, diff_diff/trop_global.py:L1083-L1092, and diff_diff/trop_global.py:L1274-L1283.
I could not execute the targeted pytest suites here because pytest is not installed in this environment.

Methodology

Severity: P3. Impact: The only substantive behavior shift beyond warning text is that the local Rust path now keeps any non-zero Rust bootstrap result instead of rerunning Python when successes are 1..9. That change is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976, and the Rust backend already returns SE=NaN when fewer than two draws succeed in rust/src/trop.rs:L1085-L1101 (global has the same contract in rust/src/trop.rs:L1825-L1841). Concrete fix: None required.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new findings. The broader TROP duplication debt remains pre-existing and is already tracked in TODO.md:L69, so it is mitigated under the review policy.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The helper itself is well covered in tests/test_bootstrap_utils.py:L136-L195, and the integration tests added here cover local Python, global Python, and local Rust warning paths in tests/test_trop.py:L3730-L3893. But this diff also changes the global Rust warning branch and both Rao-Wu warning branches without a targeted assertion on those paths: diff_diff/trop_global.py:L998-L1006, diff_diff/trop_local.py:L1242-L1251, diff_diff/trop_global.py:L1274-L1283. That is a coverage-depth gap, not a correctness defect. Concrete fix: Add one mocked global-Rust warning test and one survey-design warning test each for the local/global Rao-Wu branches that forces >5% failures and asserts the warning context string.

CI AI review on PR #324 flagged (P3) that the delta changed the Rao-Wu local, Rao-Wu global, and global-Rust warning branches without direct assertions on those paths. Add three targeted tests mirroring the pattern of the existing four (mocked inner-fit side effects, direct method invocation, pytest.warns with the context string in the match regex): - `_bootstrap_variance_global` Rust happy path - `_bootstrap_rao_wu_local` (survey design with per-unit PSU) - `_bootstrap_rao_wu_global` (same survey setup) All six changed warning sites now have direct regression coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-18T22:24:52Z

/ai-review

github-actions · 2026-04-18T22:29:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 356e3ecf5e9d87daf280cf2f9782b88f8c007934

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed TROP bootstrap paths.

Executive Summary

Affected methods: TROP bootstrap SE diagnostics for local/global unit-resample bootstrap, local/global Rao-Wu bootstrap, and both Rust-return paths in diff_diff/trop_local.py:L988-L995, diff_diff/trop_local.py:L1075-L1083, diff_diff/trop_local.py:L1242-L1250, diff_diff/trop_global.py:L998-L1006, diff_diff/trop_global.py:L1083-L1091, and diff_diff/trop_global.py:L1274-L1282.
No estimator math, weighting rule, variance formula, or identification assumption changed. TROP still routes SEs through safe_inference() in diff_diff/trop.py:L834-L851 and diff_diff/trop_global.py:L842-L857, so NaN-consistent inference is preserved.
The new proportional warning guard is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976, so the warning-policy change and the local Rust nonzero-result behavior are documented deviations rather than defects.
Pattern consistency is good: the old < 10 successes guard is removed from the touched TROP code, and all six in-scope sites now use diff_diff.bootstrap_utils.warn_bootstrap_failure_rate() in diff_diff/bootstrap_utils.py:L30-L70.
The prior re-review coverage gap is addressed by new helper tests in tests/test_bootstrap_utils.py:L136-L195 and new TROP integration coverage in tests/test_trop.py:L3706-L4044.
I could not execute the targeted pytest suites here because pytest is not installed in this environment.

Methodology

Severity: P3 (informational). Impact: The PR changes only post-loop bootstrap failure diagnostics for TROP’s block-bootstrap and Rao-Wu SE paths; it does not change the estimator formulas summarized in docs/methodology/REGISTRY.md:L1947-L1976. The >5% threshold matches the existing SyntheticDiD bootstrap/placebo convention in diff_diff/synthetic_did.py:L1060-L1067 and diff_diff/synthetic_did.py:L1244-L1251, and the local Rust path’s new “accept any non-zero result” behavior is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976. The Rust backend already returns SE=NaN when fewer than two draws succeed in rust/src/trop.rs:L1085-L1101 and rust/src/trop.rs:L1825-L1841, so there is no new NaN/inference inconsistency. Concrete fix: None required.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The broader TROP duplication debt remains pre-existing and tracked in TODO.md:L69.

Security

No findings.

Documentation/Tests

Severity: P3 (informational). Impact: The previously noted coverage gap is now closed: the PR adds direct assertions for the global Rust warning path and both Rao-Wu warning branches in tests/test_trop.py:L3895-L4044, plus helper edge cases in tests/test_bootstrap_utils.py:L136-L195, and it updates the registry note in docs/methodology/REGISTRY.md:L1973-L1976. Concrete fix: None required.

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 18, 2026 18:06

igerber added the ready-for-ci Triggers CI test workflows label Apr 18, 2026

igerber merged commit 4387707 into main Apr 18, 2026
22 of 24 checks passed

igerber deleted the fix/trop-bootstrap-failure-rate-guards branch April 18, 2026 23:28

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard TROP bootstrap loops against silent high-failure-rate runs#324

Guard TROP bootstrap loops against silent high-failure-rate runs#324
igerber merged 3 commits intomainfrom
fix/trop-bootstrap-failure-rate-guards

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant