Skip to content

Guard TROP bootstrap loops against silent high-failure-rate runs#324

Merged
igerber merged 3 commits intomainfrom
fix/trop-bootstrap-failure-rate-guards
Apr 18, 2026
Merged

Guard TROP bootstrap loops against silent high-failure-rate runs#324
igerber merged 3 commits intomainfrom
fix/trop-bootstrap-failure-rate-guards

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 18, 2026

Summary

  • Add shared warn_bootstrap_failure_rate helper in diff_diff/bootstrap_utils.py that emits a proportional UserWarning when the bootstrap replicate failure rate exceeds 5% — matching the existing SyntheticDiD bootstrap and placebo convention.
  • Replace the hard-coded < 10 successes threshold in all four TROP bootstrap sites (local unit-resample, local Rao-Wu, global unit-resample, global Rao-Wu) plus both Rust happy paths (local and global) with the proportional guard. Previously, a run with n_bootstrap=200 and 11 successes (94.5% failure rate) passed silently; now any run with failure rate > 5% warns.
  • Document the change in docs/methodology/REGISTRY.md (TROP edge-cases section).

Methodology references (required if estimator / math changes)

  • Method name(s): TROP bootstrap standard-error estimation (local and global, including Rao-Wu survey variant and Rust happy paths).
  • Paper / source link(s): No estimator math changed. This PR adjusts diagnostic/warning control flow only. The zero-success SE = NaN contract is unchanged.
  • Any intentional deviations from the source (and why): None. The 5% threshold is an internal diagnostic convention chosen to match the existing SyntheticDiD bootstrap and placebo guards (synthetic_did.py:1060 and :1245), not a deviation from any cited reference.

Validation

  • Tests added/updated:
    • tests/test_bootstrap_utils.py — 7 new tests for the shared helper (above-threshold, below-threshold, full-success, zero-attempts, zero-success, custom threshold, context string in message).
    • tests/test_trop.py::TestTROPBootstrapFailureRateGuard — 4 integration tests covering the local Python, global Python, local Rust, and full-success-silence paths via mocked inner-fit side effects.
  • Backtest / simulation / notebook evidence (if applicable): N/A — warning-behavior change, no numeric-contract change.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Audit context: this PR resolves findings #13#16 from the axis-D (degenerate-replicate handling) slice of the in-flight silent-failures audit. SDID bootstrap/placebo (D-2/D-3) were verified as already-guarded during prep, so no SDID code change was bundled.

igerber and others added 2 commits April 18, 2026 18:06
Replace the hard-coded "< 10 successes" warning threshold in the four
TROP bootstrap sites (local unit-resample, local Rao-Wu, global
unit-resample, global Rao-Wu) plus the Rust global happy path with a
proportional 5% failure-rate guard, matching the existing SyntheticDiD
bootstrap and placebo convention. A shared helper
`bootstrap_utils.warn_bootstrap_failure_rate` centralizes the check so
future axis-D work (e.g. PowerAnalysis simulation counter) can reuse it.

Before this change, a run with `n_bootstrap=200` and 11 successes
(94.5% failure rate) passed silently because 11 >= 10. Now any run
with failure rate > 5% emits a `UserWarning` surfacing the success
count, total attempts, and failure rate.

SDID bootstrap paths (`synthetic_did.py:1036-1070` and `:1229-1251`)
were verified during this work to already have the same 5% proportional
guard — D-2/D-3 in the audit are marked resolved rather than bundled.

Covered by audit axis D (degenerate-replicate handling). Findings
#13-#16 from `docs/audits/silent-failures-findings.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two P2 items from local AI review:

1. The local Rust bootstrap happy path (`trop_local.py:988`) previously
   returned silently whenever it received `>= 10` samples and otherwise
   fell back to Python. That left the same silent-failure window the
   main PR was closing — Rust returning 11 of 200 samples (94.5%
   failure rate) would return its SE with no warning. Replaced the
   `len >= 10` path-switch with the proportional
   `warn_bootstrap_failure_rate` guard; only a zero-success Rust result
   now triggers the Python fallback.

2. The helper docstring was ambiguous about which zero case was silent.
   Revised to distinguish `n_attempted == 0` (silent — caller handles)
   from `n_success == 0` with `n_attempted > 0` (warning fires).

Added a regression test covering the Rust local path via a mocked
`_rust_bootstrap_trop_variance` returning 11/200 samples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • Affected methods: TROP bootstrap SE diagnostics for local/global block bootstrap, Rao-Wu survey bootstrap, and both Rust happy paths.
  • No estimator math, weighting, variance formula, or identification logic changed. The diff only centralizes the post-loop warning logic in diff_diff/bootstrap_utils.py:L30-L70.
  • The new >5% failure-rate guard is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976 and matches the existing SyntheticDiD convention in diff_diff/synthetic_did.py:L1060-L1067, so there is no undocumented methodology deviation.
  • Pattern consistency is good: all touched TROP bootstrap sites now use the shared helper in diff_diff/trop_local.py:L988-L995, diff_diff/trop_local.py:L1075-L1084, diff_diff/trop_local.py:L1242-L1251, diff_diff/trop_global.py:L998-L1006, diff_diff/trop_global.py:L1083-L1092, and diff_diff/trop_global.py:L1274-L1283.
  • I could not execute the targeted pytest suites here because pytest is not installed in this environment.

Methodology

  • Severity: P3. Impact: The only substantive behavior shift beyond warning text is that the local Rust path now keeps any non-zero Rust bootstrap result instead of rerunning Python when successes are 1..9. That change is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976, and the Rust backend already returns SE=NaN when fewer than two draws succeed in rust/src/trop.rs:L1085-L1101 (global has the same contract in rust/src/trop.rs:L1825-L1841). Concrete fix: None required.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new findings. The broader TROP duplication debt remains pre-existing and is already tracked in TODO.md:L69, so it is mitigated under the review policy.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The helper itself is well covered in tests/test_bootstrap_utils.py:L136-L195, and the integration tests added here cover local Python, global Python, and local Rust warning paths in tests/test_trop.py:L3730-L3893. But this diff also changes the global Rust warning branch and both Rao-Wu warning branches without a targeted assertion on those paths: diff_diff/trop_global.py:L998-L1006, diff_diff/trop_local.py:L1242-L1251, diff_diff/trop_global.py:L1274-L1283. That is a coverage-depth gap, not a correctness defect. Concrete fix: Add one mocked global-Rust warning test and one survey-design warning test each for the local/global Rao-Wu branches that forces >5% failures and asserts the warning context string.

CI AI review on PR #324 flagged (P3) that the delta changed the Rao-Wu
local, Rao-Wu global, and global-Rust warning branches without direct
assertions on those paths. Add three targeted tests mirroring the
pattern of the existing four (mocked inner-fit side effects, direct
method invocation, pytest.warns with the context string in the match
regex):

- `_bootstrap_variance_global` Rust happy path
- `_bootstrap_rao_wu_local` (survey design with per-unit PSU)
- `_bootstrap_rao_wu_global` (same survey setup)

All six changed warning sites now have direct regression coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 18, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 356e3ecf5e9d87daf280cf2f9782b88f8c007934


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed TROP bootstrap paths.

Executive Summary

  • Affected methods: TROP bootstrap SE diagnostics for local/global unit-resample bootstrap, local/global Rao-Wu bootstrap, and both Rust-return paths in diff_diff/trop_local.py:L988-L995, diff_diff/trop_local.py:L1075-L1083, diff_diff/trop_local.py:L1242-L1250, diff_diff/trop_global.py:L998-L1006, diff_diff/trop_global.py:L1083-L1091, and diff_diff/trop_global.py:L1274-L1282.
  • No estimator math, weighting rule, variance formula, or identification assumption changed. TROP still routes SEs through safe_inference() in diff_diff/trop.py:L834-L851 and diff_diff/trop_global.py:L842-L857, so NaN-consistent inference is preserved.
  • The new proportional warning guard is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976, so the warning-policy change and the local Rust nonzero-result behavior are documented deviations rather than defects.
  • Pattern consistency is good: the old < 10 successes guard is removed from the touched TROP code, and all six in-scope sites now use diff_diff.bootstrap_utils.warn_bootstrap_failure_rate() in diff_diff/bootstrap_utils.py:L30-L70.
  • The prior re-review coverage gap is addressed by new helper tests in tests/test_bootstrap_utils.py:L136-L195 and new TROP integration coverage in tests/test_trop.py:L3706-L4044.
  • I could not execute the targeted pytest suites here because pytest is not installed in this environment.

Methodology

  • Severity: P3 (informational). Impact: The PR changes only post-loop bootstrap failure diagnostics for TROP’s block-bootstrap and Rao-Wu SE paths; it does not change the estimator formulas summarized in docs/methodology/REGISTRY.md:L1947-L1976. The >5% threshold matches the existing SyntheticDiD bootstrap/placebo convention in diff_diff/synthetic_did.py:L1060-L1067 and diff_diff/synthetic_did.py:L1244-L1251, and the local Rust path’s new “accept any non-zero result” behavior is explicitly documented in docs/methodology/REGISTRY.md:L1973-L1976. The Rust backend already returns SE=NaN when fewer than two draws succeed in rust/src/trop.rs:L1085-L1101 and rust/src/trop.rs:L1825-L1841, so there is no new NaN/inference inconsistency. Concrete fix: None required.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The broader TROP duplication debt remains pre-existing and tracked in TODO.md:L69.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 (informational). Impact: The previously noted coverage gap is now closed: the PR adds direct assertions for the global Rust warning path and both Rao-Wu warning branches in tests/test_trop.py:L3895-L4044, plus helper edge cases in tests/test_bootstrap_utils.py:L136-L195, and it updates the registry note in docs/methodology/REGISTRY.md:L1973-L1976. Concrete fix: None required.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 18, 2026
@igerber igerber merged commit 4387707 into main Apr 18, 2026
22 of 24 checks passed
@igerber igerber deleted the fix/trop-bootstrap-failure-rate-guards branch April 18, 2026 23:28
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant