Fix SyntheticDiD bootstrap p-value dispatch and SE formula by igerber · Pull Request #349 · igerber/diff-diff

igerber · 2026-04-22T01:47:42Z

Summary

Narrow the empirical null p-value formula to variance_method="placebo" only. Bootstrap draws are centered on τ̂ (sampling distribution) and jackknife pseudo-values are not null draws; both now use the analytical normal-theory p-value from the SE via safe_inference, matching R's synthdid::vcov() convention. Pre-fix, bootstrap p-values clustered near 0.5 regardless of effect size.
Apply the sqrt((r-1)/r) correction to the bootstrap SE formula, matching R's synthdid and the existing placebo formula (approximately 0.25% SE change at n_bootstrap=200).
Add a hidden _bootstrap_indices test seam on _bootstrap_se for R-parity testing; generator script under benchmarks/R/ produces the pinned-indices JSON fixture.

Methodology references (required if estimator / math changes)

Method name(s): SyntheticDiD (variance_method="bootstrap", "placebo", "jackknife").
Paper / source link(s): Arkhangelsky, Athey, Hirshberg, Imbens, & Wager (2021) "Synthetic Difference-in-Differences", AER 111(12); R synthdid reference package (vcov.R formula, update.omega = is.null(weights$omega) contract).
Any intentional deviations from the source (and why): The bootstrap still holds ω and λ fixed (ω renormalized over resampled controls) rather than re-estimating per draw as Arkhangelsky et al. Algorithm 2 step 2 specifies. This matches R's synthdid::vcov(method="bootstrap") and is now explicitly labeled as a Note in REGISTRY.md §SyntheticDiD. A paper-faithful refit option is tracked in TODO.md as a follow-up.

Validation

Tests added/updated:
- tests/test_methodology_sdid.py::TestPValueSemantics — 4 new tests (bootstrap/placebo p-value formula guards, large-effect detection at z > 6, slow-marked null calibration).
- tests/test_methodology_sdid.py::TestScaleEquivariance._BASELINE — bootstrap row updated for post-fix SE / p-value.
- tests/test_methodology_sdid.py::TestScaleEquivariance::test_detects_true_effect_at_extreme_scale — variance_method != "bootstrap" carve-out removed; assertion now runs unconditionally.
- tests/test_methodology_sdid.py::TestJackknifeSERParity::test_bootstrap_se_matches_r — R-parity test skeleton (skips without the JSON fixture).
Backtest / simulation / notebook evidence (if applicable): docs/tutorials/18_geo_experiments.ipynb re-executed so its placebo vs bootstrap comparison table and summary outputs reflect the corrected bootstrap p-values.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-04-22T01:56:23Z

Overall Assessment
✅ Looks good

Executive Summary

Methods affected: SyntheticDiD bootstrap SE computation and the placebo/bootstrap/jackknife p-value dispatch.
The changed inference math in diff_diff/synthetic_did.py:618 and diff_diff/synthetic_did.py:1067 is consistent with the cited methodology and the R reference package. (docs.iza.org)
The remaining fixed-weight bootstrap shortcut versus Arkhangelsky et al. Algorithm 2 is now documented/tracked in docs/methodology/REGISTRY.md:1572 and TODO.md:92, so it is mitigated rather than blocking. (docs.iza.org)
No new NaN/Inf propagation regression is introduced; inference still gates through diff_diff/utils.py:174 before any placebo-only override in diff_diff/synthetic_did.py:610.
One P3 validation issue remains: the new R-parity test at tests/test_methodology_sdid.py:884 skips without a JSON fixture that is not present under tests/data, and its skip message points to the wrong script instead of benchmarks/R/generate_sdid_bootstrap_parity_fixture.R:1.
I could not run the targeted tests here because pytest is not installed in this environment.

Methodology

Severity: P3-informational. Impact: No unmitigated methodology defect in the changed code. Algorithm A2 in the cited SDID paper recomputes the SDID estimate on each bootstrap resample, the R synthdid reference implementation computes bootstrap SE as sqrt((B-1)/B) * sd(...), and the package README forms confidence intervals from that SE. Against that reference set, the new sqrt((r-1)/r) factor and the placebo-only empirical p-value branch in diff_diff/synthetic_did.py:618 and diff_diff/synthetic_did.py:1067 look correct. Concrete fix: None. (docs.iza.org)
Severity: P3-informational. Impact: The library still differs from the paper’s full-refit bootstrap because it follows R’s fixed-weight shortcut, but this PR now makes that limitation explicit in docs/methodology/REGISTRY.md:1572 and tracks the paper-faithful follow-up in TODO.md:92. Concrete fix: None required for approval. (docs.iza.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The PR improves tracking by adding TODO.md:92, TODO.md:118, and TODO.md:119.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The new R-parity assertion is not exercised in repo state because tests/data/sdid_bootstrap_indices_r.json is absent from tests/data, and the skip guidance in tests/test_methodology_sdid.py:886 points to benchmarks/R/benchmark_synthdid.R rather than the actual fixture generator benchmarks/R/generate_sdid_bootstrap_parity_fixture.R:1. That weakens the PR’s main R-parity validation story. Concrete fix: Commit the JSON fixture or generate it in CI, and update the skip message to the new generator script.
Verification note: I could not run pytest here because pytest is unavailable.

Ran `Rscript benchmarks/R/generate_sdid_bootstrap_parity_fixture.R` locally against R 4.5.2 + synthdid 0.0.9; the generated fixture (B=200 pinned bootstrap indices plus R-computed SE=0.498347838909041 against the TestJackknifeSERParity.Y_FLAT panel) is committed at tests/data/sdid_bootstrap_indices_r.json. The R-parity test `test_bootstrap_se_matches_r` now runs (rather than skipping) and passes at the full 1e-10 tolerance shared with the jackknife and ATT parity tests, giving the PR a live R-parity validation surface. Two fixes required to get there: - The hardcoded Y_flat in the generator had only 58 of the 184 panel values; R silently recycled them, producing a garbage fixture. Full 184-value Y_flat block restored (reshape check `stopifnot(length == 184L)` added). - The skip message in `test_bootstrap_se_matches_r` pointed to `benchmarks/R/benchmark_synthdid.R` instead of the actual generator `benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T10:27:32Z

/ai-review

github-actions · 2026-04-22T10:33:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 56c2647a2c91868fdbfb6273adedcc6da023c460

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated findings. The estimator changes in diff_diff/synthetic_did.py:L610-L624 and diff_diff/synthetic_did.py:L1047-L1071 are consistent with the documented SyntheticDiD methodology.
Methods affected: SyntheticDiD bootstrap p-value dispatch and bootstrap SE. Placebo and jackknife behavior is clarified/documented; only bootstrap inference is materially changed.
The bootstrap p-value fix is methodology-correct: only placebo draws are treated as an empirical null, while bootstrap and jackknife now use safe_inference(), preserving all-or-nothing NaN inference gating.
The bootstrap SE correction matches the documented sqrt((r-1)/r) * sd(ddof=1) formula and the R-matching registry contract.
The remaining fixed-weight bootstrap vs paper-refit bootstrap gap is explicitly documented in docs/methodology/REGISTRY.md:L1497-L1545 and docs/methodology/REGISTRY.md:L1570-L1572, and tracked in TODO.md:L92, so it is mitigated rather than blocking.
Review note: I reviewed the supplied diff plus surrounding source. I did not execute the tests in this checkout.

Methodology

Severity: P3-informational. Impact: The only remaining paper/R mismatch in the changed area is the already-documented fixed-weight bootstrap shortcut. This PR explicitly documents that deviation and tracks the paper-faithful refit-bootstrap follow-up, so there is no unmitigated methodology defect. Concrete fix: None required for approval. See diff_diff/synthetic_did.py:L610-L624, diff_diff/synthetic_did.py:L1047-L1071, docs/methodology/REGISTRY.md:L1497-L1545, and docs/methodology/REGISTRY.md:L1570-L1572.
No other findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational. Impact: The remaining SDID follow-ups are properly tracked instead of left implicit, including the refit-bootstrap option and coverage study. Concrete fix: None required for approval. See TODO.md:L92 and TODO.md:L118-L119.
No other findings.

Security

No findings.

Documentation/Tests

No findings. The prior re-review concern about missing bootstrap-parity validation is addressed in the supplied diff by adding the parity assets and extending SDID tests around bootstrap p-value semantics and R parity in tests/test_methodology_sdid.py:L875-L916 and tests/test_methodology_sdid.py:L2344-L2432.
Residual testing gap: I did not run the test suite here.

Bootstrap p-values were computed with the empirical null formula `np.mean(|draws| >= |att|)` that is only valid for placebo draws (control-index permutations, centered on 0). Bootstrap draws from unit resampling are centered on the point estimate τ̂ (sampling distribution, not null), so the same formula returns ~0.5 by construction regardless of the true effect size. Narrow the empirical-p branch at synthetic_did.py:626-634 to `inference_method == "placebo"`; bootstrap and jackknife now both use the analytical normal-theory p-value from the SE via `safe_inference`, matching R's `synthdid::vcov()` convention. Also apply the `sqrt((r-1)/r)` correction to the bootstrap SE formula at synthetic_did.py:1044-1080, matching R's synthdid and the existing placebo formula (approximately 0.25% SE change at n_bootstrap=200). Add a hidden `_bootstrap_indices` kwarg on `_bootstrap_se` as a test seam for the R-parity test (pinned R-generated indices via JSON fixture, generator in benchmarks/R/generate_sdid_bootstrap_parity_fixture.R; test skips without the fixture). Tests: - New `TestPValueSemantics` class with regression guards (bootstrap/ placebo p-value formulas, large-effect detection, null calibration). - Drop the `variance_method != "bootstrap"` carve-out in `test_detects_true_effect_at_extreme_scale`; the assertion now runs unconditionally for all three methods. - Update `TestScaleEquivariance._BASELINE` bootstrap row for the post-fix SE / p-value literals. - R-parity bootstrap SE test skeleton (`TestJackknifeSERParity`). REGISTRY.md documents the p-value dispatch semantics, the SE formula, and labels the fixed-weight bootstrap as a documented shortcut relative to Arkhangelsky et al. (2021) Algorithm 2 (matching R). Tutorial 18 re-executed so stored outputs reflect the corrected bootstrap values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ran `Rscript benchmarks/R/generate_sdid_bootstrap_parity_fixture.R` locally against R 4.5.2 + synthdid 0.0.9; the generated fixture (B=200 pinned bootstrap indices plus R-computed SE=0.498347838909041 against the TestJackknifeSERParity.Y_FLAT panel) is committed at tests/data/sdid_bootstrap_indices_r.json. The R-parity test `test_bootstrap_se_matches_r` now runs (rather than skipping) and passes at the full 1e-10 tolerance shared with the jackknife and ATT parity tests, giving the PR a live R-parity validation surface. Two fixes required to get there: - The hardcoded Y_flat in the generator had only 58 of the 184 panel values; R silently recycled them, producing a garbage fixture. Full 184-value Y_flat block restored (reshape check `stopifnot(length == 184L)` added). - The skip message in `test_bootstrap_se_matches_r` pointed to `benchmarks/R/benchmark_synthdid.R` instead of the actual generator `benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ity scope docs The slow `test_bootstrap_p_value_null_calibration` thresholds were hand-guessed (rejection rate in [0.02, 0.12] at nominal α=0.05). At 500 seeds the observed rejection rate is 0.184 ± 0.017 — fixed-weight bootstrap over-rejects at ~3.7x nominal because it ignores weight- estimation uncertainty (documented Algorithm 2 deviation). Bounds replaced with characterization assertions that reflect what the test actually needs to detect: - rejection rate > α = 0.05: catches the pre-fix dispatch bug where p clustered at ~0.5 on every seed (rejection rate → 0). - rejection rate < 0.5: upper sanity bound for new catastrophic miscalibration regressions. The empirical calibration finding (median ~0.38; rejection rates at α ∈ {0.01, 0.05, 0.10, 0.20} from 500 seeds) is captured in the SyntheticDiD REGISTRY.md fixed-weight calibration Note so future readers can trace the test bounds to real data. Also added a second REGISTRY.md Note on R-parity scope: RNG streams differ across languages (Python PCG64 vs R Mersenne Twister), so the `test_bootstrap_se_matches_r` parity is via pinned-index JSON fixture, isolating the deterministic math (per-draw estimator, weight renormalization, SE aggregation) from cross-language RNG drift. The test docstring is expanded to the same effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T10:58:27Z

/ai-review

github-actions · 2026-04-22T11:07:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e603fbfe4b8e68c4b4983cd08c0927333676b9f8

Overall Assessment

⚠️ Needs changes

Executive Summary

The advertised bootstrap p-value dispatch change and the sqrt((r-1)/r) SE correction themselves look addressed.
Affected methods: SyntheticDiD with variance_method="bootstrap" for inference and SE computation; placebo/jackknife are only touched through shared dispatch logic.
Severity P1 [Newly identified]: the live pairs-bootstrap path still counts n_bootstrap attempts rather than n_bootstrap successful draws, so the implementation does not yet match Algorithm 2 / synthdid on ordinary fits.
The repo’s own baseline still expects only 191 retained bootstrap draws for n_bootstrap=200, so this is not just a contrived edge case.
The fixed-weight bootstrap vs paper-refit-bootstrap gap is explicitly documented in the registry and tracked in TODO.md, so that part is mitigated P3 rather than blocking.
I did not execute the test suite in this environment.

Methodology

Severity: P1 [Newly identified]. Impact: the PR now documents the bootstrap SE as “matching R” and adds an R-parity harness, but the production pairs-bootstrap loop still runs for b in range(self.n_bootstrap) and continues when a draw has no controls or no treated units, so ordinary fits can return fewer than the requested number of bootstrap estimates without retrying. That is a methodology mismatch with both the paper’s Algorithm 2 and the official synthdid bootstrap loop, which discard degenerate draws and resample until B valid replicates are produced. The mismatch is visible in the repo’s own baseline, which hard-codes 191 retained bootstrap draws for a n_bootstrap=200 fit. Concrete fix: in the live pairs-bootstrap path, keep drawing until self.n_bootstrap valid estimates are accumulated (with a bounded-attempt failure path), or explicitly relabel the current behavior as a Note (deviation from R) everywhere the code/docs currently say “matches R.” References: diff_diff/synthetic_did.py:867, diff_diff/synthetic_did.py:911, diff_diff/synthetic_did.py:973, diff_diff/synthetic_did.py:1067, docs/methodology/REGISTRY.md:1503, tests/test_methodology_sdid.py:2229, benchmarks/R/generate_sdid_bootstrap_parity_fixture.R:100. (arxiv.org)
Severity: P3-informational. Impact: the remaining fixed-weight bootstrap shortcut relative to the paper-faithful refit bootstrap is explicitly documented and tracked, so I am not treating it as a defect for this PR. Concrete fix: none required for approval. References: docs/methodology/REGISTRY.md:1546, docs/methodology/REGISTRY.md:1574, TODO.md:102.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3-informational. Impact: the internal placebo_effects container now carries placebo, bootstrap, and jackknife draws, which is semantically misleading but already tracked as follow-up tech debt. Concrete fix: none required for approval. References: TODO.md:129.

Tech Debt

Severity: P3-informational. Impact: the PR properly tracks the remaining SDID follow-ups for paper-faithful refit bootstrap and coverage calibration, so those deferred items are visible rather than implicit. Concrete fix: none required for approval. References: TODO.md:102, TODO.md:128.

Security

No findings.

Documentation/Tests

No additional findings beyond the Methodology item. The new parity fixture/test materially improve coverage of the deterministic downstream math, but by design they do not exercise the live degenerate-draw retry contract that the production bootstrap path still lacks. References: tests/test_methodology_sdid.py:875, docs/methodology/REGISTRY.md:1547.

Path to Approval

Update the pairs-bootstrap branch in diff_diff/synthetic_did.py:911 so it retries degenerate all-control/all-treated resamples until it accumulates self.n_bootstrap valid estimates, matching Algorithm 2 / synthdid.
Add a regression test on the ordinary public bootstrap path that forces at least one degenerate resample and verifies the method still returns exactly n_bootstrap successful bootstrap draws.
If the project intentionally wants attempts-based behavior instead, remove the current “matches R” wording and add a labeled Note (deviation from R) in the registry/docstring/changelog that states Python counts attempts, not successful draws.

The pairs-bootstrap loop in `_bootstrap_se` counted *attempts*, not *valid draws*: `for b in range(self.n_bootstrap)` with `continue` on degenerate (all-control or all-treated) or non-finite resamples. Our own TestScaleEquivariance baseline recorded `n_successful = 191` for `n_bootstrap = 200`, i.e. 9 degenerate draws silently dropped and never replaced. That's a methodology deviation from both R's `synthdid::bootstrap_sample` (`while (count < replications) { ...; if (!is.na(est)) count = count + 1 }`) and Arkhangelsky et al. (2021) Algorithm 2 (B bootstrap replicates). Fix: switch pairs-bootstrap and Rao-Wu branches to `while len(bootstrap_estimates) < self.n_bootstrap`, incrementing only on a finite `τ_b` from a non-degenerate resample. Bounded attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs while matching R's semantics on every realistic input. R-parity contract preserved: the fixture at `tests/data/sdid_bootstrap_indices_r.json` is generated by `benchmarks/R/generate_sdid_bootstrap_parity_fixture.R` under the same retry-to-B semantic, so all 200 rows are pre-validated and the `_bootstrap_indices` seam iterates through them deterministically. `test_bootstrap_se_matches_r` still passes at 1e-10. Updated baselines in TestScaleEquivariance._BASELINE bootstrap row (`se` 0.1585… → 0.1627…, `n` 191 → 200; `att` unchanged). `test_nonfinite_tau_filtered_in_bootstrap` updated to assert the retry contract (accumulates exactly B finite draws; estimator called more than B times). REGISTRY.md edge-case Note describes the retry loop and the attempt-budget guard. CHANGELOG has a third bullet under the SDID bootstrap fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T11:27:04Z

/ai-review

github-actions · 2026-04-22T11:34:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 17de06865bad0e3bc36cb693ddeb603897d1e03a

Overall Assessment

✅ Looks good

Executive Summary

The prior P1 is resolved: the live bootstrap loop in diff_diff/synthetic_did.py:L911 now retries degenerate all-control/all-treated and non-finite draws until it accumulates n_bootstrap valid replicates, which matches both Algorithm 2’s discard-and-resample contract and synthdid’s while(count < replications) bootstrap loop. (raw.githubusercontent.com)
The bootstrap SE correction in diff_diff/synthetic_did.py:L1090 is methodologically correct: sqrt((r-1)/r) * sd(...) matches the paper’s variance definition and the R package’s implementation. The new parity test at tests/test_methodology_sdid.py:L875 is scoped appropriately to deterministic downstream math. (raw.githubusercontent.com)
The p-value dispatch change in diff_diff/synthetic_did.py:L610 is the right fix: bootstrap and jackknife now use safe_inference(...), while placebo keeps the documented empirical-null path in docs/methodology/REGISTRY.md:L1545. That is consistent with variance-based large-sample inference rather than treating bootstrap sampling draws as null draws. (researchgate.net)
The remaining fixed-weight bootstrap shortcut versus paper-faithful refitting is explicitly disclosed in docs/methodology/REGISTRY.md:L1546 and tracked in TODO.md:L102, so it is mitigated P3 rather than a blocker. The R package uses the same fixed-weight shortcut when non-null weights are passed to synthdid_estimate(). (raw.githubusercontent.com)
I did not identify any new unmitigated P1+ findings in the changed files.
I could not execute the test suite here because this environment lacks pytest and numpy, so this is a static re-review.

Methodology

Severity: P3-informational (mitigated). Impact: the only substantive methodology gap still present is the already-documented fixed-weight bootstrap shortcut relative to paper-faithful re-estimation; this PR correctly fixes the retry contract, SE scaling, and bootstrap/jackknife inference dispatch for the method it claims to implement. Concrete fix: none required for approval; the follow-up is already tracked in TODO.md:L102. (raw.githubusercontent.com)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational (tracked). Impact: the PR properly records the remaining SDID follow-ups, including the paper-faithful refit bootstrap and the bootstrap coverage study, in TODO.md:L102 and TODO.md:L128. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

No findings in the PR itself. The updated registry and tests cover the corrected bootstrap SE formula, retry-to-n_bootstrap semantics, and the R-parity scope in docs/methodology/REGISTRY.md:L1528, docs/methodology/REGISTRY.md:L1547, tests/test_methodology_sdid.py:L998, and tests/test_methodology_sdid.py:L2380. I could not run them locally in this environment.

… deviation Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that R's default synthdid::vcov(method="bootstrap") rebinds attr(estimate, "opts") — which includes update.omega=TRUE from the original fit — back into synthdid_estimate inside its do.call, so the renormalized ω is used only as Frank-Wolfe initialization and ω and λ are re-estimated per draw. R's default bootstrap is refit, not fixed- weight. The sum_normalize helper in R's source explicitly comments that the supplied weights "are used only for initialization" in bootstrap and placebo SEs. Our variance_method="bootstrap" holds the renormalized ω exactly (no FW re-run). It is therefore a deliberate deviation from R's default. Our PR #349 fixture generator at benchmarks/R/... is a manual fixed-weight invocation — it omits the opts rebind, which defaults update.omega to FALSE given non-null weights. The 1e-10 parity test anchors our fixed-weight path to that manual R invocation, not to R's real vcov behavior. Documentation-only fix across all claim sites; no methodology or code behavior changes: - REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as "Alternative: Bootstrap at unit level — fixed-weight shortcut"; add explicit **Note (deviation from R)** citing the vcov.R / synthdid.R opts-rebind mechanism; call out bootstrap_refit as matching R's default vcov. Requirements checklist entries and R-parity test scope Note rewritten to match. - diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se method docstring drop the "matching R" framing on the fixed-weight path; bootstrap_refit is flagged as matching R's default. - diff_diff/results.py: SyntheticDiDResults.variance_method field doc fixed (I introduced the "R-compatible fixed-weight shortcut" misphrasing in round 1; it was wrong). - CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that bootstrap_refit matches R's default and the existing fixed-weight bootstrap is now explicitly documented as a deviation. - benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment calls out the non-default invocation shape (no opts rebind → runs fixed-weight); references the Python test that consumes this fixture. - tests/test_methodology_sdid.py::test_bootstrap_se_matches_r docstring: rewritten to scope the parity check correctly (manual R fixed-weight, not R's default vcov). - TODO.md: add a new row for the refit cross-language parity anchor (Julia Synthdid.jl or R via the real vcov path) to make the missing anchor explicit. All 57 targeted tests pass; no methodology change, no numerical output change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021 Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap") behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so the renormalized ω serves only as Frank-Wolfe initialization). The previously-shipped fixed-weight shortcut is removed entirely; the "bootstrap_refit" enum value briefly added in earlier commits of this PR is folded back into "bootstrap". Why this is a correctness fix, not just a relabel: the old fixed-weight "bootstrap" matched neither the paper (which prescribes refit) nor R's default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored fixed-weight Python against a manual R invocation that omitted the opts rebind — both sides were wrong in the same direction. Coverage MC at benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the new "bootstrap" tracks placebo near-nominal across the three representative DGPs; the old fixed-weight column over-rejected at α=0.05 at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal). Capability regression: SDID + survey designs (pweight-only AND strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight bootstrap was the only SDID variance method that supported strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside _bootstrap_se). Pweight-only users can switch to variance_method="placebo" or "jackknife"; strata/PSU/FPC users have no SDID variance option on this release. Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's "Note (deferred survey + bootstrap composition)" and TODO.md. The deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via `git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite _bootstrap_se body. Cross-surface allow-list reverts: the additive "bootstrap_refit" enum shipped in earlier commits of this PR rippled through results.py:960 summary gating, business_report.py:602 inference-label allow-list, power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults field docstrings. All of those are now back to a 3-value surface ("bootstrap", "jackknife", "placebo"). Tests: - TestBootstrapRefitSE class deleted; 4 unique tests folded into TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey, raises-full-design-survey, summary-shows-replications). - test_bootstrap_se_matches_r deleted along with its fixture (tests/data/sdid_bootstrap_indices_r.json) and generator (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they anchored the now-removed fixed-weight path. - TestPValueSemantics::test_refit_p_value_matches_analytical deleted as duplicate of test_bootstrap_p_value_matches_analytical. - TestScaleEquivariance._BASELINE: "bootstrap" row updated to the refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit- identical to the captured "bootstrap_refit" baseline since the new bootstrap path is the same code as the old refit path. Tolerance tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity. - TestGetSetParams: variance_method literals rebound to "bootstrap"; test_set_params_accepts_bootstrap_refit deleted (redundant with constructor tests). - TestCoverageMCArtifact: expected methods list set exact-equal to ("placebo", "bootstrap", "jackknife"). - test_business_report.py inference-label test class + method renamed to drop "refit" suffix; assertion checks for "bootstrap variance". The benchmarks/data/sdid_coverage.json artifact is updated transitionally in this commit (fixed-weight column dropped; refit column renamed to bootstrap) so the schema test stays green; a follow-up commit regenerates from a fresh 500-seed MC re-run with the new code path. The REGISTRY coverage table cells are TBD pending that re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits April 22, 2026 06:51

igerber force-pushed the sdid-bootstrap branch from 56c2647 to e603fbf Compare April 22, 2026 10:58

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber merged commit aa63ed4 into main Apr 22, 2026
23 of 24 checks passed

igerber deleted the sdid-bootstrap branch April 22, 2026 13:15

igerber mentioned this pull request Apr 22, 2026

Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351

Merged

igerber mentioned this pull request Apr 24, 2026

Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355 #361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SyntheticDiD bootstrap p-value dispatch and SE formula#349

Fix SyntheticDiD bootstrap p-value dispatch and SE formula#349
igerber merged 4 commits intomainfrom
sdid-bootstrap

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant