Skip to content

Fix SyntheticDiD bootstrap p-value dispatch and SE formula#349

Merged
igerber merged 4 commits intomainfrom
sdid-bootstrap
Apr 22, 2026
Merged

Fix SyntheticDiD bootstrap p-value dispatch and SE formula#349
igerber merged 4 commits intomainfrom
sdid-bootstrap

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 22, 2026

Summary

  • Narrow the empirical null p-value formula to variance_method="placebo" only. Bootstrap draws are centered on τ̂ (sampling distribution) and jackknife pseudo-values are not null draws; both now use the analytical normal-theory p-value from the SE via safe_inference, matching R's synthdid::vcov() convention. Pre-fix, bootstrap p-values clustered near 0.5 regardless of effect size.
  • Apply the sqrt((r-1)/r) correction to the bootstrap SE formula, matching R's synthdid and the existing placebo formula (approximately 0.25% SE change at n_bootstrap=200).
  • Add a hidden _bootstrap_indices test seam on _bootstrap_se for R-parity testing; generator script under benchmarks/R/ produces the pinned-indices JSON fixture.

Methodology references (required if estimator / math changes)

  • Method name(s): SyntheticDiD (variance_method="bootstrap", "placebo", "jackknife").
  • Paper / source link(s): Arkhangelsky, Athey, Hirshberg, Imbens, & Wager (2021) "Synthetic Difference-in-Differences", AER 111(12); R synthdid reference package (vcov.R formula, update.omega = is.null(weights$omega) contract).
  • Any intentional deviations from the source (and why): The bootstrap still holds ω and λ fixed (ω renormalized over resampled controls) rather than re-estimating per draw as Arkhangelsky et al. Algorithm 2 step 2 specifies. This matches R's synthdid::vcov(method="bootstrap") and is now explicitly labeled as a Note in REGISTRY.md §SyntheticDiD. A paper-faithful refit option is tracked in TODO.md as a follow-up.

Validation

  • Tests added/updated:
    • tests/test_methodology_sdid.py::TestPValueSemantics — 4 new tests (bootstrap/placebo p-value formula guards, large-effect detection at z > 6, slow-marked null calibration).
    • tests/test_methodology_sdid.py::TestScaleEquivariance._BASELINE — bootstrap row updated for post-fix SE / p-value.
    • tests/test_methodology_sdid.py::TestScaleEquivariance::test_detects_true_effect_at_extreme_scalevariance_method != "bootstrap" carve-out removed; assertion now runs unconditionally.
    • tests/test_methodology_sdid.py::TestJackknifeSERParity::test_bootstrap_se_matches_r — R-parity test skeleton (skips without the JSON fixture).
  • Backtest / simulation / notebook evidence (if applicable): docs/tutorials/18_geo_experiments.ipynb re-executed so its placebo vs bootstrap comparison table and summary outputs reflect the corrected bootstrap p-values.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment
✅ Looks good

Executive Summary

Methodology

  • Severity: P3-informational. Impact: No unmitigated methodology defect in the changed code. Algorithm A2 in the cited SDID paper recomputes the SDID estimate on each bootstrap resample, the R synthdid reference implementation computes bootstrap SE as sqrt((B-1)/B) * sd(...), and the package README forms confidence intervals from that SE. Against that reference set, the new sqrt((r-1)/r) factor and the placebo-only empirical p-value branch in diff_diff/synthetic_did.py:618 and diff_diff/synthetic_did.py:1067 look correct. Concrete fix: None. (docs.iza.org)
  • Severity: P3-informational. Impact: The library still differs from the paper’s full-refit bootstrap because it follows R’s fixed-weight shortcut, but this PR now makes that limitation explicit in docs/methodology/REGISTRY.md:1572 and tracks the paper-faithful follow-up in TODO.md:92. Concrete fix: None required for approval. (docs.iza.org)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The new R-parity assertion is not exercised in repo state because tests/data/sdid_bootstrap_indices_r.json is absent from tests/data, and the skip guidance in tests/test_methodology_sdid.py:886 points to benchmarks/R/benchmark_synthdid.R rather than the actual fixture generator benchmarks/R/generate_sdid_bootstrap_parity_fixture.R:1. That weakens the PR’s main R-parity validation story. Concrete fix: Commit the JSON fixture or generate it in CI, and update the skip message to the new generator script.
  • Verification note: I could not run pytest here because pytest is unavailable.

igerber added a commit that referenced this pull request Apr 22, 2026
Ran `Rscript benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`
locally against R 4.5.2 + synthdid 0.0.9; the generated fixture
(B=200 pinned bootstrap indices plus R-computed SE=0.498347838909041
against the TestJackknifeSERParity.Y_FLAT panel) is committed at
tests/data/sdid_bootstrap_indices_r.json. The R-parity test
`test_bootstrap_se_matches_r` now runs (rather than skipping) and
passes at the full 1e-10 tolerance shared with the jackknife and ATT
parity tests, giving the PR a live R-parity validation surface.

Two fixes required to get there:
- The hardcoded Y_flat in the generator had only 58 of the 184 panel
  values; R silently recycled them, producing a garbage fixture. Full
  184-value Y_flat block restored (reshape check `stopifnot(length ==
  184L)` added).
- The skip message in `test_bootstrap_se_matches_r` pointed to
  `benchmarks/R/benchmark_synthdid.R` instead of the actual generator
  `benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 56c2647a2c91868fdbfb6273adedcc6da023c460


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated findings. The estimator changes in diff_diff/synthetic_did.py:L610-L624 and diff_diff/synthetic_did.py:L1047-L1071 are consistent with the documented SyntheticDiD methodology.
  • Methods affected: SyntheticDiD bootstrap p-value dispatch and bootstrap SE. Placebo and jackknife behavior is clarified/documented; only bootstrap inference is materially changed.
  • The bootstrap p-value fix is methodology-correct: only placebo draws are treated as an empirical null, while bootstrap and jackknife now use safe_inference(), preserving all-or-nothing NaN inference gating.
  • The bootstrap SE correction matches the documented sqrt((r-1)/r) * sd(ddof=1) formula and the R-matching registry contract.
  • The remaining fixed-weight bootstrap vs paper-refit bootstrap gap is explicitly documented in docs/methodology/REGISTRY.md:L1497-L1545 and docs/methodology/REGISTRY.md:L1570-L1572, and tracked in TODO.md:L92, so it is mitigated rather than blocking.
  • Review note: I reviewed the supplied diff plus surrounding source. I did not execute the tests in this checkout.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational. Impact: The remaining SDID follow-ups are properly tracked instead of left implicit, including the refit-bootstrap option and coverage study. Concrete fix: None required for approval. See TODO.md:L92 and TODO.md:L118-L119.
  • No other findings.

Security

  • No findings.

Documentation/Tests

igerber and others added 3 commits April 22, 2026 06:51
Bootstrap p-values were computed with the empirical null formula
`np.mean(|draws| >= |att|)` that is only valid for placebo draws
(control-index permutations, centered on 0). Bootstrap draws from unit
resampling are centered on the point estimate τ̂ (sampling distribution,
not null), so the same formula returns ~0.5 by construction regardless
of the true effect size. Narrow the empirical-p branch at
synthetic_did.py:626-634 to `inference_method == "placebo"`; bootstrap
and jackknife now both use the analytical normal-theory p-value from
the SE via `safe_inference`, matching R's `synthdid::vcov()` convention.

Also apply the `sqrt((r-1)/r)` correction to the bootstrap SE formula
at synthetic_did.py:1044-1080, matching R's synthdid and the existing
placebo formula (approximately 0.25% SE change at n_bootstrap=200).

Add a hidden `_bootstrap_indices` kwarg on `_bootstrap_se` as a test
seam for the R-parity test (pinned R-generated indices via JSON fixture,
generator in benchmarks/R/generate_sdid_bootstrap_parity_fixture.R;
test skips without the fixture).

Tests:
- New `TestPValueSemantics` class with regression guards (bootstrap/
  placebo p-value formulas, large-effect detection, null calibration).
- Drop the `variance_method != "bootstrap"` carve-out in
  `test_detects_true_effect_at_extreme_scale`; the assertion now runs
  unconditionally for all three methods.
- Update `TestScaleEquivariance._BASELINE` bootstrap row for the
  post-fix SE / p-value literals.
- R-parity bootstrap SE test skeleton (`TestJackknifeSERParity`).

REGISTRY.md documents the p-value dispatch semantics, the SE formula,
and labels the fixed-weight bootstrap as a documented shortcut relative
to Arkhangelsky et al. (2021) Algorithm 2 (matching R). Tutorial 18
re-executed so stored outputs reflect the corrected bootstrap values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran `Rscript benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`
locally against R 4.5.2 + synthdid 0.0.9; the generated fixture
(B=200 pinned bootstrap indices plus R-computed SE=0.498347838909041
against the TestJackknifeSERParity.Y_FLAT panel) is committed at
tests/data/sdid_bootstrap_indices_r.json. The R-parity test
`test_bootstrap_se_matches_r` now runs (rather than skipping) and
passes at the full 1e-10 tolerance shared with the jackknife and ATT
parity tests, giving the PR a live R-parity validation surface.

Two fixes required to get there:
- The hardcoded Y_flat in the generator had only 58 of the 184 panel
  values; R silently recycled them, producing a garbage fixture. Full
  184-value Y_flat block restored (reshape check `stopifnot(length ==
  184L)` added).
- The skip message in `test_bootstrap_se_matches_r` pointed to
  `benchmarks/R/benchmark_synthdid.R` instead of the actual generator
  `benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ity scope docs

The slow `test_bootstrap_p_value_null_calibration` thresholds were
hand-guessed (rejection rate in [0.02, 0.12] at nominal α=0.05). At 500
seeds the observed rejection rate is 0.184 ± 0.017 — fixed-weight
bootstrap over-rejects at ~3.7x nominal because it ignores weight-
estimation uncertainty (documented Algorithm 2 deviation). Bounds
replaced with characterization assertions that reflect what the test
actually needs to detect:

- rejection rate > α = 0.05: catches the pre-fix dispatch bug where p
  clustered at ~0.5 on every seed (rejection rate → 0).
- rejection rate < 0.5: upper sanity bound for new catastrophic
  miscalibration regressions.

The empirical calibration finding (median ~0.38; rejection rates at
α ∈ {0.01, 0.05, 0.10, 0.20} from 500 seeds) is captured in the
SyntheticDiD REGISTRY.md fixed-weight calibration Note so future
readers can trace the test bounds to real data.

Also added a second REGISTRY.md Note on R-parity scope: RNG streams
differ across languages (Python PCG64 vs R Mersenne Twister), so the
`test_bootstrap_se_matches_r` parity is via pinned-index JSON fixture,
isolating the deterministic math (per-draw estimator, weight
renormalization, SE aggregation) from cross-language RNG drift. The
test docstring is expanded to the same effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e603fbfe4b8e68c4b4983cd08c0927333676b9f8


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The advertised bootstrap p-value dispatch change and the sqrt((r-1)/r) SE correction themselves look addressed.
  • Affected methods: SyntheticDiD with variance_method="bootstrap" for inference and SE computation; placebo/jackknife are only touched through shared dispatch logic.
  • Severity P1 [Newly identified]: the live pairs-bootstrap path still counts n_bootstrap attempts rather than n_bootstrap successful draws, so the implementation does not yet match Algorithm 2 / synthdid on ordinary fits.
  • The repo’s own baseline still expects only 191 retained bootstrap draws for n_bootstrap=200, so this is not just a contrived edge case.
  • The fixed-weight bootstrap vs paper-refit-bootstrap gap is explicitly documented in the registry and tracked in TODO.md, so that part is mitigated P3 rather than blocking.
  • I did not execute the test suite in this environment.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3-informational. Impact: the internal placebo_effects container now carries placebo, bootstrap, and jackknife draws, which is semantically misleading but already tracked as follow-up tech debt. Concrete fix: none required for approval. References: TODO.md:129.

Tech Debt

  • Severity: P3-informational. Impact: the PR properly tracks the remaining SDID follow-ups for paper-faithful refit bootstrap and coverage calibration, so those deferred items are visible rather than implicit. Concrete fix: none required for approval. References: TODO.md:102, TODO.md:128.

Security

  • No findings.

Documentation/Tests

  • No additional findings beyond the Methodology item. The new parity fixture/test materially improve coverage of the deterministic downstream math, but by design they do not exercise the live degenerate-draw retry contract that the production bootstrap path still lacks. References: tests/test_methodology_sdid.py:875, docs/methodology/REGISTRY.md:1547.

Path to Approval

  1. Update the pairs-bootstrap branch in diff_diff/synthetic_did.py:911 so it retries degenerate all-control/all-treated resamples until it accumulates self.n_bootstrap valid estimates, matching Algorithm 2 / synthdid.
  2. Add a regression test on the ordinary public bootstrap path that forces at least one degenerate resample and verifies the method still returns exactly n_bootstrap successful bootstrap draws.
  3. If the project intentionally wants attempts-based behavior instead, remove the current “matches R” wording and add a labeled Note (deviation from R) in the registry/docstring/changelog that states Python counts attempts, not successful draws.

The pairs-bootstrap loop in `_bootstrap_se` counted *attempts*, not
*valid draws*: `for b in range(self.n_bootstrap)` with `continue` on
degenerate (all-control or all-treated) or non-finite resamples. Our
own TestScaleEquivariance baseline recorded `n_successful = 191` for
`n_bootstrap = 200`, i.e. 9 degenerate draws silently dropped and
never replaced. That's a methodology deviation from both R's
`synthdid::bootstrap_sample` (`while (count < replications) { ...;
if (!is.na(est)) count = count + 1 }`) and Arkhangelsky et al.
(2021) Algorithm 2 (B bootstrap replicates).

Fix: switch pairs-bootstrap and Rao-Wu branches to
`while len(bootstrap_estimates) < self.n_bootstrap`, incrementing
only on a finite `τ_b` from a non-degenerate resample. Bounded
attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs
while matching R's semantics on every realistic input.

R-parity contract preserved: the fixture at
`tests/data/sdid_bootstrap_indices_r.json` is generated by
`benchmarks/R/generate_sdid_bootstrap_parity_fixture.R` under the
same retry-to-B semantic, so all 200 rows are pre-validated and the
`_bootstrap_indices` seam iterates through them deterministically.
`test_bootstrap_se_matches_r` still passes at 1e-10.

Updated baselines in TestScaleEquivariance._BASELINE bootstrap row
(`se` 0.1585… → 0.1627…, `n` 191 → 200; `att` unchanged).
`test_nonfinite_tau_filtered_in_bootstrap` updated to assert the
retry contract (accumulates exactly B finite draws; estimator
called more than B times).

REGISTRY.md edge-case Note describes the retry loop and the
attempt-budget guard. CHANGELOG has a third bullet under the
SDID bootstrap fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 17de06865bad0e3bc36cb693ddeb603897d1e03a


Overall Assessment

✅ Looks good

Executive Summary

  • The prior P1 is resolved: the live bootstrap loop in diff_diff/synthetic_did.py:L911 now retries degenerate all-control/all-treated and non-finite draws until it accumulates n_bootstrap valid replicates, which matches both Algorithm 2’s discard-and-resample contract and synthdid’s while(count < replications) bootstrap loop. (raw.githubusercontent.com)
  • The bootstrap SE correction in diff_diff/synthetic_did.py:L1090 is methodologically correct: sqrt((r-1)/r) * sd(...) matches the paper’s variance definition and the R package’s implementation. The new parity test at tests/test_methodology_sdid.py:L875 is scoped appropriately to deterministic downstream math. (raw.githubusercontent.com)
  • The p-value dispatch change in diff_diff/synthetic_did.py:L610 is the right fix: bootstrap and jackknife now use safe_inference(...), while placebo keeps the documented empirical-null path in docs/methodology/REGISTRY.md:L1545. That is consistent with variance-based large-sample inference rather than treating bootstrap sampling draws as null draws. (researchgate.net)
  • The remaining fixed-weight bootstrap shortcut versus paper-faithful refitting is explicitly disclosed in docs/methodology/REGISTRY.md:L1546 and tracked in TODO.md:L102, so it is mitigated P3 rather than a blocker. The R package uses the same fixed-weight shortcut when non-null weights are passed to synthdid_estimate(). (raw.githubusercontent.com)
  • I did not identify any new unmitigated P1+ findings in the changed files.
  • I could not execute the test suite here because this environment lacks pytest and numpy, so this is a static re-review.

Methodology

  • Severity: P3-informational (mitigated). Impact: the only substantive methodology gap still present is the already-documented fixed-weight bootstrap shortcut relative to paper-faithful re-estimation; this PR correctly fixes the retry contract, SE scaling, and bootstrap/jackknife inference dispatch for the method it claims to implement. Concrete fix: none required for approval; the follow-up is already tracked in TODO.md:L102. (raw.githubusercontent.com)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational (tracked). Impact: the PR properly records the remaining SDID follow-ups, including the paper-faithful refit bootstrap and the bootstrap coverage study, in TODO.md:L102 and TODO.md:L128. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026
@igerber igerber merged commit aa63ed4 into main Apr 22, 2026
23 of 24 checks passed
@igerber igerber deleted the sdid-bootstrap branch April 22, 2026 13:15
igerber added a commit that referenced this pull request Apr 22, 2026
… deviation

Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that
R's default synthdid::vcov(method="bootstrap") rebinds
attr(estimate, "opts") — which includes update.omega=TRUE from the
original fit — back into synthdid_estimate inside its do.call, so the
renormalized ω is used only as Frank-Wolfe initialization and ω and λ
are re-estimated per draw. R's default bootstrap is refit, not fixed-
weight. The sum_normalize helper in R's source explicitly comments
that the supplied weights "are used only for initialization" in
bootstrap and placebo SEs.

Our variance_method="bootstrap" holds the renormalized ω exactly (no
FW re-run). It is therefore a deliberate deviation from R's default.
Our PR #349 fixture generator at benchmarks/R/... is a manual
fixed-weight invocation — it omits the opts rebind, which defaults
update.omega to FALSE given non-null weights. The 1e-10 parity test
anchors our fixed-weight path to that manual R invocation, not to R's
real vcov behavior.

Documentation-only fix across all claim sites; no methodology or code
behavior changes:

- REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as
  "Alternative: Bootstrap at unit level — fixed-weight shortcut";
  add explicit **Note (deviation from R)** citing the vcov.R /
  synthdid.R opts-rebind mechanism; call out bootstrap_refit as
  matching R's default vcov. Requirements checklist entries and
  R-parity test scope Note rewritten to match.
- diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se
  method docstring drop the "matching R" framing on the fixed-weight
  path; bootstrap_refit is flagged as matching R's default.
- diff_diff/results.py: SyntheticDiDResults.variance_method field doc
  fixed (I introduced the "R-compatible fixed-weight shortcut"
  misphrasing in round 1; it was wrong).
- CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that
  bootstrap_refit matches R's default and the existing fixed-weight
  bootstrap is now explicitly documented as a deviation.
- benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment
  calls out the non-default invocation shape (no opts rebind → runs
  fixed-weight); references the Python test that consumes this fixture.
- tests/test_methodology_sdid.py::test_bootstrap_se_matches_r
  docstring: rewritten to scope the parity check correctly (manual
  R fixed-weight, not R's default vcov).
- TODO.md: add a new row for the refit cross-language parity anchor
  (Julia Synthdid.jl or R via the real vcov path) to make the missing
  anchor explicit.

All 57 targeted tests pass; no methodology change, no numerical
output change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 22, 2026
variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021
Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap")
behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so
the renormalized ω serves only as Frank-Wolfe initialization). The
previously-shipped fixed-weight shortcut is removed entirely; the
"bootstrap_refit" enum value briefly added in earlier commits of this
PR is folded back into "bootstrap".

Why this is a correctness fix, not just a relabel: the old fixed-weight
"bootstrap" matched neither the paper (which prescribes refit) nor R's
default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored
fixed-weight Python against a manual R invocation that omitted the opts
rebind — both sides were wrong in the same direction. Coverage MC at
benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the
new "bootstrap" tracks placebo near-nominal across the three
representative DGPs; the old fixed-weight column over-rejected at α=0.05
at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal).

Capability regression: SDID + survey designs (pweight-only AND
strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight
bootstrap was the only SDID variance method that supported
strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside
_bootstrap_se). Pweight-only users can switch to variance_method="placebo"
or "jackknife"; strata/PSU/FPC users have no SDID variance option on
this release. Rao-Wu rescaled weights composed with paper-faithful
Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and
reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's
"Note (deferred survey + bootstrap composition)" and TODO.md. The
deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via
`git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite
_bootstrap_se body.

Cross-surface allow-list reverts: the additive "bootstrap_refit" enum
shipped in earlier commits of this PR rippled through results.py:960
summary gating, business_report.py:602 inference-label allow-list,
power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults
field docstrings. All of those are now back to a 3-value surface
("bootstrap", "jackknife", "placebo").

Tests:
- TestBootstrapRefitSE class deleted; 4 unique tests folded into
  TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey,
  raises-full-design-survey, summary-shows-replications).
- test_bootstrap_se_matches_r deleted along with its fixture
  (tests/data/sdid_bootstrap_indices_r.json) and generator
  (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they
  anchored the now-removed fixed-weight path.
- TestPValueSemantics::test_refit_p_value_matches_analytical deleted as
  duplicate of test_bootstrap_p_value_matches_analytical.
- TestScaleEquivariance._BASELINE: "bootstrap" row updated to the
  refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit-
  identical to the captured "bootstrap_refit" baseline since the new
  bootstrap path is the same code as the old refit path. Tolerance
  tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity.
- TestGetSetParams: variance_method literals rebound to "bootstrap";
  test_set_params_accepts_bootstrap_refit deleted (redundant with
  constructor tests).
- TestCoverageMCArtifact: expected methods list set exact-equal to
  ("placebo", "bootstrap", "jackknife").
- test_business_report.py inference-label test class + method renamed
  to drop "refit" suffix; assertion checks for "bootstrap variance".

The benchmarks/data/sdid_coverage.json artifact is updated transitionally
in this commit (fixed-weight column dropped; refit column renamed to
bootstrap) so the schema test stays green; a follow-up commit
regenerates from a fresh 500-seed MC re-run with the new code path.
The REGISTRY coverage table cells are TBD pending that re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant