Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition by igerber · Pull Request #355 · igerber/diff-diff

igerber · 2026-04-24T01:06:16Z

Summary

Follow-up to Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 restoring the SDID survey-bootstrap capability that PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 front-door rejected as a known regression.
Adds a weighted Frank-Wolfe Rust kernel (sc_weight_fw_weighted + _with_convergence sibling) accepting per-coordinate reg_weights, plus matching pure-Python fallback. New Python helpers compute_sdid_unit_weights_survey / compute_time_weights_survey thread the weighted objective through the two-pass sparsify-refit dispatcher.
SyntheticDiD._bootstrap_se reintroduces the Rao-Wu branch: per-draw rw = w_control[boot_idx] for pweight-only or rw = generate_rao_wu_weights(resolved_survey_unit, rng) for full design, composed with the weighted-FW helpers; post-FW ω_eff = rw·ω / Σ(rw·ω) feeds compute_sdid_estimator.
Removes the PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 bootstrap × any-survey and unconditional strata/PSU/FPC × any-method guards; replaces with a method-gated version that still rejects strata/PSU/FPC for placebo / jackknife (separate methodology gap tracked in TODO.md).
Coverage MC extended with a 4th stratified_survey DGP; regenerated benchmarks/data/sdid_coverage.json — the new row's α=0.05 rejection is 0.042 (inside the [0.02, 0.10] calibration gate), mean SE / true SD = 1.25 (slightly conservative, the safer direction).

Methodology references (required if estimator / math changes)

Method name(s): SyntheticDiD survey-weighted bootstrap (weighted Frank-Wolfe + Rao-Wu composition)
Paper / source link(s): Arkhangelsky et al. (2021) Synthetic Difference-in-Differences, AER 111(12), 4088-4118 (Algorithm 2 step 2); Rao & Wu (1988) Resampling Inference with Complex Survey Data, JASA 83(401); Rao, Wu & Yue (1992) Some Recent Work on Resampling Methods for Complex Surveys, Survey Methodology 18(2). No external R/Julia anchor exists for survey-weighted SDID FW (neither package defines it); validation rests on the coverage-MC calibration row.
Intentional deviations: the weighted-FW objective min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i² returns ω on the simplex and downstream composes ω_eff = rw·ω / Σ(rw·ω). This is NOT the same as directly minimizing the standard SDID loss on ω_eff — the argmin sets differ because of the non-constant scaling factor in the ω↔ω_eff reparameterization. Intentional design mirroring the spirit of the pre-PR-Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 Rao-Wu composition but with ω re-estimated per draw under the weighted objective (so weight-estimation uncertainty propagates). See REGISTRY.md §SyntheticDiD Note (survey + bootstrap composition) for the full derivation.

Validation

Tests added/updated: tests/test_weighted_fw.py (new; 15 tests covering the Rust kernel + Python wrappers + survey helpers + backend parity), tests/test_methodology_sdid.py::TestBootstrapSE (two PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 raises-tests rewritten as succeeds-tests + new test_bootstrap_single_psu_returns_nan short-circuit regression + coverage-MC schema updated to include stratified_survey row with explicit calibration-gate assertion), tests/test_survey_phase5.py::TestSyntheticDiDSurvey (three PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 raises-tests rewritten as succeeds-tests with explicit SE assertions + new test_bootstrap_full_design_se_differs_from_pweight_only contract).
Backtest / simulation / notebook evidence: regenerated benchmarks/data/sdid_coverage.json (500 seeds × B=200, 40 min wall-clock on M-series Mac with Rust backend). Stratified-survey bootstrap rejection @ α=0.05 = 0.042 (inside [0.02, 0.10] gate).

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

github-actions · 2026-04-24T01:14:04Z

Overall Assessment

⚠️ Needs changes — I found two unmitigated P1 issues in the new SDID survey-bootstrap path.

Executive Summary

The new ω-side weighted Frank-Wolfe composition and the single-PSU NaN short-circuit are documented in REGISTRY.md, so I did not treat those deliberate deviations as defects.
compute_time_weights_survey() does not actually solve the weighted λ objective stated in the registry once intercept=True; the helper row-scales by sqrt(rw) and then unweighted-centers the transformed matrix, which changes the problem for non-uniform survey weights.
The newly supported full-design survey path still assumes SurveyDesign.weights names a real column. Because SurveyDesign.resolve() allows weights=None and synthesizes ones, valid SurveyDesign(strata=..., psu=...) inputs can still fail before bootstrap starts.
I did not find a new inline-inference / partial-NaN anti-pattern; final inference still flows through safe_inference().
Several user-facing docs/messages still say SDID bootstrap rejects all survey designs, and the new tests do not cover either blocker above.

Methodology

P1 Impact: compute_time_weights_survey() is documented as solving Σ_u rw_u (Σ_t λ_t Y_pre[t,u] - Y_post_mean[u])² + ζ²||λ||², but the implementation row-scales Y_time by sqrt(rw) and then passes intercept=True into _sc_weight_fw(), whose centering is an unweighted column mean. That is not equivalent to the stated weighted-loss problem once weights vary, so every non-uniform survey bootstrap draw refits λ on the wrong objective and can bias the reported survey bootstrap SEs. Concrete fix: weighted-center Y_time by rw_control / rw_control.sum() and call _sc_weight_fw(..., intercept=False), or extend _sc_weight_fw to support weighted centering directly; then add a small direct-optimizer regression for non-uniform rw_control. References: diff_diff/utils.py:L1977-L2042, diff_diff/utils.py:L1440-L1443, diff_diff/synthetic_did.py:L1051-L1065, docs/methodology/REGISTRY.md:L1563-L1567

Code Quality

P1 Impact: the PR newly front-doors full-design survey bootstrap, but SyntheticDiD.fit() still extracts unit weights through _extract_unit_survey_weights(data, ..., survey_design, ...), and that helper directly indexes survey_design.weights. SurveyDesign.resolve() explicitly supports weights=None by synthesizing unit weights of 1, so a valid SurveyDesign(strata=..., psu=...) input now reaches this path and can fail before the advertised bootstrap branch runs. Concrete fix: derive w_control / w_treated from resolved_survey.weights or resolved_survey_unit.weights when survey_design.weights is None, or reject weights=None up front with a clear error; add a regression test for a weight-less full design. References: diff_diff/synthetic_did.py:L415-L437, diff_diff/survey.py:L160-L207, diff_diff/survey.py:L1051-L1071

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the blocking issues above.

Tech Debt

No separately tracked TODO.md item mitigates the two P1 findings above.

Security

No material findings.

Documentation/Tests

P3 Impact: several user-facing docs/messages still say SDID bootstrap rejects all survey designs, which now contradicts the updated registry/changelog and will misdirect users. Concrete fix: update the SyntheticDiD.fit() docstring, the stale placebo fallback messaging/comments, and the “Current Limitations” table in docs/survey-roadmap.md to the new support matrix. References: diff_diff/synthetic_did.py:L248-L273, diff_diff/synthetic_did.py:L1252-L1263, diff_diff/synthetic_did.py:L1355-L1365, docs/survey-roadmap.md:L47-L60, docs/survey-roadmap.md:L226-L227
P3 Impact: the new tests do not guard either blocker: full-design tests all use an explicit weight column, and the λ helper tests only check uniform-weight parity / simplex invariants, not objective correctness under non-uniform rw_control. Concrete fix: add a SurveyDesign(strata=..., psu=...) regression with no explicit weights, plus a direct-objective test for compute_time_weights_survey() on a small non-uniform case. References: tests/test_methodology_sdid.py:L621-L643, tests/test_weighted_fw.py:L203-L250

Path to Approval

Fix compute_time_weights_survey() so the survey bootstrap λ refit matches the documented weighted objective under intercept=True, and add a non-uniform-weight regression that checks the helper against a direct constrained optimizer.
Handle SurveyDesign(weights=None, strata=..., psu=...) explicitly in SyntheticDiD.fit() by either supporting implicit unit weights of 1 from the resolved survey object or rejecting that configuration with a clear front-door error; add a regression test for the chosen behavior.

Fixes two P1 issues flagged by the CI reviewer on the initial submission of PR #352. P1 Methodology — `compute_time_weights_survey` was documented as solving the WLS-style weighted λ objective min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||² but row-scaled Y by sqrt(rw) and then handed the scaled matrix to `_sc_weight_fw(intercept=True)`, whose column-centering uses an UNWEIGHTED mean across controls. That is not the weighted objective once rw varies, so non-uniform survey bootstrap draws were refitting λ on the wrong objective and could bias the reported SE. Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using `col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass `intercept=False` to the kernel so no additional unweighted centering happens on the scaled matrix. Both two-pass calls updated. `compute_sdid_unit_weights_survey` is unchanged — its column-centering is PER-UNIT (time means within each control column), which is independent of rw. P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is a valid configuration (`SurveyDesign.resolve()` synthesizes ones when weights is None), but `_extract_unit_survey_weights` indexed `survey_design.weights` as if it were always a column name, so the groupby would fail with a KeyError before the bootstrap branch could run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector of ones of length `len(unit_order)` when `survey_design.weights is None`, matching `SurveyDesign.resolve()`'s semantics. Regression tests: - `test_non_uniform_rw_beats_unweighted_centering_variant` (test_weighted_fw.py): reproduces the pre-fix buggy variant (row- scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and asserts the fixed path's weighted SSR is strictly ≤ the buggy variant's weighted SSR. If a future revert reintroduces intercept=True after the row-scaling, this test fails. - `test_bootstrap_full_design_without_explicit_weights` (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with no explicit `weights` column now succeeds on the bootstrap path; survey_metadata populated with n_strata / n_psu. P3 Documentation: - `SyntheticDiD.fit()` docstring (survey_design parameter + Raises block): replace "bootstrap rejects all survey designs" language with the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight- only and full design; placebo/jackknife ✓ pweight-only, ✗ full design). - `_placebo_variance_se` fallback-guidance messages (two sites): drop the "strata/PSU/FPC not yet supported by any SDID variance method" framing; recommend bootstrap for full-design survey fallback, jackknife for pweight-only, adding controls as the universal fallback. - `docs/survey-roadmap.md` Current Limitations table: collapse the two SDID bootstrap-rejection rows into a single row for placebo/ jackknife + full design (the bootstrap + full design row no longer applies). Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE + TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T09:45:21Z

/ai-review

github-actions · 2026-04-24T09:53:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 099797e7edeff956ac39c586e3924fbd57d0a11c

Overall Assessment

⛔ Blocker — one unmitigated P0 remains in the new pweight-only SDID survey-bootstrap path.

Executive Summary

The two prior blockers appear resolved: compute_time_weights_survey() now does weighted centering before the FW solve, and SurveyDesign(weights=None, strata=..., psu=...) no longer crashes the bootstrap path. References: diff_diff/utils.py:L1965-L2055, diff_diff/survey.py:L1051-L1084
New P0: the pweight-only bootstrap branch silently falls back to an unweighted treated mean when a bootstrap draw’s treated survey weights sum to zero, so supported survey inputs with zero weights can yield incorrect bootstrap SEs/p-values with no warning. References: diff_diff/synthetic_did.py:L988-L1038, diff_diff/survey.py:L160-L176, docs/methodology/REGISTRY.md:L1557-L1574
I did not count the weighted-FW + Rao-Wu composition itself as a defect because that deviation from the baseline SDID paper is explicitly documented in REGISTRY.md. References: docs/methodology/REGISTRY.md:L1549-L1574
The new tests cover the prior blockers and positive survey-bootstrap paths, but they still miss the zero-treated-mass pweight bootstrap case that exposes the blocker. References: tests/test_methodology_sdid.py:L598-L710, tests/test_survey_phase5.py:L179-L209, tests/test_survey_phase5.py:L338-L399

Methodology

Arkhangelsky et al.’s Algorithm 2 bootstraps SDID by resampling rows with replacement and discarding resamples with no treated or no control units; Rao & Wu (1988) and Rao, Wu & Yue (1992) are the cited complex-survey resampling references for the survey branch. The PR’s weighted-FW survey composition is a documented registry deviation, so I treated it as methodology-acceptable for this review. (arxiv.org)

P0 Impact: fit-time survey SDID always uses survey-weighted treated means when w_treated is present, but _bootstrap_se() only retries zero-mass treated draws for the Rao-Wu/full-design branch. In the new pweight-only branch, rw_treated_draw.sum() == 0 falls through to np.mean(...), silently turning a zero-weight treated resample into an unweighted one. Because zero survey weights are allowed as long as not all weights are zero, this is a reachable supported input, and it directly corrupts the bootstrap draw distribution used for se, p_value, and conf_int. Concrete fix: when survey weights are active, treat rw_treated_draw.sum() == 0 as a degenerate draw and continue/retry, just as the full-design Rao-Wu branch already does; do not ever fall back to the unweighted treated mean while w_treated is in force. References: diff_diff/synthetic_did.py:L535-L567, diff_diff/synthetic_did.py:L988-L1038, diff_diff/survey.py:L160-L176, docs/methodology/REGISTRY.md:L1557-L1574

Code Quality

P3 Impact: the replicate-weight rejection path still tells users that SDID survey support is only pweight-only placebo/jackknife, which is now stale after PR HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow) #352. That will misstate the support matrix for anyone who hits the replicate-weight error. Concrete fix: update the comment and NotImplementedError at the replicate-weight guard to say replicate-weight designs remain unsupported, while pweight bootstrap now supports both pweight-only and full-design survey inputs. References: diff_diff/synthetic_did.py:L303-L314

Performance

No material findings in the changed path.

Maintainability

No material findings beyond the blocker above.

Tech Debt

No blocking findings. The remaining full-design placebo/jackknife gap is explicitly tracked in TODO.md, so I treated it as mitigated/non-blocking for this review. Reference: TODO.md:L107-L108

Security

No material findings.

Documentation/Tests

P3 Impact: the added tests validate the fixed weighted-centering helper and the positive-path survey bootstrap restores, but they do not exercise the zero-treated-mass pweight bootstrap draw that currently produces the blocker. Concrete fix: add a regression with at least two treated units where one has zero survey weight, and assert the bootstrap skips zero-mass treated draws instead of computing np.mean(...). References: tests/test_methodology_sdid.py:L598-L710, tests/test_survey_phase5.py:L338-L399, tests/test_weighted_fw.py:L244-L307

Path to Approval

In SyntheticDiD._bootstrap_se(), make zero-sum treated survey mass a degenerate survey bootstrap draw in the pweight-only branch as well; retry/skip it instead of falling back to np.mean(...).
Add a regression test for pweight-only bootstrap with a zero-weight treated unit to prove that zero-mass treated draws are skipped and cannot contaminate the reported bootstrap SE/inference.

Execution note: I did not run the test suite in this environment because the available python lacks numpy; the review above is from diff and source inspection.

igerber · 2026-04-24T09:59:02Z

/ai-review

…path CI review flagged a silent-failure P0: when a bootstrap draw's treated units all had survey weight 0 under a pweight-only survey design, the code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted mean). That silently dropped survey weighting for that draw while the fit-time ATT uses the survey-weighted treated mean — corrupting the bootstrap distribution used for SE, p-value, and CI. Reachable with any `SurveyDesign(weights=...)` input where at least one treated unit has pweight 0 (a valid configuration per ``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks only those units. Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the pweight-only branch as well. When ANY survey path is active and ``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the draw is retried. The pre-fix behavior (unweighted treated mean fallback) is unreachable from any code path now. Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws`` in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a panel with 3 treated units where one has `wt=0`; bootstrap at B=100 still returns finite SE > 0. The test would fail if the code reverted to the unweighted-mean fallback because zero-mass draws would either produce degenerate SDiD estimators (NaN τ → gets filtered out by the finite-τ gate, possibly leaving no valid draws) or produce systematically different τ values that inflate the SE. P3 (same review) also addressed: - ``diff_diff/synthetic_did.py`` replicate-weight rejection message (line ~303) now reflects the PR #352 support matrix instead of the pre-PR #352 "only pweight-only placebo/jackknife" framing. Analytical survey designs are now fully supported per the truth table; only replicate-weight variance remains unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T10:09:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2b2e8f0824511607468418768386f4634978af6f

Overall Assessment

✅ Looks good — no unmitigated P0 or P1 findings.

Executive Summary

The prior blocker is fixed: the pweight-only bootstrap path now retries zero-treated-mass survey draws instead of silently falling back to an unweighted treated mean. Refs: diff_diff/synthetic_did.py:L1013-L1045, tests/test_methodology_sdid.py:L674-L710.
The prior SurveyDesign(weights=None, strata=..., psu=...) regression is fixed. Refs: diff_diff/survey.py:L1051-L1084, tests/test_methodology_sdid.py:L714-L741.
I did not find any new unmitigated methodology or inference bugs in the changed code. The weighted-FW + Rao-Wu full-design branch remains a documented Registry deviation, so I treated it as informational rather than a defect.
Two P3 items remain: secondary docs do not describe that hybrid composition as precisely as REGISTRY.md, and the full-design branch still lacks a tight deterministic regression around the boot_idx + Rao-Wu interaction.
I could not execute tests here because this environment does not have numpy or pytest installed.

Methodology

No unmitigated findings. The full-design survey bootstrap is not the literal SDID Algorithm 2 pairs bootstrap or a literal standalone Rao-Wu bootstrap, but REGISTRY.md explicitly documents the shipped hybrid (rw sliced over resampled units), so under the review rules this is informational rather than blocking. Refs: diff_diff/synthetic_did.py:L978-L1001, docs/methodology/REGISTRY.md:L1563-L1574. Arkhangelsky et al. describe Algorithm 2 as unit-level resampling, while Rao-Wu is PSU-within-stratum resampling/weight rescaling.

Code Quality

No material findings. The previous weights=None crash is resolved in diff_diff/survey.py:L1051-L1084 and covered in tests/test_methodology_sdid.py:L714-L741.

Performance

No material findings in the changed path.

Maintainability

No material findings. The Rust/NumPy weighted-FW split is coherent and has dedicated parity/invariant coverage in tests/test_weighted_fw.py:L1-L383.

Tech Debt

No blocking findings. The remaining full-design placebo/jackknife gap is explicitly tracked in TODO.md:L107-L108, so it is mitigated/non-blocking.

Security

No material findings.

Documentation/Tests

Severity P3. Impact: survey-theory.md and CHANGELOG.md describe the restored full-design path as “Rao-Wu rescaled bootstrap” / “Rao-Wu rescaled weights per draw”, but only REGISTRY.md surfaces the material caveat that SDID still keeps the unit-level boot_idx resampling and then applies Rao-Wu weights. That can mislead readers about the exact methodology that shipped. Concrete fix: mirror REGISTRY.md’s “sliced over the resampled units” wording in the secondary docs, or name it explicitly as a hybrid pairs-bootstrap + Rao-Wu composition. Refs: diff_diff/synthetic_did.py:L978-L1001, docs/methodology/REGISTRY.md:L1563-L1574, docs/methodology/survey-theory.md:L725-L732, CHANGELOG.md:L23-L27.
Severity P3. Impact: the new tests cover kernel parity, the prior zero-treated-mass bug, and survey happy paths, but they do not pin the methodology-critical boot_idx + generate_rao_wu_weights() interaction with a small deterministic regression; that behavior is still guarded mainly by the slow coverage artifact. Concrete fix: add a targeted test that monkeypatches generate_rao_wu_weights() on a multi-unit-per-PSU panel and asserts the exact per-draw control/treated weights fed into _bootstrap_se(). Refs: diff_diff/synthetic_did.py:L978-L1045, tests/test_survey_phase5.py:L179-L209, tests/test_methodology_sdid.py:L621-L650, tests/test_methodology_sdid.py:L3055-L3143.

igerber · 2026-04-24T10:13:09Z

/ai-review

…lice Two P3s from R3; PR was already ✅ Looks good — these are close-out polish. P3 docs/tests — secondary surfaces described the full-design path as "Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material caveat that SDID still uses unit-level pairs-bootstrap resampling (``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled weights on top — a hybrid composition, not a standalone Rao-Wu bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352 Added entry to use the hybrid-composition wording mirroring REGISTRY. P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights`` interaction was only guarded by the slow coverage MC. Add ``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in ``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to return a known vector of distinct per-unit values (``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors fed into the weighted FW helper via a capturing wrapper on ``compute_sdid_unit_weights_survey``, and asserts every captured vector lies within ``known_rw[:n_control]`` (positions 1..n_control). This catches two bug classes: - slice-order regression: if someone swaps rw-then-slice for slice-then-rw, the captured vectors would include values from the treated slice ``known_rw[n_control:]`` and the assertion fires. - rw-drift regression: if the Rao-Wu call site bypasses ``generate_rao_wu_weights`` (e.g., a refactor silently uses the pweight-only branch for full-design fits), the captured vector would be the user's w_control (all 1.0 in this test) instead of the known Rao-Wu output. Verified: 294 targeted tests pass across test_methodology_sdid / test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T10:20:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2bf3f93deb9a7b6d42a180721586730f905440af

Overall Assessment

⛔ Blocker — prior review items around documentation consistency and weights=None handling look addressed, but I found one new unmitigated P0 in the newly restored pweight-only SDID bootstrap path.

Executive Summary

The prior documentation concern is fixed: the shipped hybrid pairs-bootstrap + Rao-Wu composition is now described consistently in docs/methodology/REGISTRY.md:L1549-L1576, docs/methodology/survey-theory.md:L728-L741, and CHANGELOG.md:L20-L27.
The prior SurveyDesign(weights=None, strata=..., psu=...) regression is fixed in diff_diff/survey.py:L1051-L1084 and covered by tests/test_methodology_sdid.py:L704-L741.
[Newly identified] P0: the new pweight-only bootstrap branch bypasses SurveyDesign.resolve() normalization and feeds raw survey weights into the weighted Frank-Wolfe objective, so bootstrap SE / p-value / CI can change when all pweights are multiplied by a constant. Refs: diff_diff/survey.py:L160-L203, diff_diff/synthetic_did.py:L427-L444, diff_diff/synthetic_did.py:L632-L653, diff_diff/synthetic_did.py:L1002-L1082, diff_diff/utils.py:L1845-L1862, diff_diff/utils.py:L1979-L1993.
The new Rao-Wu × boot_idx regression test is helpful but weaker than its docstring claims: it only guards against treated-slice contamination, not exact boot_idx-ordered equality. Refs: tests/test_methodology_sdid.py:L743-L821.
I could not execute tests here because this environment is missing numpy, so importing the package fails.

Methodology

Severity P0 [Newly identified]. Impact: SurveyDesign.resolve() normalizes pweights/aweights to mean 1 (diff_diff/survey.py:L160-L203), which is the library’s scale-invariance contract. The new pweight-only bootstrap path ignores those resolved weights, extracts raw unit weights from the input frame (diff_diff/synthetic_did.py:L427-L430, diff_diff/survey.py:L1051-L1084), and passes them as rw into the weighted-FW bootstrap objective (diff_diff/synthetic_did.py:L640-L653, diff_diff/synthetic_did.py:L1002-L1082, diff_diff/utils.py:L1845-L1862, diff_diff/utils.py:L1979-L1993). Because that objective is not invariant to a global rescaling of rw, two equivalent designs such as w and 2*w can now produce different bootstrap SEs and downstream inference with no warning. Concrete fix: for the pweight-only bootstrap branch, source control/treated unit weights from the resolved unit-level design (resolved_survey_unit.weights, or equivalent normalized arrays) rather than the raw survey_design column, and add a regression asserting identical bootstrap SE / p-value / CI under w vs c*w.
Severity P3. Impact: the full-design survey bootstrap is a documented hybrid, not a paper-exact reproduction of either source procedure. Arkhangelsky et al. describe Algorithm 2 as a resampling-and-refit bootstrap for SDID, while Rao-Wu resampling perturbs PSU weights within strata; REGISTRY.md now explicitly labels the shipped SDID survey path as that hybrid composition, so under the review rules this is informational rather than a defect. Concrete fix: none required. Refs: docs/methodology/REGISTRY.md:L1549-L1576, docs/methodology/survey-theory.md:L728-L741. citeturn2view0turn1search1

Code Quality

No additional findings beyond the raw-vs-resolved weight bug above.

Performance

No material findings in the changed path.

Maintainability

No material findings. The weighted Rust and NumPy FW paths are coherently covered by tests/test_weighted_fw.py.

Tech Debt

No blocker. The remaining placebo / jackknife + strata/PSU/FPC limitation is explicitly tracked in TODO.md:L107-L108, so that gap is properly mitigated for review purposes.

Security

No findings.

Documentation/Tests

Severity P3. Impact: test_bootstrap_full_design_rao_wu_boot_idx_slice says it checks the exact expected slice known_rw[:n_control][boot_idx[boot_is_control]], but the actual assertion only checks that captured values stay within [1, 15] and never pull from the treated slice (tests/test_methodology_sdid.py:L743-L821). A regression that permuted, deduplicated, or otherwise mis-ordered the control-side boot_idx slice would still pass. Concrete fix: make one bootstrap draw deterministic and assert exact equality of the captured rw_control vector to the expected slice.

Path to Approval

Normalize the pweight-only bootstrap branch onto resolved unit-level survey weights before passing rw into compute_sdid_unit_weights_survey() / compute_time_weights_survey().
Add a regression test that SurveyDesign(weights="w") and SurveyDesign(weights="c*w") produce identical bootstrap SE, p-value, and CI on the same SDID fit.
Tighten the Rao-Wu × boot_idx deterministic test to assert the exact captured slice, not just the value range.

…tighten boot_idx slice test R4 P0 (scale-invariance): the pweight-only bootstrap branch was sourcing w_control / w_treated from raw panel-column weights via _extract_unit_survey_weights. The weighted-FW bootstrap objective is not scale-invariant in rw (loss scales as rw^2 via A·diag(rw), reg scales as rw), so two equivalent designs w and c*w could produce different bootstrap SE / p-value / CI with no warning. Fix: source w_control / w_treated from resolved_survey_unit.weights, which SurveyDesign.resolve() normalizes to mean=1 (survey.py L189-L203). Placebo / jackknife paths also consume the same w_control / w_treated but are scale-invariant, so their numerics are unchanged. R4 P3 (test tightening): the boot_idx × Rao-Wu regression test asserted captured rw values stayed within the known_rw[1, 15] range — too weak to catch permutation / deduplication regressions in the slice order. Tighten by reproducing the bootstrap RNG stream externally (fake_rao_wu doesn't consume rng) and asserting exact-equality between the captured rw_control vector and known_rw[:n_control][boot_idx[boot_is_control]]. New regression test: test_bootstrap_scale_invariance_under_pweight_rescaling fits the same panel under SurveyDesign("wt") vs SurveyDesign("wt_scaled") (10x rescale) and asserts SE, p-value, CI match to machine-epsilon tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T10:35:38Z

/ai-review

github-actions · 2026-04-24T10:41:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 676d8317de7374ce7ff9aa3bef830ac24b3fd91b

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The prior rerun’s raw-vs-resolved pweight blocker is fixed in diff_diff/synthetic_did.py:L421-L458, and the Rao-Wu-by-boot_idx regression is now checked exactly in tests/test_methodology_sdid.py:L743-L843.

Executive Summary

Previous blocker resolved: survey bootstrap now sources normalized unit weights from resolved_survey_unit.weights, restoring pweight scale invariance, with regression coverage at tests/test_methodology_sdid.py:L845-L905.
Previous test weakness resolved: the full-design Rao-Wu slice test now asserts exact known_rw[:n_control][boot_idx[boot_is_control]] equality in tests/test_methodology_sdid.py:L743-L843.
The weighted time-weight helper now uses weighted centering before sqrt(rw) row-scaling, matching the documented objective and covering the prior centering bug via diff_diff/utils.py:L2037-L2057 and tests/test_weighted_fw.py:L244-L307.
Severity P2: the new Rust weighted-FW dispatcher silently falls back to the unweighted kernel on reg_weights length mismatch, unlike the NumPy path which raises; see rust/src/weights.rs:L694-L699 versus diff_diff/utils.py:L1453-L1459.
Severity P3 [Newly identified]: docs/choosing_estimator.rst:L817-L823 still says SDID bootstrap rejects all survey designs, contradicting the updated support matrix in docs/methodology/REGISTRY.md:L1549-L1576.
I could not execute tests here because this environment lacks numpy and pytest.

Methodology

No unmitigated P0/P1 findings.
Severity P3. Impact: the restored survey path is a deliberate hybrid pairs-bootstrap + Rao-Wu composition rather than a paper-exact survey bootstrap, but that deviation is now explicitly documented in docs/methodology/REGISTRY.md:L1563-L1574 and mirrored in diff_diff/synthetic_did.py:L61-L73, so under the review rules it is informational only. Concrete fix: none.

Code Quality

Severity P2. Impact: sc_weight_fw_weighted_internal() silently delegates to the unweighted solver when reg_weights.len() != t0, so Rust and NumPy backends disagree on invalid input and a direct caller can get the wrong objective with no error. Refs: rust/src/weights.rs:L670-L699, diff_diff/utils.py:L1380-L1406, diff_diff/utils.py:L1453-L1459. Concrete fix: validate reg_weights.shape == (Y.shape[1] - 1,) before Rust dispatch, or have the Rust entry point raise ValueError on mismatch instead of falling back.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining placebo/jackknife + strata/PSU/FPC gap is explicitly tracked in TODO.md:L107-L108, so it is mitigated and non-blocking for this PR. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity P3 [Newly identified]. Impact: the compatibility note in docs/choosing_estimator.rst:L817-L823 still advertises the pre-PR limitation, which can steer users away from the restored SDID survey-bootstrap path now described in docs/methodology/REGISTRY.md:L1549-L1576 and diff_diff/guides/llms-full.txt:L1673-L1679. Concrete fix: update docs/choosing_estimator.rst to match the new support matrix.
Severity P3. Impact: the module docstring in benchmarks/python/coverage_sdid.py:L2-L12 still says the study covers three DGPs and runs each method on each DGP, but the script now includes a fourth DGP and intentionally skips unsupported methods via survey_design_factory; that makes the calibration harness easier to misread. Concrete fix: refresh the docstring to describe four DGPs and note that stratified_survey is bootstrap-only (benchmarks/python/coverage_sdid.py:L66-L82, L178-L222, L348-L357).

igerber · 2026-04-24T10:47:07Z

/ai-review

R5 P2: Rust ``sc_weight_fw_weighted_internal`` silently fell back to the unweighted kernel on ``reg_weights.len() != Y.shape[1] - 1`` while the NumPy path raised ``ValueError``. Two direct callers of the Rust pyfunctions (``sc_weight_fw_weighted`` / ``_with_convergence``) could therefore get the wrong objective with no error. Validate shape both at the Python dispatcher (``diff_diff.utils._sc_weight_fw``) and at each Rust pyfunction entry point so both backends raise ``ValueError`` consistently. The internal's fallback is kept as defensive for Rust- side callers but becomes unreachable from the Python API; comment updated to reflect that. Regression test: ``test_reg_weights_length_mismatch_raises`` in tests/test_weighted_fw.py. R5 P3 (docs/choosing_estimator.rst): the SDID row in the survey support matrix and the trailing note still said survey-bootstrap was rejected. Updated to describe the landed hybrid pairs-bootstrap + Rao-Wu rescaled composition and to point users at the REGISTRY note for the objective form and argmin-set caveat. R5 P3 (benchmarks/python/coverage_sdid.py): the module docstring said the study covers three DGPs; it now covers four (``stratified_survey`` added) and bootstrap-only on that DGP via ``survey_design_factory``. Refresh the docstring to describe all four DGPs and call out the bootstrap-only coverage for stratified_survey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T10:55:38Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 88ad1cc1a017f6b940989b0fb78a9e1c6724df22

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings. The restored SyntheticDiD survey-bootstrap path is consistent with the new PR #352 methodology note, and the prior rerun’s substantive issues appear addressed. One P3 documentation inconsistency remains.

Executive Summary

The affected method is SyntheticDiD survey bootstrap. The new weighted-FW + Rao-Wu composition in diff_diff/synthetic_did.py:L633-L671 and diff_diff/synthetic_did.py:L924-L1108 matches the new registry note at docs/methodology/REGISTRY.md:L1549-L1575.
I did not find unmitigated P0/P1 issues in weighting, SE computation, NaN/inference handling, or control/treated slicing on the changed path.
The previous rerun’s substantive findings appear fixed in code: invalid reg_weights shapes now raise before Rust fallback (diff_diff/utils.py:L1380-L1393, rust/src/weights.rs:L894-L953), docs/choosing_estimator.rst now reflects restored support (docs/choosing_estimator.rst:L785-L827), and the coverage harness docstring was updated for the fourth DGP (benchmarks/python/coverage_sdid.py:L1-L20).
Regression coverage is well targeted for the new path: Rao-Wu slicing, zero-treated-mass retries, pweight rescaling invariance, single-PSU NaN, and weighted-centering/backend parity are covered in tests/test_methodology_sdid.py:L714-L926, tests/test_survey_phase5.py:L179-L392, and tests/test_weighted_fw.py:L137-L388.
Severity P3 [Newly identified]: docs/methodology/REGISTRY.md still contains three stale pre-PR352 statements claiming SDID survey bootstrap raises NotImplementedError (docs/methodology/REGISTRY.md:L1497-L1508, docs/methodology/REGISTRY.md:L1624-L1624, docs/methodology/REGISTRY.md:L2933-L2939), contradicting the new support matrix at docs/methodology/REGISTRY.md:L1549-L1575.
I could not execute tests here because this environment lacks numpy, pandas, and pytest.

Methodology

No unmitigated P0/P1 findings. Affected method: SyntheticDiD survey bootstrap via weighted Frank-Wolfe + Rao-Wu composition. The implementation in diff_diff/synthetic_did.py:L633-L671 and diff_diff/synthetic_did.py:L924-L1108, plus the helper math in diff_diff/utils.py:L1846-L2110, is aligned with the documented PR HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow) #352 note in docs/methodology/REGISTRY.md:L1549-L1575.
Severity P3 [Newly identified]. Impact: the Methodology Registry is internally inconsistent for this estimator. The new PR HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow) #352 support matrix/note says survey bootstrap is supported, but older bullets in the same file still say it raises NotImplementedError. Because this file is the stated source of truth, that contradiction can mislead future reviewers and maintainers even though the code itself is correct. Concrete fix: remove or rewrite the stale bullets at docs/methodology/REGISTRY.md:L1497-L1508, docs/methodology/REGISTRY.md:L1624-L1624, and docs/methodology/REGISTRY.md:L2933-L2939 so every SDID bootstrap reference points to the PR HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow) #352 survey-support note.

Code Quality

No findings. The prior Rust/NumPy invalid-input divergence on reg_weights length is now closed by shared validation in diff_diff/utils.py:L1380-L1393 and matching Rust Python-entry guards in rust/src/weights.rs:L894-L953.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining placebo/jackknife + strata/PSU/FPC gap is explicitly tracked and therefore mitigated in TODO.md:L107-L107.

Security

No findings.

Documentation/Tests

No additional findings beyond the stale REGISTRY.md text above. The newly added/updated tests cover the highest-risk branches of the changed path in tests/test_methodology_sdid.py:L714-L926, tests/test_survey_phase5.py:L179-L392, and tests/test_weighted_fw.py:L137-L388.
Tests were not executed in this environment because numpy, pandas, and pytest are unavailable.

igerber · 2026-04-24T10:57:25Z

/ai-review

Three places in REGISTRY.md still described the pre-PR-#352 behavior where SDID survey bootstrap raised NotImplementedError, contradicting the new support matrix and survey-composition note landed earlier in this PR: - §SyntheticDiD bootstrap bullet (paper-faithful refit text): the "Composed with any survey design ... raises NotImplementedError" trailer is updated to describe weighted-FW dispatch under the survey + bootstrap composition note. - §SyntheticDiD requirements checklist bullet for "Bootstrap: paper- faithful Algorithm 2 step 2": "Survey designs raise NotImplementedError" trailer is updated to describe the hybrid pairs-bootstrap + Rao-Wu rescaling composition. - §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten to add SDID to the list via its hybrid composition, with the precise distinction from standalone Rao-Wu (SunAbraham / TROP) and a pointer to §SyntheticDiD for the objective form and argmin-set caveat. Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD (L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/ FPC methodology gap that is out of scope for this PR and tracked in TODO.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T11:06:00Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 60b0e0c6d1a7c8eef615aa94ed5332e24d80b9ee

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Affected method: SyntheticDiD survey bootstrap. The new weighted-Frank-Wolfe + Rao-Wu composition is documented in the Methodology Registry, and I did not find an unmitigated paper-to-code mismatch in the resample/refit logic itself. citeturn1search3turn1search32turn1search33
The previous rerun’s lone P3 about stale REGISTRY.md text appears resolved; the SyntheticDiD support matrix and Rao-Wu section are now internally consistent at docs/methodology/REGISTRY.md:L1549-L1576 and docs/methodology/REGISTRY.md:L2933-L2941.
P1 [Newly identified]: the PR fixes zero-mass bootstrap draws, but fit() still has no fit-time positive-mass guard for survey-weighted treated/control arms. A valid SurveyDesign with all treated mass or all control mass equal to zero can still hit zero-sum weighted means or 0/0 normalization instead of failing cleanly. diff_diff/synthetic_did.py:L421-L458, diff_diff/synthetic_did.py:L551-L582, diff_diff/survey.py:L166-L176, tests/test_methodology_sdid.py:L674-L712
P3: docs/choosing_estimator.rst now says SyntheticDiD has “full (bootstrap)” support in the Weights column, but the code still hard-rejects non-pweight survey designs. docs/choosing_estimator.rst:L785-L811, diff_diff/synthetic_did.py:L317-L322
I could not execute the test suite here because this environment does not have numpy, pandas, or pytest.

Methodology

Severity P1 [Newly identified]. Impact: the PR added the right retry for zero-mass survey bootstrap draws, but the same positive-support check is still missing at fit time. After splitting w_control / w_treated, the code immediately calls np.average(..., weights=w_treated) and later normalizes omega_eff = unit_weights * w_control by its sum. If either arm has zero total survey mass, the estimator can throw or propagate NaNs instead of front-door rejecting an unidentified target population. That is a partial fix of the exact zero-mass pattern the PR already addressed inside _bootstrap_se. Concrete fix: right after w_control / w_treated are constructed, add an explicit guard if w_control.sum() <= 0 or w_treated.sum() <= 0: raise ValueError(...), and cover both pweight-only and full-design variants in tests. diff_diff/synthetic_did.py:L421-L458, diff_diff/synthetic_did.py:L551-L582, diff_diff/survey.py:L166-L176, tests/test_methodology_sdid.py:L674-L712

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining placebo / jackknife + strata/PSU/FPC gap is properly tracked and therefore mitigated. TODO.md:L108-L109

Security

No findings.

Documentation/Tests

Severity P3. Impact: the SyntheticDiD row in docs/choosing_estimator.rst uses “full (bootstrap)” in the Weights column, but SyntheticDiD.fit() still enforces weight_type='pweight' for every survey configuration. That wording conflates full design-element support with full weight-type support and can mislead users into expecting fweight / aweight bootstrap support that does not exist. Concrete fix: change that cell to pweight only (or pweight only; full design elements via bootstrap) and leave Via bootstrap in the Strata/PSU/FPC column. docs/choosing_estimator.rst:L785-L811, diff_diff/synthetic_did.py:L317-L322
No additional test-coverage findings. The new tests are well targeted around the restored survey-bootstrap path, especially Rao-Wu slicing, zero-mass draw retry, pweight rescaling invariance, and single-PSU handling. tests/test_methodology_sdid.py:L598-L928, tests/test_survey_phase5.py:L179-L390, tests/test_weighted_fw.py:L1-L407
Tests were not executed here because numpy, pandas, and pytest are unavailable.

Path to Approval

Add a fit-time survey positive-mass guard after w_control / w_treated are split, and fail cleanly when either arm’s total survey weight is zero.
Add regression tests for both pweight-only and full-design survey inputs where all treated mass is zero and where all control mass is zero; assert the chosen front-door error contract.
Fix the SyntheticDiD weights-cell wording in docs/choosing_estimator.rst so it still says pweight only while keeping full design-element support in the bootstrap column.

igerber · 2026-04-24T11:08:59Z

/ai-review

…g fix R7 P1: the per-draw zero-mass retry in ``_bootstrap_se`` (PR #355 R2 P0) only covers bootstrap draws, not the fit-time ATT. Survey weights are non-negative post-resolve() but all-zero mass on either arm is a valid input that encodes an unidentified target population. Without a fit- time guard the downstream ``np.average(Y, weights=w_treated)`` and ``omega_eff = unit_weights * w_control`` normalizations would hit 0/0 and silently propagate NaN through the bootstrap / placebo / jackknife dispatchers. Front-door the case: after ``w_control`` / ``w_treated`` are sourced from the resolved unit-level design, raise ``ValueError`` if either arm's total mass is <= 0. Covers both pweight-only and strata/PSU/FPC paths. Three regression tests added: ``test_fit_raises_on_zero_total_treated_survey_mass``, ``test_fit_raises_on_zero_total_control_survey_mass``, and ``test_fit_raises_on_zero_treated_mass_under_full_design``. R7 P3: the SDID row in ``docs/choosing_estimator.rst`` said "pweight only (placebo / jackknife); full (bootstrap)" in the **Weights** column, conflating weight-type support (fweight / aweight / pweight) with design-element support (strata / PSU / FPC). The code still hard- rejects non-pweight survey designs on every variance method. Narrow the wording to "pweight only" and leave "Via bootstrap" in the Strata/PSU/FPC column to describe design-element support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T11:16:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d43487305c496e01ae0032325767777a8f5b205a

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Affected method: SyntheticDiD survey bootstrap, specifically the new weighted Frank-Wolfe + Rao-Wu composition.
The previous re-review’s fit-time zero-mass finding is resolved: fit() now front-door rejects zero treated/control survey mass before estimation. diff_diff/synthetic_did.py:L421-L486
I did not find an unmitigated paper-to-code mismatch in the weighted-FW/Rao-Wu composition itself; that deviation from plain SDID refit bootstrap is explicitly documented in the Methodology Registry and therefore mitigated. docs/methodology/REGISTRY.md:L1549-L1576
P1: invalid fpc full-design inputs that rely on implicit PSUs are not validated on the new SDID bootstrap path, so the Rao-Wu helper throws inside every draw, _bootstrap_se() swallows the ValueError, and users eventually get a generic “Could not produce any valid bootstrap draw” after exhausting retries instead of a targeted FPC/design error. diff_diff/synthetic_did.py:L421-L486, diff_diff/synthetic_did.py:L661-L699, diff_diff/synthetic_did.py:L1020-L1192, diff_diff/bootstrap_utils.py:L695-L750
The prior documentation mismatch in docs/choosing_estimator.rst is fixed. docs/choosing_estimator.rst:L785-L823
I could not execute the test suite here because this environment lacks numpy and pytest.

Methodology

Severity P1. Impact: the new strata/PSU/FPC bootstrap path advertises FPC support, but SyntheticDiD.fit() never validates the unit-collapsed implicit-PSU FPC contract before entering the Rao-Wu retry loop. For SurveyDesign(..., fpc=...) without an explicit PSU (with or without strata), generate_rao_wu_weights() raises on every draw when FPC < effective unit-level PSU count; _bootstrap_se() catches that ValueError and keeps retrying until it emits a generic bootstrap-exhaustion failure. That is a missing assumption check on the newly opened survey-bootstrap path, not just a cosmetic error message issue. Concrete fix: after collapse_survey_to_unit_level(), validate resolved_survey_unit.fpc against the effective unit-level PSU count used by Rao-Wu when psu is None (all units if unstratified; per-stratum unit counts if stratified), and raise a targeted ValueError before calling _bootstrap_se(). diff_diff/synthetic_did.py:L421-L486, diff_diff/synthetic_did.py:L661-L699, diff_diff/synthetic_did.py:L1020-L1192, diff_diff/bootstrap_utils.py:L695-L750
No other methodology findings. The weighted-FW + Rao-Wu composition, its argmin-set caveat, and the remaining placebo/jackknife full-design gap are all documented in docs/methodology/REGISTRY.md. docs/methodology/REGISTRY.md:L1549-L1576

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining placebo/jackknife + strata/PSU/FPC gap is properly tracked in TODO.md, so it is mitigated. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

No additional findings. The docs/support matrix are now internally consistent, and the added tests cover the main survey-bootstrap regressions (zero-mass retries, resolved-weight scale invariance, Rao-Wu slicing, weights=None). I could not execute them here because numpy and pytest are unavailable. docs/choosing_estimator.rst:L785-L823, tests/test_methodology_sdid.py:L598-L995, tests/test_survey_phase5.py:L179-L405

Path to Approval

Add a front-door FPC validation on the unit-collapsed survey design for implicit-PSU cases (SurveyDesign(..., fpc=...) with no PSU, with and without strata) before entering _bootstrap_se().
Add regression tests for:
- SurveyDesign(weights="wt", fpc="pop") with pop < n_units
- SurveyDesign(weights="wt", strata="stratum", fpc="pop") with pop_h < n_units_in_stratum
Both should assert a targeted FPC/design ValueError, not the generic bootstrap-attempt-exhaustion error.

igerber · 2026-04-24T11:19:58Z

/ai-review

…D bootstrap When ``SurveyDesign(fpc=...)`` is declared without an explicit ``psu=``, ``bootstrap_utils.generate_rao_wu_weights`` (L654-L655) treats each unit as its own PSU. The helper rejects ``FPC < n_PSU`` mid-draw (L684-L688), so if FPC is set lower than the unit count (per stratum if stratified), every bootstrap draw raises ValueError; ``_bootstrap_se`` swallows the error in its retry loop and the user eventually sees a generic bootstrap-exhaustion message instead of a targeted FPC/design error. Add a front-door validation on ``resolved_survey_unit`` after ``collapse_survey_to_unit_level``: - unstratified: fpc >= total unit count; - stratified: fpc_h >= per-stratum unit count. Error messages point at the two actionable fixes (declare an explicit psu= column, or raise FPC). Two regression tests added: ``test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified`` and ``test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T11:25:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7072816bdf3e7bffe7676f4d7dc084a9640ceef7

Overall Assessment

⚠️ Needs changes — one prior P1 finding remains unmitigated.

Executive Summary

Re-review result: the earlier zero-mass, pweight scale-invariance, and weighted-centering fixes appear to be present in the current tree, but the previous implicit-PSU FPC blocker is still open. diff_diff/synthetic_did.py:L442-L486, diff_diff/utils.py:L1992-L2115, tests/test_methodology_sdid.py:L845-L972, tests/test_weighted_fw.py:L268-L346
Affected method: SyntheticDiD survey bootstrap, i.e. the PR’s weighted-Frank-Wolfe + Rao-Wu composition layered onto SDID and complex-survey resampling. (aeaweb.org)
I did not find a new undocumented paper-to-code mismatch in the weighted-FW/Rao-Wu composition itself; that extension and its argmin-set caveat are explicitly documented in the Methodology Registry, so they are informational rather than blocking. docs/methodology/REGISTRY.md:L1549-L1574
P1: implicit-PSU fpc full-design inputs (psu=None, fpc set, with or without strata) are still not front-door validated on the reopened SDID bootstrap path. fit() collapses to unit level and enters _bootstrap_se(), generate_rao_wu_weights() raises inside each draw when FPC < n_units (or < n_units_in_stratum), and _bootstrap_se() swallows that ValueError until it eventually emits the generic bootstrap-exhaustion failure. diff_diff/synthetic_did.py:L426-L486, diff_diff/synthetic_did.py:L1038-L1186, diff_diff/bootstrap_utils.py:L723-L729
The remaining placebo/jackknife + strata/PSU/FPC gap is properly tracked in TODO.md, so that part is mitigated and not a blocker. TODO.md:L108-L108
I could not run the test suite here because this environment is missing numpy, pytest, and pandas.

Methodology

Severity P1. Impact: the previous review’s assumption-check bug is still present. SurveyDesign.resolve() only validates FPC >= n_PSU when PSU is explicit, while the reopened SDID bootstrap path collapses to unit level and uses implicit unit-level PSUs when psu=None. On that path, invalid fpc values are rejected only inside generate_rao_wu_weights(), but _bootstrap_se() catches the resulting ValueError and retries until it fails generically. That is still a missing front-door design check on a methodology-critical input path. Concrete fix: after collapse_survey_to_unit_level(), add explicit validation for resolved_survey_unit.psu is None and resolved_survey_unit.fpc is not None: compare fpc against total unit count when unstratified, or per-stratum unit counts when stratified, and raise a targeted FPC/design ValueError before entering _bootstrap_se(). diff_diff/survey.py:L334-L369, diff_diff/synthetic_did.py:L426-L486, diff_diff/synthetic_did.py:L1038-L1172, diff_diff/bootstrap_utils.py:L723-L729
Severity P3-informational. Impact: the weighted-FW + Rao-Wu composition itself is documented in REGISTRY.md as a deliberate library extension, including the support matrix, objective form, argmin-set caveat, and single-PSU behavior, so I am not treating that documented deviation as a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1549-L1574

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational. Impact: full-design survey support for placebo and jackknife remains missing, but it is explicitly tracked in TODO.md, so it is mitigated under this project’s review rules. Concrete fix: none in this PR. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

Severity P3-informational. Impact: the public docs are now internally aligned with restored survey-bootstrap support for SyntheticDiD; I did not find the earlier support-matrix mismatch. Concrete fix: none. docs/choosing_estimator.rst:L785-L827, docs/methodology/REGISTRY.md:L1549-L1576
No additional documentation/test-only findings beyond the missing regression coverage needed to close the Methodology P1.

Path to Approval

Add a front-door implicit-PSU fpc validation in SyntheticDiD.fit() immediately after collapse_survey_to_unit_level(), using unit counts overall or within strata as the effective PSU count when psu=None.
Add regression tests for:
1. SurveyDesign(weights="wt", fpc="fpc_pop") with fpc_pop < n_units
2. SurveyDesign(weights="wt", strata="stratum", fpc="fpc_pop") with fpc_pop < n_units_in_stratum
Both should assert a targeted FPC/design ValueError, not the generic “Could not produce any valid bootstrap draw” error.

igerber · 2026-04-24T11:37:00Z

/ai-review

The ``Raises`` section of ``SyntheticDiD.fit()`` previously only listed "non-pweight survey design" among its ``ValueError`` conditions. The two fit-time guards added in PR #355 R7 and R8 are not reflected in the docstring: - R7 P1: zero total survey mass on either arm. - R8 P1: ``fpc`` declared without explicit ``psu=`` where ``fpc`` is below the (per-stratum) unit count. Both raise ``ValueError`` before the bootstrap loop is dispatched. Expanding the ``Raises`` section makes the contract explicit to readers (and AI reviewers) cross-referencing the documented behavior against the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T11:43:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bdebecc324e155426d638050a7a071dcaccaf9df

Overall Assessment

⚠️ Needs changes — one prior P1 finding remains unmitigated.

Executive Summary

Re-review result: the earlier zero-mass, pweight scale-invariance, and weighted-centering issues appear addressed in the reopened SDID survey-bootstrap path.
Affected method(s): SyntheticDiD survey bootstrap (variance_method="bootstrap") and the new weighted Frank-Wolfe survey helpers / Rust kernel used on bootstrap refits.
I did not find a new undocumented methodology mismatch in the weighted-FW + Rao-Wu composition itself; the extension, support matrix, and argmin-set caveat are explicitly documented in docs/methodology/REGISTRY.md:L1549-L1574.
P1 remains: implicit-PSU fpc inputs (psu=None, fpc set, with or without strata) still are not front-door validated. Invalid FPC is rejected only inside generate_rao_wu_weights(), then swallowed by _bootstrap_se()’s retry loop until the user gets a generic bootstrap-exhaustion error instead of a targeted survey-design error.
The remaining placebo/jackknife + strata/PSU/FPC limitation is properly tracked in TODO.md:L108-L108, so that gap is mitigated and not blocking.

Methodology

Severity P1. Impact: the prior implicit-PSU FPC assumption-check bug is still open. SurveyDesign.resolve() validates FPC >= n_PSU only when PSU is explicit; SyntheticDiD.fit() then forwards fpc-only full-design inputs into the survey bootstrap; generate_rao_wu_weights() treats each unit as its own PSU when psu is None and raises only mid-draw if FPC < n_units (or < n_units_in_stratum); _bootstrap_se() catches that ValueError and retries until it eventually raises the generic “Could not produce any valid bootstrap draw” failure. That is still a missing methodology-critical front-door design check. Concrete fix: after collapse_survey_to_unit_level(), if resolved_survey_unit.psu is None and resolved_survey_unit.fpc is not None, validate FPC against total unit count when unstratified and per-stratum unit counts when stratified, and raise a targeted ValueError before entering _bootstrap_se(). diff_diff/survey.py:L334-L369, diff_diff/synthetic_did.py:L421-L486, diff_diff/synthetic_did.py:L674-L699, diff_diff/synthetic_did.py:L983-L1185, diff_diff/bootstrap_utils.py:L695-L729
Severity P3-informational. Impact: the weighted-FW + Rao-Wu survey-bootstrap composition itself is documented in the Methodology Registry, including the support matrix and the deliberate argmin-set caveat, so I am not treating that documented deviation as a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1549-L1574

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational. Impact: full-design survey support for placebo and jackknife remains missing, but it is explicitly tracked in TODO.md, so it is mitigated under this project’s review rules. Concrete fix: none in this PR. TODO.md:L108-L108

Security

No findings.

Documentation/Tests

No independent findings beyond the open Methodology P1. The changed docs are aligned with restored survey-bootstrap support, and the new tests cover the resolved zero-mass, scale-invariance, slice-ordering, and single-PSU cases; they still do not regression-cover the implicit-PSU fpc path. Concrete fix: add the two targeted implicit-PSU FPC tests listed below. tests/test_methodology_sdid.py:L621-L995

Path to Approval

Add front-door implicit-PSU fpc validation in SyntheticDiD.fit() immediately after collapse_survey_to_unit_level(), using total unit count overall or within strata as the effective PSU count when psu=None.
Add regression tests for:
1. SurveyDesign(weights="wt", fpc="fpc_pop") with fpc_pop < n_units
2. SurveyDesign(weights="wt", strata="stratum", fpc="fpc_pop") with fpc_pop < n_units_in_stratum
Both should assert a targeted FPC/design ValueError, not the generic bootstrap-exhaustion error.

Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g fix R7 P1: the per-draw zero-mass retry in ``_bootstrap_se`` (PR #355 R2 P0) only covers bootstrap draws, not the fit-time ATT. Survey weights are non-negative post-resolve() but all-zero mass on either arm is a valid input that encodes an unidentified target population. Without a fit- time guard the downstream ``np.average(Y, weights=w_treated)`` and ``omega_eff = unit_weights * w_control`` normalizations would hit 0/0 and silently propagate NaN through the bootstrap / placebo / jackknife dispatchers. Front-door the case: after ``w_control`` / ``w_treated`` are sourced from the resolved unit-level design, raise ``ValueError`` if either arm's total mass is <= 0. Covers both pweight-only and strata/PSU/FPC paths. Three regression tests added: ``test_fit_raises_on_zero_total_treated_survey_mass``, ``test_fit_raises_on_zero_total_control_survey_mass``, and ``test_fit_raises_on_zero_treated_mass_under_full_design``. R7 P3: the SDID row in ``docs/choosing_estimator.rst`` said "pweight only (placebo / jackknife); full (bootstrap)" in the **Weights** column, conflating weight-type support (fweight / aweight / pweight) with design-element support (strata / PSU / FPC). The code still hard- rejects non-pweight survey designs on every variance method. Narrow the wording to "pweight only" and leave "Via bootstrap" in the Strata/PSU/FPC column to describe design-element support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…D bootstrap When ``SurveyDesign(fpc=...)`` is declared without an explicit ``psu=``, ``bootstrap_utils.generate_rao_wu_weights`` (L654-L655) treats each unit as its own PSU. The helper rejects ``FPC < n_PSU`` mid-draw (L684-L688), so if FPC is set lower than the unit count (per stratum if stratified), every bootstrap draw raises ValueError; ``_bootstrap_se`` swallows the error in its retry loop and the user eventually sees a generic bootstrap-exhaustion message instead of a targeted FPC/design error. Add a front-door validation on ``resolved_survey_unit`` after ``collapse_survey_to_unit_level``: - unstratified: fpc >= total unit count; - stratified: fpc_h >= per-stratum unit count. Error messages point at the two actionable fixes (declare an explicit psu= column, or raise FPC). Two regression tests added: ``test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified`` and ``test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The ``Raises`` section of ``SyntheticDiD.fit()`` previously only listed "non-pweight survey design" among its ``ValueError`` conditions. The two fit-time guards added in PR #355 R7 and R8 are not reflected in the docstring: - R7 P1: zero total survey mass on either arm. - R8 P1: ``fpc`` declared without explicit ``psu=`` where ``fpc`` is below the (per-stratum) unit count. Both raise ``ValueError`` before the bootstrap loop is dispatched. Expanding the ``Raises`` section makes the contract explicit to readers (and AI reviewers) cross-referencing the documented behavior against the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T11:50:18Z

/ai-review

github-actions · 2026-04-24T11:57:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1a20c16f181723b61576d80f0850f579e057d79e

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Re-review result: the prior implicit-PSU fpc blocker is addressed. SyntheticDiD.fit() now front-door validates implicit-PSU fpc before the bootstrap retry loop, which closes the earlier generic bootstrap-exhaustion failure mode. diff_diff/synthetic_did.py:L451-L496, tests/test_methodology_sdid.py:L912-L961
I did not find a new undocumented methodology mismatch in the restored SDID survey-bootstrap path. The weighted-FW + Rao-Wu composition, support matrix, and argmin-set caveat are explicitly documented in the Methodology Registry. docs/methodology/REGISTRY.md:L1549-L1574
The reopened path now has solid regression coverage for the previously risky edges: zero treated/control survey mass, pweight scale invariance, weights=None, Rao-Wu slice ordering, and full-design success cases. tests/test_methodology_sdid.py:L674-L1010, tests/test_weighted_fw.py:L49-L258, tests/test_survey_phase5.py:L179-L405
The remaining full-design gap is limited to variance_method='placebo' / 'jackknife', and that limitation is properly tracked in TODO.md, so it is not blocking under this repo’s review rules. TODO.md:L107-L108
I could not execute the test suite in this review environment because pytest and numpy are unavailable.

Methodology

No unmitigated P0/P1 findings. The previous fit-time implicit-PSU fpc validation issue appears fixed by the new upstream checks in fit(). diff_diff/synthetic_did.py:L451-L496

Severity P3-informational. Impact: the restored survey-bootstrap path uses a weighted Frank-Wolfe + Rao-Wu composition that is not paper/R-native, but that deviation is explicitly documented, including the support matrix and argmin-set caveat, so I am not treating it as a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1549-L1574

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational. Impact: full-design survey support is still missing for placebo and jackknife, but the limitation is explicitly tracked as deferred follow-up work, so it is mitigated and non-blocking. Concrete fix: none in this PR. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L363

Security

No findings.

Documentation/Tests

Severity P3-informational. Impact: benchmarks/data/sdid_coverage.json now contains bare NaN tokens for unsupported stratified_survey cells, which Python will parse but strict JSON parsers will reject. That makes the committed artifact non-standard JSON even though current repo-local consumers tolerate it. Concrete fix: emit null (or omit unsupported cells) from _summarize()/serialization and keep the schema test/docs aligned. benchmarks/python/coverage_sdid.py:L299-L314, benchmarks/python/coverage_sdid.py:L473-L476, benchmarks/data/sdid_coverage.json:L135-L168
No other findings. The added regression coverage materially improves protection around the restored survey-bootstrap path. tests/test_methodology_sdid.py:L674-L1010, tests/test_weighted_fw.py:L49-L258, tests/test_survey_phase5.py:L179-L405

The SDID coverage MC artifact carried bare ``NaN`` tokens for the ``stratified_survey`` × ``placebo`` / ``jackknife`` cells (unsupported by design — strata/PSU/FPC raises at fit-time). Python's ``json`` module tolerates those tokens on read, but strict JSON parsers reject them, making the committed artifact non-strict. ``_summarize`` now returns ``None`` (serializes as ``null``) instead of ``float('nan')`` on the all-failed branch, and ``json.dump`` is called with ``allow_nan=False`` so any stray non-finite value fails loudly instead of serializing as a bare ``NaN`` / ``Infinity`` token. The committed artifact has been patched in place (bare ``NaN`` → ``null``) and strict-loader-verified; no regen needed since the numeric content on the previously-NaN cells was definitionally absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T12:00:56Z

/ai-review

github-actions · 2026-04-24T12:11:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c29febf67b0c6c0f717b89ddafe9da97277606d0

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Previous re-review items look addressed: the implicit-PSU fpc validation now front-doors before the bootstrap retry loop, and the coverage artifact is back to strict JSON (null + allow_nan=False). diff_diff/synthetic_did.py:L456-L496, tests/test_methodology_sdid.py:L912-L960, benchmarks/python/coverage_sdid.py:L299-L335, benchmarks/python/coverage_sdid.py:L482-L488, benchmarks/data/sdid_coverage.json:L135-L168
I did not find a new undocumented mismatch in the restored weighted-FW + Rao-Wu bootstrap branch itself; the survey support matrix, hybrid composition, and argmin-set caveat are documented and the code matches that design. docs/methodology/REGISTRY.md:L1549-L1574, diff_diff/synthetic_did.py:L716-L754, diff_diff/synthetic_did.py:L980-L1246
Severity P1 [Newly identified]. The new fit-time survey “positive-mass” guard only checks raw control-arm mass, not the composed control support actually used by the estimator. A survey fit can still hit omega_eff.sum() == 0 when zero-weight controls absorb all Frank-Wolfe mass, which yields NaN output instead of a targeted identification error. diff_diff/survey.py:L171-L176, diff_diff/utils.py:L1525-L1546, diff_diff/utils.py:L1812-L1835, diff_diff/synthetic_did.py:L514-L541, diff_diff/synthetic_did.py:L654-L660
The remaining placebo/jackknife + strata/PSU/FPC gap is explicitly tracked in TODO.md, so it is non-blocking. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L363
I could not execute the tests here because numpy, pandas, and pytest are unavailable in this environment.

Methodology

Severity P1 [Newly identified]. Impact: the restored survey-support path still lacks a fit-time effective-support check for controls. Zero survey weights are valid input, compute_sdid_unit_weights() intentionally sparsifies to exact zeros, and fit() then normalizes omega_eff = unit_weights * w_control without checking that the composed vector has positive total mass. If all nonzero FW mass lands on zero-weight controls, the point estimate becomes undefined and the fit returns NaNs instead of a targeted identification error. The code already recognizes the analogous issue in the bootstrap loop (omega_scaled.sum() <= 0 retry) and in jackknife (effective_control > 0 guard), so this looks like a missing assumption check rather than an intentional methodology choice. Concrete fix: after composing omega_eff, raise ValueError when omega_eff.sum() <= 0 (or no positive effective controls remain), before normalizing; add a regression that forces the unweighted FW solution onto zero-weight controls. diff_diff/survey.py:L171-L176, diff_diff/utils.py:L1525-L1546, diff_diff/utils.py:L1812-L1835, diff_diff/synthetic_did.py:L514-L541, diff_diff/synthetic_did.py:L654-L660, diff_diff/synthetic_did.py:L1186-L1190, diff_diff/synthetic_did.py:L1579-L1590, tests/test_methodology_sdid.py:L845-L910

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational. Impact: full-design survey support is still missing for variance_method="placebo" and "jackknife", but that limitation is explicitly tracked in TODO.md, so it is mitigated and non-blocking under this repo’s review rules. Concrete fix: none required in this PR. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L363

Security

No findings.

Documentation/Tests

No findings. The prior non-standard JSON artifact issue appears fixed, and the added tests materially improve coverage around the restored survey-bootstrap path: weighted-FW kernel parity, Rao-Wu slicing, weights=None, zero-treated-mass retries, pweight rescaling invariance, and single-PSU NaN behavior. benchmarks/python/coverage_sdid.py:L299-L335, benchmarks/python/coverage_sdid.py:L482-L488, tests/test_weighted_fw.py:L49-L407, tests/test_methodology_sdid.py:L598-L1046, tests/test_survey_phase5.py:L179-L405
I could not run the test suite in this environment because numpy, pandas, and pytest are not installed.

Path to Approval

Add a fit-time effective-control guard in SyntheticDiD.fit() after omega_eff = unit_weights * w_control and before normalization; if the composed vector has zero total mass, raise a targeted ValueError instead of returning NaNs. diff_diff/synthetic_did.py:L654-L660
Add regression tests for both pweight-only and full-design survey fits where a zero-weight control is engineered to absorb all fit-time FW mass, and assert the new targeted error. tests/test_methodology_sdid.py:L845-L910

The R7 P1 positive-mass guard checks only the raw survey mass (``w_control.sum() > 0``). That's not sufficient: Frank-Wolfe sparsifies ``unit_weights`` to exact zeros by design, so even when at least one control has positive survey weight, the FW solution may concentrate all mass on controls whose survey weights are 0. The composed ``omega_eff = unit_weights * w_control`` then sums to 0 and the normalization step (``omega_eff / omega_eff.sum()``) hits 0/0, silently propagating NaN into the ATT, SE, and CI. Front-door the case: after composing ``omega_eff``, raise ``ValueError`` before the normalization when ``omega_eff.sum() <= 0``. The analogous guards already exist in the bootstrap loop (``omega_scaled.sum() <= 0`` retry) and jackknife (``effective_control > 0`` support gate); this restores the contract at fit time. Two regression tests cover both dispatch branches (pweight-only and strata/PSU/FPC). Both monkeypatch ``compute_sdid_unit_weights`` to return a canonical sparse unit-weight vector concentrated on a zero-survey-weight control — more reliable than wrestling with FW convergence dynamics on synthetic data. The ``fit()`` docstring ``Raises`` section is updated to list the new condition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T12:40:24Z

/ai-review

github-actions · 2026-04-24T12:47:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fb2dd9049c7d08efe38dde2300dab5084becc9ec

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

The prior re-review P1 is addressed: SyntheticDiD.fit() now guards zero effective control support before normalizing omega_eff, with regressions for both pweight-only and full-design survey fits. diff_diff/synthetic_did.py:L664-L691, tests/test_methodology_sdid.py:L912-L1005
I did not find a new undocumented mismatch in the core weighted-FW + Rao-Wu bootstrap implementation; the changed estimator path matches the new registry notes and support matrix. docs/methodology/REGISTRY.md:L1549-L1575, diff_diff/synthetic_did.py:L748-L786, diff_diff/synthetic_did.py:L1012-L1325, diff_diff/utils.py:L1846-L2115
Severity P1. The new stratified_survey calibration DGP returns post_periods=range(7, 12) even though generate_survey_did_data() is 1-indexed and cohort 7 implies post periods 7..12. Because fit() treats every omitted period as pre, period 12 is silently included in the pre window, contaminating the committed survey-bootstrap calibration row and the REGISTRY text that relies on it. benchmarks/python/coverage_sdid.py:L186-L217, diff_diff/prep_dgp.py:L1211-L1215, diff_diff/prep_dgp.py:L1557-L1558, diff_diff/synthetic_did.py:L381-L388, benchmarks/data/sdid_coverage.json:L135-L170, docs/methodology/REGISTRY.md:L1576-L1595
The remaining placebo/jackknife + strata/PSU/FPC limitation is explicitly tracked in TODO.md, so it is non-blocking. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L370
I could not run tests here because numpy, pandas, and pytest are unavailable in this environment.

Methodology

Severity P1. Impact: the PR’s new full-design SDID calibration evidence is generated from a mis-specified post window. With n_periods=12 and 1-indexed periods, omitting period 12 from post_periods means the last treated period is treated as pre-treatment by SyntheticDiD.fit(), so the reported stratified_survey × bootstrap rejection rates are not validating the stated 6-pre/6-post survey DGP. Concrete fix: change _stratified_survey_dgp() to return the full post tail ([7, 8, 9, 10, 11, 12], or derive it from df["period"].max()), then regenerate benchmarks/data/sdid_coverage.json and the REGISTRY table/commentary that transcribe from it. benchmarks/python/coverage_sdid.py:L186-L217, diff_diff/prep_dgp.py:L1211-L1215, diff_diff/prep_dgp.py:L1557-L1558, docs/methodology/REGISTRY.md:L1576-L1595

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational. Impact: full-design survey support is still missing for variance_method="placebo" and "jackknife", but that gap is explicitly tracked and documented, so it is non-blocking under this repo’s review rules. Concrete fix: none required in this PR. TODO.md:L107-L108, docs/methodology/REGISTRY.md:L1549-L1561

Security

No findings.

Documentation/Tests

Severity P2. Impact: the updated coverage-artifact smoke test verifies schema and the calibration gate, but not that the new stratified_survey DGP’s post_periods form the intended contiguous post-treatment tail; that is why the off-by-one DGP bug can ship with a passing artifact check. Concrete fix: add a small regression test or harness assertion that _stratified_survey_dgp() returns post_periods == [7, 8, 9, 10, 11, 12] (or equivalently max(post_periods) == df["period"].max() and no omitted periods after onset). benchmarks/python/coverage_sdid.py:L186-L217, tests/test_methodology_sdid.py:L3432-L3505

Path to Approval

Fix the stratified_survey DGP to include all post-treatment periods for the 1-indexed survey generator. benchmarks/python/coverage_sdid.py:L186-L217, diff_diff/prep_dgp.py:L1211-L1215, diff_diff/prep_dgp.py:L1557-L1558
Regenerate benchmarks/data/sdid_coverage.json and update the REGISTRY calibration table/text so they reflect the corrected survey-bootstrap benchmark. benchmarks/data/sdid_coverage.json:L135-L170, docs/methodology/REGISTRY.md:L1576-L1595
Add a regression test that fails if the survey coverage DGP drops any period after treatment onset from post_periods. tests/test_methodology_sdid.py:L3432-L3505

``generate_survey_did_data`` is 1-indexed (prep_dgp.py L1211-L1212), so ``n_periods=12`` with ``cohort_periods=[7]`` emits periods 1..12 with post = [7, 8, 9, 10, 11, 12]. The coverage harness' ``_stratified_survey_dgp`` returned ``list(range(7, 12))`` = [7, 8, 9, 10, 11], silently dropping period 12 into the pre window. SDID therefore fit the panel as 7-pre/5-post instead of the documented 6-pre/6-post, and every rejection / mean SE cell in the survey-bootstrap calibration row (plus the REGISTRY narrative transcribed from it) was derived from the mis-specified window. Fix: derive post_periods from ``df["period"].max()`` so any change to ``n_periods`` propagates. Regression test ``test_stratified_survey_dgp_post_periods_cover_full_post_tail`` fails fast if a future refactor reintroduces the off-by-one (checks unique / sorted / contiguous / max == df.period.max() plus the explicit [7, 8, 9, 10, 11, 12] shape). Regenerated only the stratified_survey block and spliced it into the main artifact (other DGPs unaffected — their seeds and DGP code are unchanged). New rejection rates at α = {0.01, 0.05, 0.10}: {0.024, 0.058, 0.094}; mean SE / true SD drops from 1.25 to 1.13. Rejection at α=0.05 remains well inside the calibration gate [0.02, 0.10]. REGISTRY table row and narrative updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T13:04:25Z

/ai-review

github-actions · 2026-04-24T13:12:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 08056e482fa9edbc915277cb647efa5675be8333

Overall Assessment

✅ Looks good — the prior P1 on the stratified_survey post window is fixed, and I did not find a new unmitigated P0/P1 in the weighted-FW + Rao-Wu SDID bootstrap changes.

Executive Summary

Previous blocker resolved: _stratified_survey_dgp() now derives the full post tail through period 12, and a regression test pins the [7, 8, 9, 10, 11, 12] contract. benchmarks/python/coverage_sdid.py:L186-L224, tests/test_methodology_sdid.py:L3522-L3564
The changed SDID survey-bootstrap path matches the documented registry contract: resolved unit-level survey weights, weighted FW unit/time refits, Rao-Wu slicing over boot_idx, and zero-mass / zero-support guards. docs/methodology/REGISTRY.md:L1549-L1597, diff_diff/synthetic_did.py:L442-L826, diff_diff/utils.py:L1303-L2075, diff_diff/bootstrap_utils.py:L657-L785
Remaining full-design placebo / jackknife limits are explicitly tracked, so they are non-blocking in this PR. TODO.md:L107-L108, docs/methodology/REGISTRY.md:L1549-L1561
Added tests cover the main survey-bootstrap regressions from the prior review: pweight rescaling invariance, zero-mass draws, FPC front-door validation, Rao-Wu slicing, single-PSU NaN handling, and the coverage-DGP post-tail fix. tests/test_methodology_sdid.py:L674-L1141, tests/test_methodology_sdid.py:L3522-L3564
I could not execute pytest here because pytest is not installed in this environment.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational
Impact: Full-design survey support still stops at variance_method="bootstrap"; placebo and jackknife with strata/PSU/FPC remain deferred. This is explicitly tracked and documented, so it is not a blocker for this PR. TODO.md:L107-L108, docs/methodology/REGISTRY.md:L1549-L1561
Concrete fix: None required in this PR.

Security

No findings.

Documentation/Tests

No findings. The documentation and tests were updated consistently with the restored survey-bootstrap support, and the prior calibration-artifact defect now has direct regression coverage. docs/methodology/REGISTRY.md:L1576-L1597, tests/test_methodology_sdid.py:L3497-L3564

The R13 P1 regression test ``test_stratified_survey_dgp_post_periods_cover_full_post_tail`` imports ``benchmarks.python.coverage_sdid`` directly to exercise the private ``_stratified_survey_dgp`` helper. CI's isolated-install job deliberately copies only ``tests/``, not ``benchmarks/``, so the module import failed with ``ModuleNotFoundError`` on CI runners that install the package into a fresh site-packages and then run the test suite against that install. The target is a benchmarking harness helper, not shipped package code, so the natural home is ``benchmarks/python/``. Moving it there keeps the test runnable locally (developer invokes explicitly before regenerating the coverage MC artifact) and out of CI's collection (``pyproject.toml testpaths = ["tests"]`` scopes default discovery to ``tests/`` only, so the new file never interferes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er-fix Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355

Closes the last SDID survey gap (TODO.md row 107). PR #355 restored variance_method="bootstrap" for strata/PSU/FPC via hybrid pairs- bootstrap + Rao-Wu + weighted-FW. This commit extends the same full- design capability to variance_method="placebo" and "jackknife". Placebo allocator — stratified permutation (Pesarin 2001). Pseudo-treated indices drawn within each stratum containing actual treated units; weighted-FW re-estimates ω and λ per draw with per- control survey weights threaded into both loss and regularization (reuses compute_sdid_unit_weights_survey + compute_time_weights_survey from PR #355). New private method _placebo_variance_se_survey. Fit-time front-door guards (per feedback_front_door_over_retry_swallow.md) distinguish two infeasible permutation configurations with targeted ValueError messages: Case B (stratum with treated units has zero controls) and Case C (stratum with treated units has fewer controls than treated). Partial-permutation fallback rejected — it silently changes the null-distribution semantics. Jackknife allocator — PSU-level leave-one-out with stratum aggregation (Rust & Rao 1996). SE² = Σ_h (1-f_h)·(n_h-1)/n_h· Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)². FPC form: f_h = n_h_sampled / fpc[h] (population-count form from survey.py::SurveyDesign.resolve; confirmed via survey.py:338-356 where fpc_h < n_psu_h is the validation constraint). λ held fixed across LOOs; ω subset + rw- composed-renormalized (matches Arkhangelsky Algorithm 3 non-survey semantics — jackknife is variance-approximation, not refit-variance). Strata with n_h < 2 skip silently; total-zero-variance → NaN + UserWarning. Unstratified designs with PSU treated as single-stratum JK1. New private method _jackknife_se_survey. Gate relaxation — deletes the placebo+jackknife+strata/PSU/FPC raise at synthetic_did.py:352-369. Replicate-weight gate at L329-337 unchanged (separate methodology; closed-form replicate variance double-counts with Rao-Wu-like rescaling). fit() dispatcher adds _placebo_use_survey_path / _jackknife_use_survey_path flags routing to the new methods when appropriate; non-survey and pweight-only paths bit-identical by construction (guarded by the same branch isolation pattern used in PR #355 _bootstrap_se). Allocator asymmetry — placebo ignores PSU axis; jackknife respects it. Intentional: placebo is a null-distribution test (stratified unit- level permutation is classical — PSU-level permutation on few PSUs is near-degenerate), while jackknife is a design-based variance approximation (PSU-level LOO is canonical per Rust & Rao). Both respect strata. Rationale documented in method docstrings and REGISTRY (follow-up commit). Tests — tests/test_survey_phase5.py: - TestSyntheticDiDSurvey: flip test_full_design_placebo_raises and test_full_design_jackknife_raises from NotImplementedError→succeeds; assert finite SE > 0, populated survey_metadata, .summary() round-trip. - TestSDIDSurveyPlaceboFullDesign (new class): pseudo-treated-stays- within-treated-strata (monkeypatched recorder), Case B / Case C front-door guards (targeted ValueError match), se-differs-from- pweight-only, deterministic dispatch. - TestSDIDSurveyJackknifeFullDesign (new class): stratum-aggregation self-consistency, fpc-reduces-se magnitude (SE_fpc = SE_nofpc/sqrt(2) at f=0.5, rtol=1e-10), se-differs-from-pweight-only, single-PSU- stratum silently skipped, unstratified short-circuit, all-strata- skipped UserWarning + NaN, deterministic dispatch. Non-survey and pweight-only regressions — all 32 tests in TestBootstrapSE + TestPlaceboSE + TestJackknifeSE pass unchanged; bit-identity preserved by the new-path-gating pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…placebo, jackknife) Second commit for the SDID survey-placebo/jackknife PR. Extends the coverage Monte Carlo artifact with jackknife on the stratified_survey DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY §SyntheticDiD gap bullets to two landed Notes; updates user-facing docs to reflect restored capability. Coverage MC changes ------------------- * benchmarks/python/coverage_sdid.py: _stratified_survey_design now returns ("bootstrap", "jackknife") on the methods tuple. Placebo is omitted because the DGP's cohort packs into a single stratum with 0 never-treated units — stratified-permutation placebo is structurally infeasible on this DGP (raises Case C at fit-time). Module docstring explains the exclusion and the jackknife anti-conservatism caveat. * benchmarks/data/sdid_coverage.json: regenerated stratified_survey block at n_seeds=500, n_bootstrap=200. Bootstrap validates near- nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti- conservatism from the stratified jackknife formula with 2 PSUs per stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation). REGISTRY.md §SyntheticDiD ------------------------- * Survey support matrix updated: all three variance methods now support strata/PSU/FPC (not just bootstrap). * Two new landed Notes: - "Note (survey + placebo composition)": stratified-permutation allocator, weighted-FW refit, ω_eff composition, fit-time feasibility guards (Case B / Case C), scope note on what is NOT randomized (within-stratum PSU axis). Cites Pesarin (2001) / Pesarin & Salmaso (2010). - "Note (survey + jackknife composition)": PSU-level LOO algorithm, explicit stratum-aggregation SE² formula, FPC handling (population- count form from survey.py:338-356), fixed-weights rationale, degenerate-LOO skip semantics, scope note, known anti-conservatism with few PSUs per stratum. Cites Rust & Rao (1996). * "Allocator asymmetry" paragraph in the survey support matrix documents the intentional asymmetry (placebo ignores PSU, jackknife respects it) with rationale rooted in each method's role (null- distribution test vs design-based variance approximation). * Coverage MC table adds the stratified_survey × jackknife row with anti-conservatism narrative; placebo row explicitly marked N/A-on- this-DGP (with pointer to the unit-test coverage). * Requirements checklist entries updated to describe full-design support for placebo and jackknife. Docs sweep ---------- * docs/methodology/survey-theory.md: new bullets describing the stratified-permutation placebo allocator and the PSU-level LOO jackknife, parallel to the existing hybrid-bootstrap bullet. * docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID row updated from "bootstrap only (PR #352)" to "Full (all three variance methods)"; legend amended; "Note on SyntheticDiD" block rewritten to describe all three allocators with the jackknife few-PSU caveat. * docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/ jackknife gap; Phase 6 bullet updated to describe all three allocators; Current Limitations table entry removed (only replicate- weight limitation remains, merged into one row). * CHANGELOG.md: "### Added" entry for placebo + jackknife full-design support (no new section header — folded into existing Unreleased block); "### Changed (PR #355)" tweaked to note the separate follow-up for placebo/jackknife. * TODO.md row 107 deleted (capability gap closed). * diff_diff/synthetic_did.py __init__ docstring: survey_design parameter description rewritten to describe all three methods. Placebo fallback-guidance comment updated to remove stale "placebo and jackknife reject strata/PSU/FPC" line. * diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated to describe all three survey allocators (UTF-8 fingerprint preserved — `D'Haultfœuille` still appears throughout). * tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative and assertions updated to reflect that placebo=0-fits is expected structurally on stratified_survey (documented Case C), while jackknife now runs successfully with the known anti-conservatism caveat intentionally unasserted at the calibration-gate level. Verification ------------ * pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign tests/test_survey_phase5.py::TestSyntheticDiDSurvey tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact} tests/test_guides.py → 82 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…trap/jackknife only P1 (Methodology — implicit-PSU FPC validator leaked into placebo): PR #355 R8 P1 added a fit-time validator that rejects ``psu=None`` + ``fpc < n_units`` designs, because Rao-Wu bootstrap treats each unit as its own PSU and would fail mid-draw with the bootstrap loop swallowing the error as a generic exhaustion message. The validator ran unconditionally on every survey fit. After R8 documented FPC as a placebo no-op (Pesarin 2001 §1.5 — permutation tests condition on the observed sample), this validator became inconsistent: a placebo fit with low FPC and no explicit ``psu`` would still raise a "FPC must be ≥ n_units" error for a constraint that doesn't apply to the placebo math. Fix: gate the implicit-PSU FPC validator on ``self.variance_method in ("bootstrap", "jackknife")``. Both methods genuinely consume FPC (Rao-Wu rescaling for bootstrap, Rust & Rao ``(1 - f_h)`` factor for jackknife). Placebo proceeds to the documented no-op warning path regardless of FPC value. New regression ``test_placebo_low_fpc_no_psu_warns_no_validator_block``: sets ``fpc_col = 5`` (well below n_units=30) with no PSU. Asserts (a) placebo fit succeeds, (b) emits the documented FPC-no-op ``UserWarning``, (c) SE matches the no-FPC pweight-only fit at ``rel=1e-12``, AND (d) bootstrap on the same low-FPC design still raises the validator error (gating preserves bootstrap/jackknife behavior — only placebo's FPC contract changes). Verification: 97 passed (1 new low-FPC placebo regression; existing bootstrap/jackknife FPC validation regressions still fire on their fixtures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 24, 2026 07:46

igerber and others added 3 commits April 24, 2026 07:48

igerber force-pushed the sdid-bootstrap-survey branch from bdebecc to 1a20c16 Compare April 24, 2026 11:50

igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026

igerber merged commit 4590632 into main Apr 24, 2026
19 checks passed

igerber deleted the sdid-bootstrap-survey branch April 24, 2026 15:16

igerber mentioned this pull request Apr 24, 2026

Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355 #361

Merged

igerber added a commit that referenced this pull request Apr 24, 2026

Merge pull request #361 from igerber/sdid-bootstrap-changelog-pr-numb…

e28080d

…er-fix Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355

igerber mentioned this pull request Apr 24, 2026

Restore SDID survey support for placebo and jackknife variance methods #365

Open

Conversation

igerber commented Apr 24, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval