Skip to content

Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition#355

Merged
igerber merged 16 commits intomainfrom
sdid-bootstrap-survey
Apr 24, 2026
Merged

Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition#355
igerber merged 16 commits intomainfrom
sdid-bootstrap-survey

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

  • Follow-up to Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 restoring the SDID survey-bootstrap capability that PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 front-door rejected as a known regression.
  • Adds a weighted Frank-Wolfe Rust kernel (sc_weight_fw_weighted + _with_convergence sibling) accepting per-coordinate reg_weights, plus matching pure-Python fallback. New Python helpers compute_sdid_unit_weights_survey / compute_time_weights_survey thread the weighted objective through the two-pass sparsify-refit dispatcher.
  • SyntheticDiD._bootstrap_se reintroduces the Rao-Wu branch: per-draw rw = w_control[boot_idx] for pweight-only or rw = generate_rao_wu_weights(resolved_survey_unit, rng) for full design, composed with the weighted-FW helpers; post-FW ω_eff = rw·ω / Σ(rw·ω) feeds compute_sdid_estimator.
  • Removes the PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 bootstrap × any-survey and unconditional strata/PSU/FPC × any-method guards; replaces with a method-gated version that still rejects strata/PSU/FPC for placebo / jackknife (separate methodology gap tracked in TODO.md).
  • Coverage MC extended with a 4th stratified_survey DGP; regenerated benchmarks/data/sdid_coverage.json — the new row's α=0.05 rejection is 0.042 (inside the [0.02, 0.10] calibration gate), mean SE / true SD = 1.25 (slightly conservative, the safer direction).

Methodology references (required if estimator / math changes)

  • Method name(s): SyntheticDiD survey-weighted bootstrap (weighted Frank-Wolfe + Rao-Wu composition)
  • Paper / source link(s): Arkhangelsky et al. (2021) Synthetic Difference-in-Differences, AER 111(12), 4088-4118 (Algorithm 2 step 2); Rao & Wu (1988) Resampling Inference with Complex Survey Data, JASA 83(401); Rao, Wu & Yue (1992) Some Recent Work on Resampling Methods for Complex Surveys, Survey Methodology 18(2). No external R/Julia anchor exists for survey-weighted SDID FW (neither package defines it); validation rests on the coverage-MC calibration row.
  • Intentional deviations: the weighted-FW objective min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i² returns ω on the simplex and downstream composes ω_eff = rw·ω / Σ(rw·ω). This is NOT the same as directly minimizing the standard SDID loss on ω_eff — the argmin sets differ because of the non-constant scaling factor in the ω↔ω_eff reparameterization. Intentional design mirroring the spirit of the pre-PR-Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 Rao-Wu composition but with ω re-estimated per draw under the weighted objective (so weight-estimation uncertainty propagates). See REGISTRY.md §SyntheticDiD Note (survey + bootstrap composition) for the full derivation.

Validation

  • Tests added/updated: tests/test_weighted_fw.py (new; 15 tests covering the Rust kernel + Python wrappers + survey helpers + backend parity), tests/test_methodology_sdid.py::TestBootstrapSE (two PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 raises-tests rewritten as succeeds-tests + new test_bootstrap_single_psu_returns_nan short-circuit regression + coverage-MC schema updated to include stratified_survey row with explicit calibration-gate assertion), tests/test_survey_phase5.py::TestSyntheticDiDSurvey (three PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 raises-tests rewritten as succeeds-tests with explicit SE assertions + new test_bootstrap_full_design_se_differs_from_pweight_only contract).
  • Backtest / simulation / notebook evidence: regenerated benchmarks/data/sdid_coverage.json (500 seeds × B=200, 40 min wall-clock on M-series Mac with Rust backend). Stratified-survey bootstrap rejection @ α=0.05 = 0.042 (inside [0.02, 0.10] gate).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes — I found two unmitigated P1 issues in the new SDID survey-bootstrap path.

Executive Summary

  • The new ω-side weighted Frank-Wolfe composition and the single-PSU NaN short-circuit are documented in REGISTRY.md, so I did not treat those deliberate deviations as defects.
  • compute_time_weights_survey() does not actually solve the weighted λ objective stated in the registry once intercept=True; the helper row-scales by sqrt(rw) and then unweighted-centers the transformed matrix, which changes the problem for non-uniform survey weights.
  • The newly supported full-design survey path still assumes SurveyDesign.weights names a real column. Because SurveyDesign.resolve() allows weights=None and synthesizes ones, valid SurveyDesign(strata=..., psu=...) inputs can still fail before bootstrap starts.
  • I did not find a new inline-inference / partial-NaN anti-pattern; final inference still flows through safe_inference().
  • Several user-facing docs/messages still say SDID bootstrap rejects all survey designs, and the new tests do not cover either blocker above.

Methodology

  • P1 Impact: compute_time_weights_survey() is documented as solving Σ_u rw_u (Σ_t λ_t Y_pre[t,u] - Y_post_mean[u])² + ζ²||λ||², but the implementation row-scales Y_time by sqrt(rw) and then passes intercept=True into _sc_weight_fw(), whose centering is an unweighted column mean. That is not equivalent to the stated weighted-loss problem once weights vary, so every non-uniform survey bootstrap draw refits λ on the wrong objective and can bias the reported survey bootstrap SEs. Concrete fix: weighted-center Y_time by rw_control / rw_control.sum() and call _sc_weight_fw(..., intercept=False), or extend _sc_weight_fw to support weighted centering directly; then add a small direct-optimizer regression for non-uniform rw_control. References: diff_diff/utils.py:L1977-L2042, diff_diff/utils.py:L1440-L1443, diff_diff/synthetic_did.py:L1051-L1065, docs/methodology/REGISTRY.md:L1563-L1567

Code Quality

  • P1 Impact: the PR newly front-doors full-design survey bootstrap, but SyntheticDiD.fit() still extracts unit weights through _extract_unit_survey_weights(data, ..., survey_design, ...), and that helper directly indexes survey_design.weights. SurveyDesign.resolve() explicitly supports weights=None by synthesizing unit weights of 1, so a valid SurveyDesign(strata=..., psu=...) input now reaches this path and can fail before the advertised bootstrap branch runs. Concrete fix: derive w_control / w_treated from resolved_survey.weights or resolved_survey_unit.weights when survey_design.weights is None, or reject weights=None up front with a clear error; add a regression test for a weight-less full design. References: diff_diff/synthetic_did.py:L415-L437, diff_diff/survey.py:L160-L207, diff_diff/survey.py:L1051-L1071

Performance

  • No material findings in the changed code.

Maintainability

  • No material findings beyond the blocking issues above.

Tech Debt

  • No separately tracked TODO.md item mitigates the two P1 findings above.

Security

  • No material findings.

Documentation/Tests

Path to Approval

  1. Fix compute_time_weights_survey() so the survey bootstrap λ refit matches the documented weighted objective under intercept=True, and add a non-uniform-weight regression that checks the helper against a direct constrained optimizer.
  2. Handle SurveyDesign(weights=None, strata=..., psu=...) explicitly in SyntheticDiD.fit() by either supporting implicit unit weights of 1 from the resolved survey object or rejecting that configuration with a clear front-door error; add a regression test for the chosen behavior.

igerber added a commit that referenced this pull request Apr 24, 2026
Fixes two P1 issues flagged by the CI reviewer on the initial submission
of PR #352.

P1 Methodology — `compute_time_weights_survey` was documented as solving
the WLS-style weighted λ objective
  min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||²
but row-scaled Y by sqrt(rw) and then handed the scaled matrix to
`_sc_weight_fw(intercept=True)`, whose column-centering uses an
UNWEIGHTED mean across controls. That is not the weighted objective
once rw varies, so non-uniform survey bootstrap draws were refitting λ
on the wrong objective and could bias the reported SE.

Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using
`col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass
`intercept=False` to the kernel so no additional unweighted centering
happens on the scaled matrix. Both two-pass calls updated.
`compute_sdid_unit_weights_survey` is unchanged — its column-centering
is PER-UNIT (time means within each control column), which is
independent of rw.

P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is
a valid configuration (`SurveyDesign.resolve()` synthesizes ones when
weights is None), but `_extract_unit_survey_weights` indexed
`survey_design.weights` as if it were always a column name, so the
groupby would fail with a KeyError before the bootstrap branch could
run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector
of ones of length `len(unit_order)` when `survey_design.weights is
None`, matching `SurveyDesign.resolve()`'s semantics.

Regression tests:
- `test_non_uniform_rw_beats_unweighted_centering_variant`
  (test_weighted_fw.py): reproduces the pre-fix buggy variant (row-
  scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and
  asserts the fixed path's weighted SSR is strictly ≤ the buggy
  variant's weighted SSR. If a future revert reintroduces
  intercept=True after the row-scaling, this test fails.
- `test_bootstrap_full_design_without_explicit_weights`
  (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with
  no explicit `weights` column now succeeds on the bootstrap path;
  survey_metadata populated with n_strata / n_psu.

P3 Documentation:
- `SyntheticDiD.fit()` docstring (survey_design parameter + Raises
  block): replace "bootstrap rejects all survey designs" language with
  the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight-
  only and full design; placebo/jackknife ✓ pweight-only, ✗ full
  design).
- `_placebo_variance_se` fallback-guidance messages (two sites): drop
  the "strata/PSU/FPC not yet supported by any SDID variance method"
  framing; recommend bootstrap for full-design survey fallback,
  jackknife for pweight-only, adding controls as the universal
  fallback.
- `docs/survey-roadmap.md` Current Limitations table: collapse the two
  SDID bootstrap-rejection rows into a single row for placebo/
  jackknife + full design (the bootstrap + full design row no longer
  applies).

Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE +
TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 099797e7edeff956ac39c586e3924fbd57d0a11c


Overall Assessment

Blocker — one unmitigated P0 remains in the new pweight-only SDID survey-bootstrap path.

Executive Summary

Methodology

Arkhangelsky et al.’s Algorithm 2 bootstraps SDID by resampling rows with replacement and discarding resamples with no treated or no control units; Rao & Wu (1988) and Rao, Wu & Yue (1992) are the cited complex-survey resampling references for the survey branch. The PR’s weighted-FW survey composition is a documented registry deviation, so I treated it as methodology-acceptable for this review. (arxiv.org)

  • P0 Impact: fit-time survey SDID always uses survey-weighted treated means when w_treated is present, but _bootstrap_se() only retries zero-mass treated draws for the Rao-Wu/full-design branch. In the new pweight-only branch, rw_treated_draw.sum() == 0 falls through to np.mean(...), silently turning a zero-weight treated resample into an unweighted one. Because zero survey weights are allowed as long as not all weights are zero, this is a reachable supported input, and it directly corrupts the bootstrap draw distribution used for se, p_value, and conf_int. Concrete fix: when survey weights are active, treat rw_treated_draw.sum() == 0 as a degenerate draw and continue/retry, just as the full-design Rao-Wu branch already does; do not ever fall back to the unweighted treated mean while w_treated is in force. References: diff_diff/synthetic_did.py:L535-L567, diff_diff/synthetic_did.py:L988-L1038, diff_diff/survey.py:L160-L176, docs/methodology/REGISTRY.md:L1557-L1574

Code Quality

Performance

  • No material findings in the changed path.

Maintainability

  • No material findings beyond the blocker above.

Tech Debt

  • No blocking findings. The remaining full-design placebo/jackknife gap is explicitly tracked in TODO.md, so I treated it as mitigated/non-blocking for this review. Reference: TODO.md:L107-L108

Security

  • No material findings.

Documentation/Tests

Path to Approval

  1. In SyntheticDiD._bootstrap_se(), make zero-sum treated survey mass a degenerate survey bootstrap draw in the pweight-only branch as well; retry/skip it instead of falling back to np.mean(...).
  2. Add a regression test for pweight-only bootstrap with a zero-weight treated unit to prove that zero-mass treated draws are skipped and cannot contaminate the reported bootstrap SE/inference.

Execution note: I did not run the test suite in this environment because the available python lacks numpy; the review above is from diff and source inspection.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
…path

CI review flagged a silent-failure P0: when a bootstrap draw's treated
units all had survey weight 0 under a pweight-only survey design, the
code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted
mean). That silently dropped survey weighting for that draw while the
fit-time ATT uses the survey-weighted treated mean — corrupting the
bootstrap distribution used for SE, p-value, and CI. Reachable with
any `SurveyDesign(weights=...)` input where at least one treated unit
has pweight 0 (a valid configuration per
``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks
only those units.

Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the
pweight-only branch as well. When ANY survey path is active and
``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the
draw is retried. The pre-fix behavior (unweighted treated mean
fallback) is unreachable from any code path now.

Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws``
in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a
panel with 3 treated units where one has `wt=0`; bootstrap at B=100
still returns finite SE > 0. The test would fail if the code reverted
to the unweighted-mean fallback because zero-mass draws would either
produce degenerate SDiD estimators (NaN τ → gets filtered out by the
finite-τ gate, possibly leaving no valid draws) or produce
systematically different τ values that inflate the SE.

P3 (same review) also addressed:
- ``diff_diff/synthetic_did.py`` replicate-weight rejection message
  (line ~303) now reflects the PR #352 support matrix instead of the
  pre-PR #352 "only pweight-only placebo/jackknife" framing.
  Analytical survey designs are now fully supported per the truth
  table; only replicate-weight variance remains unsupported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2b2e8f0824511607468418768386f4634978af6f


Overall Assessment

Looks good — no unmitigated P0 or P1 findings.

Executive Summary

  • The prior blocker is fixed: the pweight-only bootstrap path now retries zero-treated-mass survey draws instead of silently falling back to an unweighted treated mean. Refs: diff_diff/synthetic_did.py:L1013-L1045, tests/test_methodology_sdid.py:L674-L710.
  • The prior SurveyDesign(weights=None, strata=..., psu=...) regression is fixed. Refs: diff_diff/survey.py:L1051-L1084, tests/test_methodology_sdid.py:L714-L741.
  • I did not find any new unmitigated methodology or inference bugs in the changed code. The weighted-FW + Rao-Wu full-design branch remains a documented Registry deviation, so I treated it as informational rather than a defect.
  • Two P3 items remain: secondary docs do not describe that hybrid composition as precisely as REGISTRY.md, and the full-design branch still lacks a tight deterministic regression around the boot_idx + Rao-Wu interaction.
  • I could not execute tests here because this environment does not have numpy or pytest installed.

Methodology

  • No unmitigated findings. The full-design survey bootstrap is not the literal SDID Algorithm 2 pairs bootstrap or a literal standalone Rao-Wu bootstrap, but REGISTRY.md explicitly documents the shipped hybrid (rw sliced over resampled units), so under the review rules this is informational rather than blocking. Refs: diff_diff/synthetic_did.py:L978-L1001, docs/methodology/REGISTRY.md:L1563-L1574. Arkhangelsky et al. describe Algorithm 2 as unit-level resampling, while Rao-Wu is PSU-within-stratum resampling/weight rescaling.

Code Quality

  • No material findings. The previous weights=None crash is resolved in diff_diff/survey.py:L1051-L1084 and covered in tests/test_methodology_sdid.py:L714-L741.

Performance

  • No material findings in the changed path.

Maintainability

  • No material findings. The Rust/NumPy weighted-FW split is coherent and has dedicated parity/invariant coverage in tests/test_weighted_fw.py:L1-L383.

Tech Debt

  • No blocking findings. The remaining full-design placebo/jackknife gap is explicitly tracked in TODO.md:L107-L108, so it is mitigated/non-blocking.

Security

  • No material findings.

Documentation/Tests

  • Severity P3. Impact: survey-theory.md and CHANGELOG.md describe the restored full-design path as “Rao-Wu rescaled bootstrap” / “Rao-Wu rescaled weights per draw”, but only REGISTRY.md surfaces the material caveat that SDID still keeps the unit-level boot_idx resampling and then applies Rao-Wu weights. That can mislead readers about the exact methodology that shipped. Concrete fix: mirror REGISTRY.md’s “sliced over the resampled units” wording in the secondary docs, or name it explicitly as a hybrid pairs-bootstrap + Rao-Wu composition. Refs: diff_diff/synthetic_did.py:L978-L1001, docs/methodology/REGISTRY.md:L1563-L1574, docs/methodology/survey-theory.md:L725-L732, CHANGELOG.md:L23-L27.
  • Severity P3. Impact: the new tests cover kernel parity, the prior zero-treated-mass bug, and survey happy paths, but they do not pin the methodology-critical boot_idx + generate_rao_wu_weights() interaction with a small deterministic regression; that behavior is still guarded mainly by the slow coverage artifact. Concrete fix: add a targeted test that monkeypatches generate_rao_wu_weights() on a multi-unit-per-PSU panel and asserts the exact per-draw control/treated weights fed into _bootstrap_se(). Refs: diff_diff/synthetic_did.py:L978-L1045, tests/test_survey_phase5.py:L179-L209, tests/test_methodology_sdid.py:L621-L650, tests/test_methodology_sdid.py:L3055-L3143.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
…lice

Two P3s from R3; PR was already ✅ Looks good — these are close-out
polish.

P3 docs/tests — secondary surfaces described the full-design path as
"Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material
caveat that SDID still uses unit-level pairs-bootstrap resampling
(``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled
weights on top — a hybrid composition, not a standalone Rao-Wu
bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone
Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352
Added entry to use the hybrid-composition wording mirroring REGISTRY.

P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights``
interaction was only guarded by the slow coverage MC. Add
``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in
``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to
return a known vector of distinct per-unit values
(``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors
fed into the weighted FW helper via a capturing wrapper on
``compute_sdid_unit_weights_survey``, and asserts every captured vector
lies within ``known_rw[:n_control]`` (positions 1..n_control). This
catches two bug classes:
- slice-order regression: if someone swaps rw-then-slice for
  slice-then-rw, the captured vectors would include values from the
  treated slice ``known_rw[n_control:]`` and the assertion fires.
- rw-drift regression: if the Rao-Wu call site bypasses
  ``generate_rao_wu_weights`` (e.g., a refactor silently uses the
  pweight-only branch for full-design fits), the captured vector would
  be the user's w_control (all 1.0 in this test) instead of the known
  Rao-Wu output.

Verified: 294 targeted tests pass across test_methodology_sdid /
test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2bf3f93deb9a7b6d42a180721586730f905440af


Overall Assessment

Blocker — prior review items around documentation consistency and weights=None handling look addressed, but I found one new unmitigated P0 in the newly restored pweight-only SDID bootstrap path.

Executive Summary

  • The prior documentation concern is fixed: the shipped hybrid pairs-bootstrap + Rao-Wu composition is now described consistently in docs/methodology/REGISTRY.md:L1549-L1576, docs/methodology/survey-theory.md:L728-L741, and CHANGELOG.md:L20-L27.
  • The prior SurveyDesign(weights=None, strata=..., psu=...) regression is fixed in diff_diff/survey.py:L1051-L1084 and covered by tests/test_methodology_sdid.py:L704-L741.
  • [Newly identified] P0: the new pweight-only bootstrap branch bypasses SurveyDesign.resolve() normalization and feeds raw survey weights into the weighted Frank-Wolfe objective, so bootstrap SE / p-value / CI can change when all pweights are multiplied by a constant. Refs: diff_diff/survey.py:L160-L203, diff_diff/synthetic_did.py:L427-L444, diff_diff/synthetic_did.py:L632-L653, diff_diff/synthetic_did.py:L1002-L1082, diff_diff/utils.py:L1845-L1862, diff_diff/utils.py:L1979-L1993.
  • The new Rao-Wu × boot_idx regression test is helpful but weaker than its docstring claims: it only guards against treated-slice contamination, not exact boot_idx-ordered equality. Refs: tests/test_methodology_sdid.py:L743-L821.
  • I could not execute tests here because this environment is missing numpy, so importing the package fails.

Methodology

  • Severity P0 [Newly identified]. Impact: SurveyDesign.resolve() normalizes pweights/aweights to mean 1 (diff_diff/survey.py:L160-L203), which is the library’s scale-invariance contract. The new pweight-only bootstrap path ignores those resolved weights, extracts raw unit weights from the input frame (diff_diff/synthetic_did.py:L427-L430, diff_diff/survey.py:L1051-L1084), and passes them as rw into the weighted-FW bootstrap objective (diff_diff/synthetic_did.py:L640-L653, diff_diff/synthetic_did.py:L1002-L1082, diff_diff/utils.py:L1845-L1862, diff_diff/utils.py:L1979-L1993). Because that objective is not invariant to a global rescaling of rw, two equivalent designs such as w and 2*w can now produce different bootstrap SEs and downstream inference with no warning. Concrete fix: for the pweight-only bootstrap branch, source control/treated unit weights from the resolved unit-level design (resolved_survey_unit.weights, or equivalent normalized arrays) rather than the raw survey_design column, and add a regression asserting identical bootstrap SE / p-value / CI under w vs c*w.
  • Severity P3. Impact: the full-design survey bootstrap is a documented hybrid, not a paper-exact reproduction of either source procedure. Arkhangelsky et al. describe Algorithm 2 as a resampling-and-refit bootstrap for SDID, while Rao-Wu resampling perturbs PSU weights within strata; REGISTRY.md now explicitly labels the shipped SDID survey path as that hybrid composition, so under the review rules this is informational rather than a defect. Concrete fix: none required. Refs: docs/methodology/REGISTRY.md:L1549-L1576, docs/methodology/survey-theory.md:L728-L741. citeturn2view0turn1search1

Code Quality

  • No additional findings beyond the raw-vs-resolved weight bug above.

Performance

  • No material findings in the changed path.

Maintainability

  • No material findings. The weighted Rust and NumPy FW paths are coherently covered by tests/test_weighted_fw.py.

Tech Debt

  • No blocker. The remaining placebo / jackknife + strata/PSU/FPC limitation is explicitly tracked in TODO.md:L107-L108, so that gap is properly mitigated for review purposes.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: test_bootstrap_full_design_rao_wu_boot_idx_slice says it checks the exact expected slice known_rw[:n_control][boot_idx[boot_is_control]], but the actual assertion only checks that captured values stay within [1, 15] and never pull from the treated slice (tests/test_methodology_sdid.py:L743-L821). A regression that permuted, deduplicated, or otherwise mis-ordered the control-side boot_idx slice would still pass. Concrete fix: make one bootstrap draw deterministic and assert exact equality of the captured rw_control vector to the expected slice.

Path to Approval

  1. Normalize the pweight-only bootstrap branch onto resolved unit-level survey weights before passing rw into compute_sdid_unit_weights_survey() / compute_time_weights_survey().
  2. Add a regression test that SurveyDesign(weights="w") and SurveyDesign(weights="c*w") produce identical bootstrap SE, p-value, and CI on the same SDID fit.
  3. Tighten the Rao-Wu × boot_idx deterministic test to assert the exact captured slice, not just the value range.

igerber added a commit that referenced this pull request Apr 24, 2026
…tighten boot_idx slice test

R4 P0 (scale-invariance): the pweight-only bootstrap branch was sourcing
w_control / w_treated from raw panel-column weights via
_extract_unit_survey_weights. The weighted-FW bootstrap objective is not
scale-invariant in rw (loss scales as rw^2 via A·diag(rw), reg scales as
rw), so two equivalent designs w and c*w could produce different
bootstrap SE / p-value / CI with no warning. Fix: source w_control /
w_treated from resolved_survey_unit.weights, which SurveyDesign.resolve()
normalizes to mean=1 (survey.py L189-L203). Placebo / jackknife paths
also consume the same w_control / w_treated but are scale-invariant, so
their numerics are unchanged.

R4 P3 (test tightening): the boot_idx × Rao-Wu regression test asserted
captured rw values stayed within the known_rw[1, 15] range — too weak to
catch permutation / deduplication regressions in the slice order. Tighten
by reproducing the bootstrap RNG stream externally (fake_rao_wu doesn't
consume rng) and asserting exact-equality between the captured rw_control
vector and known_rw[:n_control][boot_idx[boot_is_control]].

New regression test: test_bootstrap_scale_invariance_under_pweight_rescaling
fits the same panel under SurveyDesign("wt") vs SurveyDesign("wt_scaled")
(10x rescale) and asserts SE, p-value, CI match to machine-epsilon tolerance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 676d8317de7374ce7ff9aa3bef830ac24b3fd91b


Overall Assessment

Looks good — no unmitigated P0/P1 findings. The prior rerun’s raw-vs-resolved pweight blocker is fixed in diff_diff/synthetic_did.py:L421-L458, and the Rao-Wu-by-boot_idx regression is now checked exactly in tests/test_methodology_sdid.py:L743-L843.

Executive Summary

  • Previous blocker resolved: survey bootstrap now sources normalized unit weights from resolved_survey_unit.weights, restoring pweight scale invariance, with regression coverage at tests/test_methodology_sdid.py:L845-L905.
  • Previous test weakness resolved: the full-design Rao-Wu slice test now asserts exact known_rw[:n_control][boot_idx[boot_is_control]] equality in tests/test_methodology_sdid.py:L743-L843.
  • The weighted time-weight helper now uses weighted centering before sqrt(rw) row-scaling, matching the documented objective and covering the prior centering bug via diff_diff/utils.py:L2037-L2057 and tests/test_weighted_fw.py:L244-L307.
  • Severity P2: the new Rust weighted-FW dispatcher silently falls back to the unweighted kernel on reg_weights length mismatch, unlike the NumPy path which raises; see rust/src/weights.rs:L694-L699 versus diff_diff/utils.py:L1453-L1459.
  • Severity P3 [Newly identified]: docs/choosing_estimator.rst:L817-L823 still says SDID bootstrap rejects all survey designs, contradicting the updated support matrix in docs/methodology/REGISTRY.md:L1549-L1576.
  • I could not execute tests here because this environment lacks numpy and pytest.

Methodology

  • No unmitigated P0/P1 findings.
  • Severity P3. Impact: the restored survey path is a deliberate hybrid pairs-bootstrap + Rao-Wu composition rather than a paper-exact survey bootstrap, but that deviation is now explicitly documented in docs/methodology/REGISTRY.md:L1563-L1574 and mirrored in diff_diff/synthetic_did.py:L61-L73, so under the review rules it is informational only. Concrete fix: none.

Code Quality

  • Severity P2. Impact: sc_weight_fw_weighted_internal() silently delegates to the unweighted solver when reg_weights.len() != t0, so Rust and NumPy backends disagree on invalid input and a direct caller can get the wrong objective with no error. Refs: rust/src/weights.rs:L670-L699, diff_diff/utils.py:L1380-L1406, diff_diff/utils.py:L1453-L1459. Concrete fix: validate reg_weights.shape == (Y.shape[1] - 1,) before Rust dispatch, or have the Rust entry point raise ValueError on mismatch instead of falling back.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the remaining placebo/jackknife + strata/PSU/FPC gap is explicitly tracked in TODO.md:L107-L108, so it is mitigated and non-blocking for this PR. Concrete fix: none in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3 [Newly identified]. Impact: the compatibility note in docs/choosing_estimator.rst:L817-L823 still advertises the pre-PR limitation, which can steer users away from the restored SDID survey-bootstrap path now described in docs/methodology/REGISTRY.md:L1549-L1576 and diff_diff/guides/llms-full.txt:L1673-L1679. Concrete fix: update docs/choosing_estimator.rst to match the new support matrix.
  • Severity P3. Impact: the module docstring in benchmarks/python/coverage_sdid.py:L2-L12 still says the study covers three DGPs and runs each method on each DGP, but the script now includes a fourth DGP and intentionally skips unsupported methods via survey_design_factory; that makes the calibration harness easier to misread. Concrete fix: refresh the docstring to describe four DGPs and note that stratified_survey is bootstrap-only (benchmarks/python/coverage_sdid.py:L66-L82, L178-L222, L348-L357).

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
R5 P2: Rust ``sc_weight_fw_weighted_internal`` silently fell back to the
unweighted kernel on ``reg_weights.len() != Y.shape[1] - 1`` while the
NumPy path raised ``ValueError``. Two direct callers of the Rust
pyfunctions (``sc_weight_fw_weighted`` / ``_with_convergence``) could
therefore get the wrong objective with no error. Validate shape both at
the Python dispatcher (``diff_diff.utils._sc_weight_fw``) and at each
Rust pyfunction entry point so both backends raise ``ValueError``
consistently. The internal's fallback is kept as defensive for Rust-
side callers but becomes unreachable from the Python API; comment
updated to reflect that. Regression test:
``test_reg_weights_length_mismatch_raises`` in tests/test_weighted_fw.py.

R5 P3 (docs/choosing_estimator.rst): the SDID row in the survey support
matrix and the trailing note still said survey-bootstrap was rejected.
Updated to describe the landed hybrid pairs-bootstrap + Rao-Wu rescaled
composition and to point users at the REGISTRY note for the objective
form and argmin-set caveat.

R5 P3 (benchmarks/python/coverage_sdid.py): the module docstring said
the study covers three DGPs; it now covers four
(``stratified_survey`` added) and bootstrap-only on that DGP via
``survey_design_factory``. Refresh the docstring to describe all four
DGPs and call out the bootstrap-only coverage for stratified_survey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 88ad1cc1a017f6b940989b0fb78a9e1c6724df22


Overall Assessment

Looks good — no unmitigated P0/P1 findings. The restored SyntheticDiD survey-bootstrap path is consistent with the new PR #352 methodology note, and the prior rerun’s substantive issues appear addressed. One P3 documentation inconsistency remains.

Executive Summary

  • The affected method is SyntheticDiD survey bootstrap. The new weighted-FW + Rao-Wu composition in diff_diff/synthetic_did.py:L633-L671 and diff_diff/synthetic_did.py:L924-L1108 matches the new registry note at docs/methodology/REGISTRY.md:L1549-L1575.
  • I did not find unmitigated P0/P1 issues in weighting, SE computation, NaN/inference handling, or control/treated slicing on the changed path.
  • The previous rerun’s substantive findings appear fixed in code: invalid reg_weights shapes now raise before Rust fallback (diff_diff/utils.py:L1380-L1393, rust/src/weights.rs:L894-L953), docs/choosing_estimator.rst now reflects restored support (docs/choosing_estimator.rst:L785-L827), and the coverage harness docstring was updated for the fourth DGP (benchmarks/python/coverage_sdid.py:L1-L20).
  • Regression coverage is well targeted for the new path: Rao-Wu slicing, zero-treated-mass retries, pweight rescaling invariance, single-PSU NaN, and weighted-centering/backend parity are covered in tests/test_methodology_sdid.py:L714-L926, tests/test_survey_phase5.py:L179-L392, and tests/test_weighted_fw.py:L137-L388.
  • Severity P3 [Newly identified]: docs/methodology/REGISTRY.md still contains three stale pre-PR352 statements claiming SDID survey bootstrap raises NotImplementedError (docs/methodology/REGISTRY.md:L1497-L1508, docs/methodology/REGISTRY.md:L1624-L1624, docs/methodology/REGISTRY.md:L2933-L2939), contradicting the new support matrix at docs/methodology/REGISTRY.md:L1549-L1575.
  • I could not execute tests here because this environment lacks numpy, pandas, and pytest.

Methodology

Code Quality

  • No findings. The prior Rust/NumPy invalid-input divergence on reg_weights length is now closed by shared validation in diff_diff/utils.py:L1380-L1393 and matching Rust Python-entry guards in rust/src/weights.rs:L894-L953.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining placebo/jackknife + strata/PSU/FPC gap is explicitly tracked and therefore mitigated in TODO.md:L107-L107.

Security

  • No findings.

Documentation/Tests

  • No additional findings beyond the stale REGISTRY.md text above. The newly added/updated tests cover the highest-risk branches of the changed path in tests/test_methodology_sdid.py:L714-L926, tests/test_survey_phase5.py:L179-L392, and tests/test_weighted_fw.py:L137-L388.
  • Tests were not executed in this environment because numpy, pandas, and pytest are unavailable.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026


Three places in REGISTRY.md still described the pre-PR-#352 behavior
where SDID survey bootstrap raised NotImplementedError, contradicting
the new support matrix and survey-composition note landed earlier in
this PR:

- §SyntheticDiD bootstrap bullet (paper-faithful refit text): the
  "Composed with any survey design ... raises NotImplementedError"
  trailer is updated to describe weighted-FW dispatch under the
  survey + bootstrap composition note.
- §SyntheticDiD requirements checklist bullet for "Bootstrap: paper-
  faithful Algorithm 2 step 2": "Survey designs raise
  NotImplementedError" trailer is updated to describe the hybrid
  pairs-bootstrap + Rao-Wu rescaling composition.
- §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten
  to add SDID to the list via its hybrid composition, with the
  precise distinction from standalone Rao-Wu (SunAbraham / TROP) and
  a pointer to §SyntheticDiD for the objective form and argmin-set
  caveat.

Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD
(L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/
FPC methodology gap that is out of scope for this PR and tracked in
TODO.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 60b0e0c6d1a7c8eef615aa94ed5332e24d80b9ee


Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

  • Affected method: SyntheticDiD survey bootstrap. The new weighted-Frank-Wolfe + Rao-Wu composition is documented in the Methodology Registry, and I did not find an unmitigated paper-to-code mismatch in the resample/refit logic itself. citeturn1search3turn1search32turn1search33
  • The previous rerun’s lone P3 about stale REGISTRY.md text appears resolved; the SyntheticDiD support matrix and Rao-Wu section are now internally consistent at docs/methodology/REGISTRY.md:L1549-L1576 and docs/methodology/REGISTRY.md:L2933-L2941.
  • P1 [Newly identified]: the PR fixes zero-mass bootstrap draws, but fit() still has no fit-time positive-mass guard for survey-weighted treated/control arms. A valid SurveyDesign with all treated mass or all control mass equal to zero can still hit zero-sum weighted means or 0/0 normalization instead of failing cleanly. diff_diff/synthetic_did.py:L421-L458, diff_diff/synthetic_did.py:L551-L582, diff_diff/survey.py:L166-L176, tests/test_methodology_sdid.py:L674-L712
  • P3: docs/choosing_estimator.rst now says SyntheticDiD has “full (bootstrap)” support in the Weights column, but the code still hard-rejects non-pweight survey designs. docs/choosing_estimator.rst:L785-L811, diff_diff/synthetic_did.py:L317-L322
  • I could not execute the test suite here because this environment does not have numpy, pandas, or pytest.

Methodology

  • Severity P1 [Newly identified]. Impact: the PR added the right retry for zero-mass survey bootstrap draws, but the same positive-support check is still missing at fit time. After splitting w_control / w_treated, the code immediately calls np.average(..., weights=w_treated) and later normalizes omega_eff = unit_weights * w_control by its sum. If either arm has zero total survey mass, the estimator can throw or propagate NaNs instead of front-door rejecting an unidentified target population. That is a partial fix of the exact zero-mass pattern the PR already addressed inside _bootstrap_se. Concrete fix: right after w_control / w_treated are constructed, add an explicit guard if w_control.sum() <= 0 or w_treated.sum() <= 0: raise ValueError(...), and cover both pweight-only and full-design variants in tests. diff_diff/synthetic_did.py:L421-L458, diff_diff/synthetic_did.py:L551-L582, diff_diff/survey.py:L166-L176, tests/test_methodology_sdid.py:L674-L712

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining placebo / jackknife + strata/PSU/FPC gap is properly tracked and therefore mitigated. TODO.md:L108-L109

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the SyntheticDiD row in docs/choosing_estimator.rst uses “full (bootstrap)” in the Weights column, but SyntheticDiD.fit() still enforces weight_type='pweight' for every survey configuration. That wording conflates full design-element support with full weight-type support and can mislead users into expecting fweight / aweight bootstrap support that does not exist. Concrete fix: change that cell to pweight only (or pweight only; full design elements via bootstrap) and leave Via bootstrap in the Strata/PSU/FPC column. docs/choosing_estimator.rst:L785-L811, diff_diff/synthetic_did.py:L317-L322
  • No additional test-coverage findings. The new tests are well targeted around the restored survey-bootstrap path, especially Rao-Wu slicing, zero-mass draw retry, pweight rescaling invariance, and single-PSU handling. tests/test_methodology_sdid.py:L598-L928, tests/test_survey_phase5.py:L179-L390, tests/test_weighted_fw.py:L1-L407
  • Tests were not executed here because numpy, pandas, and pytest are unavailable.

Path to Approval

  1. Add a fit-time survey positive-mass guard after w_control / w_treated are split, and fail cleanly when either arm’s total survey weight is zero.
  2. Add regression tests for both pweight-only and full-design survey inputs where all treated mass is zero and where all control mass is zero; assert the chosen front-door error contract.
  3. Fix the SyntheticDiD weights-cell wording in docs/choosing_estimator.rst so it still says pweight only while keeping full design-element support in the bootstrap column.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
…g fix

R7 P1: the per-draw zero-mass retry in ``_bootstrap_se`` (PR #355 R2 P0)
only covers bootstrap draws, not the fit-time ATT. Survey weights are
non-negative post-resolve() but all-zero mass on either arm is a valid
input that encodes an unidentified target population. Without a fit-
time guard the downstream ``np.average(Y, weights=w_treated)`` and
``omega_eff = unit_weights * w_control`` normalizations would hit 0/0
and silently propagate NaN through the bootstrap / placebo / jackknife
dispatchers. Front-door the case: after ``w_control`` / ``w_treated``
are sourced from the resolved unit-level design, raise ``ValueError``
if either arm's total mass is <= 0. Covers both pweight-only and
strata/PSU/FPC paths. Three regression tests added:
``test_fit_raises_on_zero_total_treated_survey_mass``,
``test_fit_raises_on_zero_total_control_survey_mass``, and
``test_fit_raises_on_zero_treated_mass_under_full_design``.

R7 P3: the SDID row in ``docs/choosing_estimator.rst`` said "pweight
only (placebo / jackknife); full (bootstrap)" in the **Weights** column,
conflating weight-type support (fweight / aweight / pweight) with
design-element support (strata / PSU / FPC). The code still hard-
rejects non-pweight survey designs on every variance method. Narrow
the wording to "pweight only" and leave "Via bootstrap" in the
Strata/PSU/FPC column to describe design-element support.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d43487305c496e01ae0032325767777a8f5b205a


Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

  • Affected method: SyntheticDiD survey bootstrap, specifically the new weighted Frank-Wolfe + Rao-Wu composition.
  • The previous re-review’s fit-time zero-mass finding is resolved: fit() now front-door rejects zero treated/control survey mass before estimation. diff_diff/synthetic_did.py:L421-L486
  • I did not find an unmitigated paper-to-code mismatch in the weighted-FW/Rao-Wu composition itself; that deviation from plain SDID refit bootstrap is explicitly documented in the Methodology Registry and therefore mitigated. docs/methodology/REGISTRY.md:L1549-L1576
  • P1: invalid fpc full-design inputs that rely on implicit PSUs are not validated on the new SDID bootstrap path, so the Rao-Wu helper throws inside every draw, _bootstrap_se() swallows the ValueError, and users eventually get a generic “Could not produce any valid bootstrap draw” after exhausting retries instead of a targeted FPC/design error. diff_diff/synthetic_did.py:L421-L486, diff_diff/synthetic_did.py:L661-L699, diff_diff/synthetic_did.py:L1020-L1192, diff_diff/bootstrap_utils.py:L695-L750
  • The prior documentation mismatch in docs/choosing_estimator.rst is fixed. docs/choosing_estimator.rst:L785-L823
  • I could not execute the test suite here because this environment lacks numpy and pytest.

Methodology

  • Severity P1. Impact: the new strata/PSU/FPC bootstrap path advertises FPC support, but SyntheticDiD.fit() never validates the unit-collapsed implicit-PSU FPC contract before entering the Rao-Wu retry loop. For SurveyDesign(..., fpc=...) without an explicit PSU (with or without strata), generate_rao_wu_weights() raises on every draw when FPC < effective unit-level PSU count; _bootstrap_se() catches that ValueError and keeps retrying until it emits a generic bootstrap-exhaustion failure. That is a missing assumption check on the newly opened survey-bootstrap path, not just a cosmetic error message issue. Concrete fix: after collapse_survey_to_unit_level(), validate resolved_survey_unit.fpc against the effective unit-level PSU count used by Rao-Wu when psu is None (all units if unstratified; per-stratum unit counts if stratified), and raise a targeted ValueError before calling _bootstrap_se(). diff_diff/synthetic_did.py:L421-L486, diff_diff/synthetic_did.py:L661-L699, diff_diff/synthetic_did.py:L1020-L1192, diff_diff/bootstrap_utils.py:L695-L750
  • No other methodology findings. The weighted-FW + Rao-Wu composition, its argmin-set caveat, and the remaining placebo/jackknife full-design gap are all documented in docs/methodology/REGISTRY.md. docs/methodology/REGISTRY.md:L1549-L1576

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining placebo/jackknife + strata/PSU/FPC gap is properly tracked in TODO.md, so it is mitigated. TODO.md:L108-L108

Security

  • No findings.

Documentation/Tests

  • No additional findings. The docs/support matrix are now internally consistent, and the added tests cover the main survey-bootstrap regressions (zero-mass retries, resolved-weight scale invariance, Rao-Wu slicing, weights=None). I could not execute them here because numpy and pytest are unavailable. docs/choosing_estimator.rst:L785-L823, tests/test_methodology_sdid.py:L598-L995, tests/test_survey_phase5.py:L179-L405

Path to Approval

  1. Add a front-door FPC validation on the unit-collapsed survey design for implicit-PSU cases (SurveyDesign(..., fpc=...) with no PSU, with and without strata) before entering _bootstrap_se().

  2. Add regression tests for:

    • SurveyDesign(weights="wt", fpc="pop") with pop < n_units
    • SurveyDesign(weights="wt", strata="stratum", fpc="pop") with pop_h < n_units_in_stratum

    Both should assert a targeted FPC/design ValueError, not the generic bootstrap-attempt-exhaustion error.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
…D bootstrap

When ``SurveyDesign(fpc=...)`` is declared without an explicit
``psu=``, ``bootstrap_utils.generate_rao_wu_weights`` (L654-L655)
treats each unit as its own PSU. The helper rejects
``FPC < n_PSU`` mid-draw (L684-L688), so if FPC is set lower than the
unit count (per stratum if stratified), every bootstrap draw raises
ValueError; ``_bootstrap_se`` swallows the error in its retry loop and
the user eventually sees a generic bootstrap-exhaustion message
instead of a targeted FPC/design error.

Add a front-door validation on ``resolved_survey_unit`` after
``collapse_survey_to_unit_level``:
  - unstratified: fpc >= total unit count;
  - stratified: fpc_h >= per-stratum unit count.

Error messages point at the two actionable fixes (declare an
explicit psu= column, or raise FPC).

Two regression tests added:
``test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified``
and ``test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7072816bdf3e7bffe7676f4d7dc084a9640ceef7


Overall Assessment

⚠️ Needs changes — one prior P1 finding remains unmitigated.

Executive Summary

  • Re-review result: the earlier zero-mass, pweight scale-invariance, and weighted-centering fixes appear to be present in the current tree, but the previous implicit-PSU FPC blocker is still open. diff_diff/synthetic_did.py:L442-L486, diff_diff/utils.py:L1992-L2115, tests/test_methodology_sdid.py:L845-L972, tests/test_weighted_fw.py:L268-L346
  • Affected method: SyntheticDiD survey bootstrap, i.e. the PR’s weighted-Frank-Wolfe + Rao-Wu composition layered onto SDID and complex-survey resampling. (aeaweb.org)
  • I did not find a new undocumented paper-to-code mismatch in the weighted-FW/Rao-Wu composition itself; that extension and its argmin-set caveat are explicitly documented in the Methodology Registry, so they are informational rather than blocking. docs/methodology/REGISTRY.md:L1549-L1574
  • P1: implicit-PSU fpc full-design inputs (psu=None, fpc set, with or without strata) are still not front-door validated on the reopened SDID bootstrap path. fit() collapses to unit level and enters _bootstrap_se(), generate_rao_wu_weights() raises inside each draw when FPC < n_units (or < n_units_in_stratum), and _bootstrap_se() swallows that ValueError until it eventually emits the generic bootstrap-exhaustion failure. diff_diff/synthetic_did.py:L426-L486, diff_diff/synthetic_did.py:L1038-L1186, diff_diff/bootstrap_utils.py:L723-L729
  • The remaining placebo/jackknife + strata/PSU/FPC gap is properly tracked in TODO.md, so that part is mitigated and not a blocker. TODO.md:L108-L108
  • I could not run the test suite here because this environment is missing numpy, pytest, and pandas.

Methodology

  • Severity P1. Impact: the previous review’s assumption-check bug is still present. SurveyDesign.resolve() only validates FPC >= n_PSU when PSU is explicit, while the reopened SDID bootstrap path collapses to unit level and uses implicit unit-level PSUs when psu=None. On that path, invalid fpc values are rejected only inside generate_rao_wu_weights(), but _bootstrap_se() catches the resulting ValueError and retries until it fails generically. That is still a missing front-door design check on a methodology-critical input path. Concrete fix: after collapse_survey_to_unit_level(), add explicit validation for resolved_survey_unit.psu is None and resolved_survey_unit.fpc is not None: compare fpc against total unit count when unstratified, or per-stratum unit counts when stratified, and raise a targeted FPC/design ValueError before entering _bootstrap_se(). diff_diff/survey.py:L334-L369, diff_diff/synthetic_did.py:L426-L486, diff_diff/synthetic_did.py:L1038-L1172, diff_diff/bootstrap_utils.py:L723-L729
  • Severity P3-informational. Impact: the weighted-FW + Rao-Wu composition itself is documented in REGISTRY.md as a deliberate library extension, including the support matrix, objective form, argmin-set caveat, and single-PSU behavior, so I am not treating that documented deviation as a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1549-L1574

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3-informational. Impact: full-design survey support for placebo and jackknife remains missing, but it is explicitly tracked in TODO.md, so it is mitigated under this project’s review rules. Concrete fix: none in this PR. TODO.md:L108-L108

Security

  • No findings.

Documentation/Tests

  • Severity P3-informational. Impact: the public docs are now internally aligned with restored survey-bootstrap support for SyntheticDiD; I did not find the earlier support-matrix mismatch. Concrete fix: none. docs/choosing_estimator.rst:L785-L827, docs/methodology/REGISTRY.md:L1549-L1576
  • No additional documentation/test-only findings beyond the missing regression coverage needed to close the Methodology P1.

Path to Approval

  1. Add a front-door implicit-PSU fpc validation in SyntheticDiD.fit() immediately after collapse_survey_to_unit_level(), using unit counts overall or within strata as the effective PSU count when psu=None.

  2. Add regression tests for:

    1. SurveyDesign(weights="wt", fpc="fpc_pop") with fpc_pop < n_units
    2. SurveyDesign(weights="wt", strata="stratum", fpc="fpc_pop") with fpc_pop < n_units_in_stratum

    Both should assert a targeted FPC/design ValueError, not the generic “Could not produce any valid bootstrap draw” error.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
The ``Raises`` section of ``SyntheticDiD.fit()`` previously only listed
"non-pweight survey design" among its ``ValueError`` conditions. The
two fit-time guards added in PR #355 R7 and R8 are not reflected in
the docstring:

- R7 P1: zero total survey mass on either arm.
- R8 P1: ``fpc`` declared without explicit ``psu=`` where ``fpc``
  is below the (per-stratum) unit count.

Both raise ``ValueError`` before the bootstrap loop is dispatched.
Expanding the ``Raises`` section makes the contract explicit to
readers (and AI reviewers) cross-referencing the documented behavior
against the implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bdebecc324e155426d638050a7a071dcaccaf9df


Overall Assessment

⚠️ Needs changes — one prior P1 finding remains unmitigated.

Executive Summary

  • Re-review result: the earlier zero-mass, pweight scale-invariance, and weighted-centering issues appear addressed in the reopened SDID survey-bootstrap path.
  • Affected method(s): SyntheticDiD survey bootstrap (variance_method="bootstrap") and the new weighted Frank-Wolfe survey helpers / Rust kernel used on bootstrap refits.
  • I did not find a new undocumented methodology mismatch in the weighted-FW + Rao-Wu composition itself; the extension, support matrix, and argmin-set caveat are explicitly documented in docs/methodology/REGISTRY.md:L1549-L1574.
  • P1 remains: implicit-PSU fpc inputs (psu=None, fpc set, with or without strata) still are not front-door validated. Invalid FPC is rejected only inside generate_rao_wu_weights(), then swallowed by _bootstrap_se()’s retry loop until the user gets a generic bootstrap-exhaustion error instead of a targeted survey-design error.
  • The remaining placebo/jackknife + strata/PSU/FPC limitation is properly tracked in TODO.md:L108-L108, so that gap is mitigated and not blocking.

Methodology

  • Severity P1. Impact: the prior implicit-PSU FPC assumption-check bug is still open. SurveyDesign.resolve() validates FPC >= n_PSU only when PSU is explicit; SyntheticDiD.fit() then forwards fpc-only full-design inputs into the survey bootstrap; generate_rao_wu_weights() treats each unit as its own PSU when psu is None and raises only mid-draw if FPC < n_units (or < n_units_in_stratum); _bootstrap_se() catches that ValueError and retries until it eventually raises the generic “Could not produce any valid bootstrap draw” failure. That is still a missing methodology-critical front-door design check. Concrete fix: after collapse_survey_to_unit_level(), if resolved_survey_unit.psu is None and resolved_survey_unit.fpc is not None, validate FPC against total unit count when unstratified and per-stratum unit counts when stratified, and raise a targeted ValueError before entering _bootstrap_se(). diff_diff/survey.py:L334-L369, diff_diff/synthetic_did.py:L421-L486, diff_diff/synthetic_did.py:L674-L699, diff_diff/synthetic_did.py:L983-L1185, diff_diff/bootstrap_utils.py:L695-L729
  • Severity P3-informational. Impact: the weighted-FW + Rao-Wu survey-bootstrap composition itself is documented in the Methodology Registry, including the support matrix and the deliberate argmin-set caveat, so I am not treating that documented deviation as a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1549-L1574

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3-informational. Impact: full-design survey support for placebo and jackknife remains missing, but it is explicitly tracked in TODO.md, so it is mitigated under this project’s review rules. Concrete fix: none in this PR. TODO.md:L108-L108

Security

  • No findings.

Documentation/Tests

  • No independent findings beyond the open Methodology P1. The changed docs are aligned with restored survey-bootstrap support, and the new tests cover the resolved zero-mass, scale-invariance, slice-ordering, and single-PSU cases; they still do not regression-cover the implicit-PSU fpc path. Concrete fix: add the two targeted implicit-PSU FPC tests listed below. tests/test_methodology_sdid.py:L621-L995

Path to Approval

  1. Add front-door implicit-PSU fpc validation in SyntheticDiD.fit() immediately after collapse_survey_to_unit_level(), using total unit count overall or within strata as the effective PSU count when psu=None.

  2. Add regression tests for:

    1. SurveyDesign(weights="wt", fpc="fpc_pop") with fpc_pop < n_units
    2. SurveyDesign(weights="wt", strata="stratum", fpc="fpc_pop") with fpc_pop < n_units_in_stratum

    Both should assert a targeted FPC/design ValueError, not the generic bootstrap-exhaustion error.

igerber and others added 2 commits April 24, 2026 07:46
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up
to #351 which front-door rejected all survey designs). This commit adds the
weighted-FW kernel + Python wrappers; the bootstrap integration lands in
the next commit.

Rust (rust/src/weights.rs, rust/src/lib.rs):
- New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted`
  loop variants. Identical to the unweighted loops except for the
  regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in
  place of `eta*lam[j]`, and the FW step-size denominator uses the
  diag(reg_w)-weighted simplex direction norm
  `Σ_j reg_w[j]*d[j]²` (which simplifies to
  `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam).
- New `sc_weight_fw_weighted_internal` dispatcher that delegates to the
  unweighted internal when reg_weights is None (preserves the legacy
  numeric contract for any future caller that wants the generic shape).
- Two new pyfunctions: `sc_weight_fw_weighted` and
  `sc_weight_fw_weighted_with_convergence`. Same call shape as the
  existing unweighted siblings plus a trailing `reg_weights` kwarg.
  Registered in lib.rs.
- 3 new Rust unit tests in rust/src/weights.rs:
    * test_weighted_fw_reg_weights_none_delegates — bit-identity at
      rel=1e-14 against the unweighted internal.
    * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform
      rw=1 collapses to uniform regularization (rel=1e-12, allowing for
      ULP-scale drift from different float reduction orders).
    * test_weighted_fw_simplex_invariants — for arbitrary positive rw
      and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to
      1 and is non-negative.

Python (diff_diff/utils.py, diff_diff/_backend.py):
- Export _rust_sc_weight_fw_weighted and _with_convergence from _backend
  (mirrors the shape added for _rust_sc_weight_fw_with_convergence in
  PR #351 c0d089b).
- Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a
  `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust
  path, dispatches to the new weighted pyfunctions; on the pure-Python
  path, runs a weighted FW loop mirroring the Rust derivation.
- New helper `compute_sdid_unit_weights_survey(Y_pre_control,
  Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by
  rw_control and passes rw_control as reg_weights so the FW solves the
  unit-weight survey-bootstrap objective
    min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² +
                    ζ²·Σ_i rw_i·ω_i²
  Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights.
  Returns ω on the standard simplex (caller composes ω_eff downstream).
- New helper `compute_time_weights_survey(Y_pre_control, Y_post_control,
  rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes
  no reg_weights (uniform reg on λ — λ is per-period, rw is per-control,
  no alignment for per-λ weighting). Two-pass structure unchanged.
- Both new helpers expose `return_convergence=True` returning the AND of
  the two pass convergence flags, mirroring the contract added in
  PR #351 c0d089b.

Tests (tests/test_weighted_fw.py — new, 15 tests):
- _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at
  bit-identity; uniform reg matches unweighted at rel=1e-12;
  Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw;
  return_convergence tuple shape.
- compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted
  helper, simplex invariants under arbitrary rw, shape-mismatch raises,
  return_convergence AND.
- compute_time_weights_survey: same coverage matrix, plus a zero-rw
  subset test (Rao-Wu-style undrawn PSU yields valid simplex λ).
- Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7
  for both unit and time helpers (monkeypatches HAS_RUST_BACKEND).

ABI preservation: existing unweighted callers of _sc_weight_fw,
compute_sdid_unit_weights, compute_time_weights are unaffected — the new
kwarg defaults to None and dispatches to the legacy code path. The
bit-identity check on TestScaleEquivariance::test_baseline_parity_small
_scale[bootstrap] still passes at rel=1e-14 (verified in the next commit
when the bootstrap integration lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sition

PR #352 restores the SDID survey-bootstrap capability that PR #351 front-
door rejected as a known regression. Pweight-only and full-design surveys
now both succeed; placebo / jackknife continue to reject full designs (a
separate methodology gap tracked in TODO.md).

`diff_diff/synthetic_did.py::fit` (guards):
- Replace the unconditional strata/PSU/FPC NotImpl guard with a method-
  gated version that fires only for placebo / jackknife. Rationale +
  truth-table in REGISTRY.md §SyntheticDiD survey-support matrix:

      method     pweight-only      strata/PSU/FPC
      bootstrap  ✓ (this PR)       ✓ Rao-Wu (this PR)
      placebo    ✓ unchanged       ✗ NotImpl (separate gap)
      jackknife  ✓ unchanged       ✗ NotImpl (separate gap)

- Delete the unconditional `bootstrap + any-survey` guard added in #351.
  Keep the `weight_type != "pweight"` validation (fweight/aweight still
  rejected).

`diff_diff/synthetic_did.py::fit` (survey resolution):
- After validating the per-unit survey weights (`w_treated`, `w_control`),
  also collapse the observation-level `resolved_survey` to a unit-level
  view via `collapse_survey_to_unit_level(...)` ordered as
  `[*control_units, *treated_units]`. The resulting
  `resolved_survey_unit` is what `_bootstrap_se` slices via
  `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw.

`diff_diff/synthetic_did.py::fit` (dispatcher):
- Branch the bootstrap call on whether the design is pweight-only or
  full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for
  pweight-only, `resolved_survey=resolved_survey_unit` for full design,
  None/None for non-survey.

`diff_diff/synthetic_did.py::_bootstrap_se`:
- New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword-
  only, default None — preserves the legacy signature).
- Single-PSU short-circuit: unstratified survey with <2 PSUs returns
  (NaN, []) since the bootstrap distribution is unidentified
  (resampling one PSU yields the same subset every draw). Recovered from
  the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5).
- Per-draw Rao-Wu rescaling for full designs:
  ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the
  resampled units. Pweight-only path uses ``rw = w_control[boot_idx]``
  (constant per draw, no rescaling).
- Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)``
  when survey weights are present.
- Warm-start: the simplex init scales by rw before sum_normalize when on
  the survey path, matching the per-draw weighted-FW geometry.
- Per-draw FW dispatch: survey paths call the new
  ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey``
  helpers (PR #352 commit 1) which run the weighted-FW kernel; non-
  survey paths continue to call the unweighted helpers (bit-identity
  preserved on the non-survey refit path).
- Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID
  estimator (which expects simplex weights). Degenerate-retry if
  ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls).
- Aggregate FW non-convergence warning: tally is the AND of the two
  helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b
  shape preserved, no copy change).

Tests:
- ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as
  succeeds-tests with explicit SE assertions —
    * ``test_full_design_bootstrap_succeeds`` (was ``_raises``):
      finite SE, populated survey_metadata.n_strata/n_psu, summary()
      includes Survey Design + Bootstrap replications blocks.
    * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``):
      finite SE, variance_method preserved (cross-surface guard).
    * New ``test_bootstrap_full_design_se_differs_from_pweight_only``
      resurrects the PR #351 R3-deleted differs-from contract: ATT
      matches between paths (both compose ω_eff post-fit) but SE
      differs (Rao-Wu adds PSU clustering variance).
- ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351
  raises-tests as succeeds-tests, plus add the
  ``test_bootstrap_single_psu_returns_nan`` short-circuit regression.

Verified: 308 tests pass across test_methodology_sdid /
test_business_report SDID subset / test_rust_backend / test_survey_phase5
/ test_weighted_fw / test_guides.

Bit-identity check: the non-survey refit path goes through the
unweighted helpers (no weighted-FW dispatch), so
``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]``
remains at rel=1e-14 — verified passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber and others added 3 commits April 24, 2026 07:48
…g fix

R7 P1: the per-draw zero-mass retry in ``_bootstrap_se`` (PR #355 R2 P0)
only covers bootstrap draws, not the fit-time ATT. Survey weights are
non-negative post-resolve() but all-zero mass on either arm is a valid
input that encodes an unidentified target population. Without a fit-
time guard the downstream ``np.average(Y, weights=w_treated)`` and
``omega_eff = unit_weights * w_control`` normalizations would hit 0/0
and silently propagate NaN through the bootstrap / placebo / jackknife
dispatchers. Front-door the case: after ``w_control`` / ``w_treated``
are sourced from the resolved unit-level design, raise ``ValueError``
if either arm's total mass is <= 0. Covers both pweight-only and
strata/PSU/FPC paths. Three regression tests added:
``test_fit_raises_on_zero_total_treated_survey_mass``,
``test_fit_raises_on_zero_total_control_survey_mass``, and
``test_fit_raises_on_zero_treated_mass_under_full_design``.

R7 P3: the SDID row in ``docs/choosing_estimator.rst`` said "pweight
only (placebo / jackknife); full (bootstrap)" in the **Weights** column,
conflating weight-type support (fweight / aweight / pweight) with
design-element support (strata / PSU / FPC). The code still hard-
rejects non-pweight survey designs on every variance method. Narrow
the wording to "pweight only" and leave "Via bootstrap" in the
Strata/PSU/FPC column to describe design-element support.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…D bootstrap

When ``SurveyDesign(fpc=...)`` is declared without an explicit
``psu=``, ``bootstrap_utils.generate_rao_wu_weights`` (L654-L655)
treats each unit as its own PSU. The helper rejects
``FPC < n_PSU`` mid-draw (L684-L688), so if FPC is set lower than the
unit count (per stratum if stratified), every bootstrap draw raises
ValueError; ``_bootstrap_se`` swallows the error in its retry loop and
the user eventually sees a generic bootstrap-exhaustion message
instead of a targeted FPC/design error.

Add a front-door validation on ``resolved_survey_unit`` after
``collapse_survey_to_unit_level``:
  - unstratified: fpc >= total unit count;
  - stratified: fpc_h >= per-stratum unit count.

Error messages point at the two actionable fixes (declare an
explicit psu= column, or raise FPC).

Two regression tests added:
``test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified``
and ``test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``Raises`` section of ``SyntheticDiD.fit()`` previously only listed
"non-pweight survey design" among its ``ValueError`` conditions. The
two fit-time guards added in PR #355 R7 and R8 are not reflected in
the docstring:

- R7 P1: zero total survey mass on either arm.
- R8 P1: ``fpc`` declared without explicit ``psu=`` where ``fpc``
  is below the (per-stratum) unit count.

Both raise ``ValueError`` before the bootstrap loop is dispatched.
Expanding the ``Raises`` section makes the contract explicit to
readers (and AI reviewers) cross-referencing the documented behavior
against the implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the sdid-bootstrap-survey branch from bdebecc to 1a20c16 Compare April 24, 2026 11:50
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1a20c16f181723b61576d80f0850f579e057d79e


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Re-review result: the prior implicit-PSU fpc blocker is addressed. SyntheticDiD.fit() now front-door validates implicit-PSU fpc before the bootstrap retry loop, which closes the earlier generic bootstrap-exhaustion failure mode. diff_diff/synthetic_did.py:L451-L496, tests/test_methodology_sdid.py:L912-L961
  • I did not find a new undocumented methodology mismatch in the restored SDID survey-bootstrap path. The weighted-FW + Rao-Wu composition, support matrix, and argmin-set caveat are explicitly documented in the Methodology Registry. docs/methodology/REGISTRY.md:L1549-L1574
  • The reopened path now has solid regression coverage for the previously risky edges: zero treated/control survey mass, pweight scale invariance, weights=None, Rao-Wu slice ordering, and full-design success cases. tests/test_methodology_sdid.py:L674-L1010, tests/test_weighted_fw.py:L49-L258, tests/test_survey_phase5.py:L179-L405
  • The remaining full-design gap is limited to variance_method='placebo' / 'jackknife', and that limitation is properly tracked in TODO.md, so it is not blocking under this repo’s review rules. TODO.md:L107-L108
  • I could not execute the test suite in this review environment because pytest and numpy are unavailable.

Methodology

No unmitigated P0/P1 findings. The previous fit-time implicit-PSU fpc validation issue appears fixed by the new upstream checks in fit(). diff_diff/synthetic_did.py:L451-L496

  • Severity P3-informational. Impact: the restored survey-bootstrap path uses a weighted Frank-Wolfe + Rao-Wu composition that is not paper/R-native, but that deviation is explicitly documented, including the support matrix and argmin-set caveat, so I am not treating it as a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L1549-L1574

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3-informational. Impact: full-design survey support is still missing for placebo and jackknife, but the limitation is explicitly tracked as deferred follow-up work, so it is mitigated and non-blocking. Concrete fix: none in this PR. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L363

Security

  • No findings.

Documentation/Tests

  • Severity P3-informational. Impact: benchmarks/data/sdid_coverage.json now contains bare NaN tokens for unsupported stratified_survey cells, which Python will parse but strict JSON parsers will reject. That makes the committed artifact non-standard JSON even though current repo-local consumers tolerate it. Concrete fix: emit null (or omit unsupported cells) from _summarize()/serialization and keep the schema test/docs aligned. benchmarks/python/coverage_sdid.py:L299-L314, benchmarks/python/coverage_sdid.py:L473-L476, benchmarks/data/sdid_coverage.json:L135-L168
  • No other findings. The added regression coverage materially improves protection around the restored survey-bootstrap path. tests/test_methodology_sdid.py:L674-L1010, tests/test_weighted_fw.py:L49-L258, tests/test_survey_phase5.py:L179-L405

The SDID coverage MC artifact carried bare ``NaN`` tokens for the
``stratified_survey`` × ``placebo`` / ``jackknife`` cells (unsupported
by design — strata/PSU/FPC raises at fit-time). Python's ``json``
module tolerates those tokens on read, but strict JSON parsers reject
them, making the committed artifact non-strict.

``_summarize`` now returns ``None`` (serializes as ``null``) instead
of ``float('nan')`` on the all-failed branch, and ``json.dump`` is
called with ``allow_nan=False`` so any stray non-finite value fails
loudly instead of serializing as a bare ``NaN`` / ``Infinity`` token.
The committed artifact has been patched in place (bare ``NaN`` →
``null``) and strict-loader-verified; no regen needed since the
numeric content on the previously-NaN cells was definitionally
absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c29febf67b0c6c0f717b89ddafe9da97277606d0


Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

  • Previous re-review items look addressed: the implicit-PSU fpc validation now front-doors before the bootstrap retry loop, and the coverage artifact is back to strict JSON (null + allow_nan=False). diff_diff/synthetic_did.py:L456-L496, tests/test_methodology_sdid.py:L912-L960, benchmarks/python/coverage_sdid.py:L299-L335, benchmarks/python/coverage_sdid.py:L482-L488, benchmarks/data/sdid_coverage.json:L135-L168
  • I did not find a new undocumented mismatch in the restored weighted-FW + Rao-Wu bootstrap branch itself; the survey support matrix, hybrid composition, and argmin-set caveat are documented and the code matches that design. docs/methodology/REGISTRY.md:L1549-L1574, diff_diff/synthetic_did.py:L716-L754, diff_diff/synthetic_did.py:L980-L1246
  • Severity P1 [Newly identified]. The new fit-time survey “positive-mass” guard only checks raw control-arm mass, not the composed control support actually used by the estimator. A survey fit can still hit omega_eff.sum() == 0 when zero-weight controls absorb all Frank-Wolfe mass, which yields NaN output instead of a targeted identification error. diff_diff/survey.py:L171-L176, diff_diff/utils.py:L1525-L1546, diff_diff/utils.py:L1812-L1835, diff_diff/synthetic_did.py:L514-L541, diff_diff/synthetic_did.py:L654-L660
  • The remaining placebo/jackknife + strata/PSU/FPC gap is explicitly tracked in TODO.md, so it is non-blocking. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L363
  • I could not execute the tests here because numpy, pandas, and pytest are unavailable in this environment.

Methodology

  • Severity P1 [Newly identified]. Impact: the restored survey-support path still lacks a fit-time effective-support check for controls. Zero survey weights are valid input, compute_sdid_unit_weights() intentionally sparsifies to exact zeros, and fit() then normalizes omega_eff = unit_weights * w_control without checking that the composed vector has positive total mass. If all nonzero FW mass lands on zero-weight controls, the point estimate becomes undefined and the fit returns NaNs instead of a targeted identification error. The code already recognizes the analogous issue in the bootstrap loop (omega_scaled.sum() <= 0 retry) and in jackknife (effective_control > 0 guard), so this looks like a missing assumption check rather than an intentional methodology choice. Concrete fix: after composing omega_eff, raise ValueError when omega_eff.sum() <= 0 (or no positive effective controls remain), before normalizing; add a regression that forces the unweighted FW solution onto zero-weight controls. diff_diff/survey.py:L171-L176, diff_diff/utils.py:L1525-L1546, diff_diff/utils.py:L1812-L1835, diff_diff/synthetic_did.py:L514-L541, diff_diff/synthetic_did.py:L654-L660, diff_diff/synthetic_did.py:L1186-L1190, diff_diff/synthetic_did.py:L1579-L1590, tests/test_methodology_sdid.py:L845-L910

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3-informational. Impact: full-design survey support is still missing for variance_method="placebo" and "jackknife", but that limitation is explicitly tracked in TODO.md, so it is mitigated and non-blocking under this repo’s review rules. Concrete fix: none required in this PR. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L363

Security

  • No findings.

Documentation/Tests

  • No findings. The prior non-standard JSON artifact issue appears fixed, and the added tests materially improve coverage around the restored survey-bootstrap path: weighted-FW kernel parity, Rao-Wu slicing, weights=None, zero-treated-mass retries, pweight rescaling invariance, and single-PSU NaN behavior. benchmarks/python/coverage_sdid.py:L299-L335, benchmarks/python/coverage_sdid.py:L482-L488, tests/test_weighted_fw.py:L49-L407, tests/test_methodology_sdid.py:L598-L1046, tests/test_survey_phase5.py:L179-L405
  • I could not run the test suite in this environment because numpy, pandas, and pytest are not installed.

Path to Approval

  1. Add a fit-time effective-control guard in SyntheticDiD.fit() after omega_eff = unit_weights * w_control and before normalization; if the composed vector has zero total mass, raise a targeted ValueError instead of returning NaNs. diff_diff/synthetic_did.py:L654-L660
  2. Add regression tests for both pweight-only and full-design survey fits where a zero-weight control is engineered to absorb all fit-time FW mass, and assert the new targeted error. tests/test_methodology_sdid.py:L845-L910

The R7 P1 positive-mass guard checks only the raw survey mass
(``w_control.sum() > 0``). That's not sufficient: Frank-Wolfe
sparsifies ``unit_weights`` to exact zeros by design, so even when
at least one control has positive survey weight, the FW solution may
concentrate all mass on controls whose survey weights are 0. The
composed ``omega_eff = unit_weights * w_control`` then sums to 0 and
the normalization step (``omega_eff / omega_eff.sum()``) hits 0/0,
silently propagating NaN into the ATT, SE, and CI.

Front-door the case: after composing ``omega_eff``, raise
``ValueError`` before the normalization when
``omega_eff.sum() <= 0``. The analogous guards already exist in the
bootstrap loop (``omega_scaled.sum() <= 0`` retry) and jackknife
(``effective_control > 0`` support gate); this restores the contract
at fit time.

Two regression tests cover both dispatch branches (pweight-only and
strata/PSU/FPC). Both monkeypatch ``compute_sdid_unit_weights`` to
return a canonical sparse unit-weight vector concentrated on a
zero-survey-weight control — more reliable than wrestling with FW
convergence dynamics on synthetic data. The ``fit()`` docstring
``Raises`` section is updated to list the new condition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: fb2dd9049c7d08efe38dde2300dab5084becc9ec


Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

  • The prior re-review P1 is addressed: SyntheticDiD.fit() now guards zero effective control support before normalizing omega_eff, with regressions for both pweight-only and full-design survey fits. diff_diff/synthetic_did.py:L664-L691, tests/test_methodology_sdid.py:L912-L1005
  • I did not find a new undocumented mismatch in the core weighted-FW + Rao-Wu bootstrap implementation; the changed estimator path matches the new registry notes and support matrix. docs/methodology/REGISTRY.md:L1549-L1575, diff_diff/synthetic_did.py:L748-L786, diff_diff/synthetic_did.py:L1012-L1325, diff_diff/utils.py:L1846-L2115
  • Severity P1. The new stratified_survey calibration DGP returns post_periods=range(7, 12) even though generate_survey_did_data() is 1-indexed and cohort 7 implies post periods 7..12. Because fit() treats every omitted period as pre, period 12 is silently included in the pre window, contaminating the committed survey-bootstrap calibration row and the REGISTRY text that relies on it. benchmarks/python/coverage_sdid.py:L186-L217, diff_diff/prep_dgp.py:L1211-L1215, diff_diff/prep_dgp.py:L1557-L1558, diff_diff/synthetic_did.py:L381-L388, benchmarks/data/sdid_coverage.json:L135-L170, docs/methodology/REGISTRY.md:L1576-L1595
  • The remaining placebo/jackknife + strata/PSU/FPC limitation is explicitly tracked in TODO.md, so it is non-blocking. TODO.md:L107-L108, diff_diff/synthetic_did.py:L345-L370
  • I could not run tests here because numpy, pandas, and pytest are unavailable in this environment.

Methodology

  • Severity P1. Impact: the PR’s new full-design SDID calibration evidence is generated from a mis-specified post window. With n_periods=12 and 1-indexed periods, omitting period 12 from post_periods means the last treated period is treated as pre-treatment by SyntheticDiD.fit(), so the reported stratified_survey × bootstrap rejection rates are not validating the stated 6-pre/6-post survey DGP. Concrete fix: change _stratified_survey_dgp() to return the full post tail ([7, 8, 9, 10, 11, 12], or derive it from df["period"].max()), then regenerate benchmarks/data/sdid_coverage.json and the REGISTRY table/commentary that transcribe from it. benchmarks/python/coverage_sdid.py:L186-L217, diff_diff/prep_dgp.py:L1211-L1215, diff_diff/prep_dgp.py:L1557-L1558, docs/methodology/REGISTRY.md:L1576-L1595

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3-informational. Impact: full-design survey support is still missing for variance_method="placebo" and "jackknife", but that gap is explicitly tracked and documented, so it is non-blocking under this repo’s review rules. Concrete fix: none required in this PR. TODO.md:L107-L108, docs/methodology/REGISTRY.md:L1549-L1561

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the updated coverage-artifact smoke test verifies schema and the calibration gate, but not that the new stratified_survey DGP’s post_periods form the intended contiguous post-treatment tail; that is why the off-by-one DGP bug can ship with a passing artifact check. Concrete fix: add a small regression test or harness assertion that _stratified_survey_dgp() returns post_periods == [7, 8, 9, 10, 11, 12] (or equivalently max(post_periods) == df["period"].max() and no omitted periods after onset). benchmarks/python/coverage_sdid.py:L186-L217, tests/test_methodology_sdid.py:L3432-L3505

Path to Approval

  1. Fix the stratified_survey DGP to include all post-treatment periods for the 1-indexed survey generator. benchmarks/python/coverage_sdid.py:L186-L217, diff_diff/prep_dgp.py:L1211-L1215, diff_diff/prep_dgp.py:L1557-L1558
  2. Regenerate benchmarks/data/sdid_coverage.json and update the REGISTRY calibration table/text so they reflect the corrected survey-bootstrap benchmark. benchmarks/data/sdid_coverage.json:L135-L170, docs/methodology/REGISTRY.md:L1576-L1595
  3. Add a regression test that fails if the survey coverage DGP drops any period after treatment onset from post_periods. tests/test_methodology_sdid.py:L3432-L3505

``generate_survey_did_data`` is 1-indexed (prep_dgp.py L1211-L1212),
so ``n_periods=12`` with ``cohort_periods=[7]`` emits periods 1..12
with post = [7, 8, 9, 10, 11, 12]. The coverage harness'
``_stratified_survey_dgp`` returned ``list(range(7, 12))`` =
[7, 8, 9, 10, 11], silently dropping period 12 into the pre window.
SDID therefore fit the panel as 7-pre/5-post instead of the
documented 6-pre/6-post, and every rejection / mean SE cell in the
survey-bootstrap calibration row (plus the REGISTRY narrative
transcribed from it) was derived from the mis-specified window.

Fix: derive post_periods from ``df["period"].max()`` so any change
to ``n_periods`` propagates. Regression test
``test_stratified_survey_dgp_post_periods_cover_full_post_tail``
fails fast if a future refactor reintroduces the off-by-one (checks
unique / sorted / contiguous / max == df.period.max() plus the
explicit [7, 8, 9, 10, 11, 12] shape).

Regenerated only the stratified_survey block and spliced it into
the main artifact (other DGPs unaffected — their seeds and DGP code
are unchanged). New rejection rates at α = {0.01, 0.05, 0.10}:
{0.024, 0.058, 0.094}; mean SE / true SD drops from 1.25 to 1.13.
Rejection at α=0.05 remains well inside the calibration gate
[0.02, 0.10]. REGISTRY table row and narrative updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 08056e482fa9edbc915277cb647efa5675be8333


Overall Assessment

Looks good — the prior P1 on the stratified_survey post window is fixed, and I did not find a new unmitigated P0/P1 in the weighted-FW + Rao-Wu SDID bootstrap changes.

Executive Summary

  • Previous blocker resolved: _stratified_survey_dgp() now derives the full post tail through period 12, and a regression test pins the [7, 8, 9, 10, 11, 12] contract. benchmarks/python/coverage_sdid.py:L186-L224, tests/test_methodology_sdid.py:L3522-L3564
  • The changed SDID survey-bootstrap path matches the documented registry contract: resolved unit-level survey weights, weighted FW unit/time refits, Rao-Wu slicing over boot_idx, and zero-mass / zero-support guards. docs/methodology/REGISTRY.md:L1549-L1597, diff_diff/synthetic_did.py:L442-L826, diff_diff/utils.py:L1303-L2075, diff_diff/bootstrap_utils.py:L657-L785
  • Remaining full-design placebo / jackknife limits are explicitly tracked, so they are non-blocking in this PR. TODO.md:L107-L108, docs/methodology/REGISTRY.md:L1549-L1561
  • Added tests cover the main survey-bootstrap regressions from the prior review: pweight rescaling invariance, zero-mass draws, FPC front-door validation, Rao-Wu slicing, single-PSU NaN handling, and the coverage-DGP post-tail fix. tests/test_methodology_sdid.py:L674-L1141, tests/test_methodology_sdid.py:L3522-L3564
  • I could not execute pytest here because pytest is not installed in this environment.

Methodology

  • No findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational
  • Impact: Full-design survey support still stops at variance_method="bootstrap"; placebo and jackknife with strata/PSU/FPC remain deferred. This is explicitly tracked and documented, so it is not a blocker for this PR. TODO.md:L107-L108, docs/methodology/REGISTRY.md:L1549-L1561
  • Concrete fix: None required in this PR.

Security

  • No findings.

Documentation/Tests

  • No findings. The documentation and tests were updated consistently with the restored survey-bootstrap support, and the prior calibration-artifact defect now has direct regression coverage. docs/methodology/REGISTRY.md:L1576-L1597, tests/test_methodology_sdid.py:L3497-L3564

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026
The R13 P1 regression test ``test_stratified_survey_dgp_post_periods_cover_full_post_tail``
imports ``benchmarks.python.coverage_sdid`` directly to exercise the
private ``_stratified_survey_dgp`` helper. CI's isolated-install job
deliberately copies only ``tests/``, not ``benchmarks/``, so the
module import failed with ``ModuleNotFoundError`` on CI runners that
install the package into a fresh site-packages and then run the test
suite against that install.

The target is a benchmarking harness helper, not shipped package
code, so the natural home is ``benchmarks/python/``. Moving it there
keeps the test runnable locally (developer invokes explicitly before
regenerating the coverage MC artifact) and out of CI's collection
(``pyproject.toml testpaths = ["tests"]`` scopes default discovery to
``tests/`` only, so the new file never interferes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path
bullet from PR #355 and the profile_panel/autonomous-guide bullet from
this PR now live side-by-side under [Unreleased]).

has_always_treated now has binary-only semantics:
- For binary treatment (absorbing or non-absorbing): unit_min == 1
  means the unit is treated in every observed period (no
  pre-treatment information in the DiD sense).
- For continuous treatment: always False. Pre-treatment periods on
  continuous DiD are determined by the separate `first_treat` column
  supplied to `ContinuousDiD.fit`, not by whether the dose is
  positive. A unit with a constant positive dose can still have
  well-defined pre-treatment periods, so flagging it as
  "always-treated / no pre-treatment information" was factually wrong
  and triggered the misleading `has_always_treated_units` alert on
  valid continuous panels.
- Categorical: False by construction.

Guide §2 has_always_treated field doc updated to state the
binary-only semantics explicitly, with a note about `first_treat`.

Tests:
- New: test_continuous_positive_dose_does_not_fire_has_always_treated
  asserts has_always_treated=False AND the alert does not fire on a
  constant-positive-dose continuous panel.
- Existing test_continuous_zero_dose_controls_flag_has_never_treated
  updated: has_always_treated expected to be False (was True under
  the old semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit 4590632 into main Apr 24, 2026
19 checks passed
@igerber igerber deleted the sdid-bootstrap-survey branch April 24, 2026 15:16
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path
bullet from PR #355 and the profile_panel/autonomous-guide bullet from
this PR now live side-by-side under [Unreleased]).

has_always_treated now has binary-only semantics:
- For binary treatment (absorbing or non-absorbing): unit_min == 1
  means the unit is treated in every observed period (no
  pre-treatment information in the DiD sense).
- For continuous treatment: always False. Pre-treatment periods on
  continuous DiD are determined by the separate `first_treat` column
  supplied to `ContinuousDiD.fit`, not by whether the dose is
  positive. A unit with a constant positive dose can still have
  well-defined pre-treatment periods, so flagging it as
  "always-treated / no pre-treatment information" was factually wrong
  and triggered the misleading `has_always_treated_units` alert on
  valid continuous panels.
- Categorical: False by construction.

Guide §2 has_always_treated field doc updated to state the
binary-only semantics explicitly, with a note about `first_treat`.

Tests:
- New: test_continuous_positive_dose_does_not_fire_has_always_treated
  asserts has_always_treated=False AND the alert does not fire on a
  constant-positive-dose continuous panel.
- Existing test_continuous_zero_dose_controls_flag_has_never_treated
  updated: has_always_treated expected to be False (was True under
  the old semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path
bullet from PR #355 and the profile_panel/autonomous-guide bullet from
this PR now live side-by-side under [Unreleased]).

has_always_treated now has binary-only semantics:
- For binary treatment (absorbing or non-absorbing): unit_min == 1
  means the unit is treated in every observed period (no
  pre-treatment information in the DiD sense).
- For continuous treatment: always False. Pre-treatment periods on
  continuous DiD are determined by the separate `first_treat` column
  supplied to `ContinuousDiD.fit`, not by whether the dose is
  positive. A unit with a constant positive dose can still have
  well-defined pre-treatment periods, so flagging it as
  "always-treated / no pre-treatment information" was factually wrong
  and triggered the misleading `has_always_treated_units` alert on
  valid continuous panels.
- Categorical: False by construction.

Guide §2 has_always_treated field doc updated to state the
binary-only semantics explicitly, with a note about `first_treat`.

Tests:
- New: test_continuous_positive_dose_does_not_fire_has_always_treated
  asserts has_always_treated=False AND the alert does not fire on a
  constant-positive-dose continuous panel.
- Existing test_continuous_zero_dose_controls_flag_has_never_treated
  updated: has_always_treated expected to be False (was True under
  the old semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…er-fix

Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355
igerber added a commit that referenced this pull request Apr 24, 2026
Closes the last SDID survey gap (TODO.md row 107). PR #355 restored
variance_method="bootstrap" for strata/PSU/FPC via hybrid pairs-
bootstrap + Rao-Wu + weighted-FW. This commit extends the same full-
design capability to variance_method="placebo" and "jackknife".

Placebo allocator — stratified permutation (Pesarin 2001).
Pseudo-treated indices drawn within each stratum containing actual
treated units; weighted-FW re-estimates ω and λ per draw with per-
control survey weights threaded into both loss and regularization
(reuses compute_sdid_unit_weights_survey + compute_time_weights_survey
from PR #355). New private method _placebo_variance_se_survey.

Fit-time front-door guards (per feedback_front_door_over_retry_swallow.md)
distinguish two infeasible permutation configurations with targeted
ValueError messages: Case B (stratum with treated units has zero
controls) and Case C (stratum with treated units has fewer controls
than treated). Partial-permutation fallback rejected — it silently
changes the null-distribution semantics.

Jackknife allocator — PSU-level leave-one-out with stratum
aggregation (Rust & Rao 1996). SE² = Σ_h (1-f_h)·(n_h-1)/n_h·
Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)². FPC form: f_h = n_h_sampled / fpc[h]
(population-count form from survey.py::SurveyDesign.resolve;
confirmed via survey.py:338-356 where fpc_h < n_psu_h is the
validation constraint). λ held fixed across LOOs; ω subset + rw-
composed-renormalized (matches Arkhangelsky Algorithm 3 non-survey
semantics — jackknife is variance-approximation, not refit-variance).
Strata with n_h < 2 skip silently; total-zero-variance → NaN +
UserWarning. Unstratified designs with PSU treated as single-stratum
JK1. New private method _jackknife_se_survey.

Gate relaxation — deletes the placebo+jackknife+strata/PSU/FPC raise
at synthetic_did.py:352-369. Replicate-weight gate at L329-337
unchanged (separate methodology; closed-form replicate variance
double-counts with Rao-Wu-like rescaling). fit() dispatcher adds
_placebo_use_survey_path / _jackknife_use_survey_path flags routing
to the new methods when appropriate; non-survey and pweight-only
paths bit-identical by construction (guarded by the same branch
isolation pattern used in PR #355 _bootstrap_se).

Allocator asymmetry — placebo ignores PSU axis; jackknife respects
it. Intentional: placebo is a null-distribution test (stratified unit-
level permutation is classical — PSU-level permutation on few PSUs is
near-degenerate), while jackknife is a design-based variance
approximation (PSU-level LOO is canonical per Rust & Rao). Both
respect strata. Rationale documented in method docstrings and
REGISTRY (follow-up commit).

Tests — tests/test_survey_phase5.py:
- TestSyntheticDiDSurvey: flip test_full_design_placebo_raises and
  test_full_design_jackknife_raises from NotImplementedError→succeeds;
  assert finite SE > 0, populated survey_metadata, .summary() round-trip.
- TestSDIDSurveyPlaceboFullDesign (new class): pseudo-treated-stays-
  within-treated-strata (monkeypatched recorder), Case B / Case C
  front-door guards (targeted ValueError match), se-differs-from-
  pweight-only, deterministic dispatch.
- TestSDIDSurveyJackknifeFullDesign (new class): stratum-aggregation
  self-consistency, fpc-reduces-se magnitude (SE_fpc = SE_nofpc/sqrt(2)
  at f=0.5, rtol=1e-10), se-differs-from-pweight-only, single-PSU-
  stratum silently skipped, unstratified short-circuit, all-strata-
  skipped UserWarning + NaN, deterministic dispatch.

Non-survey and pweight-only regressions — all 32 tests in
TestBootstrapSE + TestPlaceboSE + TestJackknifeSE pass unchanged;
bit-identity preserved by the new-path-gating pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…placebo, jackknife)

Second commit for the SDID survey-placebo/jackknife PR. Extends the
coverage Monte Carlo artifact with jackknife on the stratified_survey
DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY
§SyntheticDiD gap bullets to two landed Notes; updates user-facing
docs to reflect restored capability.

Coverage MC changes
-------------------
* benchmarks/python/coverage_sdid.py: _stratified_survey_design now
  returns ("bootstrap", "jackknife") on the methods tuple. Placebo is
  omitted because the DGP's cohort packs into a single stratum with 0
  never-treated units — stratified-permutation placebo is structurally
  infeasible on this DGP (raises Case C at fit-time). Module docstring
  explains the exclusion and the jackknife anti-conservatism caveat.
* benchmarks/data/sdid_coverage.json: regenerated stratified_survey
  block at n_seeds=500, n_bootstrap=200. Bootstrap validates near-
  nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row
  reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti-
  conservatism from the stratified jackknife formula with 2 PSUs per
  stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation).

REGISTRY.md §SyntheticDiD
-------------------------
* Survey support matrix updated: all three variance methods now
  support strata/PSU/FPC (not just bootstrap).
* Two new landed Notes:
  - "Note (survey + placebo composition)": stratified-permutation
    allocator, weighted-FW refit, ω_eff composition, fit-time
    feasibility guards (Case B / Case C), scope note on what is NOT
    randomized (within-stratum PSU axis). Cites Pesarin (2001) /
    Pesarin & Salmaso (2010).
  - "Note (survey + jackknife composition)": PSU-level LOO algorithm,
    explicit stratum-aggregation SE² formula, FPC handling (population-
    count form from survey.py:338-356), fixed-weights rationale,
    degenerate-LOO skip semantics, scope note, known anti-conservatism
    with few PSUs per stratum. Cites Rust & Rao (1996).
* "Allocator asymmetry" paragraph in the survey support matrix
  documents the intentional asymmetry (placebo ignores PSU, jackknife
  respects it) with rationale rooted in each method's role (null-
  distribution test vs design-based variance approximation).
* Coverage MC table adds the stratified_survey × jackknife row with
  anti-conservatism narrative; placebo row explicitly marked N/A-on-
  this-DGP (with pointer to the unit-test coverage).
* Requirements checklist entries updated to describe full-design
  support for placebo and jackknife.

Docs sweep
----------
* docs/methodology/survey-theory.md: new bullets describing the
  stratified-permutation placebo allocator and the PSU-level LOO
  jackknife, parallel to the existing hybrid-bootstrap bullet.
* docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID
  row updated from "bootstrap only (PR #352)" to "Full (all three
  variance methods)"; legend amended; "Note on SyntheticDiD" block
  rewritten to describe all three allocators with the jackknife
  few-PSU caveat.
* docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/
  jackknife gap; Phase 6 bullet updated to describe all three
  allocators; Current Limitations table entry removed (only replicate-
  weight limitation remains, merged into one row).
* CHANGELOG.md: "### Added" entry for placebo + jackknife full-design
  support (no new section header — folded into existing Unreleased
  block); "### Changed (PR #355)" tweaked to note the separate
  follow-up for placebo/jackknife.
* TODO.md row 107 deleted (capability gap closed).
* diff_diff/synthetic_did.py __init__ docstring: survey_design
  parameter description rewritten to describe all three methods.
  Placebo fallback-guidance comment updated to remove stale "placebo
  and jackknife reject strata/PSU/FPC" line.
* diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated
  to describe all three survey allocators (UTF-8 fingerprint
  preserved — `D'Haultfœuille` still appears throughout).
* tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative
  and assertions updated to reflect that placebo=0-fits is expected
  structurally on stratified_survey (documented Case C), while
  jackknife now runs successfully with the known anti-conservatism
  caveat intentionally unasserted at the calibration-gate level.

Verification
------------
* pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign
  tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign
  tests/test_survey_phase5.py::TestSyntheticDiDSurvey
  tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
  tests/test_guides.py → 82 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…trap/jackknife only

P1 (Methodology — implicit-PSU FPC validator leaked into placebo):
PR #355 R8 P1 added a fit-time validator that rejects ``psu=None``
+ ``fpc < n_units`` designs, because Rao-Wu bootstrap treats each
unit as its own PSU and would fail mid-draw with the bootstrap loop
swallowing the error as a generic exhaustion message. The validator
ran unconditionally on every survey fit.

After R8 documented FPC as a placebo no-op (Pesarin 2001 §1.5 —
permutation tests condition on the observed sample), this validator
became inconsistent: a placebo fit with low FPC and no explicit
``psu`` would still raise a "FPC must be ≥ n_units" error for a
constraint that doesn't apply to the placebo math.

Fix: gate the implicit-PSU FPC validator on
``self.variance_method in ("bootstrap", "jackknife")``. Both methods
genuinely consume FPC (Rao-Wu rescaling for bootstrap, Rust & Rao
``(1 - f_h)`` factor for jackknife). Placebo proceeds to the
documented no-op warning path regardless of FPC value.

New regression
``test_placebo_low_fpc_no_psu_warns_no_validator_block``:
sets ``fpc_col = 5`` (well below n_units=30) with no PSU. Asserts
(a) placebo fit succeeds, (b) emits the documented FPC-no-op
``UserWarning``, (c) SE matches the no-FPC pweight-only fit at
``rel=1e-12``, AND (d) bootstrap on the same low-FPC design still
raises the validator error (gating preserves bootstrap/jackknife
behavior — only placebo's FPC contract changes).

Verification: 97 passed (1 new low-FPC placebo regression; existing
bootstrap/jackknife FPC validation regressions still fire on their
fixtures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant