Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition#355
Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition#355
Conversation
Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Fixes two P1 issues flagged by the CI reviewer on the initial submission of PR #352. P1 Methodology — `compute_time_weights_survey` was documented as solving the WLS-style weighted λ objective min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||² but row-scaled Y by sqrt(rw) and then handed the scaled matrix to `_sc_weight_fw(intercept=True)`, whose column-centering uses an UNWEIGHTED mean across controls. That is not the weighted objective once rw varies, so non-uniform survey bootstrap draws were refitting λ on the wrong objective and could bias the reported SE. Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using `col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass `intercept=False` to the kernel so no additional unweighted centering happens on the scaled matrix. Both two-pass calls updated. `compute_sdid_unit_weights_survey` is unchanged — its column-centering is PER-UNIT (time means within each control column), which is independent of rw. P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is a valid configuration (`SurveyDesign.resolve()` synthesizes ones when weights is None), but `_extract_unit_survey_weights` indexed `survey_design.weights` as if it were always a column name, so the groupby would fail with a KeyError before the bootstrap branch could run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector of ones of length `len(unit_order)` when `survey_design.weights is None`, matching `SurveyDesign.resolve()`'s semantics. Regression tests: - `test_non_uniform_rw_beats_unweighted_centering_variant` (test_weighted_fw.py): reproduces the pre-fix buggy variant (row- scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and asserts the fixed path's weighted SSR is strictly ≤ the buggy variant's weighted SSR. If a future revert reintroduces intercept=True after the row-scaling, this test fails. - `test_bootstrap_full_design_without_explicit_weights` (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with no explicit `weights` column now succeeds on the bootstrap path; survey_metadata populated with n_strata / n_psu. P3 Documentation: - `SyntheticDiD.fit()` docstring (survey_design parameter + Raises block): replace "bootstrap rejects all survey designs" language with the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight- only and full design; placebo/jackknife ✓ pweight-only, ✗ full design). - `_placebo_variance_se` fallback-guidance messages (two sites): drop the "strata/PSU/FPC not yet supported by any SDID variance method" framing; recommend bootstrap for full-design survey fallback, jackknife for pweight-only, adding controls as the universal fallback. - `docs/survey-roadmap.md` Current Limitations table: collapse the two SDID bootstrap-rejection rows into a single row for placebo/ jackknife + full design (the bootstrap + full design row no longer applies). Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE + TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker — one unmitigated Executive Summary
Methodology Arkhangelsky et al.’s Algorithm 2 bootstraps SDID by resampling rows with replacement and discarding resamples with no treated or no control units; Rao & Wu (1988) and Rao, Wu & Yue (1992) are the cited complex-survey resampling references for the survey branch. The PR’s weighted-FW survey composition is a documented registry deviation, so I treated it as methodology-acceptable for this review. (arxiv.org)
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Execution note: I did not run the test suite in this environment because the available |
|
/ai-review |
…path CI review flagged a silent-failure P0: when a bootstrap draw's treated units all had survey weight 0 under a pweight-only survey design, the code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted mean). That silently dropped survey weighting for that draw while the fit-time ATT uses the survey-weighted treated mean — corrupting the bootstrap distribution used for SE, p-value, and CI. Reachable with any `SurveyDesign(weights=...)` input where at least one treated unit has pweight 0 (a valid configuration per ``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks only those units. Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the pweight-only branch as well. When ANY survey path is active and ``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the draw is retried. The pre-fix behavior (unweighted treated mean fallback) is unreachable from any code path now. Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws`` in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a panel with 3 treated units where one has `wt=0`; bootstrap at B=100 still returns finite SE > 0. The test would fail if the code reverted to the unweighted-mean fallback because zero-mass draws would either produce degenerate SDiD estimators (NaN τ → gets filtered out by the finite-τ gate, possibly leaving no valid draws) or produce systematically different τ values that inflate the SE. P3 (same review) also addressed: - ``diff_diff/synthetic_did.py`` replicate-weight rejection message (line ~303) now reflects the PR #352 support matrix instead of the pre-PR #352 "only pweight-only placebo/jackknife" framing. Analytical survey designs are now fully supported per the truth table; only replicate-weight variance remains unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
/ai-review |
…lice Two P3s from R3; PR was already ✅ Looks good — these are close-out polish. P3 docs/tests — secondary surfaces described the full-design path as "Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material caveat that SDID still uses unit-level pairs-bootstrap resampling (``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled weights on top — a hybrid composition, not a standalone Rao-Wu bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352 Added entry to use the hybrid-composition wording mirroring REGISTRY. P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights`` interaction was only guarded by the slow coverage MC. Add ``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in ``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to return a known vector of distinct per-unit values (``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors fed into the weighted FW helper via a capturing wrapper on ``compute_sdid_unit_weights_survey``, and asserts every captured vector lies within ``known_rw[:n_control]`` (positions 1..n_control). This catches two bug classes: - slice-order regression: if someone swaps rw-then-slice for slice-then-rw, the captured vectors would include values from the treated slice ``known_rw[n_control:]`` and the assertion fires. - rw-drift regression: if the Rao-Wu call site bypasses ``generate_rao_wu_weights`` (e.g., a refactor silently uses the pweight-only branch for full-design fits), the captured vector would be the user's w_control (all 1.0 in this test) instead of the known Rao-Wu output. Verified: 294 targeted tests pass across test_methodology_sdid / test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker — prior review items around documentation consistency and Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…tighten boot_idx slice test
R4 P0 (scale-invariance): the pweight-only bootstrap branch was sourcing
w_control / w_treated from raw panel-column weights via
_extract_unit_survey_weights. The weighted-FW bootstrap objective is not
scale-invariant in rw (loss scales as rw^2 via A·diag(rw), reg scales as
rw), so two equivalent designs w and c*w could produce different
bootstrap SE / p-value / CI with no warning. Fix: source w_control /
w_treated from resolved_survey_unit.weights, which SurveyDesign.resolve()
normalizes to mean=1 (survey.py L189-L203). Placebo / jackknife paths
also consume the same w_control / w_treated but are scale-invariant, so
their numerics are unchanged.
R4 P3 (test tightening): the boot_idx × Rao-Wu regression test asserted
captured rw values stayed within the known_rw[1, 15] range — too weak to
catch permutation / deduplication regressions in the slice order. Tighten
by reproducing the bootstrap RNG stream externally (fake_rao_wu doesn't
consume rng) and asserting exact-equality between the captured rw_control
vector and known_rw[:n_control][boot_idx[boot_is_control]].
New regression test: test_bootstrap_scale_invariance_under_pweight_rescaling
fits the same panel under SurveyDesign("wt") vs SurveyDesign("wt_scaled")
(10x rescale) and asserts SE, p-value, CI match to machine-epsilon tolerance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. The prior rerun’s raw-vs-resolved pweight blocker is fixed in Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
/ai-review |
R5 P2: Rust ``sc_weight_fw_weighted_internal`` silently fell back to the unweighted kernel on ``reg_weights.len() != Y.shape[1] - 1`` while the NumPy path raised ``ValueError``. Two direct callers of the Rust pyfunctions (``sc_weight_fw_weighted`` / ``_with_convergence``) could therefore get the wrong objective with no error. Validate shape both at the Python dispatcher (``diff_diff.utils._sc_weight_fw``) and at each Rust pyfunction entry point so both backends raise ``ValueError`` consistently. The internal's fallback is kept as defensive for Rust- side callers but becomes unreachable from the Python API; comment updated to reflect that. Regression test: ``test_reg_weights_length_mismatch_raises`` in tests/test_weighted_fw.py. R5 P3 (docs/choosing_estimator.rst): the SDID row in the survey support matrix and the trailing note still said survey-bootstrap was rejected. Updated to describe the landed hybrid pairs-bootstrap + Rao-Wu rescaled composition and to point users at the REGISTRY note for the objective form and argmin-set caveat. R5 P3 (benchmarks/python/coverage_sdid.py): the module docstring said the study covers three DGPs; it now covers four (``stratified_survey`` added) and bootstrap-only on that DGP via ``survey_design_factory``. Refresh the docstring to describe all four DGPs and call out the bootstrap-only coverage for stratified_survey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. The restored Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
/ai-review |
Three places in REGISTRY.md still described the pre-PR-#352 behavior where SDID survey bootstrap raised NotImplementedError, contradicting the new support matrix and survey-composition note landed earlier in this PR: - §SyntheticDiD bootstrap bullet (paper-faithful refit text): the "Composed with any survey design ... raises NotImplementedError" trailer is updated to describe weighted-FW dispatch under the survey + bootstrap composition note. - §SyntheticDiD requirements checklist bullet for "Bootstrap: paper- faithful Algorithm 2 step 2": "Survey designs raise NotImplementedError" trailer is updated to describe the hybrid pairs-bootstrap + Rao-Wu rescaling composition. - §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten to add SDID to the list via its hybrid composition, with the precise distinction from standalone Rao-Wu (SunAbraham / TROP) and a pointer to §SyntheticDiD for the objective form and argmin-set caveat. Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD (L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/ FPC methodology gap that is out of scope for this PR and tracked in TODO.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code QualityNo findings. PerformanceNo findings. MaintainabilityNo findings. Tech DebtNo findings. The remaining SecurityNo findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
…g fix R7 P1: the per-draw zero-mass retry in ``_bootstrap_se`` (PR #355 R2 P0) only covers bootstrap draws, not the fit-time ATT. Survey weights are non-negative post-resolve() but all-zero mass on either arm is a valid input that encodes an unidentified target population. Without a fit- time guard the downstream ``np.average(Y, weights=w_treated)`` and ``omega_eff = unit_weights * w_control`` normalizations would hit 0/0 and silently propagate NaN through the bootstrap / placebo / jackknife dispatchers. Front-door the case: after ``w_control`` / ``w_treated`` are sourced from the resolved unit-level design, raise ``ValueError`` if either arm's total mass is <= 0. Covers both pweight-only and strata/PSU/FPC paths. Three regression tests added: ``test_fit_raises_on_zero_total_treated_survey_mass``, ``test_fit_raises_on_zero_total_control_survey_mass``, and ``test_fit_raises_on_zero_treated_mass_under_full_design``. R7 P3: the SDID row in ``docs/choosing_estimator.rst`` said "pweight only (placebo / jackknife); full (bootstrap)" in the **Weights** column, conflating weight-type support (fweight / aweight / pweight) with design-element support (strata / PSU / FPC). The code still hard- rejects non-pweight survey designs on every variance method. Narrow the wording to "pweight only" and leave "Via bootstrap" in the Strata/PSU/FPC column to describe design-element support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
…D bootstrap When ``SurveyDesign(fpc=...)`` is declared without an explicit ``psu=``, ``bootstrap_utils.generate_rao_wu_weights`` (L654-L655) treats each unit as its own PSU. The helper rejects ``FPC < n_PSU`` mid-draw (L684-L688), so if FPC is set lower than the unit count (per stratum if stratified), every bootstrap draw raises ValueError; ``_bootstrap_se`` swallows the error in its retry loop and the user eventually sees a generic bootstrap-exhaustion message instead of a targeted FPC/design error. Add a front-door validation on ``resolved_survey_unit`` after ``collapse_survey_to_unit_level``: - unstratified: fpc >= total unit count; - stratified: fpc_h >= per-stratum unit count. Error messages point at the two actionable fixes (declare an explicit psu= column, or raise FPC). Two regression tests added: ``test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified`` and ``test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
The ``Raises`` section of ``SyntheticDiD.fit()`` previously only listed "non-pweight survey design" among its ``ValueError`` conditions. The two fit-time guards added in PR #355 R7 and R8 are not reflected in the docstring: - R7 P1: zero total survey mass on either arm. - R8 P1: ``fpc`` declared without explicit ``psu=`` where ``fpc`` is below the (per-stratum) unit count. Both raise ``ValueError`` before the bootstrap loop is dispatched. Expanding the ``Raises`` section makes the contract explicit to readers (and AI reviewers) cross-referencing the documented behavior against the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g fix R7 P1: the per-draw zero-mass retry in ``_bootstrap_se`` (PR #355 R2 P0) only covers bootstrap draws, not the fit-time ATT. Survey weights are non-negative post-resolve() but all-zero mass on either arm is a valid input that encodes an unidentified target population. Without a fit- time guard the downstream ``np.average(Y, weights=w_treated)`` and ``omega_eff = unit_weights * w_control`` normalizations would hit 0/0 and silently propagate NaN through the bootstrap / placebo / jackknife dispatchers. Front-door the case: after ``w_control`` / ``w_treated`` are sourced from the resolved unit-level design, raise ``ValueError`` if either arm's total mass is <= 0. Covers both pweight-only and strata/PSU/FPC paths. Three regression tests added: ``test_fit_raises_on_zero_total_treated_survey_mass``, ``test_fit_raises_on_zero_total_control_survey_mass``, and ``test_fit_raises_on_zero_treated_mass_under_full_design``. R7 P3: the SDID row in ``docs/choosing_estimator.rst`` said "pweight only (placebo / jackknife); full (bootstrap)" in the **Weights** column, conflating weight-type support (fweight / aweight / pweight) with design-element support (strata / PSU / FPC). The code still hard- rejects non-pweight survey designs on every variance method. Narrow the wording to "pweight only" and leave "Via bootstrap" in the Strata/PSU/FPC column to describe design-element support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…D bootstrap When ``SurveyDesign(fpc=...)`` is declared without an explicit ``psu=``, ``bootstrap_utils.generate_rao_wu_weights`` (L654-L655) treats each unit as its own PSU. The helper rejects ``FPC < n_PSU`` mid-draw (L684-L688), so if FPC is set lower than the unit count (per stratum if stratified), every bootstrap draw raises ValueError; ``_bootstrap_se`` swallows the error in its retry loop and the user eventually sees a generic bootstrap-exhaustion message instead of a targeted FPC/design error. Add a front-door validation on ``resolved_survey_unit`` after ``collapse_survey_to_unit_level``: - unstratified: fpc >= total unit count; - stratified: fpc_h >= per-stratum unit count. Error messages point at the two actionable fixes (declare an explicit psu= column, or raise FPC). Two regression tests added: ``test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified`` and ``test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``Raises`` section of ``SyntheticDiD.fit()`` previously only listed "non-pweight survey design" among its ``ValueError`` conditions. The two fit-time guards added in PR #355 R7 and R8 are not reflected in the docstring: - R7 P1: zero total survey mass on either arm. - R8 P1: ``fpc`` declared without explicit ``psu=`` where ``fpc`` is below the (per-stratum) unit count. Both raise ``ValueError`` before the bootstrap loop is dispatched. Expanding the ``Raises`` section makes the contract explicit to readers (and AI reviewers) cross-referencing the documented behavior against the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdebecc to
1a20c16
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyNo unmitigated P0/P1 findings. The previous fit-time implicit-PSU
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The SDID coverage MC artifact carried bare ``NaN`` tokens for the
``stratified_survey`` × ``placebo`` / ``jackknife`` cells (unsupported
by design — strata/PSU/FPC raises at fit-time). Python's ``json``
module tolerates those tokens on read, but strict JSON parsers reject
them, making the committed artifact non-strict.
``_summarize`` now returns ``None`` (serializes as ``null``) instead
of ``float('nan')`` on the all-failed branch, and ``json.dump`` is
called with ``allow_nan=False`` so any stray non-finite value fails
loudly instead of serializing as a bare ``NaN`` / ``Infinity`` token.
The committed artifact has been patched in place (bare ``NaN`` →
``null``) and strict-loader-verified; no regen needed since the
numeric content on the previously-NaN cells was definitionally
absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
The R7 P1 positive-mass guard checks only the raw survey mass (``w_control.sum() > 0``). That's not sufficient: Frank-Wolfe sparsifies ``unit_weights`` to exact zeros by design, so even when at least one control has positive survey weight, the FW solution may concentrate all mass on controls whose survey weights are 0. The composed ``omega_eff = unit_weights * w_control`` then sums to 0 and the normalization step (``omega_eff / omega_eff.sum()``) hits 0/0, silently propagating NaN into the ATT, SE, and CI. Front-door the case: after composing ``omega_eff``, raise ``ValueError`` before the normalization when ``omega_eff.sum() <= 0``. The analogous guards already exist in the bootstrap loop (``omega_scaled.sum() <= 0`` retry) and jackknife (``effective_control > 0`` support gate); this restores the contract at fit time. Two regression tests cover both dispatch branches (pweight-only and strata/PSU/FPC). Both monkeypatch ``compute_sdid_unit_weights`` to return a canonical sparse unit-weight vector concentrated on a zero-survey-weight control — more reliable than wrestling with FW convergence dynamics on synthetic data. The ``fit()`` docstring ``Raises`` section is updated to list the new condition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
``generate_survey_did_data`` is 1-indexed (prep_dgp.py L1211-L1212),
so ``n_periods=12`` with ``cohort_periods=[7]`` emits periods 1..12
with post = [7, 8, 9, 10, 11, 12]. The coverage harness'
``_stratified_survey_dgp`` returned ``list(range(7, 12))`` =
[7, 8, 9, 10, 11], silently dropping period 12 into the pre window.
SDID therefore fit the panel as 7-pre/5-post instead of the
documented 6-pre/6-post, and every rejection / mean SE cell in the
survey-bootstrap calibration row (plus the REGISTRY narrative
transcribed from it) was derived from the mis-specified window.
Fix: derive post_periods from ``df["period"].max()`` so any change
to ``n_periods`` propagates. Regression test
``test_stratified_survey_dgp_post_periods_cover_full_post_tail``
fails fast if a future refactor reintroduces the off-by-one (checks
unique / sorted / contiguous / max == df.period.max() plus the
explicit [7, 8, 9, 10, 11, 12] shape).
Regenerated only the stratified_survey block and spliced it into
the main artifact (other DGPs unaffected — their seeds and DGP code
are unchanged). New rejection rates at α = {0.01, 0.05, 0.10}:
{0.024, 0.058, 0.094}; mean SE / true SD drops from 1.25 to 1.13.
Rejection at α=0.05 remains well inside the calibration gate
[0.02, 0.10]. REGISTRY table row and narrative updated to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — the prior P1 on the Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The R13 P1 regression test ``test_stratified_survey_dgp_post_periods_cover_full_post_tail`` imports ``benchmarks.python.coverage_sdid`` directly to exercise the private ``_stratified_survey_dgp`` helper. CI's isolated-install job deliberately copies only ``tests/``, not ``benchmarks/``, so the module import failed with ``ModuleNotFoundError`` on CI runners that install the package into a fresh site-packages and then run the test suite against that install. The target is a benchmarking harness helper, not shipped package code, so the natural home is ``benchmarks/python/``. Moving it there keeps the test runnable locally (developer invokes explicitly before regenerating the coverage MC artifact) and out of CI's collection (``pyproject.toml testpaths = ["tests"]`` scopes default discovery to ``tests/`` only, so the new file never interferes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last SDID survey gap (TODO.md row 107). PR #355 restored variance_method="bootstrap" for strata/PSU/FPC via hybrid pairs- bootstrap + Rao-Wu + weighted-FW. This commit extends the same full- design capability to variance_method="placebo" and "jackknife". Placebo allocator — stratified permutation (Pesarin 2001). Pseudo-treated indices drawn within each stratum containing actual treated units; weighted-FW re-estimates ω and λ per draw with per- control survey weights threaded into both loss and regularization (reuses compute_sdid_unit_weights_survey + compute_time_weights_survey from PR #355). New private method _placebo_variance_se_survey. Fit-time front-door guards (per feedback_front_door_over_retry_swallow.md) distinguish two infeasible permutation configurations with targeted ValueError messages: Case B (stratum with treated units has zero controls) and Case C (stratum with treated units has fewer controls than treated). Partial-permutation fallback rejected — it silently changes the null-distribution semantics. Jackknife allocator — PSU-level leave-one-out with stratum aggregation (Rust & Rao 1996). SE² = Σ_h (1-f_h)·(n_h-1)/n_h· Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)². FPC form: f_h = n_h_sampled / fpc[h] (population-count form from survey.py::SurveyDesign.resolve; confirmed via survey.py:338-356 where fpc_h < n_psu_h is the validation constraint). λ held fixed across LOOs; ω subset + rw- composed-renormalized (matches Arkhangelsky Algorithm 3 non-survey semantics — jackknife is variance-approximation, not refit-variance). Strata with n_h < 2 skip silently; total-zero-variance → NaN + UserWarning. Unstratified designs with PSU treated as single-stratum JK1. New private method _jackknife_se_survey. Gate relaxation — deletes the placebo+jackknife+strata/PSU/FPC raise at synthetic_did.py:352-369. Replicate-weight gate at L329-337 unchanged (separate methodology; closed-form replicate variance double-counts with Rao-Wu-like rescaling). fit() dispatcher adds _placebo_use_survey_path / _jackknife_use_survey_path flags routing to the new methods when appropriate; non-survey and pweight-only paths bit-identical by construction (guarded by the same branch isolation pattern used in PR #355 _bootstrap_se). Allocator asymmetry — placebo ignores PSU axis; jackknife respects it. Intentional: placebo is a null-distribution test (stratified unit- level permutation is classical — PSU-level permutation on few PSUs is near-degenerate), while jackknife is a design-based variance approximation (PSU-level LOO is canonical per Rust & Rao). Both respect strata. Rationale documented in method docstrings and REGISTRY (follow-up commit). Tests — tests/test_survey_phase5.py: - TestSyntheticDiDSurvey: flip test_full_design_placebo_raises and test_full_design_jackknife_raises from NotImplementedError→succeeds; assert finite SE > 0, populated survey_metadata, .summary() round-trip. - TestSDIDSurveyPlaceboFullDesign (new class): pseudo-treated-stays- within-treated-strata (monkeypatched recorder), Case B / Case C front-door guards (targeted ValueError match), se-differs-from- pweight-only, deterministic dispatch. - TestSDIDSurveyJackknifeFullDesign (new class): stratum-aggregation self-consistency, fpc-reduces-se magnitude (SE_fpc = SE_nofpc/sqrt(2) at f=0.5, rtol=1e-10), se-differs-from-pweight-only, single-PSU- stratum silently skipped, unstratified short-circuit, all-strata- skipped UserWarning + NaN, deterministic dispatch. Non-survey and pweight-only regressions — all 32 tests in TestBootstrapSE + TestPlaceboSE + TestJackknifeSE pass unchanged; bit-identity preserved by the new-path-gating pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…placebo, jackknife)
Second commit for the SDID survey-placebo/jackknife PR. Extends the
coverage Monte Carlo artifact with jackknife on the stratified_survey
DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY
§SyntheticDiD gap bullets to two landed Notes; updates user-facing
docs to reflect restored capability.
Coverage MC changes
-------------------
* benchmarks/python/coverage_sdid.py: _stratified_survey_design now
returns ("bootstrap", "jackknife") on the methods tuple. Placebo is
omitted because the DGP's cohort packs into a single stratum with 0
never-treated units — stratified-permutation placebo is structurally
infeasible on this DGP (raises Case C at fit-time). Module docstring
explains the exclusion and the jackknife anti-conservatism caveat.
* benchmarks/data/sdid_coverage.json: regenerated stratified_survey
block at n_seeds=500, n_bootstrap=200. Bootstrap validates near-
nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row
reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti-
conservatism from the stratified jackknife formula with 2 PSUs per
stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation).
REGISTRY.md §SyntheticDiD
-------------------------
* Survey support matrix updated: all three variance methods now
support strata/PSU/FPC (not just bootstrap).
* Two new landed Notes:
- "Note (survey + placebo composition)": stratified-permutation
allocator, weighted-FW refit, ω_eff composition, fit-time
feasibility guards (Case B / Case C), scope note on what is NOT
randomized (within-stratum PSU axis). Cites Pesarin (2001) /
Pesarin & Salmaso (2010).
- "Note (survey + jackknife composition)": PSU-level LOO algorithm,
explicit stratum-aggregation SE² formula, FPC handling (population-
count form from survey.py:338-356), fixed-weights rationale,
degenerate-LOO skip semantics, scope note, known anti-conservatism
with few PSUs per stratum. Cites Rust & Rao (1996).
* "Allocator asymmetry" paragraph in the survey support matrix
documents the intentional asymmetry (placebo ignores PSU, jackknife
respects it) with rationale rooted in each method's role (null-
distribution test vs design-based variance approximation).
* Coverage MC table adds the stratified_survey × jackknife row with
anti-conservatism narrative; placebo row explicitly marked N/A-on-
this-DGP (with pointer to the unit-test coverage).
* Requirements checklist entries updated to describe full-design
support for placebo and jackknife.
Docs sweep
----------
* docs/methodology/survey-theory.md: new bullets describing the
stratified-permutation placebo allocator and the PSU-level LOO
jackknife, parallel to the existing hybrid-bootstrap bullet.
* docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID
row updated from "bootstrap only (PR #352)" to "Full (all three
variance methods)"; legend amended; "Note on SyntheticDiD" block
rewritten to describe all three allocators with the jackknife
few-PSU caveat.
* docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/
jackknife gap; Phase 6 bullet updated to describe all three
allocators; Current Limitations table entry removed (only replicate-
weight limitation remains, merged into one row).
* CHANGELOG.md: "### Added" entry for placebo + jackknife full-design
support (no new section header — folded into existing Unreleased
block); "### Changed (PR #355)" tweaked to note the separate
follow-up for placebo/jackknife.
* TODO.md row 107 deleted (capability gap closed).
* diff_diff/synthetic_did.py __init__ docstring: survey_design
parameter description rewritten to describe all three methods.
Placebo fallback-guidance comment updated to remove stale "placebo
and jackknife reject strata/PSU/FPC" line.
* diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated
to describe all three survey allocators (UTF-8 fingerprint
preserved — `D'Haultfœuille` still appears throughout).
* tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative
and assertions updated to reflect that placebo=0-fits is expected
structurally on stratified_survey (documented Case C), while
jackknife now runs successfully with the known anti-conservatism
caveat intentionally unasserted at the calibration-gate level.
Verification
------------
* pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign
tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign
tests/test_survey_phase5.py::TestSyntheticDiDSurvey
tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
tests/test_guides.py → 82 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…trap/jackknife only P1 (Methodology — implicit-PSU FPC validator leaked into placebo): PR #355 R8 P1 added a fit-time validator that rejects ``psu=None`` + ``fpc < n_units`` designs, because Rao-Wu bootstrap treats each unit as its own PSU and would fail mid-draw with the bootstrap loop swallowing the error as a generic exhaustion message. The validator ran unconditionally on every survey fit. After R8 documented FPC as a placebo no-op (Pesarin 2001 §1.5 — permutation tests condition on the observed sample), this validator became inconsistent: a placebo fit with low FPC and no explicit ``psu`` would still raise a "FPC must be ≥ n_units" error for a constraint that doesn't apply to the placebo math. Fix: gate the implicit-PSU FPC validator on ``self.variance_method in ("bootstrap", "jackknife")``. Both methods genuinely consume FPC (Rao-Wu rescaling for bootstrap, Rust & Rao ``(1 - f_h)`` factor for jackknife). Placebo proceeds to the documented no-op warning path regardless of FPC value. New regression ``test_placebo_low_fpc_no_psu_warns_no_validator_block``: sets ``fpc_col = 5`` (well below n_units=30) with no PSU. Asserts (a) placebo fit succeeds, (b) emits the documented FPC-no-op ``UserWarning``, (c) SE matches the no-FPC pweight-only fit at ``rel=1e-12``, AND (d) bootstrap on the same low-FPC design still raises the validator error (gating preserves bootstrap/jackknife behavior — only placebo's FPC contract changes). Verification: 97 passed (1 new low-FPC placebo regression; existing bootstrap/jackknife FPC validation regressions still fire on their fixtures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
sc_weight_fw_weighted+_with_convergencesibling) accepting per-coordinatereg_weights, plus matching pure-Python fallback. New Python helperscompute_sdid_unit_weights_survey/compute_time_weights_surveythread the weighted objective through the two-pass sparsify-refit dispatcher.SyntheticDiD._bootstrap_sereintroduces the Rao-Wu branch: per-drawrw = w_control[boot_idx]for pweight-only orrw = generate_rao_wu_weights(resolved_survey_unit, rng)for full design, composed with the weighted-FW helpers; post-FWω_eff = rw·ω / Σ(rw·ω)feedscompute_sdid_estimator.bootstrap × any-surveyand unconditionalstrata/PSU/FPC × any-methodguards; replaces with a method-gated version that still rejects strata/PSU/FPC for placebo / jackknife (separate methodology gap tracked in TODO.md).stratified_surveyDGP; regeneratedbenchmarks/data/sdid_coverage.json— the new row's α=0.05 rejection is 0.042 (inside the [0.02, 0.10] calibration gate),mean SE / true SD = 1.25(slightly conservative, the safer direction).Methodology references (required if estimator / math changes)
min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²returns ω on the simplex and downstream composesω_eff = rw·ω / Σ(rw·ω). This is NOT the same as directly minimizing the standard SDID loss onω_eff— the argmin sets differ because of the non-constant scaling factor in the ω↔ω_eff reparameterization. Intentional design mirroring the spirit of the pre-PR-Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 Rao-Wu composition but with ω re-estimated per draw under the weighted objective (so weight-estimation uncertainty propagates). SeeREGISTRY.md §SyntheticDiD Note (survey + bootstrap composition)for the full derivation.Validation
tests/test_weighted_fw.py(new; 15 tests covering the Rust kernel + Python wrappers + survey helpers + backend parity),tests/test_methodology_sdid.py::TestBootstrapSE(two PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 raises-tests rewritten as succeeds-tests + newtest_bootstrap_single_psu_returns_nanshort-circuit regression + coverage-MC schema updated to includestratified_surveyrow with explicit calibration-gate assertion),tests/test_survey_phase5.py::TestSyntheticDiDSurvey(three PR Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study #351 raises-tests rewritten as succeeds-tests with explicit SE assertions + newtest_bootstrap_full_design_se_differs_from_pweight_onlycontract).benchmarks/data/sdid_coverage.json(500 seeds × B=200, 40 min wall-clock on M-series Mac with Rust backend). Stratified-survey bootstrap rejection @ α=0.05 = 0.042 (inside [0.02, 0.10] gate).Security / privacy
🤖 Generated with Claude Code