HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow)#352
HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow)#352
Conversation
…est, workflow) Ships paper Section 4 (de Chaisemartin et al. 2026, arXiv:2405.04465v6) pre-test diagnostics for the HeterogeneousAdoptionDiD estimator in a new module `diff_diff/had_pretests.py`: - `qug_test()` — order-statistic ratio test of H_0: d_lower = 0 (Theorem 4). Closed-form T = D_(1)/(D_(2)-D_(1)); asymptotic p-value 1/(1+T) under the Exp(1)/Exp(1) limit law; reject if T > 1/alpha - 1. Zero-filter with warning; tie-break returns all-NaN inference. - `stute_test()` — Cramer-von Mises cusum test with Mammen wild bootstrap (Appendix D). Statistic S = (1/G^2) sum(cumsum^2); literal per-iteration OLS refit matching paper Appendix D Algorithm. n_bootstrap >= 99 front-door validated; G < 10 returns NaN; G > 100_000 warns and points to yatchew_hr_test. - `yatchew_hr_test()` — heteroskedasticity-robust variance-ratio linearity test (Theorem 7 / Equation 29). sigma2_diff normalizes by 2G paper-literal (NOT 2(G-1)); tight hand-computed parity at atol=1e-12. - `did_had_pretest_workflow()` — composite runs all three on a two-period panel and returns HADPretestReport with priority-ordered verdict string. Deferred to Phase 3 follow-up patches: - Joint Equation 18 cross-horizon Stute pre-trend test (paper formula needs PDF extraction; tracked in TODO.md) - Appendix D vectorized matrix form for Stute bootstrap (functionally identical, ~2x faster; shipped literal-refit to match paper text) - Tight R-package numerical parity (bootstrap seed semantics differ across numpy/R) 47 tests across 9 classes cover rejection rates, closed-form parity, tie-breaks, NaN propagation, JSON serialization, seed reproducibility, sample-size gates, and verdict logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI review found that did_had_pretest_workflow's fail-to-reject verdict
("TWFE safe under Section 4 assumptions") was a methodology overclaim:
Phase 3 ships steps 1 + 3 of the paper's four-step workflow but not
step 2 (Assumption 7 pre-trends test via Equation 18), so the verdict
could mislead users into believing the full paper-prescribed workflow
had passed.
Fix:
- _compose_verdict fail-to-reject branch now returns
"QUG and linearity diagnostics fail-to-reject; Assumption 7
pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"
- HADPretestReport docstring adds a prominent partial-workflow caveat
on the verdict and all_pass fields.
- did_had_pretest_workflow docstring renames intent from "run the
paper's workflow" to "run a PARTIAL workflow (steps 1 and 3 only)".
- Tests: renamed + updated to assert the Assumption 7 caveat appears
in the verdict and that "TWFE safe" does NOT appear.
- REGISTRY.md updated to document the partial-workflow semantic and
revised verdict string in both the Note block and checkbox row.
Also addresses P2 from the review: replaced positional `None` in the
_aggregate_first_difference call with keyword argument cluster_col=None
plus an inline comment naming the intent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…string + string concat Re-review came back ✅ after the P1 fix; remaining findings were all P3 documentation/cleanup: - Module docstring now explicitly frames Phase 3 as shipping steps 1 + 3 of the paper's four-step workflow (not "all three fail-to-reject -> TWFE valid", which could read as unconditional certification). - `_compose_verdict` docstring synchronized with the partial-workflow verdict text (previously said "TWFE safe under Section 4 assumptions"; now says "QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run"). - Collapsed adjacent-string-literal in `yatchew_hr_test` sigma4_W<=0 warning into a single string. No behavioral changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Three unmitigated P1 issues in the new HAD pretest path would make me block approval. Executive Summary
Methodology
Code Quality
Performance
Maintainability No new maintainability findings beyond the API-contract issues above. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
P1 #1 — Stute tie-safe CvM: Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each observation's dose, so tied observations share the post-tie cumulative sum. My naive cumsum over sorted residuals produced partial within-tie sums that were row-order-dependent. Fix: after cumsum, replace within-tie-block values with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now accepts `d_sorted` and collapses tie blocks before squaring. Regression test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses` validates the end-to-end stute_test contract. P1 #2 — Exact-linear fit must fail-to-reject (not return NaN): For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix: dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally produces p=1 when eps=0 exactly). Added a scale-relative short-circuit (sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat the short-circuit by producing non-zero but tiny OLS residuals. Yatchew exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan, TestYatchewHRTest.test_exact_linear_returns_p1_not_nan. P1 #3 — HADPretestReport.all_pass contract: Previously `all_pass = not (reject or reject or reject)` could be True while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every constituent p-value being finite AND no test rejecting. Updated docstring. Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan. P2 #1 — QUG negative-dose guard: HAD doses must be non-negative (paper Section 2). The raw qug_test API was silently folding d < 0 rows into the n_excluded_zero counter (filter was `d > 0`). Fix: front-door ValueError on any d < 0. Regression: TestQUGTest.test_negative_dose_raises. P3 #1 — QUG np.partition: REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤ partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight closed-form parity at atol=1e-12 still holds. P3 #2 — REGISTRY n_bootstrap default: REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated REGISTRY to match code and added a note about the n_bootstrap >= 99 front-door validation. Test count: 47 -> 53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Blocker This re-review is narrower than the previous one: the earlier P1s on Stute tie handling, exact-linear Executive Summary
Methodology
Code Quality No separate findings beyond the methodology issues above. Performance
Maintainability No separate findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
P0 — Exact-linear shortcut not translation-invariant: R1's scale-relative shortcut compared sum(eps^2) against sum(dy^2), but sum(dy^2) is inflated by additive shifts in dy (e.g. dy = 1e12 + 2*d + noise gives sum(dy^2) ~ 5e25, and threshold becomes ~50, which is easily tripped by ordinary OLS SSE on unit-scale noise). This silently forced p=1.0 on genuinely noisy data. Fix: compare against CENTERED total sum of squares sum((dy - dybar)^2), which equals (1 - R^2) * TSS and is translation-invariant by construction (the mean subtraction absorbs any additive shift). Applied to both stute_test and yatchew_hr_test. Module-level constant docstring updated to explain why centering is required. Regressions: TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data, TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data. Exact numerical equivalence under dy -> dy + 1e12 is not attainable (FP cancellation at 1e12 scale costs ~4 digits), but the SHORTCUT behavior and decision MUST match — the tests assert both. P1 — Yatchew missing tie-handling guard: R1's tie-safe CvM fixed Stute, but yatchew_hr_test still silently computed adjacent differences on stable-sorted duplicates. Under ties, within-tie row ordering changes sigma2_diff and sigma4_W, so the statistic depends on arbitrary input row order rather than the data. Fix: front-door UserWarning + all-NaN result when `np.unique(d).size < G`. Users with tied doses (mass-point designs, discretised registers) are directed to stute_test, whose tie-safe CvM handles ties correctly. This also subsumes the constant-d case (all tied) without needing a separate check. Regressions: TestYatchewHRTest.test_duplicate_doses_return_nan, TestYatchewHRTest.test_constant_d_returns_nan. P2 — REGISTRY documentation: Added two Note blocks to the HAD section: - **Note (Phase 3 Yatchew tie policy)**: justifies the NaN-on-duplicate policy and directs users to Stute for tied-dose designs. - **Note (Phase 3 exact-linear short-circuit)**: documents the translation-invariance requirement and the centered-TSS scaling. Test count: 53 -> 57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Focused re-review of the PR diff. The prior blockers from the last AI review look addressed: the exact-linear shortcuts are now translation-invariant in diff_diff/had_pretests.py:L904-L920 and diff_diff/had_pretests.py:L1068-L1092, the Yatchew duplicate/constant-dose guard is present in diff_diff/had_pretests.py:L1032-L1063, Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P0 — Yatchew sigma4_W=0 branch was not sign-aware: R2's safety branch unconditionally returned (t_stat_hr=-inf, p=1, reject=False) whenever sigma4_W <= 0, which is mathematically wrong for non-exact-linear inputs where the adjacent-residual-product sum happens to vanish. Reviewer's counterexample: d=[1,2,3,4,5], dy=[1,0,-2,0,1]. OLS slope=0, residuals=dy itself. sigma2_lin=1.2, sigma2_diff=1.0, sigma4_W=0 (each adjacent residual product has a zero factor). The formal statistic is sqrt(5)*(1.2-1.0)/0 = +inf under the one-sided critical value, so the correct decision is reject=True. Previously the branch mapped this to p=1, silently flipping the rejection. Fix: after the exact-linear short-circuit has already caught the residuals-are-zero case, the sigma4_W=0 branch is now sign-aware on the numerator: - numerator > 0 (sigma2_lin > sigma2_diff) -> T_hr=+inf, p=0, reject=True - numerator < 0 -> T_hr=-inf, p=1, reject=False - numerator = 0 -> T_hr=NaN, p=NaN, reject=False (genuinely indeterminate) All three branches now emit a UserWarning so users see the degenerate denominator rather than silently consuming an infinite statistic. Regression: TestYatchewHRTest.test_sigma4_W_zero_with_positive_numerator_rejects pins the reviewer's counterexample: asserts t_stat_hr=+inf, p=0, reject=True (NOT p=1, reject=False), with sigma2_lin=1.2 and sigma2_diff=1.0 preserved for inspection. P3 — yatchew_hr_test docstring stale on tie policy: R2 added the duplicate-dose NaN guard in code and REGISTRY, but the public docstring still said "stable sort preserves input order among tied d values." Updated "Notes" section to describe: (a) duplicate-dose rejection with redirect to stute_test, (b) the translation-invariant exact-linear short-circuit, and (c) the sign-aware sigma4_W=0 branch. Test count: 57 -> 58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good. The prior re-review blocker appears resolved: the exact-linear guards are now translation-invariant, Static review only: this environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R4 verdict was "Looks good" overall — no P0/P1 blockers remain. Fixes the P3 + P2 proactively to avoid tied-dose workflow confusion. R4 P3 / P2 — Workflow must follow paper's "Stute OR Yatchew" step 3: R3's fix made yatchew_hr_test return NaN on duplicate doses (per paper methodology). But the composite workflow then forced "inconclusive" on any NaN child result, so common QUG-style panels with repeated d=0 (reviewer cites an application with 12 zero-dose units) were too conservatively classified. Per paper Section 4.2-4.3 step 3, linearity can be tested with Stute OR Yatchew; a conclusive Stute alone suffices. Fix: - _compose_verdict now treats "QUG NaN" and "BOTH Stute + Yatchew NaN" as the only inconclusive cases. A conclusive Stute + NaN Yatchew (or vice versa) proceeds to the fail-to-reject / rejection logic with a "(Yatchew NaN - skipped)" suffix on the fail-to-reject verdict. - all_pass gating: requires QUG conclusive AND at least ONE of Stute / Yatchew conclusive, plus no conclusive test rejects. Follows the paper's "OR" wording. - HADPretestReport docstring updated to reflect the paper-faithful all_pass semantic. - REGISTRY Yatchew tie-policy Note extended with a paragraph describing the workflow fallback. Tests: - Split TestComposeVerdictLogic.test_nan_in_any_test_inconclusive into test_qug_nan_is_inconclusive + test_both_linearity_nan_is_inconclusive, plus two new tests covering "Yatchew NaN alone does not force inconclusive" and "Stute NaN alone does not force inconclusive". - New TestCompositeWorkflow.test_workflow_handles_tied_zero_doses_via_stute_fallback regression: 60-unit panel with 20 never-treated (d=0 ties) plus 40 treated with continuous doses. Yatchew NaN, Stute + QUG conclusive, composite verdict must be conclusive (not "inconclusive") with a "Yatchew NaN - skipped" note and all_pass=True. Test count: 58 -> 62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Blocker. The prior re-review findings around tied-dose fallback, sign-aware Executive Summary
Methodology
No other unmitigated methodology findings in the diff. The tied-dose workflow fallback, sign-aware infinite-limit handling, and conclusive-only Code Quality No findings. Performance
Maintainability No findings beyond the documentation drift noted below. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
P0 — Exact-linear shortcut not scale-invariant:
R2's fix changed the denominator from sum(dy^2) to centered TSS, which
made the guard TRANSLATION-invariant. But the code used
`max(centered_TSS, 1.0)` as a defensive floor, which BROKE scale
invariance: multiplying dy by a small constant (e.g. dy * 1e-12) made
centered_TSS ~ 1e-24 but the floor held the threshold at 1.0, firing
the shortcut on noisy data that should NOT trigger it.
Fix: remove the `max(..., 1.0)` floor and handle the `centered_TSS == 0`
case (constant dy, trivially linear) in a dedicated branch above the
relative comparison. The remaining path is purely `eps^2 <= tol *
dy_centered^2`, which is both translation- AND scale-invariant by
construction (R^2 is dimensionless).
Applied to both `stute_test` and `yatchew_hr_test`. Module-level
constant docstring updated to document the invariance and the
zero-TSS split.
Regressions:
- TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data
- TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data
Both multiply a noisy `dy` by 1e-12 and assert the shortcut does NOT
fire and decisions match the unscaled input.
P3 — Stale verdict docs:
R4's Stute-OR-Yatchew linearity-step fallback changed `_compose_verdict`
semantics, but the `HADPretestReport.verdict` docstring still listed
"Any NaN p-value -> inconclusive" as rule 1, and the REGISTRY Phase 3
checkbox for the workflow still said "`inconclusive - {tests} NaN`
when any statistic is NaN".
Fix: both docs updated to reflect the paper-faithful semantics:
inconclusive only when QUG is NaN or when BOTH Stute and Yatchew are
NaN; otherwise the fail-to-reject verdict carries a
"(Yatchew NaN - skipped)" (or Stute) suffix when one linearity test
was NaN but the other was conclusive. REGISTRY note also documents
the `all_pass` gating (QUG conclusive AND at least one of
Stute/Yatchew conclusive AND no conclusive test rejects).
Test count: 62 -> 64.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only: this environment does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive": The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" BEFORE checking whether any conclusive test had rejected. The reviewer's example: G=2 with QUG rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates — the workflow emitted "inconclusive - both linearity NaN", hiding a real assumption failure. The paper's rule is one-way: TWFE is admissible only if NO test rejects. A conclusive rejection therefore dominates unresolved-step notes. Fix: reorder _compose_verdict: 1. Collect rejections from conclusive tests first. If any, that is the primary verdict, and unresolved-step notes are APPENDED via "; additional steps unresolved: ..." rather than replacing the rejection. 2. Only when NO conclusive rejection exists AND a required step is unresolved do we return a pure "inconclusive - ..." verdict. 3. Otherwise fall through to the partial-workflow fail-to-reject verdict (with "(Yatchew NaN - skipped)" suffix if applicable). Regressions: - TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection - TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection - TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses: qug_test and _validate_had_panel both front-door-reject d < 0 (paper Section 2 HAD support restriction), but the new linearity helpers only validated shape + NaN. Negative doses are outside the method's stated scope and could silently produce conclusive-looking output. Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test now raise ValueError on any d < 0 with a message directing users to pre-process or check the dose column. Docstrings updated to list the new contract in the Raises section. Regressions: - TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises - TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises R6 P2 — Docstrings / REGISTRY sync: HADPretestReport.verdict docstring rewritten to describe the new "rejection-first, unresolved-suffix" priority. REGISTRY Phase 3 workflow checkbox updated to document the conclusive-rejection-not- hidden semantics plus the non-negative-dose contract. Test count: 64 -> 69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only: this environment lacks Executive Summary
Methodology Previous methodology P1s are addressed in code, tests, and registry text. Remaining methodology items are documented P3s rather than blockers.
Code Quality No findings. Performance
Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests No findings. The re-review gaps are now covered by direct regression tests and matching registry prose. Locations: tests/test_had_pretests.py:L175-L189, tests/test_had_pretests.py:L597-L668, tests/test_had_pretests.py:L920-L1006, docs/methodology/REGISTRY.md:L2343-L2351. |
Fixes two P1 issues flagged by the CI reviewer on the initial submission of PR #352. P1 Methodology — `compute_time_weights_survey` was documented as solving the WLS-style weighted λ objective min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||² but row-scaled Y by sqrt(rw) and then handed the scaled matrix to `_sc_weight_fw(intercept=True)`, whose column-centering uses an UNWEIGHTED mean across controls. That is not the weighted objective once rw varies, so non-uniform survey bootstrap draws were refitting λ on the wrong objective and could bias the reported SE. Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using `col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass `intercept=False` to the kernel so no additional unweighted centering happens on the scaled matrix. Both two-pass calls updated. `compute_sdid_unit_weights_survey` is unchanged — its column-centering is PER-UNIT (time means within each control column), which is independent of rw. P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is a valid configuration (`SurveyDesign.resolve()` synthesizes ones when weights is None), but `_extract_unit_survey_weights` indexed `survey_design.weights` as if it were always a column name, so the groupby would fail with a KeyError before the bootstrap branch could run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector of ones of length `len(unit_order)` when `survey_design.weights is None`, matching `SurveyDesign.resolve()`'s semantics. Regression tests: - `test_non_uniform_rw_beats_unweighted_centering_variant` (test_weighted_fw.py): reproduces the pre-fix buggy variant (row- scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and asserts the fixed path's weighted SSR is strictly ≤ the buggy variant's weighted SSR. If a future revert reintroduces intercept=True after the row-scaling, this test fails. - `test_bootstrap_full_design_without_explicit_weights` (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with no explicit `weights` column now succeeds on the bootstrap path; survey_metadata populated with n_strata / n_psu. P3 Documentation: - `SyntheticDiD.fit()` docstring (survey_design parameter + Raises block): replace "bootstrap rejects all survey designs" language with the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight- only and full design; placebo/jackknife ✓ pweight-only, ✗ full design). - `_placebo_variance_se` fallback-guidance messages (two sites): drop the "strata/PSU/FPC not yet supported by any SDID variance method" framing; recommend bootstrap for full-design survey fallback, jackknife for pweight-only, adding controls as the universal fallback. - `docs/survey-roadmap.md` Current Limitations table: collapse the two SDID bootstrap-rejection rows into a single row for placebo/ jackknife + full design (the bootstrap + full design row no longer applies). Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE + TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…path CI review flagged a silent-failure P0: when a bootstrap draw's treated units all had survey weight 0 under a pweight-only survey design, the code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted mean). That silently dropped survey weighting for that draw while the fit-time ATT uses the survey-weighted treated mean — corrupting the bootstrap distribution used for SE, p-value, and CI. Reachable with any `SurveyDesign(weights=...)` input where at least one treated unit has pweight 0 (a valid configuration per ``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks only those units. Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the pweight-only branch as well. When ANY survey path is active and ``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the draw is retried. The pre-fix behavior (unweighted treated mean fallback) is unreachable from any code path now. Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws`` in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a panel with 3 treated units where one has `wt=0`; bootstrap at B=100 still returns finite SE > 0. The test would fail if the code reverted to the unweighted-mean fallback because zero-mass draws would either produce degenerate SDiD estimators (NaN τ → gets filtered out by the finite-τ gate, possibly leaving no valid draws) or produce systematically different τ values that inflate the SE. P3 (same review) also addressed: - ``diff_diff/synthetic_did.py`` replicate-weight rejection message (line ~303) now reflects the PR #352 support matrix instead of the pre-PR #352 "only pweight-only placebo/jackknife" framing. Analytical survey designs are now fully supported per the truth table; only replicate-weight variance remains unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lice Two P3s from R3; PR was already ✅ Looks good — these are close-out polish. P3 docs/tests — secondary surfaces described the full-design path as "Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material caveat that SDID still uses unit-level pairs-bootstrap resampling (``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled weights on top — a hybrid composition, not a standalone Rao-Wu bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352 Added entry to use the hybrid-composition wording mirroring REGISTRY. P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights`` interaction was only guarded by the slow coverage MC. Add ``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in ``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to return a known vector of distinct per-unit values (``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors fed into the weighted FW helper via a capturing wrapper on ``compute_sdid_unit_weights_survey``, and asserts every captured vector lies within ``known_rw[:n_control]`` (positions 1..n_control). This catches two bug classes: - slice-order regression: if someone swaps rw-then-slice for slice-then-rw, the captured vectors would include values from the treated slice ``known_rw[n_control:]`` and the assertion fires. - rw-drift regression: if the Rao-Wu call site bypasses ``generate_rao_wu_weights`` (e.g., a refactor silently uses the pweight-only branch for full-design fits), the captured vector would be the user's w_control (all 1.0 in this test) instead of the known Rao-Wu output. Verified: 294 targeted tests pass across test_methodology_sdid / test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three places in REGISTRY.md still described the pre-PR-#352 behavior where SDID survey bootstrap raised NotImplementedError, contradicting the new support matrix and survey-composition note landed earlier in this PR: - §SyntheticDiD bootstrap bullet (paper-faithful refit text): the "Composed with any survey design ... raises NotImplementedError" trailer is updated to describe weighted-FW dispatch under the survey + bootstrap composition note. - §SyntheticDiD requirements checklist bullet for "Bootstrap: paper- faithful Algorithm 2 step 2": "Survey designs raise NotImplementedError" trailer is updated to describe the hybrid pairs-bootstrap + Rao-Wu rescaling composition. - §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten to add SDID to the list via its hybrid composition, with the precise distinction from standalone Rao-Wu (SunAbraham / TROP) and a pointer to §SyntheticDiD for the objective form and argmin-set caveat. Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD (L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/ FPC methodology gap that is out of scope for this PR and tracked in TODO.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap composition and propagates the landed capability across the documentation surfaces. Coverage MC harness (benchmarks/python/coverage_sdid.py): - Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/ stratum) null-treatment panel with moderate weight variation and modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post periods). Converts per-observation ``treated`` to a unit-level ever-treated indicator (SDID's block-treatment requirement). - Extend ``DGPSpec`` with an optional ``survey_design_factory`` callable that returns ``(SurveyDesign, supported_methods_tuple)``. For ``stratified_survey``: bootstrap only — placebo / jackknife reject strata/PSU/FPC at fit-time, so the harness skips them rather than catching the NotImplementedError inside ``_fit_one``. - ``_fit_one`` gains an optional ``survey_design`` kwarg routed through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls the factory once per seed (DataFrame contents don't affect columns) and gates methods on the supported set. Regenerated ``benchmarks/data/sdid_coverage.json`` via ``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend); aer63 remains the long tail at 2237 s, stratified_survey adds only 33 s. Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the bootstrap is slightly conservative (overestimates empirical sampling SD by ~25%) — the safer direction under Rao-Wu rescaling with only 4 PSUs total. Validates the weighted-FW + Rao-Wu composition end-to-end. REGISTRY.md §SyntheticDiD: - Add ``stratified_survey`` row to the coverage MC table and a paragraph under it documenting the calibration verdict, the conservatism direction, and why placebo/jackknife rows are NaN. - Replace the survey-support bullet with a truth-table matrix (PR #352 shape); add a ``Note (survey + bootstrap composition)`` documenting the weighted-FW objective (unit and time forms), the ω_eff composition, the argmin-set caveat, the per-draw rw dispatch (pweight-only vs Rao-Wu), and the single-PSU short-circuit. - Update the ``Note (default variance_method deviation from R)`` to drop the "bootstrap rejects surveys" framing (no longer accurate). - Update the ``Note (coverage Monte Carlo calibration)`` header to say "4 representative null-panel DGPs" and flag stratified_survey as bootstrap-only. User-facing docs: - ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu Rescaled Bootstrap list; describe the weighted-FW composition. - ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet restores SDID. - ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix table row for SyntheticDiD switches from "pweight only (placebo/ jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC"; "Note on SyntheticDiD" block rewritten for the landed contract. - ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap bullet now describes survey support and the ω_eff composition. - ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet includes SDID in the Rao-Wu list with the weighted-FW formula. CHANGELOG.md: - Retain the PR #351 regression Changed entry but annotate it as "restored in PR #352"; add new Added/Changed PR #352 entries documenting the weighted-FW kernel, survey helpers, _bootstrap_se Rao-Wu composition, and the new coverage MC row. TODO.md: - Row 103 (SDID + survey designs) → closed by PR #352; replaced with a narrower follow-up for placebo/jackknife + strata/PSU/FPC (Low priority, no concrete sketch yet). Tests: - ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including ``stratified_survey``); new explicit assertions that the stratified_survey bootstrap row has ≥100 successful fits and α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows n_successful_fits == 0 (strata/PSU/FPC rejection contract). Verified: TestCoverageMCArtifact passes against the regenerated artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes two P1 issues flagged by the CI reviewer on the initial submission of PR #352. P1 Methodology — `compute_time_weights_survey` was documented as solving the WLS-style weighted λ objective min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||² but row-scaled Y by sqrt(rw) and then handed the scaled matrix to `_sc_weight_fw(intercept=True)`, whose column-centering uses an UNWEIGHTED mean across controls. That is not the weighted objective once rw varies, so non-uniform survey bootstrap draws were refitting λ on the wrong objective and could bias the reported SE. Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using `col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass `intercept=False` to the kernel so no additional unweighted centering happens on the scaled matrix. Both two-pass calls updated. `compute_sdid_unit_weights_survey` is unchanged — its column-centering is PER-UNIT (time means within each control column), which is independent of rw. P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is a valid configuration (`SurveyDesign.resolve()` synthesizes ones when weights is None), but `_extract_unit_survey_weights` indexed `survey_design.weights` as if it were always a column name, so the groupby would fail with a KeyError before the bootstrap branch could run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector of ones of length `len(unit_order)` when `survey_design.weights is None`, matching `SurveyDesign.resolve()`'s semantics. Regression tests: - `test_non_uniform_rw_beats_unweighted_centering_variant` (test_weighted_fw.py): reproduces the pre-fix buggy variant (row- scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and asserts the fixed path's weighted SSR is strictly ≤ the buggy variant's weighted SSR. If a future revert reintroduces intercept=True after the row-scaling, this test fails. - `test_bootstrap_full_design_without_explicit_weights` (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with no explicit `weights` column now succeeds on the bootstrap path; survey_metadata populated with n_strata / n_psu. P3 Documentation: - `SyntheticDiD.fit()` docstring (survey_design parameter + Raises block): replace "bootstrap rejects all survey designs" language with the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight- only and full design; placebo/jackknife ✓ pweight-only, ✗ full design). - `_placebo_variance_se` fallback-guidance messages (two sites): drop the "strata/PSU/FPC not yet supported by any SDID variance method" framing; recommend bootstrap for full-design survey fallback, jackknife for pweight-only, adding controls as the universal fallback. - `docs/survey-roadmap.md` Current Limitations table: collapse the two SDID bootstrap-rejection rows into a single row for placebo/ jackknife + full design (the bootstrap + full design row no longer applies). Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE + TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…path CI review flagged a silent-failure P0: when a bootstrap draw's treated units all had survey weight 0 under a pweight-only survey design, the code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted mean). That silently dropped survey weighting for that draw while the fit-time ATT uses the survey-weighted treated mean — corrupting the bootstrap distribution used for SE, p-value, and CI. Reachable with any `SurveyDesign(weights=...)` input where at least one treated unit has pweight 0 (a valid configuration per ``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks only those units. Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the pweight-only branch as well. When ANY survey path is active and ``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the draw is retried. The pre-fix behavior (unweighted treated mean fallback) is unreachable from any code path now. Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws`` in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a panel with 3 treated units where one has `wt=0`; bootstrap at B=100 still returns finite SE > 0. The test would fail if the code reverted to the unweighted-mean fallback because zero-mass draws would either produce degenerate SDiD estimators (NaN τ → gets filtered out by the finite-τ gate, possibly leaving no valid draws) or produce systematically different τ values that inflate the SE. P3 (same review) also addressed: - ``diff_diff/synthetic_did.py`` replicate-weight rejection message (line ~303) now reflects the PR #352 support matrix instead of the pre-PR #352 "only pweight-only placebo/jackknife" framing. Analytical survey designs are now fully supported per the truth table; only replicate-weight variance remains unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lice Two P3s from R3; PR was already ✅ Looks good — these are close-out polish. P3 docs/tests — secondary surfaces described the full-design path as "Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material caveat that SDID still uses unit-level pairs-bootstrap resampling (``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled weights on top — a hybrid composition, not a standalone Rao-Wu bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352 Added entry to use the hybrid-composition wording mirroring REGISTRY. P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights`` interaction was only guarded by the slow coverage MC. Add ``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in ``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to return a known vector of distinct per-unit values (``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors fed into the weighted FW helper via a capturing wrapper on ``compute_sdid_unit_weights_survey``, and asserts every captured vector lies within ``known_rw[:n_control]`` (positions 1..n_control). This catches two bug classes: - slice-order regression: if someone swaps rw-then-slice for slice-then-rw, the captured vectors would include values from the treated slice ``known_rw[n_control:]`` and the assertion fires. - rw-drift regression: if the Rao-Wu call site bypasses ``generate_rao_wu_weights`` (e.g., a refactor silently uses the pweight-only branch for full-design fits), the captured vector would be the user's w_control (all 1.0 in this test) instead of the known Rao-Wu output. Verified: 294 targeted tests pass across test_methodology_sdid / test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three places in REGISTRY.md still described the pre-PR-#352 behavior where SDID survey bootstrap raised NotImplementedError, contradicting the new support matrix and survey-composition note landed earlier in this PR: - §SyntheticDiD bootstrap bullet (paper-faithful refit text): the "Composed with any survey design ... raises NotImplementedError" trailer is updated to describe weighted-FW dispatch under the survey + bootstrap composition note. - §SyntheticDiD requirements checklist bullet for "Bootstrap: paper- faithful Algorithm 2 step 2": "Survey designs raise NotImplementedError" trailer is updated to describe the hybrid pairs-bootstrap + Rao-Wu rescaling composition. - §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten to add SDID to the list via its hybrid composition, with the precise distinction from standalone Rao-Wu (SunAbraham / TROP) and a pointer to §SyntheticDiD for the objective form and argmin-set caveat. Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD (L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/ FPC methodology gap that is out of scope for this PR and tracked in TODO.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…placebo, jackknife)
Second commit for the SDID survey-placebo/jackknife PR. Extends the
coverage Monte Carlo artifact with jackknife on the stratified_survey
DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY
§SyntheticDiD gap bullets to two landed Notes; updates user-facing
docs to reflect restored capability.
Coverage MC changes
-------------------
* benchmarks/python/coverage_sdid.py: _stratified_survey_design now
returns ("bootstrap", "jackknife") on the methods tuple. Placebo is
omitted because the DGP's cohort packs into a single stratum with 0
never-treated units — stratified-permutation placebo is structurally
infeasible on this DGP (raises Case C at fit-time). Module docstring
explains the exclusion and the jackknife anti-conservatism caveat.
* benchmarks/data/sdid_coverage.json: regenerated stratified_survey
block at n_seeds=500, n_bootstrap=200. Bootstrap validates near-
nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row
reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti-
conservatism from the stratified jackknife formula with 2 PSUs per
stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation).
REGISTRY.md §SyntheticDiD
-------------------------
* Survey support matrix updated: all three variance methods now
support strata/PSU/FPC (not just bootstrap).
* Two new landed Notes:
- "Note (survey + placebo composition)": stratified-permutation
allocator, weighted-FW refit, ω_eff composition, fit-time
feasibility guards (Case B / Case C), scope note on what is NOT
randomized (within-stratum PSU axis). Cites Pesarin (2001) /
Pesarin & Salmaso (2010).
- "Note (survey + jackknife composition)": PSU-level LOO algorithm,
explicit stratum-aggregation SE² formula, FPC handling (population-
count form from survey.py:338-356), fixed-weights rationale,
degenerate-LOO skip semantics, scope note, known anti-conservatism
with few PSUs per stratum. Cites Rust & Rao (1996).
* "Allocator asymmetry" paragraph in the survey support matrix
documents the intentional asymmetry (placebo ignores PSU, jackknife
respects it) with rationale rooted in each method's role (null-
distribution test vs design-based variance approximation).
* Coverage MC table adds the stratified_survey × jackknife row with
anti-conservatism narrative; placebo row explicitly marked N/A-on-
this-DGP (with pointer to the unit-test coverage).
* Requirements checklist entries updated to describe full-design
support for placebo and jackknife.
Docs sweep
----------
* docs/methodology/survey-theory.md: new bullets describing the
stratified-permutation placebo allocator and the PSU-level LOO
jackknife, parallel to the existing hybrid-bootstrap bullet.
* docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID
row updated from "bootstrap only (PR #352)" to "Full (all three
variance methods)"; legend amended; "Note on SyntheticDiD" block
rewritten to describe all three allocators with the jackknife
few-PSU caveat.
* docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/
jackknife gap; Phase 6 bullet updated to describe all three
allocators; Current Limitations table entry removed (only replicate-
weight limitation remains, merged into one row).
* CHANGELOG.md: "### Added" entry for placebo + jackknife full-design
support (no new section header — folded into existing Unreleased
block); "### Changed (PR #355)" tweaked to note the separate
follow-up for placebo/jackknife.
* TODO.md row 107 deleted (capability gap closed).
* diff_diff/synthetic_did.py __init__ docstring: survey_design
parameter description rewritten to describe all three methods.
Placebo fallback-guidance comment updated to remove stale "placebo
and jackknife reject strata/PSU/FPC" line.
* diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated
to describe all three survey allocators (UTF-8 fingerprint
preserved — `D'Haultfœuille` still appears throughout).
* tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative
and assertions updated to reflect that placebo=0-fits is expected
structurally on stratified_survey (documented Case C), while
jackknife now runs successfully with the known anti-conservatism
caveat intentionally unasserted at the calibration-gate level.
Verification
------------
* pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign
tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign
tests/test_survey_phase5.py::TestSyntheticDiDSurvey
tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
tests/test_guides.py → 82 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h harness docstring P3 (Maintainability — survey jackknife still populated _loo_unit_ids): ``fit()`` was unconditionally setting ``_loo_unit_ids`` / ``_loo_roles`` on every jackknife fit, including full-design survey fits where the underlying replicates are PSU-level and ``get_loo_effects_df()`` now raises ``NotImplementedError``. Internal / canned guidance keyed off ``_loo_unit_ids is not None`` as the availability check (e.g., ``practitioner.py``) would still call the accessor on a survey fit and hit the new raise. Fix: only populate ``_loo_unit_ids`` / ``_loo_roles`` when ``_loo_granularity == "unit"``; leave them ``None`` on the PSU path so ``_loo_unit_ids is not None`` correctly reports availability. ``_loo_granularity`` is the authoritative accessor gate; the legacy ``_loo_unit_ids`` sentinel now agrees with it. P3 (Documentation — harness docstring stale): ``coverage_sdid.py::_fit_one`` docstring said "fit() routes [survey designs] through the bootstrap survey path (PR #352) when method=='bootstrap'" — stale after the placebo + jackknife full- design paths landed. Rewrote to describe the three method-specific survey variance paths (weighted-FW + Rao-Wu bootstrap; stratified- permutation + weighted-FW placebo; PSU-LOO + stratum-aggregation jackknife) and mention the Case B-D ValueError failure modes alongside NotImplementedError. Verification: 94 passed (no behavior change on the gating fix — it's a state-gating tightening, not a correctness change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
HeterogeneousAdoptionDiDin a new modulediff_diff/had_pretests.py.qug_test(Theorem 4),stute_test(Appendix D with Mammen wild bootstrap),yatchew_hr_test(Theorem 7 / Equation 29), and a compositedid_had_pretest_workflowthat runs all three on a two-period HAD panel.HADPretestReportverdict explicitly flags the Assumption 7 gap when all implemented tests fail-to-reject, so callers do not receive a misleading "TWFE safe" signal.Methodology references (required if estimator / math changes)
chaisemartin::qug_test,stute_test,yatchew_test.dechaisemartin-2026-review.mdand extraction from the paper PDF is out of scope for this PR. Tracked inTODO.md; documented in REGISTRY.md.M = I - X(X'X)^{-1}X'once) is functionally identical and ~2x faster; deferred as a performance follow-up to keep reviewer surface narrow. Tracked inTODO.md.chaisemartin::stute_test/yatchew_test(bootstrap seed semantics + B differ across numpy/R). Tight parity deferred.Validation
tests/test_had_pretests.py(new, 47 tests, 9 test classes):TestQUGTest: closed-form tight parity at atol=1e-12, tie-break NaN, zero-filter warning, asymptotic-size upper bound on Uniform null, single-realization rejection/non-rejection on shifted-Beta / Uniform, invalid-alpha + NaN-input + 2D-input raises.TestStuteTest: single-realization no-reject on linear / reject on quadratic,_cvm_statisticalgebraic equivalence to paper's stated form, n_bootstrap<99 raise, mismatched-length raise, NaN-input raise.TestYatchewHRTest: single-realization linear / quadratic, paper-literalsigma2_diffnormalizer (divides by2G, NOT2(G-1)) at atol=1e-12, invalid-alpha raise.TestCompositeWorkflow: integration tests for (a) all-pass on linear DGP flags Assumption 7 gap, (b) quadratic-plus-shifted-support DGP triggers bundled rejection verdict, (c) workflow seed forwards to Stute only.TestNaNPropagation: constant-dy trivial linearity, QUG-tie propagates to composite verdict, Stute constant-d NaN.TestJSONSerialization:to_dict()JSON-safe for all 4 result classes,HADPretestReport.to_dataframe()schema[test, statistic_name, statistic_value, p_value, reject, alpha, n_obs], per-test single-row frames.TestSeedReproducibility: bitwise reproducibility at seed=42, different seeds produce different p-values.TestSampleSizeGates: QUG G<2 NaN, stute G<10 NaN, yatchew G<3 NaN (all via UserWarning, not ValueError).TestComposeVerdictLogic: 5 priority-order tests on_compose_verdictcovering (a) NaN inconclusive, (b) none-reject Assumption 7 gap, (c) only-QUG, (d) only-linearity, (e) all-reject bundled.TestSummary: non-empty summary strings for all 4 result classes.tests/test_had.py(Phase 1-2) — 214 passed, 2 skipped, no regressions.practitioner_next_steps+ tutorials).Security / privacy
Generated with Claude Code