HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow) by igerber · Pull Request #352 · igerber/diff-diff

igerber · 2026-04-22T17:41:58Z

Summary

Ship paper Section 4 (de Chaisemartin, Ciccia, D'Haultfoeuille, Knau 2026, arXiv:2405.04465v6) pre-test diagnostics for HeterogeneousAdoptionDiD in a new module diff_diff/had_pretests.py.
qug_test (Theorem 4), stute_test (Appendix D with Mammen wild bootstrap), yatchew_hr_test (Theorem 7 / Equation 29), and a composite did_had_pretest_workflow that runs all three on a two-period HAD panel.
Partial-workflow semantic: Phase 3 ships paper steps 1 (QUG) and 3 (linearity) only; step 2 (Assumption 7 pre-trends test via Equation 18) is deferred to a follow-up patch. The HADPretestReport verdict explicitly flags the Assumption 7 gap when all implemented tests fail-to-reject, so callers do not receive a misleading "TWFE safe" signal.

Methodology references (required if estimator / math changes)

Method name(s): QUG null test for support infimum; Cramer-von Mises linearity test with Mammen (1993) wild bootstrap; Yatchew heteroskedasticity-robust variance-ratio linearity test; four-step HAD pre-testing workflow.
Paper / source link(s):
- Primary: de Chaisemartin, Ciccia, D'Haultfoeuille, Knau (2026). arXiv:2405.04465v6. Theorem 4 (QUG), Appendix D + Equation 17 (Stute), Theorem 7 + Equation 29 (Yatchew-HR), Section 4.2-4.3 (composite workflow).
- Stute, W. (1997). Nonparametric model checks for regression. Annals of Statistics 25, 613-641.
- Mammen, E. (1993). Bootstrap and wild bootstrap for high-dimensional linear models. Annals of Statistics 21, 255-285.
- Yatchew, A. (1997). An elementary estimator of the partial linear model. Economics Letters 57, 135-143.
- R reference implementations: chaisemartin::qug_test, stute_test, yatchew_test.
Any intentional deviations from the source (and why):
- Partial workflow: step 2 (joint cross-horizon Stute test via paper Equation 18 on pre-period placebos) deferred to a Phase 3 follow-up patch. The paper's exact stacked-residual formula is not reproduced in dechaisemartin-2026-review.md and extraction from the paper PDF is out of scope for this PR. Tracked in TODO.md; documented in REGISTRY.md.
- Stute bootstrap form: literal per-iteration OLS refit matching paper Appendix D Algorithm text. Appendix D vectorized matrix form (precomputing M = I - X(X'X)^{-1}X' once) is functionally identical and ~2x faster; deferred as a performance follow-up to keep reviewer surface narrow. Tracked in TODO.md.
- No multiple-testing adjustment: paper does not prescribe Bonferroni/Holm; per-test alpha is used as-is. Documented in REGISTRY.md Phase 3 Note block.
- R-parity scope: QUG tight closed-form parity at atol=1e-12; Stute / Yatchew-HR validated via synthetic-DGP coverage rates rather than tight numerical parity against chaisemartin::stute_test / yatchew_test (bootstrap seed semantics + B differ across numpy/R). Tight parity deferred.

Validation

Tests added/updated: tests/test_had_pretests.py (new, 47 tests, 9 test classes):
- TestQUGTest: closed-form tight parity at atol=1e-12, tie-break NaN, zero-filter warning, asymptotic-size upper bound on Uniform null, single-realization rejection/non-rejection on shifted-Beta / Uniform, invalid-alpha + NaN-input + 2D-input raises.
- TestStuteTest: single-realization no-reject on linear / reject on quadratic, _cvm_statistic algebraic equivalence to paper's stated form, n_bootstrap<99 raise, mismatched-length raise, NaN-input raise.
- TestYatchewHRTest: single-realization linear / quadratic, paper-literal sigma2_diff normalizer (divides by 2G, NOT 2(G-1)) at atol=1e-12, invalid-alpha raise.
- TestCompositeWorkflow: integration tests for (a) all-pass on linear DGP flags Assumption 7 gap, (b) quadratic-plus-shifted-support DGP triggers bundled rejection verdict, (c) workflow seed forwards to Stute only.
- TestNaNPropagation: constant-dy trivial linearity, QUG-tie propagates to composite verdict, Stute constant-d NaN.
- TestJSONSerialization: to_dict() JSON-safe for all 4 result classes, HADPretestReport.to_dataframe() schema [test, statistic_name, statistic_value, p_value, reject, alpha, n_obs], per-test single-row frames.
- TestSeedReproducibility: bitwise reproducibility at seed=42, different seeds produce different p-values.
- TestSampleSizeGates: QUG G<2 NaN, stute G<10 NaN, yatchew G<3 NaN (all via UserWarning, not ValueError).
- TestComposeVerdictLogic: 5 priority-order tests on _compose_verdict covering (a) NaN inconclusive, (b) none-reject Assumption 7 gap, (c) only-QUG, (d) only-linearity, (e) all-reject bundled.
- TestSummary: non-empty summary strings for all 4 result classes.
Regression: tests/test_had.py (Phase 1-2) — 214 passed, 2 skipped, no regressions.
Backtest / simulation / notebook evidence (if applicable): N/A (no tutorial changes in this PR; Phase 5 integrates practitioner_next_steps + tutorials).

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…est, workflow) Ships paper Section 4 (de Chaisemartin et al. 2026, arXiv:2405.04465v6) pre-test diagnostics for the HeterogeneousAdoptionDiD estimator in a new module `diff_diff/had_pretests.py`: - `qug_test()` — order-statistic ratio test of H_0: d_lower = 0 (Theorem 4). Closed-form T = D_(1)/(D_(2)-D_(1)); asymptotic p-value 1/(1+T) under the Exp(1)/Exp(1) limit law; reject if T > 1/alpha - 1. Zero-filter with warning; tie-break returns all-NaN inference. - `stute_test()` — Cramer-von Mises cusum test with Mammen wild bootstrap (Appendix D). Statistic S = (1/G^2) sum(cumsum^2); literal per-iteration OLS refit matching paper Appendix D Algorithm. n_bootstrap >= 99 front-door validated; G < 10 returns NaN; G > 100_000 warns and points to yatchew_hr_test. - `yatchew_hr_test()` — heteroskedasticity-robust variance-ratio linearity test (Theorem 7 / Equation 29). sigma2_diff normalizes by 2G paper-literal (NOT 2(G-1)); tight hand-computed parity at atol=1e-12. - `did_had_pretest_workflow()` — composite runs all three on a two-period panel and returns HADPretestReport with priority-ordered verdict string. Deferred to Phase 3 follow-up patches: - Joint Equation 18 cross-horizon Stute pre-trend test (paper formula needs PDF extraction; tracked in TODO.md) - Appendix D vectorized matrix form for Stute bootstrap (functionally identical, ~2x faster; shipped literal-refit to match paper text) - Tight R-package numerical parity (bootstrap seed semantics differ across numpy/R) 47 tests across 9 classes cover rejection rates, closed-form parity, tie-breaks, NaN propagation, JSON serialization, seed reproducibility, sample-size gates, and verdict logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AI review found that did_had_pretest_workflow's fail-to-reject verdict ("TWFE safe under Section 4 assumptions") was a methodology overclaim: Phase 3 ships steps 1 + 3 of the paper's four-step workflow but not step 2 (Assumption 7 pre-trends test via Equation 18), so the verdict could mislead users into believing the full paper-prescribed workflow had passed. Fix: - _compose_verdict fail-to-reject branch now returns "QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)" - HADPretestReport docstring adds a prominent partial-workflow caveat on the verdict and all_pass fields. - did_had_pretest_workflow docstring renames intent from "run the paper's workflow" to "run a PARTIAL workflow (steps 1 and 3 only)". - Tests: renamed + updated to assert the Assumption 7 caveat appears in the verdict and that "TWFE safe" does NOT appear. - REGISTRY.md updated to document the partial-workflow semantic and revised verdict string in both the Note block and checkbox row. Also addresses P2 from the review: replaced positional `None` in the _aggregate_first_difference call with keyword argument cluster_col=None plus an inline comment naming the intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…string + string concat Re-review came back ✅ after the P1 fix; remaining findings were all P3 documentation/cleanup: - Module docstring now explicitly frames Phase 3 as shipping steps 1 + 3 of the paper's four-step workflow (not "all three fail-to-reject -> TWFE valid", which could read as unconditional certification). - `_compose_verdict` docstring synchronized with the partial-workflow verdict text (previously said "TWFE safe under Section 4 assumptions"; now says "QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run"). - Collapsed adjacent-string-literal in `yatchew_hr_test` sigma4_W<=0 warning into a single string. No behavioral changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T17:49:28Z

Overall Assessment

⚠️ Needs changes

Three unmitigated P1 issues in the new HAD pretest path would make me block approval.

Executive Summary

stute_test is not tie-safe: with duplicated doses, the implemented CvM statistic becomes row-order-dependent and no longer matches the registry/paper definition.
Exact linear null cases are routed to NaN/inconclusive by the new degeneracy branches, even though Assumption 8 holds exactly there.
HADPretestReport.all_pass can be True while verdict says the report is inconclusive, so downstream callers can silently mis-gate on the boolean.
The exported raw qug_test API still needs a negative-dose guard; today it can silently drop invalid negatives and report them as excluded zeros.
The documented Phase 3 deferrals in TODO.md:L98-L102 and REGISTRY.md are properly tracked and are not blockers.
Static review only: I could not execute the tests in this environment because the reviewer runtime is missing Python test dependencies (numpy, pytest).

Methodology

Severity P1. Impact: Duplicate dose values are handled incorrectly in stute_test. The registry defines c_G(d)=G^{-1/2}\sum 1{D\le d}\hat\varepsilon and S=(1/G)\sum c_G^2(D_g) in docs/methodology/REGISTRY.md:L2280-L2288. The implementation instead sorts once and feeds observation-by-observation cumulative sums into _cvm_statistic in diff_diff/had_pretests.py:L525-L547 and diff_diff/had_pretests.py:L850-L860. When d has ties, all observations in the tie block should use the same post-tie cumulative sum; the current code uses partial within-tie sums, so the statistic becomes row-order-dependent and can be wrong on rounded/discrete doses. Concrete fix: collapse equal-dose blocks before squaring in both the observed and bootstrap statistics, or explicitly reject duplicate d values with a front-door error/warning if the implementation is intended only for continuous support; add an order-invariance regression.
Severity P1. Impact: Exact-linear inputs are misclassified as inconclusive. stute_test returns all-NaN when var(eps)==0 in diff_diff/had_pretests.py:L829-L848, and yatchew_hr_test does the same when sigma4_W<=0 in diff_diff/had_pretests.py:L959-L991. For an exact affine relation ΔY=a+bD, Assumption 8 holds exactly; under the Stute bootstrap, S=0 and every S*=0, so the correct outcome is a non-rejection (p=1), not NaN. _compose_verdict then upgrades any NaN to "inconclusive" in diff_diff/had_pretests.py:L550-L575, so an exactly linear panel can fail the new composite diagnostic. The intended contract in tests/test_had_pretests.py:L370-L391 points the same way. Concrete fix: return a non-rejecting result for exact-linear fits in the Stute path, and either define the same continuity-limit behavior for Yatchew or keep the workflow from poisoning the whole report when the alternative linearity test is well-defined; add regressions for dy = a + b*d.
Severity P1. Impact: HADPretestReport.all_pass can be True on inconclusive reports. The dataclass presents all_pass as the summary boolean in diff_diff/had_pretests.py:L368-L376, but the workflow computes it from not reject only in diff_diff/had_pretests.py:L1092-L1097. Any NaN case therefore yields all_pass=True while verdict says "inconclusive - ... NaN". That is a semantic contract violation for a new public result object and can silently mislead downstream callers. Concrete fix: require all constituent p-values/statistics to be finite before setting all_pass=True, or rename the field to something explicit like no_rejections and add a separate conclusive boolean; add a regression on a NaN case such as D_(1)=D_(2).

Code Quality

Severity P2. Impact: The exported raw qug_test API silently treats all d <= 0 observations as excluded but tells the user only that d == 0 rows were dropped in diff_diff/had_pretests.py:L646-L656. Negative post-period doses violate the HAD support restriction documented for the estimator in docs/methodology/REGISTRY.md:L2330-L2330; today the panel workflow rejects them, but the raw public API computes a statistic on malformed input and misreports the exclusion metadata. Concrete fix: add a front-door ValueError on any d < 0 in qug_test (and, if intended, the other raw pretests), leaving the current warning only for true zero-dose observations.

Performance

Severity P3. Impact: The registry says the shipped QUG implementation is O(G) via np.partition in docs/methodology/REGISTRY.md:L2276-L2276, but the code currently performs a full sort in diff_diff/had_pretests.py:L681-L683. This is not a methodology blocker, but the published complexity claim is no longer true and the implementation does extra work on large G. Concrete fix: either switch to np.partition to select the two smallest positive doses or update the registry note to reflect the actual np.sort implementation.

Maintainability

No new maintainability findings beyond the API-contract issues above.

Tech Debt

Severity P3 informational. Impact: The deferred Phase 3 items called out by the PR, including the Equation 18 joint pre-trend test, vectorized Stute bootstrap, R-parity scope, and multi-period workflow dispatch, are explicitly tracked in TODO.md:L98-L102. I did not count those tracked deferrals against the assessment. Concrete fix: none required for this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: The new test module does not cover the two most failure-prone new paths I found: duplicate-dose handling in stute_test and negative-dose validation for the raw QUG API. Separately, the constant-dy test contract in tests/test_had_pretests.py:L370-L391 currently disagrees with the submitted degeneracy branches in diff_diff/had_pretests.py:L829-L848 and diff_diff/had_pretests.py:L975-L991, so the intended behavior is not actually protected by the implementation. Concrete fix: add regressions for duplicate-dose order invariance, negative-dose rejection, and the NaN/all_pass contract; if exact-linear inputs are supposed to fail-to-reject, make that behavior explicit in both code and tests.
Severity P3. Impact: REGISTRY.md still says the library default for stute_test is n_bootstrap = 499 in docs/methodology/REGISTRY.md:L2291-L2291, but the code ships 999 in diff_diff/had_pretests.py:L724-L725. Concrete fix: align the registry text with the code, or revert the code default if 499 is the intended public API.

Path to Approval

Fix stute_test tie handling so duplicate doses either produce the correct block-collapsed CvM statistic or are rejected/warned explicitly, and add an order-invariance regression.
Change the exact-linear degeneracy behavior so the linearly satisfied null does not return an inconclusive report, with tests for dy = a + b*d.
Make all_pass impossible on inconclusive reports by gating it on finite constituent results, or rename/split the boolean so callers cannot mistake “no rejection” for “valid pass.”
Add a negative-dose guard to the raw QUG API and reconcile the registry with the shipped implementation (np.partition vs np.sort, 499 vs 999).

P1 #1 — Stute tie-safe CvM: Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each observation's dose, so tied observations share the post-tie cumulative sum. My naive cumsum over sorted residuals produced partial within-tie sums that were row-order-dependent. Fix: after cumsum, replace within-tie-block values with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now accepts `d_sorted` and collapses tie blocks before squaring. Regression test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses` validates the end-to-end stute_test contract. P1 #2 — Exact-linear fit must fail-to-reject (not return NaN): For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix: dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally produces p=1 when eps=0 exactly). Added a scale-relative short-circuit (sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat the short-circuit by producing non-zero but tiny OLS residuals. Yatchew exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan, TestYatchewHRTest.test_exact_linear_returns_p1_not_nan. P1 #3 — HADPretestReport.all_pass contract: Previously `all_pass = not (reject or reject or reject)` could be True while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every constituent p-value being finite AND no test rejecting. Updated docstring. Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan. P2 #1 — QUG negative-dose guard: HAD doses must be non-negative (paper Section 2). The raw qug_test API was silently folding d < 0 rows into the n_excluded_zero counter (filter was `d > 0`). Fix: front-door ValueError on any d < 0. Regression: TestQUGTest.test_negative_dose_raises. P3 #1 — QUG np.partition: REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤ partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight closed-form parity at atol=1e-12 still holds. P3 #2 — REGISTRY n_bootstrap default: REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated REGISTRY to match code and added a note about the n_bootstrap >= 99 front-door validation. Test count: 47 -> 53. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T19:06:04Z

/ai-review

github-actions · 2026-04-22T19:14:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 10a10792675f9171e8083335b6bfd9662755d15e

Overall Assessment

Blocker

This re-review is narrower than the previous one: the earlier P1s on Stute tie handling, exact-linear NaN behavior, all_pass, raw QUG negative-dose validation, and the registry 499/999 + sort/partition drift all look addressed in diff_diff/had_pretests.py:L539-L582, diff_diff/had_pretests.py:L684-L693, diff_diff/had_pretests.py:L1191-L1202, and docs/methodology/REGISTRY.md:L2270-L2291. I still see one unmitigated P0 and one unmitigated P1 in the updated linearity-test paths.

Executive Summary

Prior re-review blockers appear fixed: Stute now uses a tie-safe CvM implementation, all_pass is conclusive-only, raw qug_test rejects negative doses, and the registry now matches the shipped np.partition / n_bootstrap=999 behavior.
Severity P0 [Newly identified]: the new “exact linear” fast path is not translation-invariant and can silently force p_value=1.0 on ordinary noisy data after adding a large constant to dy; this affects both stute_test and yatchew_hr_test.
Severity P1 [Newly identified]: yatchew_hr_test still has no duplicate/constant-dose guard, so tied doses make the statistic depend on arbitrary row or unit ordering.
The documented Phase 3 deferrals are properly tracked in TODO.md:L98-L102 and docs/methodology/REGISTRY.md:L2303-L2310, docs/methodology/REGISTRY.md:L2343-L2345, so I did not count them against approval.
Static review only: I could not execute the new tests in this reviewer environment because the local Python runtime is missing numpy / pytest.

Methodology

Severity P0 [Newly identified]. Impact: the exact-linear shortcut in both linearity tests is provably wrong under additive outcome shifts. The paper/registry definitions for Stute and Yatchew depend on OLS residuals and sorted differences after a regression with an intercept, so adding a constant to dy must not change the test result: docs/methodology/REGISTRY.md:L2280-L2300, docs/methodology/papers/dechaisemartin-2026-review.md:L153-L173. But the new shortcut scales against raw sum(dy^2) in diff_diff/had_pretests.py:L64-L71, diff_diff/had_pretests.py:L904-L915, and diff_diff/had_pretests.py:L1036-L1053. That makes the branch depend on the arbitrary level of the outcome, not the regression fit. Concretely, with G=50 and dy = 1e12 + 2*d + ε with unit-scale noise, sum(dy^2) is about 5e25, so the threshold is about 50; ordinary OLS SSE is then enough to trip the “exact linear” branch and return Stute(cvm_stat=0, p=1) / Yatchew(t=-inf, p=1, sigma2_W=0) on non-degenerate data. That is silent wrong statistical output. Concrete fix: replace the shortcut with a translation-invariant exact-fit check based on centered or fitted-response scale, or remove the shortcut and only special-case truly zero residuals under a centered tolerance; add a regression asserting identical results for (d, dy) and (d, dy + C) on the same near-linear data.
Severity P1 [Newly identified]. Impact: yatchew_hr_test still silently accepts duplicate or constant doses and uses stable sort as the tie policy in diff_diff/had_pretests.py:L984-L986, then computes adjacent differences and adjacent residual products on that arbitrary order in diff_diff/had_pretests.py:L1041-L1042 and diff_diff/had_pretests.py:L1055-L1067. Unlike stute_test, there is no front-door constant-dose guard (diff_diff/had_pretests.py:L880-L898). For tied doses, permuting rows within a tie block can change sigma2_diff, sigma4_W, and the final p-value without changing the data. In the composite workflow this arbitrary order becomes unit-id order via diff_diff/had.py:L1351-L1353, which is deterministic but still unrelated to the statistic’s methodology. The registry describes Yatchew only as “sort by D” and has no **Note:** documenting a tie-policy deviation: docs/methodology/REGISTRY.md:L2295-L2299. Concrete fix: reject or return NaN when d has duplicates or zero variance before computing the statistic, or document a justified tie-resolution rule in docs/methodology/REGISTRY.md and back it with permutation tests if you intend to support ties.

Code Quality

No separate findings beyond the methodology issues above.

Performance

Severity P3 informational. Impact: the slower literal-refit Stute bootstrap is already explicitly tracked as deferred work in TODO.md:L99-L99 and documented in docs/methodology/REGISTRY.md:L2344-L2347, so I did not count it against this PR. Concrete fix: none for this PR.

Maintainability

No separate findings.

Tech Debt

Severity P3 informational. Impact: the missing Equation 18 joint pre-trends step, tighter R parity, nprobust/Stute enhancement ideas, and multi-period workflow dispatch are all explicitly tracked in TODO.md:L98-L102 and reflected in docs/methodology/REGISTRY.md:L2303-L2310, docs/methodology/REGISTRY.md:L2343-L2345. These are properly mitigated deferrals under the project rules. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the added tests cover small-scale exact-linear cases and Stute duplicate-dose handling, but they do not protect the two problematic paths above. There is no regression that the exact-linear shortcut is invariant to adding a large constant (tests/test_had_pretests.py:L269-L279, tests/test_had_pretests.py:L344-L357), and there is no standalone Yatchew duplicate/constant-dose test at all in tests/test_had_pretests.py:L298-L357. Concrete fix: add (1) a large-offset near-linear regression for both stute_test and yatchew_hr_test, and (2) raw yatchew_hr_test cases for constant d and duplicated d under within-tie permutations.

Path to Approval

Replace the exact-linear fast path in both stute_test and yatchew_hr_test with a translation-invariant criterion, or remove it entirely; add a regression showing the outputs are unchanged when the same dy vector is shifted by a large constant.
Add a front-door duplicate/constant-dose guard to yatchew_hr_test before any exact-linear shortcut, and add tests for d = np.full(...) plus duplicated-dose permutations.
If you intentionally keep any Yatchew tie policy instead of a hard guard, document it in docs/methodology/REGISTRY.md with a **Note:** and justify why it preserves the intended statistic.

P0 — Exact-linear shortcut not translation-invariant: R1's scale-relative shortcut compared sum(eps^2) against sum(dy^2), but sum(dy^2) is inflated by additive shifts in dy (e.g. dy = 1e12 + 2*d + noise gives sum(dy^2) ~ 5e25, and threshold becomes ~50, which is easily tripped by ordinary OLS SSE on unit-scale noise). This silently forced p=1.0 on genuinely noisy data. Fix: compare against CENTERED total sum of squares sum((dy - dybar)^2), which equals (1 - R^2) * TSS and is translation-invariant by construction (the mean subtraction absorbs any additive shift). Applied to both stute_test and yatchew_hr_test. Module-level constant docstring updated to explain why centering is required. Regressions: TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data, TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data. Exact numerical equivalence under dy -> dy + 1e12 is not attainable (FP cancellation at 1e12 scale costs ~4 digits), but the SHORTCUT behavior and decision MUST match — the tests assert both. P1 — Yatchew missing tie-handling guard: R1's tie-safe CvM fixed Stute, but yatchew_hr_test still silently computed adjacent differences on stable-sorted duplicates. Under ties, within-tie row ordering changes sigma2_diff and sigma4_W, so the statistic depends on arbitrary input row order rather than the data. Fix: front-door UserWarning + all-NaN result when `np.unique(d).size < G`. Users with tied doses (mass-point designs, discretised registers) are directed to stute_test, whose tie-safe CvM handles ties correctly. This also subsumes the constant-d case (all tied) without needing a separate check. Regressions: TestYatchewHRTest.test_duplicate_doses_return_nan, TestYatchewHRTest.test_constant_d_returns_nan. P2 — REGISTRY documentation: Added two Note blocks to the HAD section: - **Note (Phase 3 Yatchew tie policy)**: justifies the NaN-on-duplicate policy and directs users to Stute for tied-dose designs. - **Note (Phase 3 exact-linear short-circuit)**: documents the translation-invariance requirement and the centered-TSS scaling. Test count: 53 -> 57. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T19:24:02Z

/ai-review

github-actions · 2026-04-22T19:30:00Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6cd4805354428d25c826eee6f7831f1d2fba65f1

Overall Assessment

⛔ Blocker

Focused re-review of the PR diff. The prior blockers from the last AI review look addressed: the exact-linear shortcuts are now translation-invariant in diff_diff/had_pretests.py:L904-L920 and diff_diff/had_pretests.py:L1068-L1092, the Yatchew duplicate/constant-dose guard is present in diff_diff/had_pretests.py:L1032-L1063, all_pass is now conclusive-only in diff_diff/had_pretests.py:L1230-L1242, and the workflow/registry/TODO notes now line up in docs/methodology/REGISTRY.md:L2343-L2347 and TODO.md:L98-L102. One unmitigated P0 remains in yatchew_hr_test().

Executive Summary

Prior re-review findings appear fixed: the translation-invariance bug, duplicate-dose guard, all_pass gating, and registry/TODO documentation all look resolved.
Severity P0 [Newly identified]: yatchew_hr_test() still treats every sigma2_W == 0 case as “exact linearity” and returns t_stat_hr=-inf, p_value=1.0, but sigma2_W == 0 can occur with nonzero residuals and a positive numerator, which flips the correct decision.
The deferred Phase 3 items are properly tracked: Equation 18 pre-trends, vectorized Stute bootstrap, tighter R parity, and multi-period workflow dispatch are all documented in TODO.md:L98-L102 and docs/methodology/REGISTRY.md:L2310-L2351, so I did not count them against approval.
Static review only: I could not run the full test suite here because the environment is missing numpy, pandas, scipy, and pytest.

Methodology

Severity P0 [Newly identified]. Impact: in diff_diff/had_pretests.py:L1103-L1125, the code assumes sigma4_W == 0 implies “OLS residuals are zero” and hard-codes t_stat_hr=-inf, p_value=1.0, reject=False. That is not true under the paper’s Equation 29 in docs/methodology/papers/dechaisemartin-2026-review.md:L165-L173: sigma4_W = 0 only means every adjacent residual-product is zero. A concrete counterexample is d=[1,2,3,4,5], dy=[1,0,-2,0,1], which yields OLS residuals [1,0,-2,0,1], sigma2_lin=1.2, sigma2_diff=1.0, and sigma2_W=0. The formal statistic therefore tends to +inf, not -inf, so the current branch silently turns a rejection into a fail-to-reject. Because diff_diff/had_pretests.py:L1226-L1242 feeds that result straight into did_had_pretest_workflow(), the composite verdict can also be wrong with no warning. Concrete fix: keep the earlier exact-linear shortcut for true zero-residual fits, but do not map every later sigma4_W == 0 case to p=1. Distinguish by the numerator sign (+inf reject, -inf fail-to-reject, near-zero numerator -> warning/NaN), or conservatively return NaN unless the exact-linear condition already fired.

Code Quality

No separate findings beyond the methodology blocker above.

Performance

Severity P3 informational. Impact: the slower literal-refit Stute bootstrap remains, but it is explicitly documented and tracked in TODO.md:L99 and docs/methodology/REGISTRY.md:L2344, so I did not count it against approval. Concrete fix: none for this PR.

Maintainability

No separate findings.

Tech Debt

Severity P3 informational. Impact: the deferred Equation 18 joint pre-trends test, tighter R parity, nprobust enhancement idea, and multi-period workflow dispatch are all newly tracked in TODO.md:L98-L102. These are properly mitigated deferrals under the project rules. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the Yatchew tests in tests/test_had_pretests.py:L377-L440 cover exact-linear, shifted-noisy, and duplicate-dose paths, but they do not cover the blocker case where sigma2_W == 0 despite nonzero residuals. That leaves the silent wrong-output branch above unpinned. Concrete fix: add a regression using a unique-dose input such as d=[1,2,3,4,5], dy=[1,0,-2,0,1] and assert the function does not return p_value=1.0, reject=False.
Severity P3 informational. Impact: the public docstring in diff_diff/had_pretests.py:L983-L991 still says tied doses are handled by stable sort, but the shipped implementation now warns and returns NaN on duplicates in diff_diff/had_pretests.py:L1032-L1063. The registry note is correct in docs/methodology/REGISTRY.md:L2346-L2347, so this is informational only, but the API docs are currently misleading. Concrete fix: update the yatchew_hr_test() docstring to describe the duplicate-dose guard and direct tied-dose users to stute_test().

Path to Approval

Fix the sigma4_W == 0 branch in yatchew_hr_test() so it does not unconditionally map zero denominator to t_stat_hr=-inf, p_value=1.0. Only true exact-linear fits should take that path.
Add a regression test for a unique-dose/non-exact-linear case with sigma2_W == 0 and assert the function does not silently fail-to-reject.
Update the yatchew_hr_test() docstring so the in-code API docs match the documented registry tie policy.

P0 — Yatchew sigma4_W=0 branch was not sign-aware: R2's safety branch unconditionally returned (t_stat_hr=-inf, p=1, reject=False) whenever sigma4_W <= 0, which is mathematically wrong for non-exact-linear inputs where the adjacent-residual-product sum happens to vanish. Reviewer's counterexample: d=[1,2,3,4,5], dy=[1,0,-2,0,1]. OLS slope=0, residuals=dy itself. sigma2_lin=1.2, sigma2_diff=1.0, sigma4_W=0 (each adjacent residual product has a zero factor). The formal statistic is sqrt(5)*(1.2-1.0)/0 = +inf under the one-sided critical value, so the correct decision is reject=True. Previously the branch mapped this to p=1, silently flipping the rejection. Fix: after the exact-linear short-circuit has already caught the residuals-are-zero case, the sigma4_W=0 branch is now sign-aware on the numerator: - numerator > 0 (sigma2_lin > sigma2_diff) -> T_hr=+inf, p=0, reject=True - numerator < 0 -> T_hr=-inf, p=1, reject=False - numerator = 0 -> T_hr=NaN, p=NaN, reject=False (genuinely indeterminate) All three branches now emit a UserWarning so users see the degenerate denominator rather than silently consuming an infinite statistic. Regression: TestYatchewHRTest.test_sigma4_W_zero_with_positive_numerator_rejects pins the reviewer's counterexample: asserts t_stat_hr=+inf, p=0, reject=True (NOT p=1, reject=False), with sigma2_lin=1.2 and sigma2_diff=1.0 preserved for inspection. P3 — yatchew_hr_test docstring stale on tie policy: R2 added the duplicate-dose NaN guard in code and REGISTRY, but the public docstring still said "stable sort preserves input order among tied d values." Updated "Notes" section to describe: (a) duplicate-dose rejection with redirect to stute_test, (b) the translation-invariant exact-linear short-circuit, and (c) the sign-aware sigma4_W=0 branch. Test count: 57 -> 58. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T19:47:00Z

/ai-review

github-actions · 2026-04-22T19:55:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: de20d51f122c2b271b88b167fa30321c7717f009

Overall Assessment

Looks good.

The prior re-review blocker appears resolved: the exact-linear guards are now translation-invariant, yatchew_hr_test() no longer hard-codes every sigma4_W == 0 case to p=1, all_pass now requires finite p-values, and the registry/TODO/docstring updates line up (diff_diff/had_pretests.py:904, diff_diff/had_pretests.py:1135, diff_diff/had_pretests.py:1288, docs/methodology/REGISTRY.md:2343, TODO.md:98). I did not find an unmitigated P0/P1 in the changed code.

Static review only: this environment is missing numpy and pytest, so I could not run the new suite or import the package dynamically.

Executive Summary

Prior P1+/P0 findings from the last review look fixed in the new had_pretests implementation, especially the translation-invariant exact-linear guards, the sign-aware sigma4_W == 0 branch, and the conclusive-only all_pass gating (diff_diff/had_pretests.py:904, diff_diff/had_pretests.py:1103, diff_diff/had_pretests.py:1288).
No unmitigated P0/P1 findings remain in the diff.
Severity P3 [Newly identified]. Impact: did_had_pretest_workflow() will force a tied-dose panel to an overall inconclusive verdict because yatchew_hr_test() returns NaN on duplicates and the workflow treats any NaN child result as inconclusive, even though the registry’s tie policy says tied-dose users should use stute_test() instead. Concrete fix: add a Stute-only fallback in the workflow, or explicitly document that the composite helper is intentionally inconclusive on tied-dose panels (diff_diff/had_pretests.py:604, diff_diff/had_pretests.py:1068, diff_diff/had_pretests.py:1288, docs/methodology/REGISTRY.md:2307, docs/methodology/REGISTRY.md:2346).
Deferred Phase 3 items are properly tracked in TODO.md and REGISTRY.md, so I did not count them against approval (TODO.md:98, docs/methodology/REGISTRY.md:2310).

Methodology

Severity P3 [Newly identified]. Impact: the composite workflow is stricter than the paper’s step-3 “Stute or Yatchew” wording. yatchew_hr_test() returns NaN on any duplicated dose (diff_diff/had_pretests.py:1068), and _compose_verdict() plus all_pass then classify the whole workflow as inconclusive on any NaN child result (diff_diff/had_pretests.py:604, diff_diff/had_pretests.py:1288). That matters for common QUG-style panels with repeated d == 0; the paper review itself cites an application with 12 zero-dose units (docs/methodology/papers/dechaisemartin-2026-review.md:298). This is conservative rather than silently wrong, so I would not block on it. Concrete fix: in did_had_pretest_workflow(), skip Yatchew from conclusive gating when duplicate doses are present, or document the helper’s tied-dose limitation explicitly in the registry/docstring.

Code Quality

No findings.

Performance

Severity P3 informational. Impact: the literal Stute bootstrap refit remains slower than Appendix D’s vectorized matrix form, but that is explicitly documented and tracked, not a correctness defect (TODO.md:99, docs/methodology/REGISTRY.md:2344). Concrete fix: none for this PR.

Maintainability

No findings.

Tech Debt

Severity P3 informational. Impact: the deferred Equation 18 joint pre-trends test, tighter R parity, nprobust-related enhancement, and multi-period workflow dispatch are all tracked in TODO.md, so these are properly mitigated deferrals under the project rules (TODO.md:98). Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests cover standalone zero filtering in QUG and standalone duplicate-dose NaN behavior in Yatchew, but they do not pin the workflow-on-tied-dose path above. The workflow tests only exercise continuous-dose pass/fail and QUG-tie inconclusive cases, so the common “multiple zero-dose units + valid Stute test” workflow path is still untested (tests/test_had_pretests.py:95, tests/test_had_pretests.py:414, tests/test_had_pretests.py:489, tests/test_had_pretests.py:611). Concrete fix: add a did_had_pretest_workflow() regression with repeated d == 0 observations and otherwise linear dy, and assert either a Stute-only fallback verdict or the explicitly documented inconclusive behavior.

R4 verdict was "Looks good" overall — no P0/P1 blockers remain. Fixes the P3 + P2 proactively to avoid tied-dose workflow confusion. R4 P3 / P2 — Workflow must follow paper's "Stute OR Yatchew" step 3: R3's fix made yatchew_hr_test return NaN on duplicate doses (per paper methodology). But the composite workflow then forced "inconclusive" on any NaN child result, so common QUG-style panels with repeated d=0 (reviewer cites an application with 12 zero-dose units) were too conservatively classified. Per paper Section 4.2-4.3 step 3, linearity can be tested with Stute OR Yatchew; a conclusive Stute alone suffices. Fix: - _compose_verdict now treats "QUG NaN" and "BOTH Stute + Yatchew NaN" as the only inconclusive cases. A conclusive Stute + NaN Yatchew (or vice versa) proceeds to the fail-to-reject / rejection logic with a "(Yatchew NaN - skipped)" suffix on the fail-to-reject verdict. - all_pass gating: requires QUG conclusive AND at least ONE of Stute / Yatchew conclusive, plus no conclusive test rejects. Follows the paper's "OR" wording. - HADPretestReport docstring updated to reflect the paper-faithful all_pass semantic. - REGISTRY Yatchew tie-policy Note extended with a paragraph describing the workflow fallback. Tests: - Split TestComposeVerdictLogic.test_nan_in_any_test_inconclusive into test_qug_nan_is_inconclusive + test_both_linearity_nan_is_inconclusive, plus two new tests covering "Yatchew NaN alone does not force inconclusive" and "Stute NaN alone does not force inconclusive". - New TestCompositeWorkflow.test_workflow_handles_tied_zero_doses_via_stute_fallback regression: 60-unit panel with 20 never-treated (d=0 ties) plus 40 treated with continuous doses. Yatchew NaN, Stute + QUG conclusive, composite verdict must be conclusive (not "inconclusive") with a "Yatchew NaN - skipped" note and all_pass=True. Test count: 58 -> 62. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T20:27:44Z

/ai-review

github-actions · 2026-04-22T20:34:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 85a43f4b3d5c9c686cd6155fc35dba732f51edbc

Overall Assessment

Blocker.

The prior re-review findings around tied-dose fallback, sign-aware sigma4_W == 0, and conclusive all_pass gating look addressed, but there is still one silent numerical-correctness bug in the new exact-linear short-circuit. Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the new test suite.

Executive Summary

Prior blocker-level issues appear fixed: the workflow now accepts a single conclusive linearity test per the paper’s “Stute or Yatchew” wording, and the sigma4_W == 0 branch is sign-aware instead of hard-coding p=1. diff_diff/had_pretests.py:617 diff_diff/had_pretests.py:1152 diff_diff/had_pretests.py:1311 docs/methodology/REGISTRY.md:2346
Severity P0: the new “exact linear” guard in both stute_test() and yatchew_hr_test() is still unit-scale sensitive because it compares against max(centered_TSS, 1.0). Multiplying a noisy dy by a sufficiently small constant can silently force the exact-linear branch and return p=1.0 / t_stat_hr=-inf on non-exact data, even though both tests should be invariant to outcome rescaling. diff_diff/had_pretests.py:928 diff_diff/had_pretests.py:1122 docs/methodology/REGISTRY.md:2347
The documented Phase 3 deferrals are properly tracked in TODO.md and should not block this PR on their own. TODO.md:98 TODO.md:99 TODO.md:100 TODO.md:102
Severity P3 informational: the verdict docs are still slightly stale. HADPretestReport and the Phase 3 checklist text still imply that any NaN child test makes the workflow inconclusive, which no longer matches _compose_verdict() after the tied-dose Stute fallback fix. diff_diff/had_pretests.py:394 diff_diff/had_pretests.py:617 docs/methodology/REGISTRY.md:2351

Methodology

Severity P0. Impact: the exact-linear short-circuit in both linearity tests is not actually the pure relative sum(eps^2) <= tol * sum((dy - dybar)^2) rule described in the registry. Because the code uses max(dy_centered_sq, 1.0), the branch stops being scale-invariant once centered_TSS < 1: for any noisy dataset, multiplying dy by a small enough positive constant will eventually satisfy the guard and silently convert a non-exact sample into stute_test(...).p_value = 1.0 and yatchew_hr_test(...).t_stat_hr = -inf, p_value = 1.0. That is wrong statistical output with no warning, and it is also an undocumented mismatch with the Phase 3 registry note. Concrete fix: split the zero-TSS / constant-dy case into its own branch, and otherwise use a purely relative criterion without the hard 1.0 floor in both functions. diff_diff/had_pretests.py:928 diff_diff/had_pretests.py:1122 docs/methodology/REGISTRY.md:2347

No other unmitigated methodology findings in the diff. The tied-dose workflow fallback, sign-aware infinite-limit handling, and conclusive-only all_pass logic now line up with the documented Phase 3 note. diff_diff/had_pretests.py:617 diff_diff/had_pretests.py:1152 diff_diff/had_pretests.py:1311 docs/methodology/REGISTRY.md:2346

Code Quality

No findings.

Performance

Severity P3 informational. Impact: stute_test() still uses the literal per-bootstrap OLS refit instead of the Appendix D projector form, so large-G runs remain slower than necessary; this is explicitly documented and tracked, not a defect. Concrete fix: none required for approval of this PR. TODO.md:99 docs/methodology/REGISTRY.md:2344

Maintainability

No findings beyond the documentation drift noted below.

Tech Debt

Severity P3 informational. Impact: the deferred Equation 18 joint pre-trends step, tighter R parity, optional bandwidth work, and multi-period workflow dispatch are all tracked in TODO.md, so these are properly mitigated deferrals under the repo rules. Concrete fix: none required for this PR. TODO.md:98 TODO.md:100 TODO.md:101 TODO.md:102

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new regression coverage pins additive translation invariance of the exact-linear guard, but it does not pin multiplicative rescaling. That left the unit-scale bug above untested. Concrete fix: add one Stute and one Yatchew regression that rescale the same noisy dy by a very small positive constant (for example 1e-12) and assert that the exact-linear branch does not fire and the decision matches the unscaled input. tests/test_had_pretests.py:281 tests/test_had_pretests.py:392
Severity P3 informational. Impact: the verdict docs are still out of sync with the implemented tied-dose fallback. HADPretestReport and the registry checklist text still describe “any NaN p-value => inconclusive,” while _compose_verdict() now only treats QUG NaN or both linearity tests NaN as inconclusive. Concrete fix: align the docstring and registry checklist wording with _compose_verdict(). diff_diff/had_pretests.py:394 diff_diff/had_pretests.py:617 docs/methodology/REGISTRY.md:2351

Path to Approval

Remove the hard 1.0 floor from the exact-linear guards in both stute_test() and yatchew_hr_test(), handling the zero-centered-TSS case separately so the branch remains unit-scale invariant.
Add regression tests that rescale a noisy dy by a very small positive constant and assert that both linearity tests preserve the unscaled decision instead of short-circuiting to the exact-linear output.

P0 — Exact-linear shortcut not scale-invariant: R2's fix changed the denominator from sum(dy^2) to centered TSS, which made the guard TRANSLATION-invariant. But the code used `max(centered_TSS, 1.0)` as a defensive floor, which BROKE scale invariance: multiplying dy by a small constant (e.g. dy * 1e-12) made centered_TSS ~ 1e-24 but the floor held the threshold at 1.0, firing the shortcut on noisy data that should NOT trigger it. Fix: remove the `max(..., 1.0)` floor and handle the `centered_TSS == 0` case (constant dy, trivially linear) in a dedicated branch above the relative comparison. The remaining path is purely `eps^2 <= tol * dy_centered^2`, which is both translation- AND scale-invariant by construction (R^2 is dimensionless). Applied to both `stute_test` and `yatchew_hr_test`. Module-level constant docstring updated to document the invariance and the zero-TSS split. Regressions: - TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data - TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data Both multiply a noisy `dy` by 1e-12 and assert the shortcut does NOT fire and decisions match the unscaled input. P3 — Stale verdict docs: R4's Stute-OR-Yatchew linearity-step fallback changed `_compose_verdict` semantics, but the `HADPretestReport.verdict` docstring still listed "Any NaN p-value -> inconclusive" as rule 1, and the REGISTRY Phase 3 checkbox for the workflow still said "`inconclusive - {tests} NaN` when any statistic is NaN". Fix: both docs updated to reflect the paper-faithful semantics: inconclusive only when QUG is NaN or when BOTH Stute and Yatchew are NaN; otherwise the fail-to-reject verdict carries a "(Yatchew NaN - skipped)" (or Stute) suffix when one linearity test was NaN but the other was conclusive. REGISTRY note also documents the `all_pass` gating (QUG conclusive AND at least one of Stute/Yatchew conclusive AND no conclusive test rejects). Test count: 62 -> 64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T20:41:00Z

/ai-review

github-actions · 2026-04-22T20:47:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ad19c0208afc2e530894b6bb1247c009ee96becf

Overall Assessment

⚠️ Needs changes

Static review only: this environment does not have numpy or pytest, so I could not execute tests/test_had_pretests.py.

Executive Summary

The prior blocker-level issues called out in the last review look addressed: the exact-linear shortcut is now scale-invariant, the sigma4_W == 0 branch is sign-aware, and the workflow now treats a single conclusive linearity test as sufficient.
Severity P1: the composite verdict can still report a bare "inconclusive" outcome even when one implemented test has already produced a conclusive rejection, which understates the paper’s four-step logic.
Severity P1: stute_test() and yatchew_hr_test() still accept negative dose arrays silently, even though the HAD method scope and the panel workflow both require D_{g,2} >= 0.
The deferred Equation 18 pre-trends step, Stute matrix-form optimization, R-parity work, and multi-period workflow dispatch are all properly tracked in TODO.md / REGISTRY.md and are not blockers.
The current tests/docs do not pin the two remaining methodology issues above.

Methodology

Severity P1. Impact: _compose_verdict() short-circuits to "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" before it checks whether any conclusive test already rejected. That means the composite workflow can hide a real assumption failure. Example: with G=2, qug_test([0.2, 0.21]) rejects at alpha=0.05, while stute_test and yatchew_hr_test are both NaN by their sample-size gates; the current logic would still emit only "inconclusive - both Stute and Yatchew linearity tests NaN". The paper’s rule is one-way: TWFE is admissible only if none of the tests rejects, so a conclusive rejection should never be downgraded to a pure inconclusive verdict. Concrete fix: build rejection reasons from all conclusive tests first, and only return a pure inconclusive verdict when there are no conclusive rejections and a required step is unresolved; if desired, append an inconclusive suffix rather than replacing the rejection. Locations: diff_diff/had_pretests.py:L619-L665, docs/methodology/papers/dechaisemartin-2026-review.md:L269-L273, docs/methodology/REGISTRY.md:L2351.
Severity P1. Impact: the new public raw-array linearity helpers do not enforce the paper’s nonnegative-dose domain restriction. qug_test() rejects d < 0, and the panel workflow rejects negative post-period doses, but stute_test() and yatchew_hr_test() currently proceed after _validate_1d_numeric() with no d >= 0 check. Because these functions are now exposed as HAD diagnostics from the top-level package, callers can obtain conclusive pretest outputs on inputs that are outside the method’s stated scope. Concrete fix: mirror the front-door negative-dose guard from qug_test() in both linearity helpers immediately after d validation, and document that contract in their docstrings. Locations: docs/methodology/REGISTRY.md:L2148-L2154, diff_diff/had.py:L816-L834, diff_diff/had_pretests.py:L723-L732, diff_diff/had_pretests.py:L882-L918, diff_diff/had_pretests.py:L1079-L1143.

Code Quality

No findings.

Performance

Severity P3. Impact: stute_test() still uses literal per-bootstrap refits instead of the Appendix D projector form, so large-G runs remain slower than necessary; this is explicitly documented and tracked, so it should not block approval. Concrete fix: none required for this PR. Locations: TODO.md:L99-L102, docs/methodology/REGISTRY.md:L2344.

Maintainability

No findings beyond the documentation/test updates needed for the methodology fixes above.

Tech Debt

Severity P3. Impact: the deferred Equation 18 joint pre-trends test, tighter R parity, optional Stute bandwidth work, and multi-period workflow dispatch are all tracked in TODO.md, so these are properly mitigated deferrals under the repo rules. Concrete fix: none required for this PR. Locations: TODO.md:L98-L102.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the current coverage does not pin either P1 issue above. TestComposeVerdictLogic only asserts the pure inconclusive branches and the no-reject/reject bundles; it does not cover mixed cases like QUG reject + both linearity tests NaN or QUG NaN + Stute reject. Likewise, only qug_test() has a negative-dose regression. That means both methodology bugs can survive without a failing test, and the registry/docstrings currently codify the existing verdict precedence. Concrete fix: add regression tests for both mixed-status verdict cases and direct negative-dose raises for stute_test() / yatchew_hr_test(), then update HADPretestReport and REGISTRY.md to match the corrected behavior. Locations: tests/test_had_pretests.py:L168-L172, tests/test_had_pretests.py:L906-L921, diff_diff/had_pretests.py:L404-L417, docs/methodology/REGISTRY.md:L2343-L2351.

Path to Approval

Change _compose_verdict() so a conclusive rejection is never hidden by a generic inconclusive verdict; if another step is unresolved, append that fact instead of replacing the rejection.
Add the same D_{g,2} >= 0 front-door guard to stute_test() and yatchew_hr_test() that already exists in qug_test() and the panel validator.
Add regression tests for QUG reject + both linearity tests NaN, QUG NaN + linearity reject, and negative-dose raises in both linearity helpers, then sync the docstring / REGISTRY.md wording to the fixed semantics.

R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive": The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" BEFORE checking whether any conclusive test had rejected. The reviewer's example: G=2 with QUG rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates — the workflow emitted "inconclusive - both linearity NaN", hiding a real assumption failure. The paper's rule is one-way: TWFE is admissible only if NO test rejects. A conclusive rejection therefore dominates unresolved-step notes. Fix: reorder _compose_verdict: 1. Collect rejections from conclusive tests first. If any, that is the primary verdict, and unresolved-step notes are APPENDED via "; additional steps unresolved: ..." rather than replacing the rejection. 2. Only when NO conclusive rejection exists AND a required step is unresolved do we return a pure "inconclusive - ..." verdict. 3. Otherwise fall through to the partial-workflow fail-to-reject verdict (with "(Yatchew NaN - skipped)" suffix if applicable). Regressions: - TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection - TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection - TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses: qug_test and _validate_had_panel both front-door-reject d < 0 (paper Section 2 HAD support restriction), but the new linearity helpers only validated shape + NaN. Negative doses are outside the method's stated scope and could silently produce conclusive-looking output. Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test now raise ValueError on any d < 0 with a message directing users to pre-process or check the dose column. Docstrings updated to list the new contract in the Raises section. Regressions: - TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises - TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises R6 P2 — Docstrings / REGISTRY sync: HADPretestReport.verdict docstring rewritten to describe the new "rejection-first, unresolved-suffix" priority. REGISTRY Phase 3 workflow checkbox updated to document the conclusive-rejection-not- hidden semantics plus the non-negative-dose contract. Test count: 64 -> 69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T21:17:07Z

/ai-review

github-actions · 2026-04-22T21:23:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b7e0d8270bbe4db943cb487c49d251cbf1e322d0

Overall Assessment

✅ Looks good

Static review only: this environment lacks numpy and pytest, so I could not execute the new Phase 3 test file here.

Executive Summary

The two prior P1 methodology findings appear resolved: _compose_verdict() now preserves conclusive rejections even when another step is unresolved, and both raw linearity helpers now enforce the HAD non-negative-dose contract at the front door. Locations: diff_diff/had_pretests.py:L617-L700, diff_diff/had_pretests.py:L925-L933, diff_diff/had_pretests.py:L1132-L1140.
The added tests now pin those fixes directly, instead of leaving them implicit. Locations: tests/test_had_pretests.py:L175-L189, tests/test_had_pretests.py:L920-L1006.
The remaining Phase 3 deviations I saw are documented in REGISTRY.md / TODO.md and therefore are non-blocking under the repo rules: deferred Equation 18 step, documented Yatchew tie policy, and the explicit partial-workflow semantics. Locations: docs/methodology/REGISTRY.md:L2343-L2351, TODO.md:L98-L102.
I did not identify any new unmitigated P0/P1 issues in the diff.

Methodology

Previous methodology P1s are addressed in code, tests, and registry text. Remaining methodology items are documented P3s rather than blockers.

Severity P3. Impact: did_had_pretest_workflow() still ships only paper steps 1 and 3, not the Equation 18 Assumption 7 pre-trends step, but the report/verdict now explicitly says so and the deferral is tracked. Concrete fix: none required in this PR; tracked follow-up. Locations: diff_diff/had_pretests.py:L1316-L1337, docs/methodology/REGISTRY.md:L2303-L2310, docs/methodology/REGISTRY.md:L2343-L2351, TODO.md:L98-L102.
Severity P3. Impact: the paper review frames step 3 as a choice between Stute and Yatchew-HR, while the shipped workflow runs both implemented tests and treats any conclusive rejection as disqualifying. That behavior is now explicitly documented in the registry, so it is a mitigated deviation rather than a defect. Concrete fix: none required in this PR; optional follow-up would be a linearity_test= selector if the project wants paper-literal single-test workflow semantics. Locations: docs/methodology/papers/dechaisemartin-2026-review.md:L264-L273, diff_diff/had_pretests.py:L1386-L1405, docs/methodology/REGISTRY.md:L2343-L2351.

Code Quality

No findings.

Performance

Severity P3. Impact: stute_test() still uses literal per-bootstrap OLS refits, so large-G runs remain slower than Appendix D’s vectorized projector form; this is already warned at runtime and tracked in the registry/TODO. Concrete fix: none required in this PR. Locations: diff_diff/had_pretests.py:L952-L959, diff_diff/had_pretests.py:L1020-L1029, docs/methodology/REGISTRY.md:L2344-L2344, TODO.md:L99-L99.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining Phase 3 debt is explicitly tracked rather than hidden: Equation 18 joint pre-trends, R parity, nprobust-bandwidth follow-up, and multi-period workflow dispatch all have TODO entries. Concrete fix: none required in this PR. Locations: TODO.md:L98-L102.

Security

No findings.

Documentation/Tests

No findings. The re-review gaps are now covered by direct regression tests and matching registry prose. Locations: tests/test_had_pretests.py:L175-L189, tests/test_had_pretests.py:L597-L668, tests/test_had_pretests.py:L920-L1006, docs/methodology/REGISTRY.md:L2343-L2351.

Fixes two P1 issues flagged by the CI reviewer on the initial submission of PR #352. P1 Methodology — `compute_time_weights_survey` was documented as solving the WLS-style weighted λ objective min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||² but row-scaled Y by sqrt(rw) and then handed the scaled matrix to `_sc_weight_fw(intercept=True)`, whose column-centering uses an UNWEIGHTED mean across controls. That is not the weighted objective once rw varies, so non-uniform survey bootstrap draws were refitting λ on the wrong objective and could bias the reported SE. Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using `col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass `intercept=False` to the kernel so no additional unweighted centering happens on the scaled matrix. Both two-pass calls updated. `compute_sdid_unit_weights_survey` is unchanged — its column-centering is PER-UNIT (time means within each control column), which is independent of rw. P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is a valid configuration (`SurveyDesign.resolve()` synthesizes ones when weights is None), but `_extract_unit_survey_weights` indexed `survey_design.weights` as if it were always a column name, so the groupby would fail with a KeyError before the bootstrap branch could run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector of ones of length `len(unit_order)` when `survey_design.weights is None`, matching `SurveyDesign.resolve()`'s semantics. Regression tests: - `test_non_uniform_rw_beats_unweighted_centering_variant` (test_weighted_fw.py): reproduces the pre-fix buggy variant (row- scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and asserts the fixed path's weighted SSR is strictly ≤ the buggy variant's weighted SSR. If a future revert reintroduces intercept=True after the row-scaling, this test fails. - `test_bootstrap_full_design_without_explicit_weights` (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with no explicit `weights` column now succeeds on the bootstrap path; survey_metadata populated with n_strata / n_psu. P3 Documentation: - `SyntheticDiD.fit()` docstring (survey_design parameter + Raises block): replace "bootstrap rejects all survey designs" language with the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight- only and full design; placebo/jackknife ✓ pweight-only, ✗ full design). - `_placebo_variance_se` fallback-guidance messages (two sites): drop the "strata/PSU/FPC not yet supported by any SDID variance method" framing; recommend bootstrap for full-design survey fallback, jackknife for pweight-only, adding controls as the universal fallback. - `docs/survey-roadmap.md` Current Limitations table: collapse the two SDID bootstrap-rejection rows into a single row for placebo/ jackknife + full design (the bootstrap + full design row no longer applies). Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE + TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…path CI review flagged a silent-failure P0: when a bootstrap draw's treated units all had survey weight 0 under a pweight-only survey design, the code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted mean). That silently dropped survey weighting for that draw while the fit-time ATT uses the survey-weighted treated mean — corrupting the bootstrap distribution used for SE, p-value, and CI. Reachable with any `SurveyDesign(weights=...)` input where at least one treated unit has pweight 0 (a valid configuration per ``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks only those units. Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the pweight-only branch as well. When ANY survey path is active and ``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the draw is retried. The pre-fix behavior (unweighted treated mean fallback) is unreachable from any code path now. Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws`` in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a panel with 3 treated units where one has `wt=0`; bootstrap at B=100 still returns finite SE > 0. The test would fail if the code reverted to the unweighted-mean fallback because zero-mass draws would either produce degenerate SDiD estimators (NaN τ → gets filtered out by the finite-τ gate, possibly leaving no valid draws) or produce systematically different τ values that inflate the SE. P3 (same review) also addressed: - ``diff_diff/synthetic_did.py`` replicate-weight rejection message (line ~303) now reflects the PR #352 support matrix instead of the pre-PR #352 "only pweight-only placebo/jackknife" framing. Analytical survey designs are now fully supported per the truth table; only replicate-weight variance remains unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lice Two P3s from R3; PR was already ✅ Looks good — these are close-out polish. P3 docs/tests — secondary surfaces described the full-design path as "Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material caveat that SDID still uses unit-level pairs-bootstrap resampling (``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled weights on top — a hybrid composition, not a standalone Rao-Wu bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352 Added entry to use the hybrid-composition wording mirroring REGISTRY. P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights`` interaction was only guarded by the slow coverage MC. Add ``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in ``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to return a known vector of distinct per-unit values (``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors fed into the weighted FW helper via a capturing wrapper on ``compute_sdid_unit_weights_survey``, and asserts every captured vector lies within ``known_rw[:n_control]`` (positions 1..n_control). This catches two bug classes: - slice-order regression: if someone swaps rw-then-slice for slice-then-rw, the captured vectors would include values from the treated slice ``known_rw[n_control:]`` and the assertion fires. - rw-drift regression: if the Rao-Wu call site bypasses ``generate_rao_wu_weights`` (e.g., a refactor silently uses the pweight-only branch for full-design fits), the captured vector would be the user's w_control (all 1.0 in this test) instead of the known Rao-Wu output. Verified: 294 targeted tests pass across test_methodology_sdid / test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three places in REGISTRY.md still described the pre-PR-#352 behavior where SDID survey bootstrap raised NotImplementedError, contradicting the new support matrix and survey-composition note landed earlier in this PR: - §SyntheticDiD bootstrap bullet (paper-faithful refit text): the "Composed with any survey design ... raises NotImplementedError" trailer is updated to describe weighted-FW dispatch under the survey + bootstrap composition note. - §SyntheticDiD requirements checklist bullet for "Bootstrap: paper- faithful Algorithm 2 step 2": "Survey designs raise NotImplementedError" trailer is updated to describe the hybrid pairs-bootstrap + Rao-Wu rescaling composition. - §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten to add SDID to the list via its hybrid composition, with the precise distinction from standalone Rao-Wu (SunAbraham / TROP) and a pointer to §SyntheticDiD for the objective form and argmin-set caveat. Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD (L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/ FPC methodology gap that is out of scope for this PR and tracked in TODO.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap composition and propagates the landed capability across the documentation surfaces. Coverage MC harness (benchmarks/python/coverage_sdid.py): - Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/ stratum) null-treatment panel with moderate weight variation and modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post periods). Converts per-observation ``treated`` to a unit-level ever-treated indicator (SDID's block-treatment requirement). - Extend ``DGPSpec`` with an optional ``survey_design_factory`` callable that returns ``(SurveyDesign, supported_methods_tuple)``. For ``stratified_survey``: bootstrap only — placebo / jackknife reject strata/PSU/FPC at fit-time, so the harness skips them rather than catching the NotImplementedError inside ``_fit_one``. - ``_fit_one`` gains an optional ``survey_design`` kwarg routed through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls the factory once per seed (DataFrame contents don't affect columns) and gates methods on the supported set. Regenerated ``benchmarks/data/sdid_coverage.json`` via ``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend); aer63 remains the long tail at 2237 s, stratified_survey adds only 33 s. Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the bootstrap is slightly conservative (overestimates empirical sampling SD by ~25%) — the safer direction under Rao-Wu rescaling with only 4 PSUs total. Validates the weighted-FW + Rao-Wu composition end-to-end. REGISTRY.md §SyntheticDiD: - Add ``stratified_survey`` row to the coverage MC table and a paragraph under it documenting the calibration verdict, the conservatism direction, and why placebo/jackknife rows are NaN. - Replace the survey-support bullet with a truth-table matrix (PR #352 shape); add a ``Note (survey + bootstrap composition)`` documenting the weighted-FW objective (unit and time forms), the ω_eff composition, the argmin-set caveat, the per-draw rw dispatch (pweight-only vs Rao-Wu), and the single-PSU short-circuit. - Update the ``Note (default variance_method deviation from R)`` to drop the "bootstrap rejects surveys" framing (no longer accurate). - Update the ``Note (coverage Monte Carlo calibration)`` header to say "4 representative null-panel DGPs" and flag stratified_survey as bootstrap-only. User-facing docs: - ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu Rescaled Bootstrap list; describe the weighted-FW composition. - ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet restores SDID. - ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix table row for SyntheticDiD switches from "pweight only (placebo/ jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC"; "Note on SyntheticDiD" block rewritten for the landed contract. - ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap bullet now describes survey support and the ω_eff composition. - ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet includes SDID in the Rao-Wu list with the weighted-FW formula. CHANGELOG.md: - Retain the PR #351 regression Changed entry but annotate it as "restored in PR #352"; add new Added/Changed PR #352 entries documenting the weighted-FW kernel, survey helpers, _bootstrap_se Rao-Wu composition, and the new coverage MC row. TODO.md: - Row 103 (SDID + survey designs) → closed by PR #352; replaced with a narrower follow-up for placebo/jackknife + strata/PSU/FPC (Low priority, no concrete sketch yet). Tests: - ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including ``stratified_survey``); new explicit assertions that the stratified_survey bootstrap row has ≥100 successful fits and α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows n_successful_fits == 0 (strata/PSU/FPC rejection contract). Verified: TestCoverageMCArtifact passes against the regenerated artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes two P1 issues flagged by the CI reviewer on the initial submission of PR #352. P1 Methodology — `compute_time_weights_survey` was documented as solving the WLS-style weighted λ objective min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||² but row-scaled Y by sqrt(rw) and then handed the scaled matrix to `_sc_weight_fw(intercept=True)`, whose column-centering uses an UNWEIGHTED mean across controls. That is not the weighted objective once rw varies, so non-uniform survey bootstrap draws were refitting λ on the wrong objective and could bias the reported SE. Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using `col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass `intercept=False` to the kernel so no additional unweighted centering happens on the scaled matrix. Both two-pass calls updated. `compute_sdid_unit_weights_survey` is unchanged — its column-centering is PER-UNIT (time means within each control column), which is independent of rw. P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is a valid configuration (`SurveyDesign.resolve()` synthesizes ones when weights is None), but `_extract_unit_survey_weights` indexed `survey_design.weights` as if it were always a column name, so the groupby would fail with a KeyError before the bootstrap branch could run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector of ones of length `len(unit_order)` when `survey_design.weights is None`, matching `SurveyDesign.resolve()`'s semantics. Regression tests: - `test_non_uniform_rw_beats_unweighted_centering_variant` (test_weighted_fw.py): reproduces the pre-fix buggy variant (row- scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and asserts the fixed path's weighted SSR is strictly ≤ the buggy variant's weighted SSR. If a future revert reintroduces intercept=True after the row-scaling, this test fails. - `test_bootstrap_full_design_without_explicit_weights` (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with no explicit `weights` column now succeeds on the bootstrap path; survey_metadata populated with n_strata / n_psu. P3 Documentation: - `SyntheticDiD.fit()` docstring (survey_design parameter + Raises block): replace "bootstrap rejects all survey designs" language with the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight- only and full design; placebo/jackknife ✓ pweight-only, ✗ full design). - `_placebo_variance_se` fallback-guidance messages (two sites): drop the "strata/PSU/FPC not yet supported by any SDID variance method" framing; recommend bootstrap for full-design survey fallback, jackknife for pweight-only, adding controls as the universal fallback. - `docs/survey-roadmap.md` Current Limitations table: collapse the two SDID bootstrap-rejection rows into a single row for placebo/ jackknife + full design (the bootstrap + full design row no longer applies). Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE + TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…path CI review flagged a silent-failure P0: when a bootstrap draw's treated units all had survey weight 0 under a pweight-only survey design, the code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted mean). That silently dropped survey weighting for that draw while the fit-time ATT uses the survey-weighted treated mean — corrupting the bootstrap distribution used for SE, p-value, and CI. Reachable with any `SurveyDesign(weights=...)` input where at least one treated unit has pweight 0 (a valid configuration per ``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks only those units. Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the pweight-only branch as well. When ANY survey path is active and ``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the draw is retried. The pre-fix behavior (unweighted treated mean fallback) is unreachable from any code path now. Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws`` in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a panel with 3 treated units where one has `wt=0`; bootstrap at B=100 still returns finite SE > 0. The test would fail if the code reverted to the unweighted-mean fallback because zero-mass draws would either produce degenerate SDiD estimators (NaN τ → gets filtered out by the finite-τ gate, possibly leaving no valid draws) or produce systematically different τ values that inflate the SE. P3 (same review) also addressed: - ``diff_diff/synthetic_did.py`` replicate-weight rejection message (line ~303) now reflects the PR #352 support matrix instead of the pre-PR #352 "only pweight-only placebo/jackknife" framing. Analytical survey designs are now fully supported per the truth table; only replicate-weight variance remains unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lice Two P3s from R3; PR was already ✅ Looks good — these are close-out polish. P3 docs/tests — secondary surfaces described the full-design path as "Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material caveat that SDID still uses unit-level pairs-bootstrap resampling (``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled weights on top — a hybrid composition, not a standalone Rao-Wu bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352 Added entry to use the hybrid-composition wording mirroring REGISTRY. P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights`` interaction was only guarded by the slow coverage MC. Add ``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in ``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to return a known vector of distinct per-unit values (``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors fed into the weighted FW helper via a capturing wrapper on ``compute_sdid_unit_weights_survey``, and asserts every captured vector lies within ``known_rw[:n_control]`` (positions 1..n_control). This catches two bug classes: - slice-order regression: if someone swaps rw-then-slice for slice-then-rw, the captured vectors would include values from the treated slice ``known_rw[n_control:]`` and the assertion fires. - rw-drift regression: if the Rao-Wu call site bypasses ``generate_rao_wu_weights`` (e.g., a refactor silently uses the pweight-only branch for full-design fits), the captured vector would be the user's w_control (all 1.0 in this test) instead of the known Rao-Wu output. Verified: 294 targeted tests pass across test_methodology_sdid / test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three places in REGISTRY.md still described the pre-PR-#352 behavior where SDID survey bootstrap raised NotImplementedError, contradicting the new support matrix and survey-composition note landed earlier in this PR: - §SyntheticDiD bootstrap bullet (paper-faithful refit text): the "Composed with any survey design ... raises NotImplementedError" trailer is updated to describe weighted-FW dispatch under the survey + bootstrap composition note. - §SyntheticDiD requirements checklist bullet for "Bootstrap: paper- faithful Algorithm 2 step 2": "Survey designs raise NotImplementedError" trailer is updated to describe the hybrid pairs-bootstrap + Rao-Wu rescaling composition. - §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten to add SDID to the list via its hybrid composition, with the precise distinction from standalone Rao-Wu (SunAbraham / TROP) and a pointer to §SyntheticDiD for the objective form and argmin-set caveat. Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD (L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/ FPC methodology gap that is out of scope for this PR and tracked in TODO.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er-fix Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355

…placebo, jackknife) Second commit for the SDID survey-placebo/jackknife PR. Extends the coverage Monte Carlo artifact with jackknife on the stratified_survey DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY §SyntheticDiD gap bullets to two landed Notes; updates user-facing docs to reflect restored capability. Coverage MC changes ------------------- * benchmarks/python/coverage_sdid.py: _stratified_survey_design now returns ("bootstrap", "jackknife") on the methods tuple. Placebo is omitted because the DGP's cohort packs into a single stratum with 0 never-treated units — stratified-permutation placebo is structurally infeasible on this DGP (raises Case C at fit-time). Module docstring explains the exclusion and the jackknife anti-conservatism caveat. * benchmarks/data/sdid_coverage.json: regenerated stratified_survey block at n_seeds=500, n_bootstrap=200. Bootstrap validates near- nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti- conservatism from the stratified jackknife formula with 2 PSUs per stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation). REGISTRY.md §SyntheticDiD ------------------------- * Survey support matrix updated: all three variance methods now support strata/PSU/FPC (not just bootstrap). * Two new landed Notes: - "Note (survey + placebo composition)": stratified-permutation allocator, weighted-FW refit, ω_eff composition, fit-time feasibility guards (Case B / Case C), scope note on what is NOT randomized (within-stratum PSU axis). Cites Pesarin (2001) / Pesarin & Salmaso (2010). - "Note (survey + jackknife composition)": PSU-level LOO algorithm, explicit stratum-aggregation SE² formula, FPC handling (population- count form from survey.py:338-356), fixed-weights rationale, degenerate-LOO skip semantics, scope note, known anti-conservatism with few PSUs per stratum. Cites Rust & Rao (1996). * "Allocator asymmetry" paragraph in the survey support matrix documents the intentional asymmetry (placebo ignores PSU, jackknife respects it) with rationale rooted in each method's role (null- distribution test vs design-based variance approximation). * Coverage MC table adds the stratified_survey × jackknife row with anti-conservatism narrative; placebo row explicitly marked N/A-on- this-DGP (with pointer to the unit-test coverage). * Requirements checklist entries updated to describe full-design support for placebo and jackknife. Docs sweep ---------- * docs/methodology/survey-theory.md: new bullets describing the stratified-permutation placebo allocator and the PSU-level LOO jackknife, parallel to the existing hybrid-bootstrap bullet. * docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID row updated from "bootstrap only (PR #352)" to "Full (all three variance methods)"; legend amended; "Note on SyntheticDiD" block rewritten to describe all three allocators with the jackknife few-PSU caveat. * docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/ jackknife gap; Phase 6 bullet updated to describe all three allocators; Current Limitations table entry removed (only replicate- weight limitation remains, merged into one row). * CHANGELOG.md: "### Added" entry for placebo + jackknife full-design support (no new section header — folded into existing Unreleased block); "### Changed (PR #355)" tweaked to note the separate follow-up for placebo/jackknife. * TODO.md row 107 deleted (capability gap closed). * diff_diff/synthetic_did.py __init__ docstring: survey_design parameter description rewritten to describe all three methods. Placebo fallback-guidance comment updated to remove stale "placebo and jackknife reject strata/PSU/FPC" line. * diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated to describe all three survey allocators (UTF-8 fingerprint preserved — `D'Haultfœuille` still appears throughout). * tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative and assertions updated to reflect that placebo=0-fits is expected structurally on stratified_survey (documented Case C), while jackknife now runs successfully with the known anti-conservatism caveat intentionally unasserted at the calibration-gate level. Verification ------------ * pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign tests/test_survey_phase5.py::TestSyntheticDiDSurvey tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact} tests/test_guides.py → 82 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h harness docstring P3 (Maintainability — survey jackknife still populated _loo_unit_ids): ``fit()`` was unconditionally setting ``_loo_unit_ids`` / ``_loo_roles`` on every jackknife fit, including full-design survey fits where the underlying replicates are PSU-level and ``get_loo_effects_df()`` now raises ``NotImplementedError``. Internal / canned guidance keyed off ``_loo_unit_ids is not None`` as the availability check (e.g., ``practitioner.py``) would still call the accessor on a survey fit and hit the new raise. Fix: only populate ``_loo_unit_ids`` / ``_loo_roles`` when ``_loo_granularity == "unit"``; leave them ``None`` on the PSU path so ``_loo_unit_ids is not None`` correctly reports availability. ``_loo_granularity`` is the authoritative accessor gate; the legacy ``_loo_unit_ids`` sentinel now agrees with it. P3 (Documentation — harness docstring stale): ``coverage_sdid.py::_fit_one`` docstring said "fit() routes [survey designs] through the bootstrap survey path (PR #352) when method=='bootstrap'" — stale after the placebo + jackknife full- design paths landed. Rewrote to describe the three method-specific survey variance paths (weighted-FW + Rao-Wu bootstrap; stratified- permutation + weighted-FW placebo; PSU-LOO + stratum-aggregation jackknife) and mention the Case B-D ValueError failure modes alongside NotImplementedError. Verification: 94 passed (no behavior change on the gating fix — it's a state-gating tightening, not a correctness change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits April 22, 2026 13:35

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber merged commit 5dd8d0e into main Apr 22, 2026
23 of 24 checks passed

igerber deleted the had-phase-3-pretests branch April 22, 2026 22:33

github-actions Bot mentioned this pull request Apr 24, 2026

Restore SDID survey-bootstrap via weighted Frank-Wolfe + Rao-Wu composition #355

Merged

igerber mentioned this pull request Apr 24, 2026

Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355 #361

Merged

igerber added a commit that referenced this pull request Apr 24, 2026

Merge pull request #361 from igerber/sdid-bootstrap-changelog-pr-numb…

e28080d

…er-fix Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355

Conversation

igerber commented Apr 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant