Skip to content

HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow)#352

Merged
igerber merged 9 commits intomainfrom
had-phase-3-pretests
Apr 22, 2026
Merged

HAD Phase 3: pre-test diagnostics (qug_test, stute_test, yatchew_hr_test, composite workflow)#352
igerber merged 9 commits intomainfrom
had-phase-3-pretests

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 22, 2026

Summary

  • Ship paper Section 4 (de Chaisemartin, Ciccia, D'Haultfoeuille, Knau 2026, arXiv:2405.04465v6) pre-test diagnostics for HeterogeneousAdoptionDiD in a new module diff_diff/had_pretests.py.
  • qug_test (Theorem 4), stute_test (Appendix D with Mammen wild bootstrap), yatchew_hr_test (Theorem 7 / Equation 29), and a composite did_had_pretest_workflow that runs all three on a two-period HAD panel.
  • Partial-workflow semantic: Phase 3 ships paper steps 1 (QUG) and 3 (linearity) only; step 2 (Assumption 7 pre-trends test via Equation 18) is deferred to a follow-up patch. The HADPretestReport verdict explicitly flags the Assumption 7 gap when all implemented tests fail-to-reject, so callers do not receive a misleading "TWFE safe" signal.

Methodology references (required if estimator / math changes)

  • Method name(s): QUG null test for support infimum; Cramer-von Mises linearity test with Mammen (1993) wild bootstrap; Yatchew heteroskedasticity-robust variance-ratio linearity test; four-step HAD pre-testing workflow.
  • Paper / source link(s):
    • Primary: de Chaisemartin, Ciccia, D'Haultfoeuille, Knau (2026). arXiv:2405.04465v6. Theorem 4 (QUG), Appendix D + Equation 17 (Stute), Theorem 7 + Equation 29 (Yatchew-HR), Section 4.2-4.3 (composite workflow).
    • Stute, W. (1997). Nonparametric model checks for regression. Annals of Statistics 25, 613-641.
    • Mammen, E. (1993). Bootstrap and wild bootstrap for high-dimensional linear models. Annals of Statistics 21, 255-285.
    • Yatchew, A. (1997). An elementary estimator of the partial linear model. Economics Letters 57, 135-143.
    • R reference implementations: chaisemartin::qug_test, stute_test, yatchew_test.
  • Any intentional deviations from the source (and why):
    • Partial workflow: step 2 (joint cross-horizon Stute test via paper Equation 18 on pre-period placebos) deferred to a Phase 3 follow-up patch. The paper's exact stacked-residual formula is not reproduced in dechaisemartin-2026-review.md and extraction from the paper PDF is out of scope for this PR. Tracked in TODO.md; documented in REGISTRY.md.
    • Stute bootstrap form: literal per-iteration OLS refit matching paper Appendix D Algorithm text. Appendix D vectorized matrix form (precomputing M = I - X(X'X)^{-1}X' once) is functionally identical and ~2x faster; deferred as a performance follow-up to keep reviewer surface narrow. Tracked in TODO.md.
    • No multiple-testing adjustment: paper does not prescribe Bonferroni/Holm; per-test alpha is used as-is. Documented in REGISTRY.md Phase 3 Note block.
    • R-parity scope: QUG tight closed-form parity at atol=1e-12; Stute / Yatchew-HR validated via synthetic-DGP coverage rates rather than tight numerical parity against chaisemartin::stute_test / yatchew_test (bootstrap seed semantics + B differ across numpy/R). Tight parity deferred.

Validation

  • Tests added/updated: tests/test_had_pretests.py (new, 47 tests, 9 test classes):
    • TestQUGTest: closed-form tight parity at atol=1e-12, tie-break NaN, zero-filter warning, asymptotic-size upper bound on Uniform null, single-realization rejection/non-rejection on shifted-Beta / Uniform, invalid-alpha + NaN-input + 2D-input raises.
    • TestStuteTest: single-realization no-reject on linear / reject on quadratic, _cvm_statistic algebraic equivalence to paper's stated form, n_bootstrap<99 raise, mismatched-length raise, NaN-input raise.
    • TestYatchewHRTest: single-realization linear / quadratic, paper-literal sigma2_diff normalizer (divides by 2G, NOT 2(G-1)) at atol=1e-12, invalid-alpha raise.
    • TestCompositeWorkflow: integration tests for (a) all-pass on linear DGP flags Assumption 7 gap, (b) quadratic-plus-shifted-support DGP triggers bundled rejection verdict, (c) workflow seed forwards to Stute only.
    • TestNaNPropagation: constant-dy trivial linearity, QUG-tie propagates to composite verdict, Stute constant-d NaN.
    • TestJSONSerialization: to_dict() JSON-safe for all 4 result classes, HADPretestReport.to_dataframe() schema [test, statistic_name, statistic_value, p_value, reject, alpha, n_obs], per-test single-row frames.
    • TestSeedReproducibility: bitwise reproducibility at seed=42, different seeds produce different p-values.
    • TestSampleSizeGates: QUG G<2 NaN, stute G<10 NaN, yatchew G<3 NaN (all via UserWarning, not ValueError).
    • TestComposeVerdictLogic: 5 priority-order tests on _compose_verdict covering (a) NaN inconclusive, (b) none-reject Assumption 7 gap, (c) only-QUG, (d) only-linearity, (e) all-reject bundled.
    • TestSummary: non-empty summary strings for all 4 result classes.
  • Regression: tests/test_had.py (Phase 1-2) — 214 passed, 2 skipped, no regressions.
  • Backtest / simulation / notebook evidence (if applicable): N/A (no tutorial changes in this PR; Phase 5 integrates practitioner_next_steps + tutorials).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 3 commits April 22, 2026 13:35
…est, workflow)

Ships paper Section 4 (de Chaisemartin et al. 2026, arXiv:2405.04465v6) pre-test
diagnostics for the HeterogeneousAdoptionDiD estimator in a new module
`diff_diff/had_pretests.py`:

- `qug_test()` — order-statistic ratio test of H_0: d_lower = 0 (Theorem 4).
  Closed-form T = D_(1)/(D_(2)-D_(1)); asymptotic p-value 1/(1+T) under the
  Exp(1)/Exp(1) limit law; reject if T > 1/alpha - 1. Zero-filter with warning;
  tie-break returns all-NaN inference.
- `stute_test()` — Cramer-von Mises cusum test with Mammen wild bootstrap
  (Appendix D). Statistic S = (1/G^2) sum(cumsum^2); literal per-iteration OLS
  refit matching paper Appendix D Algorithm. n_bootstrap >= 99 front-door
  validated; G < 10 returns NaN; G > 100_000 warns and points to yatchew_hr_test.
- `yatchew_hr_test()` — heteroskedasticity-robust variance-ratio linearity test
  (Theorem 7 / Equation 29). sigma2_diff normalizes by 2G paper-literal (NOT
  2(G-1)); tight hand-computed parity at atol=1e-12.
- `did_had_pretest_workflow()` — composite runs all three on a two-period panel
  and returns HADPretestReport with priority-ordered verdict string.

Deferred to Phase 3 follow-up patches:
- Joint Equation 18 cross-horizon Stute pre-trend test (paper formula needs PDF
  extraction; tracked in TODO.md)
- Appendix D vectorized matrix form for Stute bootstrap (functionally identical,
  ~2x faster; shipped literal-refit to match paper text)
- Tight R-package numerical parity (bootstrap seed semantics differ across numpy/R)

47 tests across 9 classes cover rejection rates, closed-form parity, tie-breaks,
NaN propagation, JSON serialization, seed reproducibility, sample-size gates,
and verdict logic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI review found that did_had_pretest_workflow's fail-to-reject verdict
("TWFE safe under Section 4 assumptions") was a methodology overclaim:
Phase 3 ships steps 1 + 3 of the paper's four-step workflow but not
step 2 (Assumption 7 pre-trends test via Equation 18), so the verdict
could mislead users into believing the full paper-prescribed workflow
had passed.

Fix:
- _compose_verdict fail-to-reject branch now returns
  "QUG and linearity diagnostics fail-to-reject; Assumption 7
  pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"
- HADPretestReport docstring adds a prominent partial-workflow caveat
  on the verdict and all_pass fields.
- did_had_pretest_workflow docstring renames intent from "run the
  paper's workflow" to "run a PARTIAL workflow (steps 1 and 3 only)".
- Tests: renamed + updated to assert the Assumption 7 caveat appears
  in the verdict and that "TWFE safe" does NOT appear.
- REGISTRY.md updated to document the partial-workflow semantic and
  revised verdict string in both the Note block and checkbox row.

Also addresses P2 from the review: replaced positional `None` in the
_aggregate_first_difference call with keyword argument cluster_col=None
plus an inline comment naming the intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…string + string concat

Re-review came back ✅ after the P1 fix; remaining findings were all P3
documentation/cleanup:

- Module docstring now explicitly frames Phase 3 as shipping steps 1 + 3
  of the paper's four-step workflow (not "all three fail-to-reject ->
  TWFE valid", which could read as unconditional certification).
- `_compose_verdict` docstring synchronized with the partial-workflow
  verdict text (previously said "TWFE safe under Section 4 assumptions";
  now says "QUG and linearity diagnostics fail-to-reject; Assumption 7
  pre-trends test NOT run").
- Collapsed adjacent-string-literal in `yatchew_hr_test` sigma4_W<=0
  warning into a single string.

No behavioral changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Three unmitigated P1 issues in the new HAD pretest path would make me block approval.

Executive Summary

  • stute_test is not tie-safe: with duplicated doses, the implemented CvM statistic becomes row-order-dependent and no longer matches the registry/paper definition.
  • Exact linear null cases are routed to NaN/inconclusive by the new degeneracy branches, even though Assumption 8 holds exactly there.
  • HADPretestReport.all_pass can be True while verdict says the report is inconclusive, so downstream callers can silently mis-gate on the boolean.
  • The exported raw qug_test API still needs a negative-dose guard; today it can silently drop invalid negatives and report them as excluded zeros.
  • The documented Phase 3 deferrals in TODO.md:L98-L102 and REGISTRY.md are properly tracked and are not blockers.
  • Static review only: I could not execute the tests in this environment because the reviewer runtime is missing Python test dependencies (numpy, pytest).

Methodology

  • Severity P1. Impact: Duplicate dose values are handled incorrectly in stute_test. The registry defines c_G(d)=G^{-1/2}\sum 1{D\le d}\hat\varepsilon and S=(1/G)\sum c_G^2(D_g) in docs/methodology/REGISTRY.md:L2280-L2288. The implementation instead sorts once and feeds observation-by-observation cumulative sums into _cvm_statistic in diff_diff/had_pretests.py:L525-L547 and diff_diff/had_pretests.py:L850-L860. When d has ties, all observations in the tie block should use the same post-tie cumulative sum; the current code uses partial within-tie sums, so the statistic becomes row-order-dependent and can be wrong on rounded/discrete doses. Concrete fix: collapse equal-dose blocks before squaring in both the observed and bootstrap statistics, or explicitly reject duplicate d values with a front-door error/warning if the implementation is intended only for continuous support; add an order-invariance regression.
  • Severity P1. Impact: Exact-linear inputs are misclassified as inconclusive. stute_test returns all-NaN when var(eps)==0 in diff_diff/had_pretests.py:L829-L848, and yatchew_hr_test does the same when sigma4_W<=0 in diff_diff/had_pretests.py:L959-L991. For an exact affine relation ΔY=a+bD, Assumption 8 holds exactly; under the Stute bootstrap, S=0 and every S*=0, so the correct outcome is a non-rejection (p=1), not NaN. _compose_verdict then upgrades any NaN to "inconclusive" in diff_diff/had_pretests.py:L550-L575, so an exactly linear panel can fail the new composite diagnostic. The intended contract in tests/test_had_pretests.py:L370-L391 points the same way. Concrete fix: return a non-rejecting result for exact-linear fits in the Stute path, and either define the same continuity-limit behavior for Yatchew or keep the workflow from poisoning the whole report when the alternative linearity test is well-defined; add regressions for dy = a + b*d.
  • Severity P1. Impact: HADPretestReport.all_pass can be True on inconclusive reports. The dataclass presents all_pass as the summary boolean in diff_diff/had_pretests.py:L368-L376, but the workflow computes it from not reject only in diff_diff/had_pretests.py:L1092-L1097. Any NaN case therefore yields all_pass=True while verdict says "inconclusive - ... NaN". That is a semantic contract violation for a new public result object and can silently mislead downstream callers. Concrete fix: require all constituent p-values/statistics to be finite before setting all_pass=True, or rename the field to something explicit like no_rejections and add a separate conclusive boolean; add a regression on a NaN case such as D_(1)=D_(2).

Code Quality

  • Severity P2. Impact: The exported raw qug_test API silently treats all d <= 0 observations as excluded but tells the user only that d == 0 rows were dropped in diff_diff/had_pretests.py:L646-L656. Negative post-period doses violate the HAD support restriction documented for the estimator in docs/methodology/REGISTRY.md:L2330-L2330; today the panel workflow rejects them, but the raw public API computes a statistic on malformed input and misreports the exclusion metadata. Concrete fix: add a front-door ValueError on any d < 0 in qug_test (and, if intended, the other raw pretests), leaving the current warning only for true zero-dose observations.

Performance

  • Severity P3. Impact: The registry says the shipped QUG implementation is O(G) via np.partition in docs/methodology/REGISTRY.md:L2276-L2276, but the code currently performs a full sort in diff_diff/had_pretests.py:L681-L683. This is not a methodology blocker, but the published complexity claim is no longer true and the implementation does extra work on large G. Concrete fix: either switch to np.partition to select the two smallest positive doses or update the registry note to reflect the actual np.sort implementation.

Maintainability

No new maintainability findings beyond the API-contract issues above.

Tech Debt

  • Severity P3 informational. Impact: The deferred Phase 3 items called out by the PR, including the Equation 18 joint pre-trend test, vectorized Stute bootstrap, R-parity scope, and multi-period workflow dispatch, are explicitly tracked in TODO.md:L98-L102. I did not count those tracked deferrals against the assessment. Concrete fix: none required for this PR.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: The new test module does not cover the two most failure-prone new paths I found: duplicate-dose handling in stute_test and negative-dose validation for the raw QUG API. Separately, the constant-dy test contract in tests/test_had_pretests.py:L370-L391 currently disagrees with the submitted degeneracy branches in diff_diff/had_pretests.py:L829-L848 and diff_diff/had_pretests.py:L975-L991, so the intended behavior is not actually protected by the implementation. Concrete fix: add regressions for duplicate-dose order invariance, negative-dose rejection, and the NaN/all_pass contract; if exact-linear inputs are supposed to fail-to-reject, make that behavior explicit in both code and tests.
  • Severity P3. Impact: REGISTRY.md still says the library default for stute_test is n_bootstrap = 499 in docs/methodology/REGISTRY.md:L2291-L2291, but the code ships 999 in diff_diff/had_pretests.py:L724-L725. Concrete fix: align the registry text with the code, or revert the code default if 499 is the intended public API.

Path to Approval

  1. Fix stute_test tie handling so duplicate doses either produce the correct block-collapsed CvM statistic or are rejected/warned explicitly, and add an order-invariance regression.
  2. Change the exact-linear degeneracy behavior so the linearly satisfied null does not return an inconclusive report, with tests for dy = a + b*d.
  3. Make all_pass impossible on inconclusive reports by gating it on finite constituent results, or rename/split the boolean so callers cannot mistake “no rejection” for “valid pass.”
  4. Add a negative-dose guard to the raw QUG API and reconcile the registry with the shipped implementation (np.partition vs np.sort, 499 vs 999).

P1 #1 — Stute tie-safe CvM:
Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each
observation's dose, so tied observations share the post-tie cumulative sum.
My naive cumsum over sorted residuals produced partial within-tie sums that
were row-order-dependent. Fix: after cumsum, replace within-tie-block values
with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now
accepts `d_sorted` and collapses tie blocks before squaring. Regression
test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance
on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses`
validates the end-to-end stute_test contract.

P1 #2 — Exact-linear fit must fail-to-reject (not return NaN):
For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome
is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix:
dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally
produces p=1 when eps=0 exactly). Added a scale-relative short-circuit
(sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so
FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat
the short-circuit by producing non-zero but tiny OLS residuals. Yatchew
exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than
NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan,
TestYatchewHRTest.test_exact_linear_returns_p1_not_nan.

P1 #3 — HADPretestReport.all_pass contract:
Previously `all_pass = not (reject or reject or reject)` could be True
while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every
constituent p-value being finite AND no test rejecting. Updated docstring.
Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan.

P2 #1 — QUG negative-dose guard:
HAD doses must be non-negative (paper Section 2). The raw qug_test API
was silently folding d < 0 rows into the n_excluded_zero counter (filter
was `d > 0`). Fix: front-door ValueError on any d < 0. Regression:
TestQUGTest.test_negative_dose_raises.

P3 #1 — QUG np.partition:
REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched
qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤
partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight
closed-form parity at atol=1e-12 still holds.

P3 #2 — REGISTRY n_bootstrap default:
REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated
REGISTRY to match code and added a note about the n_bootstrap >= 99
front-door validation.

Test count: 47 -> 53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 10a10792675f9171e8083335b6bfd9662755d15e


Overall Assessment

Blocker

This re-review is narrower than the previous one: the earlier P1s on Stute tie handling, exact-linear NaN behavior, all_pass, raw QUG negative-dose validation, and the registry 499/999 + sort/partition drift all look addressed in diff_diff/had_pretests.py:L539-L582, diff_diff/had_pretests.py:L684-L693, diff_diff/had_pretests.py:L1191-L1202, and docs/methodology/REGISTRY.md:L2270-L2291. I still see one unmitigated P0 and one unmitigated P1 in the updated linearity-test paths.

Executive Summary

  • Prior re-review blockers appear fixed: Stute now uses a tie-safe CvM implementation, all_pass is conclusive-only, raw qug_test rejects negative doses, and the registry now matches the shipped np.partition / n_bootstrap=999 behavior.
  • Severity P0 [Newly identified]: the new “exact linear” fast path is not translation-invariant and can silently force p_value=1.0 on ordinary noisy data after adding a large constant to dy; this affects both stute_test and yatchew_hr_test.
  • Severity P1 [Newly identified]: yatchew_hr_test still has no duplicate/constant-dose guard, so tied doses make the statistic depend on arbitrary row or unit ordering.
  • The documented Phase 3 deferrals are properly tracked in TODO.md:L98-L102 and docs/methodology/REGISTRY.md:L2303-L2310, docs/methodology/REGISTRY.md:L2343-L2345, so I did not count them against approval.
  • Static review only: I could not execute the new tests in this reviewer environment because the local Python runtime is missing numpy / pytest.

Methodology

  • Severity P0 [Newly identified]. Impact: the exact-linear shortcut in both linearity tests is provably wrong under additive outcome shifts. The paper/registry definitions for Stute and Yatchew depend on OLS residuals and sorted differences after a regression with an intercept, so adding a constant to dy must not change the test result: docs/methodology/REGISTRY.md:L2280-L2300, docs/methodology/papers/dechaisemartin-2026-review.md:L153-L173. But the new shortcut scales against raw sum(dy^2) in diff_diff/had_pretests.py:L64-L71, diff_diff/had_pretests.py:L904-L915, and diff_diff/had_pretests.py:L1036-L1053. That makes the branch depend on the arbitrary level of the outcome, not the regression fit. Concretely, with G=50 and dy = 1e12 + 2*d + ε with unit-scale noise, sum(dy^2) is about 5e25, so the threshold is about 50; ordinary OLS SSE is then enough to trip the “exact linear” branch and return Stute(cvm_stat=0, p=1) / Yatchew(t=-inf, p=1, sigma2_W=0) on non-degenerate data. That is silent wrong statistical output. Concrete fix: replace the shortcut with a translation-invariant exact-fit check based on centered or fitted-response scale, or remove the shortcut and only special-case truly zero residuals under a centered tolerance; add a regression asserting identical results for (d, dy) and (d, dy + C) on the same near-linear data.

  • Severity P1 [Newly identified]. Impact: yatchew_hr_test still silently accepts duplicate or constant doses and uses stable sort as the tie policy in diff_diff/had_pretests.py:L984-L986, then computes adjacent differences and adjacent residual products on that arbitrary order in diff_diff/had_pretests.py:L1041-L1042 and diff_diff/had_pretests.py:L1055-L1067. Unlike stute_test, there is no front-door constant-dose guard (diff_diff/had_pretests.py:L880-L898). For tied doses, permuting rows within a tie block can change sigma2_diff, sigma4_W, and the final p-value without changing the data. In the composite workflow this arbitrary order becomes unit-id order via diff_diff/had.py:L1351-L1353, which is deterministic but still unrelated to the statistic’s methodology. The registry describes Yatchew only as “sort by D” and has no **Note:** documenting a tie-policy deviation: docs/methodology/REGISTRY.md:L2295-L2299. Concrete fix: reject or return NaN when d has duplicates or zero variance before computing the statistic, or document a justified tie-resolution rule in docs/methodology/REGISTRY.md and back it with permutation tests if you intend to support ties.

Code Quality

No separate findings beyond the methodology issues above.

Performance

  • Severity P3 informational. Impact: the slower literal-refit Stute bootstrap is already explicitly tracked as deferred work in TODO.md:L99-L99 and documented in docs/methodology/REGISTRY.md:L2344-L2347, so I did not count it against this PR. Concrete fix: none for this PR.

Maintainability

No separate findings.

Tech Debt

  • Severity P3 informational. Impact: the missing Equation 18 joint pre-trends step, tighter R parity, nprobust/Stute enhancement ideas, and multi-period workflow dispatch are all explicitly tracked in TODO.md:L98-L102 and reflected in docs/methodology/REGISTRY.md:L2303-L2310, docs/methodology/REGISTRY.md:L2343-L2345. These are properly mitigated deferrals under the project rules. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: the added tests cover small-scale exact-linear cases and Stute duplicate-dose handling, but they do not protect the two problematic paths above. There is no regression that the exact-linear shortcut is invariant to adding a large constant (tests/test_had_pretests.py:L269-L279, tests/test_had_pretests.py:L344-L357), and there is no standalone Yatchew duplicate/constant-dose test at all in tests/test_had_pretests.py:L298-L357. Concrete fix: add (1) a large-offset near-linear regression for both stute_test and yatchew_hr_test, and (2) raw yatchew_hr_test cases for constant d and duplicated d under within-tie permutations.

Path to Approval

  1. Replace the exact-linear fast path in both stute_test and yatchew_hr_test with a translation-invariant criterion, or remove it entirely; add a regression showing the outputs are unchanged when the same dy vector is shifted by a large constant.
  2. Add a front-door duplicate/constant-dose guard to yatchew_hr_test before any exact-linear shortcut, and add tests for d = np.full(...) plus duplicated-dose permutations.
  3. If you intentionally keep any Yatchew tie policy instead of a hard guard, document it in docs/methodology/REGISTRY.md with a **Note:** and justify why it preserves the intended statistic.

P0 — Exact-linear shortcut not translation-invariant:
R1's scale-relative shortcut compared sum(eps^2) against sum(dy^2), but
sum(dy^2) is inflated by additive shifts in dy (e.g. dy = 1e12 + 2*d +
noise gives sum(dy^2) ~ 5e25, and threshold becomes ~50, which is easily
tripped by ordinary OLS SSE on unit-scale noise). This silently forced
p=1.0 on genuinely noisy data.

Fix: compare against CENTERED total sum of squares sum((dy - dybar)^2),
which equals (1 - R^2) * TSS and is translation-invariant by construction
(the mean subtraction absorbs any additive shift). Applied to both
stute_test and yatchew_hr_test. Module-level constant docstring updated to
explain why centering is required.

Regressions: TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data,
TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data.
Exact numerical equivalence under dy -> dy + 1e12 is not attainable (FP
cancellation at 1e12 scale costs ~4 digits), but the SHORTCUT behavior
and decision MUST match — the tests assert both.

P1 — Yatchew missing tie-handling guard:
R1's tie-safe CvM fixed Stute, but yatchew_hr_test still silently computed
adjacent differences on stable-sorted duplicates. Under ties, within-tie
row ordering changes sigma2_diff and sigma4_W, so the statistic depends
on arbitrary input row order rather than the data.

Fix: front-door UserWarning + all-NaN result when `np.unique(d).size < G`.
Users with tied doses (mass-point designs, discretised registers) are
directed to stute_test, whose tie-safe CvM handles ties correctly.
This also subsumes the constant-d case (all tied) without needing a
separate check.

Regressions: TestYatchewHRTest.test_duplicate_doses_return_nan,
TestYatchewHRTest.test_constant_d_returns_nan.

P2 — REGISTRY documentation:
Added two Note blocks to the HAD section:
- **Note (Phase 3 Yatchew tie policy)**: justifies the NaN-on-duplicate
  policy and directs users to Stute for tied-dose designs.
- **Note (Phase 3 exact-linear short-circuit)**: documents the
  translation-invariance requirement and the centered-TSS scaling.

Test count: 53 -> 57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6cd4805354428d25c826eee6f7831f1d2fba65f1


Overall Assessment

⛔ Blocker

Focused re-review of the PR diff. The prior blockers from the last AI review look addressed: the exact-linear shortcuts are now translation-invariant in diff_diff/had_pretests.py:L904-L920 and diff_diff/had_pretests.py:L1068-L1092, the Yatchew duplicate/constant-dose guard is present in diff_diff/had_pretests.py:L1032-L1063, all_pass is now conclusive-only in diff_diff/had_pretests.py:L1230-L1242, and the workflow/registry/TODO notes now line up in docs/methodology/REGISTRY.md:L2343-L2347 and TODO.md:L98-L102. One unmitigated P0 remains in yatchew_hr_test().

Executive Summary

  • Prior re-review findings appear fixed: the translation-invariance bug, duplicate-dose guard, all_pass gating, and registry/TODO documentation all look resolved.
  • Severity P0 [Newly identified]: yatchew_hr_test() still treats every sigma2_W == 0 case as “exact linearity” and returns t_stat_hr=-inf, p_value=1.0, but sigma2_W == 0 can occur with nonzero residuals and a positive numerator, which flips the correct decision.
  • The deferred Phase 3 items are properly tracked: Equation 18 pre-trends, vectorized Stute bootstrap, tighter R parity, and multi-period workflow dispatch are all documented in TODO.md:L98-L102 and docs/methodology/REGISTRY.md:L2310-L2351, so I did not count them against approval.
  • Static review only: I could not run the full test suite here because the environment is missing numpy, pandas, scipy, and pytest.

Methodology

  • Severity P0 [Newly identified]. Impact: in diff_diff/had_pretests.py:L1103-L1125, the code assumes sigma4_W == 0 implies “OLS residuals are zero” and hard-codes t_stat_hr=-inf, p_value=1.0, reject=False. That is not true under the paper’s Equation 29 in docs/methodology/papers/dechaisemartin-2026-review.md:L165-L173: sigma4_W = 0 only means every adjacent residual-product is zero. A concrete counterexample is d=[1,2,3,4,5], dy=[1,0,-2,0,1], which yields OLS residuals [1,0,-2,0,1], sigma2_lin=1.2, sigma2_diff=1.0, and sigma2_W=0. The formal statistic therefore tends to +inf, not -inf, so the current branch silently turns a rejection into a fail-to-reject. Because diff_diff/had_pretests.py:L1226-L1242 feeds that result straight into did_had_pretest_workflow(), the composite verdict can also be wrong with no warning. Concrete fix: keep the earlier exact-linear shortcut for true zero-residual fits, but do not map every later sigma4_W == 0 case to p=1. Distinguish by the numerator sign (+inf reject, -inf fail-to-reject, near-zero numerator -> warning/NaN), or conservatively return NaN unless the exact-linear condition already fired.

Code Quality

  • No separate findings beyond the methodology blocker above.

Performance

  • Severity P3 informational. Impact: the slower literal-refit Stute bootstrap remains, but it is explicitly documented and tracked in TODO.md:L99 and docs/methodology/REGISTRY.md:L2344, so I did not count it against approval. Concrete fix: none for this PR.

Maintainability

  • No separate findings.

Tech Debt

  • Severity P3 informational. Impact: the deferred Equation 18 joint pre-trends test, tighter R parity, nprobust enhancement idea, and multi-period workflow dispatch are all newly tracked in TODO.md:L98-L102. These are properly mitigated deferrals under the project rules. Concrete fix: none for this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the Yatchew tests in tests/test_had_pretests.py:L377-L440 cover exact-linear, shifted-noisy, and duplicate-dose paths, but they do not cover the blocker case where sigma2_W == 0 despite nonzero residuals. That leaves the silent wrong-output branch above unpinned. Concrete fix: add a regression using a unique-dose input such as d=[1,2,3,4,5], dy=[1,0,-2,0,1] and assert the function does not return p_value=1.0, reject=False.
  • Severity P3 informational. Impact: the public docstring in diff_diff/had_pretests.py:L983-L991 still says tied doses are handled by stable sort, but the shipped implementation now warns and returns NaN on duplicates in diff_diff/had_pretests.py:L1032-L1063. The registry note is correct in docs/methodology/REGISTRY.md:L2346-L2347, so this is informational only, but the API docs are currently misleading. Concrete fix: update the yatchew_hr_test() docstring to describe the duplicate-dose guard and direct tied-dose users to stute_test().

Path to Approval

  1. Fix the sigma4_W == 0 branch in yatchew_hr_test() so it does not unconditionally map zero denominator to t_stat_hr=-inf, p_value=1.0. Only true exact-linear fits should take that path.
  2. Add a regression test for a unique-dose/non-exact-linear case with sigma2_W == 0 and assert the function does not silently fail-to-reject.
  3. Update the yatchew_hr_test() docstring so the in-code API docs match the documented registry tie policy.

P0 — Yatchew sigma4_W=0 branch was not sign-aware:
R2's safety branch unconditionally returned (t_stat_hr=-inf, p=1,
reject=False) whenever sigma4_W <= 0, which is mathematically wrong for
non-exact-linear inputs where the adjacent-residual-product sum happens
to vanish.

Reviewer's counterexample: d=[1,2,3,4,5], dy=[1,0,-2,0,1].
OLS slope=0, residuals=dy itself. sigma2_lin=1.2, sigma2_diff=1.0,
sigma4_W=0 (each adjacent residual product has a zero factor).
The formal statistic is sqrt(5)*(1.2-1.0)/0 = +inf under the one-sided
critical value, so the correct decision is reject=True. Previously the
branch mapped this to p=1, silently flipping the rejection.

Fix: after the exact-linear short-circuit has already caught the
residuals-are-zero case, the sigma4_W=0 branch is now sign-aware on the
numerator:
- numerator > 0 (sigma2_lin > sigma2_diff) -> T_hr=+inf, p=0, reject=True
- numerator < 0 -> T_hr=-inf, p=1, reject=False
- numerator = 0 -> T_hr=NaN, p=NaN, reject=False (genuinely indeterminate)
All three branches now emit a UserWarning so users see the degenerate
denominator rather than silently consuming an infinite statistic.

Regression: TestYatchewHRTest.test_sigma4_W_zero_with_positive_numerator_rejects
pins the reviewer's counterexample: asserts t_stat_hr=+inf, p=0,
reject=True (NOT p=1, reject=False), with sigma2_lin=1.2 and
sigma2_diff=1.0 preserved for inspection.

P3 — yatchew_hr_test docstring stale on tie policy:
R2 added the duplicate-dose NaN guard in code and REGISTRY, but the
public docstring still said "stable sort preserves input order among
tied d values." Updated "Notes" section to describe: (a) duplicate-dose
rejection with redirect to stute_test, (b) the translation-invariant
exact-linear short-circuit, and (c) the sign-aware sigma4_W=0 branch.

Test count: 57 -> 58.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: de20d51f122c2b271b88b167fa30321c7717f009


Overall Assessment

Looks good.

The prior re-review blocker appears resolved: the exact-linear guards are now translation-invariant, yatchew_hr_test() no longer hard-codes every sigma4_W == 0 case to p=1, all_pass now requires finite p-values, and the registry/TODO/docstring updates line up (diff_diff/had_pretests.py:904, diff_diff/had_pretests.py:1135, diff_diff/had_pretests.py:1288, docs/methodology/REGISTRY.md:2343, TODO.md:98). I did not find an unmitigated P0/P1 in the changed code.

Static review only: this environment is missing numpy and pytest, so I could not run the new suite or import the package dynamically.

Executive Summary

Methodology

  • Severity P3 [Newly identified]. Impact: the composite workflow is stricter than the paper’s step-3 “Stute or Yatchew” wording. yatchew_hr_test() returns NaN on any duplicated dose (diff_diff/had_pretests.py:1068), and _compose_verdict() plus all_pass then classify the whole workflow as inconclusive on any NaN child result (diff_diff/had_pretests.py:604, diff_diff/had_pretests.py:1288). That matters for common QUG-style panels with repeated d == 0; the paper review itself cites an application with 12 zero-dose units (docs/methodology/papers/dechaisemartin-2026-review.md:298). This is conservative rather than silently wrong, so I would not block on it. Concrete fix: in did_had_pretest_workflow(), skip Yatchew from conclusive gating when duplicate doses are present, or document the helper’s tied-dose limitation explicitly in the registry/docstring.

Code Quality

  • No findings.

Performance

  • Severity P3 informational. Impact: the literal Stute bootstrap refit remains slower than Appendix D’s vectorized matrix form, but that is explicitly documented and tracked, not a correctness defect (TODO.md:99, docs/methodology/REGISTRY.md:2344). Concrete fix: none for this PR.

Maintainability

  • No findings.

Tech Debt

  • Severity P3 informational. Impact: the deferred Equation 18 joint pre-trends test, tighter R parity, nprobust-related enhancement, and multi-period workflow dispatch are all tracked in TODO.md, so these are properly mitigated deferrals under the project rules (TODO.md:98). Concrete fix: none for this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new tests cover standalone zero filtering in QUG and standalone duplicate-dose NaN behavior in Yatchew, but they do not pin the workflow-on-tied-dose path above. The workflow tests only exercise continuous-dose pass/fail and QUG-tie inconclusive cases, so the common “multiple zero-dose units + valid Stute test” workflow path is still untested (tests/test_had_pretests.py:95, tests/test_had_pretests.py:414, tests/test_had_pretests.py:489, tests/test_had_pretests.py:611). Concrete fix: add a did_had_pretest_workflow() regression with repeated d == 0 observations and otherwise linear dy, and assert either a Stute-only fallback verdict or the explicitly documented inconclusive behavior.

R4 verdict was "Looks good" overall — no P0/P1 blockers remain. Fixes
the P3 + P2 proactively to avoid tied-dose workflow confusion.

R4 P3 / P2 — Workflow must follow paper's "Stute OR Yatchew" step 3:
R3's fix made yatchew_hr_test return NaN on duplicate doses (per paper
methodology). But the composite workflow then forced "inconclusive" on
any NaN child result, so common QUG-style panels with repeated d=0
(reviewer cites an application with 12 zero-dose units) were too
conservatively classified. Per paper Section 4.2-4.3 step 3, linearity
can be tested with Stute OR Yatchew; a conclusive Stute alone
suffices.

Fix:
- _compose_verdict now treats "QUG NaN" and "BOTH Stute + Yatchew NaN"
  as the only inconclusive cases. A conclusive Stute + NaN Yatchew (or
  vice versa) proceeds to the fail-to-reject / rejection logic with a
  "(Yatchew NaN - skipped)" suffix on the fail-to-reject verdict.
- all_pass gating: requires QUG conclusive AND at least ONE of Stute /
  Yatchew conclusive, plus no conclusive test rejects. Follows the
  paper's "OR" wording.
- HADPretestReport docstring updated to reflect the paper-faithful
  all_pass semantic.
- REGISTRY Yatchew tie-policy Note extended with a paragraph describing
  the workflow fallback.

Tests:
- Split TestComposeVerdictLogic.test_nan_in_any_test_inconclusive into
  test_qug_nan_is_inconclusive + test_both_linearity_nan_is_inconclusive,
  plus two new tests covering "Yatchew NaN alone does not force
  inconclusive" and "Stute NaN alone does not force inconclusive".
- New TestCompositeWorkflow.test_workflow_handles_tied_zero_doses_via_stute_fallback
  regression: 60-unit panel with 20 never-treated (d=0 ties) plus 40
  treated with continuous doses. Yatchew NaN, Stute + QUG conclusive,
  composite verdict must be conclusive (not "inconclusive") with a
  "Yatchew NaN - skipped" note and all_pass=True.

Test count: 58 -> 62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 85a43f4b3d5c9c686cd6155fc35dba732f51edbc


Overall Assessment

Blocker.

The prior re-review findings around tied-dose fallback, sign-aware sigma4_W == 0, and conclusive all_pass gating look addressed, but there is still one silent numerical-correctness bug in the new exact-linear short-circuit. Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the new test suite.

Executive Summary

Methodology

  • Severity P0. Impact: the exact-linear short-circuit in both linearity tests is not actually the pure relative sum(eps^2) <= tol * sum((dy - dybar)^2) rule described in the registry. Because the code uses max(dy_centered_sq, 1.0), the branch stops being scale-invariant once centered_TSS < 1: for any noisy dataset, multiplying dy by a small enough positive constant will eventually satisfy the guard and silently convert a non-exact sample into stute_test(...).p_value = 1.0 and yatchew_hr_test(...).t_stat_hr = -inf, p_value = 1.0. That is wrong statistical output with no warning, and it is also an undocumented mismatch with the Phase 3 registry note. Concrete fix: split the zero-TSS / constant-dy case into its own branch, and otherwise use a purely relative criterion without the hard 1.0 floor in both functions. diff_diff/had_pretests.py:928 diff_diff/had_pretests.py:1122 docs/methodology/REGISTRY.md:2347

No other unmitigated methodology findings in the diff. The tied-dose workflow fallback, sign-aware infinite-limit handling, and conclusive-only all_pass logic now line up with the documented Phase 3 note. diff_diff/had_pretests.py:617 diff_diff/had_pretests.py:1152 diff_diff/had_pretests.py:1311 docs/methodology/REGISTRY.md:2346

Code Quality

No findings.

Performance

  • Severity P3 informational. Impact: stute_test() still uses the literal per-bootstrap OLS refit instead of the Appendix D projector form, so large-G runs remain slower than necessary; this is explicitly documented and tracked, not a defect. Concrete fix: none required for approval of this PR. TODO.md:99 docs/methodology/REGISTRY.md:2344

Maintainability

No findings beyond the documentation drift noted below.

Tech Debt

  • Severity P3 informational. Impact: the deferred Equation 18 joint pre-trends step, tighter R parity, optional bandwidth work, and multi-period workflow dispatch are all tracked in TODO.md, so these are properly mitigated deferrals under the repo rules. Concrete fix: none required for this PR. TODO.md:98 TODO.md:100 TODO.md:101 TODO.md:102

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: the new regression coverage pins additive translation invariance of the exact-linear guard, but it does not pin multiplicative rescaling. That left the unit-scale bug above untested. Concrete fix: add one Stute and one Yatchew regression that rescale the same noisy dy by a very small positive constant (for example 1e-12) and assert that the exact-linear branch does not fire and the decision matches the unscaled input. tests/test_had_pretests.py:281 tests/test_had_pretests.py:392
  • Severity P3 informational. Impact: the verdict docs are still out of sync with the implemented tied-dose fallback. HADPretestReport and the registry checklist text still describe “any NaN p-value => inconclusive,” while _compose_verdict() now only treats QUG NaN or both linearity tests NaN as inconclusive. Concrete fix: align the docstring and registry checklist wording with _compose_verdict(). diff_diff/had_pretests.py:394 diff_diff/had_pretests.py:617 docs/methodology/REGISTRY.md:2351

Path to Approval

  1. Remove the hard 1.0 floor from the exact-linear guards in both stute_test() and yatchew_hr_test(), handling the zero-centered-TSS case separately so the branch remains unit-scale invariant.
  2. Add regression tests that rescale a noisy dy by a very small positive constant and assert that both linearity tests preserve the unscaled decision instead of short-circuiting to the exact-linear output.

P0 — Exact-linear shortcut not scale-invariant:
R2's fix changed the denominator from sum(dy^2) to centered TSS, which
made the guard TRANSLATION-invariant. But the code used
`max(centered_TSS, 1.0)` as a defensive floor, which BROKE scale
invariance: multiplying dy by a small constant (e.g. dy * 1e-12) made
centered_TSS ~ 1e-24 but the floor held the threshold at 1.0, firing
the shortcut on noisy data that should NOT trigger it.

Fix: remove the `max(..., 1.0)` floor and handle the `centered_TSS == 0`
case (constant dy, trivially linear) in a dedicated branch above the
relative comparison. The remaining path is purely `eps^2 <= tol *
dy_centered^2`, which is both translation- AND scale-invariant by
construction (R^2 is dimensionless).

Applied to both `stute_test` and `yatchew_hr_test`. Module-level
constant docstring updated to document the invariance and the
zero-TSS split.

Regressions:
- TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data
- TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data
Both multiply a noisy `dy` by 1e-12 and assert the shortcut does NOT
fire and decisions match the unscaled input.

P3 — Stale verdict docs:
R4's Stute-OR-Yatchew linearity-step fallback changed `_compose_verdict`
semantics, but the `HADPretestReport.verdict` docstring still listed
"Any NaN p-value -> inconclusive" as rule 1, and the REGISTRY Phase 3
checkbox for the workflow still said "`inconclusive - {tests} NaN`
when any statistic is NaN".

Fix: both docs updated to reflect the paper-faithful semantics:
inconclusive only when QUG is NaN or when BOTH Stute and Yatchew are
NaN; otherwise the fail-to-reject verdict carries a
"(Yatchew NaN - skipped)" (or Stute) suffix when one linearity test
was NaN but the other was conclusive. REGISTRY note also documents
the `all_pass` gating (QUG conclusive AND at least one of
Stute/Yatchew conclusive AND no conclusive test rejects).

Test count: 62 -> 64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ad19c0208afc2e530894b6bb1247c009ee96becf


Overall Assessment

⚠️ Needs changes

Static review only: this environment does not have numpy or pytest, so I could not execute tests/test_had_pretests.py.

Executive Summary

  • The prior blocker-level issues called out in the last review look addressed: the exact-linear shortcut is now scale-invariant, the sigma4_W == 0 branch is sign-aware, and the workflow now treats a single conclusive linearity test as sufficient.
  • Severity P1: the composite verdict can still report a bare "inconclusive" outcome even when one implemented test has already produced a conclusive rejection, which understates the paper’s four-step logic.
  • Severity P1: stute_test() and yatchew_hr_test() still accept negative dose arrays silently, even though the HAD method scope and the panel workflow both require D_{g,2} >= 0.
  • The deferred Equation 18 pre-trends step, Stute matrix-form optimization, R-parity work, and multi-period workflow dispatch are all properly tracked in TODO.md / REGISTRY.md and are not blockers.
  • The current tests/docs do not pin the two remaining methodology issues above.

Methodology

  • Severity P1. Impact: _compose_verdict() short-circuits to "inconclusive - QUG NaN" or "inconclusive - both Stute and Yatchew linearity tests NaN" before it checks whether any conclusive test already rejected. That means the composite workflow can hide a real assumption failure. Example: with G=2, qug_test([0.2, 0.21]) rejects at alpha=0.05, while stute_test and yatchew_hr_test are both NaN by their sample-size gates; the current logic would still emit only "inconclusive - both Stute and Yatchew linearity tests NaN". The paper’s rule is one-way: TWFE is admissible only if none of the tests rejects, so a conclusive rejection should never be downgraded to a pure inconclusive verdict. Concrete fix: build rejection reasons from all conclusive tests first, and only return a pure inconclusive verdict when there are no conclusive rejections and a required step is unresolved; if desired, append an inconclusive suffix rather than replacing the rejection. Locations: diff_diff/had_pretests.py:L619-L665, docs/methodology/papers/dechaisemartin-2026-review.md:L269-L273, docs/methodology/REGISTRY.md:L2351.
  • Severity P1. Impact: the new public raw-array linearity helpers do not enforce the paper’s nonnegative-dose domain restriction. qug_test() rejects d < 0, and the panel workflow rejects negative post-period doses, but stute_test() and yatchew_hr_test() currently proceed after _validate_1d_numeric() with no d >= 0 check. Because these functions are now exposed as HAD diagnostics from the top-level package, callers can obtain conclusive pretest outputs on inputs that are outside the method’s stated scope. Concrete fix: mirror the front-door negative-dose guard from qug_test() in both linearity helpers immediately after d validation, and document that contract in their docstrings. Locations: docs/methodology/REGISTRY.md:L2148-L2154, diff_diff/had.py:L816-L834, diff_diff/had_pretests.py:L723-L732, diff_diff/had_pretests.py:L882-L918, diff_diff/had_pretests.py:L1079-L1143.

Code Quality

  • No findings.

Performance

  • Severity P3. Impact: stute_test() still uses literal per-bootstrap refits instead of the Appendix D projector form, so large-G runs remain slower than necessary; this is explicitly documented and tracked, so it should not block approval. Concrete fix: none required for this PR. Locations: TODO.md:L99-L102, docs/methodology/REGISTRY.md:L2344.

Maintainability

  • No findings beyond the documentation/test updates needed for the methodology fixes above.

Tech Debt

  • Severity P3. Impact: the deferred Equation 18 joint pre-trends test, tighter R parity, optional Stute bandwidth work, and multi-period workflow dispatch are all tracked in TODO.md, so these are properly mitigated deferrals under the repo rules. Concrete fix: none required for this PR. Locations: TODO.md:L98-L102.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the current coverage does not pin either P1 issue above. TestComposeVerdictLogic only asserts the pure inconclusive branches and the no-reject/reject bundles; it does not cover mixed cases like QUG reject + both linearity tests NaN or QUG NaN + Stute reject. Likewise, only qug_test() has a negative-dose regression. That means both methodology bugs can survive without a failing test, and the registry/docstrings currently codify the existing verdict precedence. Concrete fix: add regression tests for both mixed-status verdict cases and direct negative-dose raises for stute_test() / yatchew_hr_test(), then update HADPretestReport and REGISTRY.md to match the corrected behavior. Locations: tests/test_had_pretests.py:L168-L172, tests/test_had_pretests.py:L906-L921, diff_diff/had_pretests.py:L404-L417, docs/methodology/REGISTRY.md:L2343-L2351.

Path to Approval

  1. Change _compose_verdict() so a conclusive rejection is never hidden by a generic inconclusive verdict; if another step is unresolved, append that fact instead of replacing the rejection.
  2. Add the same D_{g,2} >= 0 front-door guard to stute_test() and yatchew_hr_test() that already exists in qug_test() and the panel validator.
  3. Add regression tests for QUG reject + both linearity tests NaN, QUG NaN + linearity reject, and negative-dose raises in both linearity helpers, then sync the docstring / REGISTRY.md wording to the fixed semantics.

R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive":
The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both
Stute and Yatchew linearity tests NaN" BEFORE checking whether any
conclusive test had rejected. The reviewer's example: G=2 with QUG
rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates —
the workflow emitted "inconclusive - both linearity NaN", hiding a
real assumption failure.

The paper's rule is one-way: TWFE is admissible only if NO test rejects.
A conclusive rejection therefore dominates unresolved-step notes.

Fix: reorder _compose_verdict:
  1. Collect rejections from conclusive tests first. If any, that is the
     primary verdict, and unresolved-step notes are APPENDED via
     "; additional steps unresolved: ..." rather than replacing the
     rejection.
  2. Only when NO conclusive rejection exists AND a required step is
     unresolved do we return a pure "inconclusive - ..." verdict.
  3. Otherwise fall through to the partial-workflow fail-to-reject
     verdict (with "(Yatchew NaN - skipped)" suffix if applicable).

Regressions:
- TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection
- TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection
- TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections

R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses:
qug_test and _validate_had_panel both front-door-reject d < 0 (paper
Section 2 HAD support restriction), but the new linearity helpers only
validated shape + NaN. Negative doses are outside the method's stated
scope and could silently produce conclusive-looking output.

Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test
now raise ValueError on any d < 0 with a message directing users to
pre-process or check the dose column. Docstrings updated to list the
new contract in the Raises section.

Regressions:
- TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises
- TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises

R6 P2 — Docstrings / REGISTRY sync:
HADPretestReport.verdict docstring rewritten to describe the new
"rejection-first, unresolved-suffix" priority. REGISTRY Phase 3
workflow checkbox updated to document the conclusive-rejection-not-
hidden semantics plus the non-negative-dose contract.

Test count: 64 -> 69.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b7e0d8270bbe4db943cb487c49d251cbf1e322d0


Overall Assessment

✅ Looks good

Static review only: this environment lacks numpy and pytest, so I could not execute the new Phase 3 test file here.

Executive Summary

Methodology

Previous methodology P1s are addressed in code, tests, and registry text. Remaining methodology items are documented P3s rather than blockers.

Code Quality

No findings.

Performance

Maintainability

No findings.

Tech Debt

  • Severity P3. Impact: the remaining Phase 3 debt is explicitly tracked rather than hidden: Equation 18 joint pre-trends, R parity, nprobust-bandwidth follow-up, and multi-period workflow dispatch all have TODO entries. Concrete fix: none required in this PR. Locations: TODO.md:L98-L102.

Security

No findings.

Documentation/Tests

No findings. The re-review gaps are now covered by direct regression tests and matching registry prose. Locations: tests/test_had_pretests.py:L175-L189, tests/test_had_pretests.py:L597-L668, tests/test_had_pretests.py:L920-L1006, docs/methodology/REGISTRY.md:L2343-L2351.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026
@igerber igerber merged commit 5dd8d0e into main Apr 22, 2026
23 of 24 checks passed
@igerber igerber deleted the had-phase-3-pretests branch April 22, 2026 22:33
igerber added a commit that referenced this pull request Apr 24, 2026
Fixes two P1 issues flagged by the CI reviewer on the initial submission
of PR #352.

P1 Methodology — `compute_time_weights_survey` was documented as solving
the WLS-style weighted λ objective
  min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||²
but row-scaled Y by sqrt(rw) and then handed the scaled matrix to
`_sc_weight_fw(intercept=True)`, whose column-centering uses an
UNWEIGHTED mean across controls. That is not the weighted objective
once rw varies, so non-uniform survey bootstrap draws were refitting λ
on the wrong objective and could bias the reported SE.

Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using
`col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass
`intercept=False` to the kernel so no additional unweighted centering
happens on the scaled matrix. Both two-pass calls updated.
`compute_sdid_unit_weights_survey` is unchanged — its column-centering
is PER-UNIT (time means within each control column), which is
independent of rw.

P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is
a valid configuration (`SurveyDesign.resolve()` synthesizes ones when
weights is None), but `_extract_unit_survey_weights` indexed
`survey_design.weights` as if it were always a column name, so the
groupby would fail with a KeyError before the bootstrap branch could
run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector
of ones of length `len(unit_order)` when `survey_design.weights is
None`, matching `SurveyDesign.resolve()`'s semantics.

Regression tests:
- `test_non_uniform_rw_beats_unweighted_centering_variant`
  (test_weighted_fw.py): reproduces the pre-fix buggy variant (row-
  scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and
  asserts the fixed path's weighted SSR is strictly ≤ the buggy
  variant's weighted SSR. If a future revert reintroduces
  intercept=True after the row-scaling, this test fails.
- `test_bootstrap_full_design_without_explicit_weights`
  (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with
  no explicit `weights` column now succeeds on the bootstrap path;
  survey_metadata populated with n_strata / n_psu.

P3 Documentation:
- `SyntheticDiD.fit()` docstring (survey_design parameter + Raises
  block): replace "bootstrap rejects all survey designs" language with
  the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight-
  only and full design; placebo/jackknife ✓ pweight-only, ✗ full
  design).
- `_placebo_variance_se` fallback-guidance messages (two sites): drop
  the "strata/PSU/FPC not yet supported by any SDID variance method"
  framing; recommend bootstrap for full-design survey fallback,
  jackknife for pweight-only, adding controls as the universal
  fallback.
- `docs/survey-roadmap.md` Current Limitations table: collapse the two
  SDID bootstrap-rejection rows into a single row for placebo/
  jackknife + full design (the bootstrap + full design row no longer
  applies).

Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE +
TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…path

CI review flagged a silent-failure P0: when a bootstrap draw's treated
units all had survey weight 0 under a pweight-only survey design, the
code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted
mean). That silently dropped survey weighting for that draw while the
fit-time ATT uses the survey-weighted treated mean — corrupting the
bootstrap distribution used for SE, p-value, and CI. Reachable with
any `SurveyDesign(weights=...)` input where at least one treated unit
has pweight 0 (a valid configuration per
``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks
only those units.

Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the
pweight-only branch as well. When ANY survey path is active and
``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the
draw is retried. The pre-fix behavior (unweighted treated mean
fallback) is unreachable from any code path now.

Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws``
in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a
panel with 3 treated units where one has `wt=0`; bootstrap at B=100
still returns finite SE > 0. The test would fail if the code reverted
to the unweighted-mean fallback because zero-mass draws would either
produce degenerate SDiD estimators (NaN τ → gets filtered out by the
finite-τ gate, possibly leaving no valid draws) or produce
systematically different τ values that inflate the SE.

P3 (same review) also addressed:
- ``diff_diff/synthetic_did.py`` replicate-weight rejection message
  (line ~303) now reflects the PR #352 support matrix instead of the
  pre-PR #352 "only pweight-only placebo/jackknife" framing.
  Analytical survey designs are now fully supported per the truth
  table; only replicate-weight variance remains unsupported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…lice

Two P3s from R3; PR was already ✅ Looks good — these are close-out
polish.

P3 docs/tests — secondary surfaces described the full-design path as
"Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material
caveat that SDID still uses unit-level pairs-bootstrap resampling
(``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled
weights on top — a hybrid composition, not a standalone Rao-Wu
bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone
Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352
Added entry to use the hybrid-composition wording mirroring REGISTRY.

P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights``
interaction was only guarded by the slow coverage MC. Add
``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in
``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to
return a known vector of distinct per-unit values
(``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors
fed into the weighted FW helper via a capturing wrapper on
``compute_sdid_unit_weights_survey``, and asserts every captured vector
lies within ``known_rw[:n_control]`` (positions 1..n_control). This
catches two bug classes:
- slice-order regression: if someone swaps rw-then-slice for
  slice-then-rw, the captured vectors would include values from the
  treated slice ``known_rw[n_control:]`` and the assertion fires.
- rw-drift regression: if the Rao-Wu call site bypasses
  ``generate_rao_wu_weights`` (e.g., a refactor silently uses the
  pweight-only branch for full-design fits), the captured vector would
  be the user's w_control (all 1.0 in this test) instead of the known
  Rao-Wu output.

Verified: 294 targeted tests pass across test_methodology_sdid /
test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026


Three places in REGISTRY.md still described the pre-PR-#352 behavior
where SDID survey bootstrap raised NotImplementedError, contradicting
the new support matrix and survey-composition note landed earlier in
this PR:

- §SyntheticDiD bootstrap bullet (paper-faithful refit text): the
  "Composed with any survey design ... raises NotImplementedError"
  trailer is updated to describe weighted-FW dispatch under the
  survey + bootstrap composition note.
- §SyntheticDiD requirements checklist bullet for "Bootstrap: paper-
  faithful Algorithm 2 step 2": "Survey designs raise
  NotImplementedError" trailer is updated to describe the hybrid
  pairs-bootstrap + Rao-Wu rescaling composition.
- §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten
  to add SDID to the list via its hybrid composition, with the
  precise distinction from standalone Rao-Wu (SunAbraham / TROP) and
  a pointer to §SyntheticDiD for the objective form and argmin-set
  caveat.

Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD
(L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/
FPC methodology gap that is out of scope for this PR and tracked in
TODO.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up
to #351 which front-door rejected all survey designs). This commit adds the
weighted-FW kernel + Python wrappers; the bootstrap integration lands in
the next commit.

Rust (rust/src/weights.rs, rust/src/lib.rs):
- New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted`
  loop variants. Identical to the unweighted loops except for the
  regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in
  place of `eta*lam[j]`, and the FW step-size denominator uses the
  diag(reg_w)-weighted simplex direction norm
  `Σ_j reg_w[j]*d[j]²` (which simplifies to
  `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam).
- New `sc_weight_fw_weighted_internal` dispatcher that delegates to the
  unweighted internal when reg_weights is None (preserves the legacy
  numeric contract for any future caller that wants the generic shape).
- Two new pyfunctions: `sc_weight_fw_weighted` and
  `sc_weight_fw_weighted_with_convergence`. Same call shape as the
  existing unweighted siblings plus a trailing `reg_weights` kwarg.
  Registered in lib.rs.
- 3 new Rust unit tests in rust/src/weights.rs:
    * test_weighted_fw_reg_weights_none_delegates — bit-identity at
      rel=1e-14 against the unweighted internal.
    * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform
      rw=1 collapses to uniform regularization (rel=1e-12, allowing for
      ULP-scale drift from different float reduction orders).
    * test_weighted_fw_simplex_invariants — for arbitrary positive rw
      and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to
      1 and is non-negative.

Python (diff_diff/utils.py, diff_diff/_backend.py):
- Export _rust_sc_weight_fw_weighted and _with_convergence from _backend
  (mirrors the shape added for _rust_sc_weight_fw_with_convergence in
  PR #351 c0d089b).
- Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a
  `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust
  path, dispatches to the new weighted pyfunctions; on the pure-Python
  path, runs a weighted FW loop mirroring the Rust derivation.
- New helper `compute_sdid_unit_weights_survey(Y_pre_control,
  Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by
  rw_control and passes rw_control as reg_weights so the FW solves the
  unit-weight survey-bootstrap objective
    min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² +
                    ζ²·Σ_i rw_i·ω_i²
  Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights.
  Returns ω on the standard simplex (caller composes ω_eff downstream).
- New helper `compute_time_weights_survey(Y_pre_control, Y_post_control,
  rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes
  no reg_weights (uniform reg on λ — λ is per-period, rw is per-control,
  no alignment for per-λ weighting). Two-pass structure unchanged.
- Both new helpers expose `return_convergence=True` returning the AND of
  the two pass convergence flags, mirroring the contract added in
  PR #351 c0d089b.

Tests (tests/test_weighted_fw.py — new, 15 tests):
- _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at
  bit-identity; uniform reg matches unweighted at rel=1e-12;
  Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw;
  return_convergence tuple shape.
- compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted
  helper, simplex invariants under arbitrary rw, shape-mismatch raises,
  return_convergence AND.
- compute_time_weights_survey: same coverage matrix, plus a zero-rw
  subset test (Rao-Wu-style undrawn PSU yields valid simplex λ).
- Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7
  for both unit and time helpers (monkeypatches HAS_RUST_BACKEND).

ABI preservation: existing unweighted callers of _sc_weight_fw,
compute_sdid_unit_weights, compute_time_weights are unaffected — the new
kwarg defaults to None and dispatches to the legacy code path. The
bit-identity check on TestScaleEquivariance::test_baseline_parity_small
_scale[bootstrap] still passes at rel=1e-14 (verified in the next commit
when the bootstrap integration lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…sition

PR #352 restores the SDID survey-bootstrap capability that PR #351 front-
door rejected as a known regression. Pweight-only and full-design surveys
now both succeed; placebo / jackknife continue to reject full designs (a
separate methodology gap tracked in TODO.md).

`diff_diff/synthetic_did.py::fit` (guards):
- Replace the unconditional strata/PSU/FPC NotImpl guard with a method-
  gated version that fires only for placebo / jackknife. Rationale +
  truth-table in REGISTRY.md §SyntheticDiD survey-support matrix:

      method     pweight-only      strata/PSU/FPC
      bootstrap  ✓ (this PR)       ✓ Rao-Wu (this PR)
      placebo    ✓ unchanged       ✗ NotImpl (separate gap)
      jackknife  ✓ unchanged       ✗ NotImpl (separate gap)

- Delete the unconditional `bootstrap + any-survey` guard added in #351.
  Keep the `weight_type != "pweight"` validation (fweight/aweight still
  rejected).

`diff_diff/synthetic_did.py::fit` (survey resolution):
- After validating the per-unit survey weights (`w_treated`, `w_control`),
  also collapse the observation-level `resolved_survey` to a unit-level
  view via `collapse_survey_to_unit_level(...)` ordered as
  `[*control_units, *treated_units]`. The resulting
  `resolved_survey_unit` is what `_bootstrap_se` slices via
  `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw.

`diff_diff/synthetic_did.py::fit` (dispatcher):
- Branch the bootstrap call on whether the design is pweight-only or
  full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for
  pweight-only, `resolved_survey=resolved_survey_unit` for full design,
  None/None for non-survey.

`diff_diff/synthetic_did.py::_bootstrap_se`:
- New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword-
  only, default None — preserves the legacy signature).
- Single-PSU short-circuit: unstratified survey with <2 PSUs returns
  (NaN, []) since the bootstrap distribution is unidentified
  (resampling one PSU yields the same subset every draw). Recovered from
  the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5).
- Per-draw Rao-Wu rescaling for full designs:
  ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the
  resampled units. Pweight-only path uses ``rw = w_control[boot_idx]``
  (constant per draw, no rescaling).
- Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)``
  when survey weights are present.
- Warm-start: the simplex init scales by rw before sum_normalize when on
  the survey path, matching the per-draw weighted-FW geometry.
- Per-draw FW dispatch: survey paths call the new
  ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey``
  helpers (PR #352 commit 1) which run the weighted-FW kernel; non-
  survey paths continue to call the unweighted helpers (bit-identity
  preserved on the non-survey refit path).
- Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID
  estimator (which expects simplex weights). Degenerate-retry if
  ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls).
- Aggregate FW non-convergence warning: tally is the AND of the two
  helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b
  shape preserved, no copy change).

Tests:
- ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as
  succeeds-tests with explicit SE assertions —
    * ``test_full_design_bootstrap_succeeds`` (was ``_raises``):
      finite SE, populated survey_metadata.n_strata/n_psu, summary()
      includes Survey Design + Bootstrap replications blocks.
    * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``):
      finite SE, variance_method preserved (cross-surface guard).
    * New ``test_bootstrap_full_design_se_differs_from_pweight_only``
      resurrects the PR #351 R3-deleted differs-from contract: ATT
      matches between paths (both compose ω_eff post-fit) but SE
      differs (Rao-Wu adds PSU clustering variance).
- ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351
  raises-tests as succeeds-tests, plus add the
  ``test_bootstrap_single_psu_returns_nan`` short-circuit regression.

Verified: 308 tests pass across test_methodology_sdid /
test_business_report SDID subset / test_rust_backend / test_survey_phase5
/ test_weighted_fw / test_guides.

Bit-identity check: the non-survey refit path goes through the
unweighted helpers (no weighted-FW dispatch), so
``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]``
remains at rel=1e-14 — verified passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap
composition and propagates the landed capability across the
documentation surfaces.

Coverage MC harness (benchmarks/python/coverage_sdid.py):
- Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses
  ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/
  stratum) null-treatment panel with moderate weight variation and
  modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post
  periods). Converts per-observation ``treated`` to a unit-level
  ever-treated indicator (SDID's block-treatment requirement).
- Extend ``DGPSpec`` with an optional ``survey_design_factory``
  callable that returns ``(SurveyDesign, supported_methods_tuple)``.
  For ``stratified_survey``: bootstrap only — placebo / jackknife
  reject strata/PSU/FPC at fit-time, so the harness skips them
  rather than catching the NotImplementedError inside ``_fit_one``.
- ``_fit_one`` gains an optional ``survey_design`` kwarg routed
  through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls
  the factory once per seed (DataFrame contents don't affect
  columns) and gates methods on the supported set.

Regenerated ``benchmarks/data/sdid_coverage.json`` via
``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap
200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend);
aer63 remains the long tail at 2237 s, stratified_survey adds only
33 s.

Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at
α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration
band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the
bootstrap is slightly conservative (overestimates empirical
sampling SD by ~25%) — the safer direction under Rao-Wu rescaling
with only 4 PSUs total. Validates the weighted-FW + Rao-Wu
composition end-to-end.

REGISTRY.md §SyntheticDiD:
- Add ``stratified_survey`` row to the coverage MC table and a
  paragraph under it documenting the calibration verdict, the
  conservatism direction, and why placebo/jackknife rows are NaN.
- Replace the survey-support bullet with a truth-table matrix (PR
  #352 shape); add a ``Note (survey + bootstrap composition)``
  documenting the weighted-FW objective (unit and time forms), the
  ω_eff composition, the argmin-set caveat, the per-draw rw
  dispatch (pweight-only vs Rao-Wu), and the single-PSU
  short-circuit.
- Update the ``Note (default variance_method deviation from R)`` to
  drop the "bootstrap rejects surveys" framing (no longer accurate).
- Update the ``Note (coverage Monte Carlo calibration)`` header to
  say "4 representative null-panel DGPs" and flag stratified_survey
  as bootstrap-only.

User-facing docs:
- ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu
  Rescaled Bootstrap list; describe the weighted-FW composition.
- ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect
  full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet
  restores SDID.
- ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix
  table row for SyntheticDiD switches from "pweight only (placebo/
  jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC";
  "Note on SyntheticDiD" block rewritten for the landed contract.
- ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap
  bullet now describes survey support and the ω_eff composition.
- ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet
  includes SDID in the Rao-Wu list with the weighted-FW formula.

CHANGELOG.md:
- Retain the PR #351 regression Changed entry but annotate it as
  "restored in PR #352"; add new Added/Changed PR #352 entries
  documenting the weighted-FW kernel, survey helpers, _bootstrap_se
  Rao-Wu composition, and the new coverage MC row.

TODO.md:
- Row 103 (SDID + survey designs) → closed by PR #352; replaced
  with a narrower follow-up for placebo/jackknife + strata/PSU/FPC
  (Low priority, no concrete sketch yet).

Tests:
- ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including
  ``stratified_survey``); new explicit assertions that the
  stratified_survey bootstrap row has ≥100 successful fits and
  α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows
  n_successful_fits == 0 (strata/PSU/FPC rejection contract).

Verified: TestCoverageMCArtifact passes against the regenerated
artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Fixes two P1 issues flagged by the CI reviewer on the initial submission
of PR #352.

P1 Methodology — `compute_time_weights_survey` was documented as solving
the WLS-style weighted λ objective
  min Σ_u rw_u·(Σ_t λ_t·Y_u,pre[t] - Y_u,post_mean)² + ζ²·||λ||²
but row-scaled Y by sqrt(rw) and then handed the scaled matrix to
`_sc_weight_fw(intercept=True)`, whose column-centering uses an
UNWEIGHTED mean across controls. That is not the weighted objective
once rw varies, so non-uniform survey bootstrap draws were refitting λ
on the wrong objective and could bias the reported SE.

Fix: weighted-center `Y_time` BEFORE the sqrt(rw) row-scaling, using
`col_weighted_mean = (Y_time * rw).sum(0) / rw.sum()`, and pass
`intercept=False` to the kernel so no additional unweighted centering
happens on the scaled matrix. Both two-pass calls updated.
`compute_sdid_unit_weights_survey` is unchanged — its column-centering
is PER-UNIT (time means within each control column), which is
independent of rw.

P1 Code Quality — `SurveyDesign(weights=None, strata=..., psu=...)` is
a valid configuration (`SurveyDesign.resolve()` synthesizes ones when
weights is None), but `_extract_unit_survey_weights` indexed
`survey_design.weights` as if it were always a column name, so the
groupby would fail with a KeyError before the bootstrap branch could
run. Fix: `_extract_unit_survey_weights` now short-circuits to a vector
of ones of length `len(unit_order)` when `survey_design.weights is
None`, matching `SurveyDesign.resolve()`'s semantics.

Regression tests:
- `test_non_uniform_rw_beats_unweighted_centering_variant`
  (test_weighted_fw.py): reproduces the pre-fix buggy variant (row-
  scale Y by sqrt(rw), then call `_sc_weight_fw(intercept=True)`) and
  asserts the fixed path's weighted SSR is strictly ≤ the buggy
  variant's weighted SSR. If a future revert reintroduces
  intercept=True after the row-scaling, this test fails.
- `test_bootstrap_full_design_without_explicit_weights`
  (test_methodology_sdid.py): `SurveyDesign(strata=..., psu=...)` with
  no explicit `weights` column now succeeds on the bootstrap path;
  survey_metadata populated with n_strata / n_psu.

P3 Documentation:
- `SyntheticDiD.fit()` docstring (survey_design parameter + Raises
  block): replace "bootstrap rejects all survey designs" language with
  the PR #352 support-matrix truth-table (bootstrap ✓ for both pweight-
  only and full design; placebo/jackknife ✓ pweight-only, ✗ full
  design).
- `_placebo_variance_se` fallback-guidance messages (two sites): drop
  the "strata/PSU/FPC not yet supported by any SDID variance method"
  framing; recommend bootstrap for full-design survey fallback,
  jackknife for pweight-only, adding controls as the universal
  fallback.
- `docs/survey-roadmap.md` Current Limitations table: collapse the two
  SDID bootstrap-rejection rows into a single row for placebo/
  jackknife + full design (the bootstrap + full design row no longer
  applies).

Verified: 75 targeted tests pass (test_weighted_fw + TestBootstrapSE +
TestScaleEquivariance + TestCoverageMCArtifact + test_survey_phase5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…path

CI review flagged a silent-failure P0: when a bootstrap draw's treated
units all had survey weight 0 under a pweight-only survey design, the
code fell through to ``np.mean(Y_boot_pre_t, axis=1)`` (unweighted
mean). That silently dropped survey weighting for that draw while the
fit-time ATT uses the survey-weighted treated mean — corrupting the
bootstrap distribution used for SE, p-value, and CI. Reachable with
any `SurveyDesign(weights=...)` input where at least one treated unit
has pweight 0 (a valid configuration per
``SurveyDesign.resolve()``'s spec) and the bootstrap resample picks
only those units.

Fix: extend the Rao-Wu degenerate-retry in ``_bootstrap_se`` to the
pweight-only branch as well. When ANY survey path is active and
``rw_treated_draw.sum() == 0`` OR ``rw_control_draw.sum() == 0``, the
draw is retried. The pre-fix behavior (unweighted treated mean
fallback) is unreachable from any code path now.

Regression: ``test_bootstrap_pweight_only_retries_zero_treated_mass_draws``
in ``tests/test_methodology_sdid.py::TestBootstrapSE``. Constructs a
panel with 3 treated units where one has `wt=0`; bootstrap at B=100
still returns finite SE > 0. The test would fail if the code reverted
to the unweighted-mean fallback because zero-mass draws would either
produce degenerate SDiD estimators (NaN τ → gets filtered out by the
finite-τ gate, possibly leaving no valid draws) or produce
systematically different τ values that inflate the SE.

P3 (same review) also addressed:
- ``diff_diff/synthetic_did.py`` replicate-weight rejection message
  (line ~303) now reflects the PR #352 support matrix instead of the
  pre-PR #352 "only pweight-only placebo/jackknife" framing.
  Analytical survey designs are now fully supported per the truth
  table; only replicate-weight variance remains unsupported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…lice

Two P3s from R3; PR was already ✅ Looks good — these are close-out
polish.

P3 docs/tests — secondary surfaces described the full-design path as
"Rao-Wu rescaled bootstrap" but only REGISTRY.md surfaced the material
caveat that SDID still uses unit-level pairs-bootstrap resampling
(``boot_idx = rng.choice(n_total)``) and then applies Rao-Wu rescaled
weights on top — a hybrid composition, not a standalone Rao-Wu
bootstrap. Update survey-theory.md (splits SunAbraham/TROP's standalone
Rao-Wu bullet from SDID's hybrid bullet) and CHANGELOG.md's PR #352
Added entry to use the hybrid-composition wording mirroring REGISTRY.

P3 tests — the methodology-critical ``boot_idx`` × ``generate_rao_wu_weights``
interaction was only guarded by the slow coverage MC. Add
``test_bootstrap_full_design_rao_wu_boot_idx_slice`` (in
``TestBootstrapSE``) which monkeypatches ``generate_rao_wu_weights`` to
return a known vector of distinct per-unit values
(``arange(1, n_total+1)``), captures the ``rw_control_draw`` vectors
fed into the weighted FW helper via a capturing wrapper on
``compute_sdid_unit_weights_survey``, and asserts every captured vector
lies within ``known_rw[:n_control]`` (positions 1..n_control). This
catches two bug classes:
- slice-order regression: if someone swaps rw-then-slice for
  slice-then-rw, the captured vectors would include values from the
  treated slice ``known_rw[n_control:]`` and the assertion fires.
- rw-drift regression: if the Rao-Wu call site bypasses
  ``generate_rao_wu_weights`` (e.g., a refactor silently uses the
  pweight-only branch for full-design fits), the captured vector would
  be the user's w_control (all 1.0 in this test) instead of the known
  Rao-Wu output.

Verified: 294 targeted tests pass across test_methodology_sdid /
test_survey_phase5 / test_weighted_fw / test_guides / test_rust_backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026


Three places in REGISTRY.md still described the pre-PR-#352 behavior
where SDID survey bootstrap raised NotImplementedError, contradicting
the new support matrix and survey-composition note landed earlier in
this PR:

- §SyntheticDiD bootstrap bullet (paper-faithful refit text): the
  "Composed with any survey design ... raises NotImplementedError"
  trailer is updated to describe weighted-FW dispatch under the
  survey + bootstrap composition note.
- §SyntheticDiD requirements checklist bullet for "Bootstrap: paper-
  faithful Algorithm 2 step 2": "Survey designs raise
  NotImplementedError" trailer is updated to describe the hybrid
  pairs-bootstrap + Rao-Wu rescaling composition.
- §Rao-Wu Rescaled Bootstrap "intentionally excluded" note: rewritten
  to add SDID to the list via its hybrid composition, with the
  precise distinction from standalone Rao-Wu (SunAbraham / TROP) and
  a pointer to §SyntheticDiD for the objective form and argmin-set
  caveat.

Remaining NotImplementedError mentions in REGISTRY.md §SyntheticDiD
(L1554-L1555) correctly describe the placebo / jackknife + strata/PSU/
FPC methodology gap that is out of scope for this PR and tracked in
TODO.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…er-fix

Fix CHANGELOG: rename survey-bootstrap PR placeholder #352 to #355
igerber added a commit that referenced this pull request Apr 24, 2026
…placebo, jackknife)

Second commit for the SDID survey-placebo/jackknife PR. Extends the
coverage Monte Carlo artifact with jackknife on the stratified_survey
DGP (bootstrap calibration unchanged); promotes the deferred REGISTRY
§SyntheticDiD gap bullets to two landed Notes; updates user-facing
docs to reflect restored capability.

Coverage MC changes
-------------------
* benchmarks/python/coverage_sdid.py: _stratified_survey_design now
  returns ("bootstrap", "jackknife") on the methods tuple. Placebo is
  omitted because the DGP's cohort packs into a single stratum with 0
  never-treated units — stratified-permutation placebo is structurally
  infeasible on this DGP (raises Case C at fit-time). Module docstring
  explains the exclusion and the jackknife anti-conservatism caveat.
* benchmarks/data/sdid_coverage.json: regenerated stratified_survey
  block at n_seeds=500, n_bootstrap=200. Bootstrap validates near-
  nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife row
  reports α=0.05 rejection = 0.45, SE/trueSD = 0.46 — documented anti-
  conservatism from the stratified jackknife formula with 2 PSUs per
  stratum (1 effective DoF per stratum, Rust & Rao 1996 limitation).

REGISTRY.md §SyntheticDiD
-------------------------
* Survey support matrix updated: all three variance methods now
  support strata/PSU/FPC (not just bootstrap).
* Two new landed Notes:
  - "Note (survey + placebo composition)": stratified-permutation
    allocator, weighted-FW refit, ω_eff composition, fit-time
    feasibility guards (Case B / Case C), scope note on what is NOT
    randomized (within-stratum PSU axis). Cites Pesarin (2001) /
    Pesarin & Salmaso (2010).
  - "Note (survey + jackknife composition)": PSU-level LOO algorithm,
    explicit stratum-aggregation SE² formula, FPC handling (population-
    count form from survey.py:338-356), fixed-weights rationale,
    degenerate-LOO skip semantics, scope note, known anti-conservatism
    with few PSUs per stratum. Cites Rust & Rao (1996).
* "Allocator asymmetry" paragraph in the survey support matrix
  documents the intentional asymmetry (placebo ignores PSU, jackknife
  respects it) with rationale rooted in each method's role (null-
  distribution test vs design-based variance approximation).
* Coverage MC table adds the stratified_survey × jackknife row with
  anti-conservatism narrative; placebo row explicitly marked N/A-on-
  this-DGP (with pointer to the unit-test coverage).
* Requirements checklist entries updated to describe full-design
  support for placebo and jackknife.

Docs sweep
----------
* docs/methodology/survey-theory.md: new bullets describing the
  stratified-permutation placebo allocator and the PSU-level LOO
  jackknife, parallel to the existing hybrid-bootstrap bullet.
* docs/tutorials/16_survey_did.ipynb cell 35: support matrix SDID
  row updated from "bootstrap only (PR #352)" to "Full (all three
  variance methods)"; legend amended; "Note on SyntheticDiD" block
  rewritten to describe all three allocators with the jackknife
  few-PSU caveat.
* docs/survey-roadmap.md: Phase 5 matrix row closes the placebo/
  jackknife gap; Phase 6 bullet updated to describe all three
  allocators; Current Limitations table entry removed (only replicate-
  weight limitation remains, merged into one row).
* CHANGELOG.md: "### Added" entry for placebo + jackknife full-design
  support (no new section header — folded into existing Unreleased
  block); "### Changed (PR #355)" tweaked to note the separate
  follow-up for placebo/jackknife.
* TODO.md row 107 deleted (capability gap closed).
* diff_diff/synthetic_did.py __init__ docstring: survey_design
  parameter description rewritten to describe all three methods.
  Placebo fallback-guidance comment updated to remove stale "placebo
  and jackknife reject strata/PSU/FPC" line.
* diff_diff/guides/llms-full.txt: Phase 5 bootstrap bullet updated
  to describe all three survey allocators (UTF-8 fingerprint
  preserved — `D'Haultfœuille` still appears throughout).
* tests/test_methodology_sdid.py::TestCoverageMCArtifact: narrative
  and assertions updated to reflect that placebo=0-fits is expected
  structurally on stratified_survey (documented Case C), while
  jackknife now runs successfully with the known anti-conservatism
  caveat intentionally unasserted at the calibration-gate level.

Verification
------------
* pytest tests/test_survey_phase5.py::TestSDIDSurveyPlaceboFullDesign
  tests/test_survey_phase5.py::TestSDIDSurveyJackknifeFullDesign
  tests/test_survey_phase5.py::TestSyntheticDiDSurvey
  tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
  tests/test_guides.py → 82 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
…h harness docstring

P3 (Maintainability — survey jackknife still populated _loo_unit_ids):
``fit()`` was unconditionally setting ``_loo_unit_ids`` / ``_loo_roles``
on every jackknife fit, including full-design survey fits where the
underlying replicates are PSU-level and ``get_loo_effects_df()`` now
raises ``NotImplementedError``. Internal / canned guidance keyed off
``_loo_unit_ids is not None`` as the availability check (e.g.,
``practitioner.py``) would still call the accessor on a survey fit and
hit the new raise.

Fix: only populate ``_loo_unit_ids`` / ``_loo_roles`` when
``_loo_granularity == "unit"``; leave them ``None`` on the PSU path so
``_loo_unit_ids is not None`` correctly reports availability.
``_loo_granularity`` is the authoritative accessor gate; the legacy
``_loo_unit_ids`` sentinel now agrees with it.

P3 (Documentation — harness docstring stale):
``coverage_sdid.py::_fit_one`` docstring said "fit() routes [survey
designs] through the bootstrap survey path (PR #352) when
method=='bootstrap'" — stale after the placebo + jackknife full-
design paths landed. Rewrote to describe the three method-specific
survey variance paths (weighted-FW + Rao-Wu bootstrap; stratified-
permutation + weighted-FW placebo; PSU-LOO + stratum-aggregation
jackknife) and mention the Case B-D ValueError failure modes alongside
NotImplementedError.

Verification: 94 passed (no behavior change on the gating fix — it's
a state-gating tightening, not a correctness change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant