Skip to content

HAD Phase 4.5 C: linearity-family pretests under survey#370

Merged
igerber merged 12 commits intomainfrom
had-phase-4.5-c
Apr 25, 2026
Merged

HAD Phase 4.5 C: linearity-family pretests under survey#370
igerber merged 12 commits intomainfrom
had-phase-4.5-c

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 25, 2026

Summary

  • Closes the Phase 4.5 C0 promise (PR HAD Phase 4.5 C0: QUG-under-survey decision gate #367 commit 29f8b12). Linearity-family pretests now accept weights= / survey= kwargs.
  • Stute family: PSU-level Mammen multiplier bootstrap via bootstrap_utils.generate_survey_multiplier_weights_batch (same kernel as PR HAD Phase 4.5 B: weighted mass-point 2SLS + event-study survey composition + sup-t bootstrap #363 sup-t). Joint Stute SHARES the multiplier matrix across horizons, preserving vector-valued empirical-process unit-level dependence + PSU clustering.
  • Yatchew: closed-form weighted OLS + pweight-sandwich variance components (no bootstrap). All three components reduce bit-exactly to existing unweighted formulas at w=ones(G) (locked at atol=1e-14).
  • Workflow: skips QUG with UserWarning per C0 deferral, sets qug=None on report, dispatches survey-aware sub-tests, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix.
  • HADPretestReport.qug retyped from QUGTestResults to Optional[QUGTestResults]; summary / to_dict / to_dataframe updated to None-tolerant rendering.
  • Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError at every entry point (defense in depth) — parallel follow-up.
  • 20 new tests; 158 pretest tests pass.

Methodology

Stute calibration (locked decision via plan-mode AskUserQuestion): PSU-level Mammen multipliers, NOT Rao-Wu rescaling — different mechanism. Reuses the kernel from PR #363's HAD event-study sup-t bootstrap. Per-obs perturbation eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new _cvm_statistic_weighted. Joint Stute shares the (B, n_psu) matrix across horizons within each replicate.

Yatchew weighted variance components (Krieger-Pfeffermann 1997 §3 pair-weight convention):

  • sigma2_lin = sum(w·eps²) / sum(w)
  • sigma2_diff = sum(w_avg·diff²) / (2·sum(w)) with w_avg_g = (w_g+w_{g-1})/2. Divisor uses sum(w) (=G at w=1), NOT sum(w_avg), to match the existing (1/(2G)) unweighted formula at had_pretests.py:1635 (Reviewer CRITICAL Add fixed effects and absorb parameters to DifferenceInDifferences #2).
  • sigma4_W = sum(w_avg·prod) / sum(w_avg) reduces to (1/(G-1))·sum(prod) at w=1.
  • T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W (effective-sample-size convention).

Trivial-survey-equivalence is DISTRIBUTIONAL (not bit-exact at atol=1e-10) — the survey path uses batched generate_survey_multiplier_weights_batch while unweighted uses per-iteration _generate_mammen_weights. Different RNG consumption ordering means same-seed produces different multiplier matrices but the bootstrap p-value distributions agree at large B (Reviewer CRITICAL #3 reframe).

Stability invariants preserved

  • Unweighted code paths bit-exact pre-PR (separate if branch for survey/weights). All 138 existing pretest tests pass unchanged.
  • Yatchew weighted variance components reduce to unweighted at w=1 at atol=1e-14 (locked by TestYatchewHRTestSurvey::test_weighted_reduces_to_unweighted_at_uniform_weights).
  • HADPretestReport schema bit-exact on the unweighted path; qug=None triggers None-tolerant rendering only on the survey path.

Files

File Change
diff_diff/had_pretests.py survey/weights kwargs + survey-aware bootstrap on 5 helpers + workflow dispatcher rewrite + _fit_weighted_ols_intercept_slope / _cvm_statistic_weighted / _resolve_pretest_unit_weights helpers + HADPretestReport.qug retyped to Optional
diff_diff/survey.py _make_trivial_resolved helper for pweight-shortcut routing
docs/methodology/REGISTRY.md Note (Phase 4.5 C) under QUG Null Test entry with full methodology
tests/test_had_pretests.py 20 new tests (TestHADPretestWorkflowSurveyGuards revised, TestStuteTestSurvey, TestYatchewHRTestSurvey, TestJointStuteSurvey)
CHANGELOG.md [Unreleased] Patch entry
TODO.md Replace Phase 4.5 C row with replicate-weight follow-up row

Test plan

  • pytest tests/test_had_pretests.py -v — 158/158 green (138 pre-PR + 20 new)
  • black diff_diff tests — clean
  • ruff check diff_diff tests — clean (auto-fixed import ordering)
  • Manual REPL smoke: stute_test(weights=), yatchew_hr_test(weights=), did_had_pretest_workflow(weights=) all functional with QUG-skip warning
  • CI

🤖 Generated with Claude Code

@igerber igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels Apr 25, 2026
@github-actions
Copy link
Copy Markdown

Overall Assessment
⛔ Blocker

Executive Summary

  • The new survey-aware Stute family has a silent zero-weight bug on the survey= path: zero-weight/domain-excluded units are neither dropped nor rejected, so they can create effective dose variation and contribute to the weighted CvM statistic. That can produce wrong pretest decisions with no warning.
  • The new pretest survey path silently accepts SurveyDesign(weight_type="aweight"|"fweight") even though the registry note and the existing HAD estimator code are pweight-only.
  • weights= does not interact correctly with the existing last-cohort auto-filter on the event-study path: staggered panels will hit a length-mismatch crash after filtering.
  • Replicate-weight deferral and Yatchew’s lack of PSU-cluster propagation are documented/tracked, so I did not count those as blockers.

Methodology
Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, yatchew_hr_test, and did_had_pretest_workflow.

  • P0 Impact: the Stute survey path allows zero survey weights, but the implementation still checks dose variation on the full unweighted d vector and _cvm_statistic_weighted() still sums over every row. That means zero-weight/domain-excluded units can create variation and CvM mass in what should be the excluded part of the sample. In the extreme case where all positive-weight units share the same dose and only zero-weight units vary, the code can return a conclusive Stute result instead of an inconclusive/degenerate one. This is silent wrong statistical output on supported survey= inputs. References: diff_diff/had_pretests.py:L1016-L1052, diff_diff/had_pretests.py:L1548-L1552, diff_diff/had_pretests.py:L1562-L1655, diff_diff/had_pretests.py:L2598-L2741. Concrete fix: mask to w > 0 before every Stute-family dose-variation check, sort/CvM computation, and bootstrap refit; alternatively reject zero survey weights consistently at every survey-aware Stute/workflow entry point until true domain-weight handling is implemented.

  • P1 Impact: the new survey-aware pretests never check resolved.weight_type, so aweight and fweight survey designs will run through pweight formulas with no warning. That is an undocumented methodology deviation from the new registry note and from the existing HAD survey guard behavior. References: docs/methodology/REGISTRY.md:L2434-L2441, diff_diff/had_pretests.py:L1535-L1541, diff_diff/had_pretests.py:L1834-L1846, diff_diff/had_pretests.py:L2600-L2606, diff_diff/had_pretests.py:L2763-L2815, diff_diff/had.py:L2976-L2994. Concrete fix: reject weight_type != "pweight" in _resolve_pretest_unit_weights and in the direct-helper survey= branches.

Code Quality

  • P1 Impact: the new weights= event-study path is broken on staggered panels that trigger last-cohort filtering. joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(..., aggregate="event_study") validate/filter to data_filtered, then call _aggregate_unit_weights on the original row-level weights array, so the length check fails after filtering. This is exactly the new-parameter interaction the edge-case checklist calls out. References: diff_diff/had_pretests.py:L2925-L2991, diff_diff/had_pretests.py:L3124-L3192, diff_diff/had_pretests.py:L3414-L3467, diff_diff/had.py:L1604-L1645. Concrete fix: subset the row-aligned weights array to data_filtered before re-aggregating, using the same positional-index pattern already used in HeterogeneousAdoptionDiD.fit().

Performance
No material findings in the changed lines.

Maintainability
No additional maintainability findings beyond the bugs above.

Tech Debt

  • P3 Impact: replicate-weight pretests are deferred, but that limitation is explicitly tracked and guarded, so it is not a blocker. References: TODO.md:L98-L100, diff_diff/had_pretests.py:L1478-L1485, diff_diff/had_pretests.py:L1804-L1809, diff_diff/had_pretests.py:L2418-L2423. Concrete fix: none required in this PR.

Security
No findings.

Documentation/Tests

  • P3 Impact: REGISTRY.md now contains conflicting methodology statements: the old Phase 4.5 C0 note still says the workflow surface is gated and that Phase 4.5 C uses Rao-Wu rescaling, while the new Phase 4.5 C note and code say the workflow is enabled and Stute uses PSU-level multiplier bootstrap. The new tests also never exercise the broken staggered-weights= path or the zero-weight survey Stute path. References: docs/methodology/REGISTRY.md:L2428-L2445, diff_diff/had_pretests.py:L609-L618, diff_diff/had_pretests.py:L2263-L2269, tests/test_had_pretests.py:L2909-L2955, tests/test_had_pretests.py:L3241-L3356. Concrete fix: rewrite the C0 note as historical/narrow it to qug_test, update the stale docstrings, and add regressions for zero-weight survey designs and staggered event-study weights=.

Path to Approval

  1. Make the survey-aware Stute family domain-consistent: drop w==0 units before variation checks/CvM/bootstrap, or reject zero survey weights everywhere until that handling exists; add a regression where only zero-weight units carry dose variation.
  2. Add pweight-only guards to all pretest survey entry points and add aweight/fweight rejection tests.
  3. Subset row-level weights= after event-study last-cohort filtering before calling the joint wrappers/workflow, and add staggered-panel regressions for both the direct wrappers and did_had_pretest_workflow(..., aggregate="event_study", weights=...).

Static review only: I couldn’t run pytest in this sandbox because the available interpreter does not have project dependencies installed (numpy is missing).

igerber added a commit that referenced this pull request Apr 25, 2026
R1 P0 — Stute survey path silently accepted zero-weight units, which
leak into the dose-variation check + CvM cusum + bootstrap refit while
contributing zero population mass. Extreme case: only zero-weight units
carry dose variation -> spurious finite test statistic with no warning.
Fix: strictly-positive guards on every survey-aware Stute / Yatchew /
workflow entry point (the weights= shortcut already had this; survey=
branch was the gap).

R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only
formulas silently (the variance components are derived assuming pweight
sandwich semantics). Fix: weight_type='pweight' guards added in
_resolve_pretest_unit_weights and on every direct-helper survey= branch
(stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit
guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914.

R1 P1 #2 — workflow's row-level weights= crashed on staggered event-
study panels because _validate_multi_period_panel filters to last
cohort but the joint wrappers re-aggregate with the original full-
panel weights array. Fix: subset joint_weights to data_filtered's
rows via data.index.get_indexer(data_filtered.index) BEFORE passing
to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional-
index pattern. Survey= path is unaffected (column references resolve
internally on data_filtered).

R1 P3 — REGISTRY C0 note still said "the same gate applies to
did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both
are stale post-C. Updated to clarify (a) workflow gate was temporary
and is now closed by C, (b) qug_test direct-helper gate remains
permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT
Rao-Wu rescaling).

7 new tests in TestPhase45CR1Regressions covering: zero-weight survey
on stute_test / stute_joint_pretest / workflow; aweight rejection on
stute_test / workflow; fweight rejection on yatchew_hr_test; staggered
event-study workflow with weights= (catches the length-mismatch crash).
165 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: fb032672790cddb0cbc4d7df7a283fdfe54c7ecd


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior zero-weight-survey and non-pweight guard findings look addressed on the direct-helper and workflow paths.
  • One prior P1 remains: the new weights= parameter still breaks on staggered panels for the direct public wrappers joint_pretrends_test() and joint_homogeneity_test().
  • There is also a new methodology-level inconsistency: the Phase 4.5 C registry note and the Stute survey implementation do not agree on the bootstrap perturbation law.
  • Replicate-weight deferral is properly guarded and tracked in TODO.md, so I did not count it as a blocker.
  • In-code docstrings/tests are not fully synced to the shipped Phase 4.5 C behavior.

Methodology

  • P1 — Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and the survey/weights branch of did_had_pretest_workflow. Impact: the new registry note documents the survey Stute bootstrap as dy_b = fitted + eps * w * eta_obs, but the implementation in both Stute paths uses dy_b = fitted + eps * eta_obs. Because that perturbation defines the bootstrap law, this is not a cosmetic wording issue: either the registry note or the code is wrong, and the PR does not document which. The underlying HAD paper’s Appendix D bootstrap perturbs residuals with a wild multiplier and then recomputes the statistic, while Theorem 7 keeps the unweighted Yatchew formulas as 1/(2G), 1/(G-1), and sqrt(G), so the unweighted branches still look paper-faithful; the ambiguity is specific to the new weighted Stute extension. Refs: docs/methodology/REGISTRY.md:L2434-L2435, diff_diff/had_pretests.py:L1656-L1678, diff_diff/had_pretests.py:L2760-L2785. Concrete fix: reconcile code and REGISTRY.md to one derivation-backed perturbation rule and lock it with a targeted regression test. (researchgate.net)

Code Quality

  • P1 — Impact: the direct public wrappers joint_pretrends_test(..., weights=...) and joint_homogeneity_test(..., weights=...) still crash on staggered event-study panels. Both wrappers call _validate_had_panel_event_study() and may filter to data_filtered, but then pass the original row-aligned weights array into _resolve_pretest_unit_weights(data_filtered, ...), which expects alignment with the filtered rows. The workflow path now subsets weights correctly before calling the wrappers, but the wrappers themselves still do not, so the new parameter remains broken on one public code path. Refs: diff_diff/had_pretests.py:L3003-L3013, diff_diff/had_pretests.py:L3064-L3099, diff_diff/had_pretests.py:L3202-L3211, diff_diff/had_pretests.py:L3267-L3297, compare the workflow fix at diff_diff/had_pretests.py:L3492-L3515. Concrete fix: in both wrappers, subset row-level weights to data_filtered.index via data.index.get_indexer(data_filtered.index) before calling _resolve_pretest_unit_weights, mirroring did_had_pretest_workflow() / HeterogeneousAdoptionDiD.fit().

Performance

  • No findings.

Maintainability

  • No additional findings beyond the code/registry/docstring inconsistencies above.

Tech Debt

  • P3 — Impact: replicate-weight survey pretests are deferred, but that limitation is explicitly tracked and consistently guarded, so it is not a blocker. Refs: TODO.md:L98-L100, diff_diff/had_pretests.py:L1478-L1485, diff_diff/had_pretests.py:L1827-L1832, diff_diff/had_pretests.py:L2448-L2453, diff_diff/had_pretests.py:L2863-L2868. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P3 — Impact: the in-code docs are not fully synchronized with Phase 4.5 C. qug_test still tells readers that survey-aware Stute “admits a Rao-Wu rescaled bootstrap,” and HADPretestReport still documents the old event-study all_pass contract that requires finite QUG output, both of which contradict the new registry note/code. The new tests also cover the workflow-level staggered-weights= regression, but not the direct-wrapper staggered path that still fails. Refs: diff_diff/had_pretests.py:L609-L618, diff_diff/had_pretests.py:L1201-L1215, tests/test_had_pretests.py:L3241-L3278, tests/test_had_pretests.py:L3526-L3549. Concrete fix: sync the docstrings to the shipped Phase 4.5 C contract and add staggered-panel weights= regressions for the two direct wrappers.

Path to Approval

  1. Fix joint_pretrends_test() and joint_homogeneity_test() to subset row-aligned weights= after last-cohort filtering, before _resolve_pretest_unit_weights.
  2. Reconcile the survey Stute bootstrap methodology end-to-end: either update the implementation to the registry’s documented perturbation rule or correct REGISTRY.md (and surrounding docstrings) to the implemented rule, then add a regression that pins the intended law.
  3. Add direct-wrapper staggered-panel tests for joint_pretrends_test(..., weights=...) and joint_homogeneity_test(..., weights=...), plus doc updates for the final approved survey-bootstrap contract.

Static review only: I could not run the test suite in this sandbox because the available Python environment is missing project dependencies (numpy, pandas).

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test
direct calls still crashed on staggered panels because the staggered-
weights subset fix from R1 was only applied at the workflow level. The
wrappers run their own _validate_had_panel_event_study() and may filter
to data_filtered, then passed the original full-panel weights array to
_resolve_pretest_unit_weights(data_filtered, ...) which expects the
filtered row count. Fix: subset row-level weights to data_filtered.index
positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights,
mirroring the workflow fix.

R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap
perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does
`dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper
Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows
through the OLS refit and the weighted CvM, not through the perturbation.
Adding `* w` would over-weight by w². Fix: update REGISTRY note to
remove the spurious `* w` and clarify the canonical form. Add a
regression that pins (a) bit-exact cvm_stat reduction at uniform weights,
(b) bootstrap p-value distributional agreement within Monte-Carlo noise.

R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract:
- qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled
  bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped).
  Updated to reflect the correct mechanism.
- HADPretestReport.all_pass docstring described the unweighted contract
  only; survey/weights path drops the QUG-conclusiveness gate
  (linearity-conditional admissibility per C0 deferral). Updated.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_pretrends_test_staggered_weights_subset
- test_joint_homogeneity_test_staggered_weights_subset
- test_stute_survey_perturbation_does_not_double_weight (locks the
  perturbation form via cvm_stat bit-exact reduction + p-value MC bound)

168 pretest tests pass (was 165 after R1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 085d8ebfa23667930bed2cd292717f1a3b38dfcb


Overall Assessment

Blocker

Executive Summary

  • The two prior re-review P1s appear fixed: the registry/code Stute perturbation law is now consistent, and the staggered-panel weights= alignment bug is patched in both direct joint wrappers and the workflow.
  • The new survey Stute implementation still has a blocker on variance-unidentified survey designs: an unstratified single-PSU design can yield a finite near-zero p-value instead of NaN/inconclusive.
  • That failure propagates into both direct Stute helpers and did_had_pretest_workflow, so the workflow can falsely mark TWFE inadmissible on a survey edge case.
  • Replicate-weight deferral is consistently guarded and properly tracked in TODO.md, so I did not count it as a blocker.
  • Some in-code docs remain stale: one qug_test error path still says Phase 4.5 C is “planned” and names Rao-Wu, and the direct joint-helper docstrings still omit their new weights/survey kwargs.

Methodology

  • P0 Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and both workflow branches that call them. Impact: on variance-unidentified survey designs, the new bootstrap path can emit materially wrong statistical output instead of an inconclusive result. The clearest case is an unstratified single-PSU design: the survey layer already documents that this should yield NaN variance, and the multiplier helper explicitly treats it as “variance unidentified” by returning an all-zero multiplier matrix. But the new Stute code then treats that zero matrix as a valid bootstrap law and computes p_value = (1 + sum(S_b >= S)) / (B + 1), which collapses to p ≈ 1/(B+1) whenever the observed CvM statistic is positive. That is a false rejection surface, not just a conservative choice. Refs: docs/methodology/REGISTRY.md:L2939-L2941, diff_diff/bootstrap_utils.py:L548-L561, diff_diff/had_pretests.py:L1664-L1685, diff_diff/had_pretests.py:L2770-L2793, diff_diff/had_pretests.py:L3565-L3598, diff_diff/had_pretests.py:L3690-L3749. Concrete fix: add the same NaN/inconclusive gate the rest of the survey stack uses for variance-unidentified designs before converting survey multipliers into a p-value. At minimum, unstratified n_psu < 2 must return p_value=NaN, reject=False, and an explicit warning; more generally, the Stute survey path should distinguish “unidentified variance” from “legitimate zero variance” instead of blindly accepting an all-zero multiplier matrix.

Code Quality

No findings. The prior staggered-panel weights= propagation bug looks fixed in the direct wrappers and the workflow.

Performance

No findings.

Maintainability

No findings beyond the documentation drift below.

Tech Debt

  • P3 Replicate-weight survey pretests are still deferred, but that limitation is explicitly tracked and guarded at every entry point, so it is not blocking. Refs: TODO.md:L98-L100, diff_diff/had_pretests.py:L1482-L1488, diff_diff/had_pretests.py:L1831-L1835, diff_diff/had_pretests.py:L2452-L2456, diff_diff/had_pretests.py:L2867-L2871. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

  • P3 The in-code documentation is still not fully synchronized with the shipped Phase 4.5 C contract. qug_test’s NotImplementedError text still says survey-aware Stute is “planned” and “admits a Rao-Wu rescaled bootstrap,” while stute_joint_pretest, joint_pretrends_test, and joint_homogeneity_test do not document their newly added weights/survey kwargs at all. Refs: diff_diff/had_pretests.py:L1247-L1268, diff_diff/had_pretests.py:L2399-L2443, diff_diff/had_pretests.py:L2931-L2954, diff_diff/had_pretests.py:L3155-L3168. Concrete fix: update the stale qug_test message and add parameter/notes coverage for the joint helper survey contract.
  • Validation note: static review only. I could not run the test suite in this sandbox because pytest is not installed.

Path to Approval

  1. In the survey Stute paths, detect variance-unidentified survey designs before p-value computation and return an inconclusive result (p_value=NaN, reject=False, warning) instead of calibrating against an all-zero multiplier matrix.
  2. Add regressions for stute_test, stute_joint_pretest, and did_had_pretest_workflow covering at least one unstratified single-PSU survey design and asserting the NaN/inconclusive surface.
  3. Sync the remaining stale direct-helper docs (qug_test error text and the joint-helper docstrings) to the final Phase 4.5 C methodology.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R3 P0 (Methodology) -- variance-unidentified survey-design guard.
generate_survey_multiplier_weights_batch returns an all-zero (B, n_psu)
multiplier matrix when n_psu - n_strata <= 0 (e.g. unstratified
single-PSU design where df_survey = n_psu - 1 = 0). Without a guard,
the Stute survey path treated zero perturbations as a valid bootstrap
law and emitted p_value = 1/(B+1) for any positive observed CvM
(spurious rejection on a survey edge case).

Fix: detect df_survey <= 0 BEFORE calling the multiplier helper and
return a NaN p_value with a UserWarning, matching the broader package
convention for variance-unidentified survey designs (mirrors
compute_survey_vcov NaN treatment, ResolvedSurveyDesign.df_survey
semantics). Applied in stute_test (single-horizon) and
stute_joint_pretest (joint variant); the data-in wrappers and the
workflow propagate via the same calls.

R3 P3 -- in-code docstrings still drifted on three surfaces:
- qug_test NotImplementedError text said survey-aware Stute is
  "planned" and "admits a Rao-Wu rescaled bootstrap"; both are stale
  post-Phase-4.5-C. Updated to reflect the shipped reality (PSU-level
  Mammen multiplier bootstrap; QUG step is skipped under survey/weights
  in the workflow).
- joint_pretrends_test, joint_homogeneity_test, stute_joint_pretest
  Parameters sections did not document the new weights= / survey=
  kwargs at all. Added with the per-test contract (per-unit
  aggregation, staggered subsetting, replicate/aweight/fweight
  rejection, variance-unidentified guard).

3 new regression tests in TestPhase45CR1Regressions:
- test_stute_test_single_psu_unstratified_returns_nan
- test_stute_joint_pretest_single_psu_unstratified_returns_nan
- test_workflow_single_psu_propagates_nan_through_stute (also documents
  Yatchew's PSU-agnostic-by-design behavior; verdict carries the
  linearity-conditional suffix)

171 pretest tests pass (was 168 after R2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 39b29a97d31633f538805b2db957e163c3160c96


Overall Assessment

Blocker

Executive Summary

  • The prior blocker on variance-unidentified survey Stute looks fixed: both stute_test and stute_joint_pretest now gate df_survey <= 0 and return NaN/inconclusive instead of spuriously rejecting.
  • The stale Phase 4.5 C docs called out in the last review also appear updated.
  • New blocker: the weighted Yatchew path is not pweight-scale invariant. SurveyDesign.resolve() normalizes pweights to mean 1, but the new weights= shortcut feeds raw weights into sqrt(sum(w)), so the same design can yield different Yatchew p-values depending on whether the caller uses weights=, survey=SurveyDesign(weights=...), or just rescales weights by a constant.
  • New P1: the new direct-helper weights= entries do not enforce 1-D input. A common (G, 1) array from df[["w"]].to_numpy() will broadcast through the weighted moments/CvM calculations and silently corrupt results instead of raising.
  • Replicate-weight deferral is consistently guarded and properly tracked in TODO.md; I did not count it against the assessment.
  • Static review only; I could not run the suite in this sandbox because numpy/pytest are unavailable.

Methodology

  • P0 docs/methodology/REGISTRY.md:L2436-L2440, diff_diff/survey.py:L189-L205, diff_diff/had_pretests.py:L1911-L1925, diff_diff/had_pretests.py:L2097-L2101, diff_diff/had_pretests.py:L2916-L2930, diff_diff/had_pretests.py:L3790-L3795.
    Impact: the new yatchew_hr_test(weights=...) and the workflow’s overall-path weights= branch use raw weights in sqrt(sum(w)), while the survey= path uses SurveyDesign.resolve()’s mean-1-normalized pweights. That makes the test statistic depend on arbitrary weight scale: w, 100*w, and survey=SurveyDesign(weights=...) can produce materially different p-values/verdicts for the same design. Because the workflow treats any conclusive Yatchew rejection as disqualifying, this can silently flip the overall pretest verdict.
    Concrete fix: normalize all pweight inputs to the same mean-1 convention before weighted pretest arithmetic, or otherwise use a scale-invariant effective-sample-size rule and apply it identically on both weights= and survey= paths. Add regressions for weights=w vs weights=100*w and for weights= vs survey=SurveyDesign(weights="w") on both yatchew_hr_test and did_had_pretest_workflow(..., aggregate="overall").

Code Quality

  • P1 diff_diff/had_pretests.py:L1573-L1584, diff_diff/had_pretests.py:L1912-L1923, diff_diff/had_pretests.py:L2701-L2712, diff_diff/survey.py:L706-L716.
    Impact: the new direct-helper weights= paths accept non-1-D arrays. If a caller passes a column vector such as df[["w"]].to_numpy(), NumPy broadcasting changes the weighted OLS moments / CvM sums instead of raising, so stute_test and yatchew_hr_test can return silently wrong numbers and stute_joint_pretest can miscompute or fail unpredictably.
    Concrete fix: validate weights (and any direct survey.weights) as strictly 1-D before use, following the existing _validate_1d_numeric / .ravel() pattern used elsewhere. Add regressions with (G, 1) inputs for all three direct helpers.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • P3 TODO.md:L98-L100, diff_diff/had_pretests.py:L1485-L1492, diff_diff/had_pretests.py:L1861-L1866, diff_diff/had_pretests.py:L2497-L2502, diff_diff/had_pretests.py:L2938-L2943.
    Impact: replicate-weight pretests remain deferred, but the limitation is explicitly tracked and guarded at every entry point.
    Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P2 tests/test_had_pretests.py:L3162-L3175, tests/test_had_pretests.py:L3052-L3068.
    Impact: the new coverage only smoke-tests weights= and survey=_make_trivial_resolved(w) separately; it does not exercise the real SurveyDesign.resolve() normalization path or weight-scale invariance, so the blocker above slips through. It also does not cover (G,1) weight rejection.
    Concrete fix: add tests for weights= vs survey=SurveyDesign(weights="w"), w vs c*w, and (G,1) weight arrays on the new public helper paths.

Path to Approval

  1. Make weighted Yatchew invariant to arbitrary global pweight scaling and identical across weights= and survey= entry paths.
  2. Add regressions covering weights=w vs weights=100*w and weights= vs survey=SurveyDesign(weights="w") for yatchew_hr_test and the overall workflow.
  3. Reject non-1-D weights / survey.weights at all new direct-helper entry points and add (G,1) regression tests.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R4 P0 (Methodology) -- Yatchew test statistic was not invariant to
uniform pweight rescaling. The formula `T_hr = sqrt(sum(w)) * (...)`
makes T_hr scale as sqrt(c) under weights -> w * c, so weights=w and
weights=100*w produced different p-values for the same design. Worse,
SurveyDesign.resolve() normalizes pweights to mean=1 internally, so
the survey= entry path and the weights= shortcut disagreed numerically.

Fix: normalize per-unit pweights to mean=1 at every helper entry
(stute_test, yatchew_hr_test, stute_joint_pretest) and at the workflow
resolution helper. Matches SurveyDesign.resolve() convention; makes
the Yatchew statistic scale-invariant; ensures weights=w and
survey=SurveyDesign(weights="w") produce identical results for the
same design. Stute is internally scale-invariant in functional form
but normalization is required so the bootstrap helper sees the same
weight vector under both entry paths (cross-path numerical agreement).

R4 P1 (Code Quality) -- column-vector weights (e.g. `df[["w"]].to_numpy()`
producing (G, 1)) silently broadcast through weighted moments / CvM
sums instead of raising. Fix: validate via `_validate_1d_numeric` on
all `weights=` arrays in stute_test, yatchew_hr_test, stute_joint_pretest;
add explicit ndim check in `_resolve_pretest_unit_weights` with a
hint about the common df[["w"]].to_numpy() mistake.

6 new regression tests in TestPhase45CR1Regressions:
- test_yatchew_weights_scale_invariant (weights=w vs weights=100*w)
- test_stute_weights_scale_invariant (mirror for Stute)
- test_workflow_weights_eq_survey_at_overall_path (weights= shortcut
  and survey=SurveyDesign(...) produce identical Yatchew + Stute
  results, atol=1e-10)
- test_stute_test_rejects_2d_weights / test_yatchew_hr_test_rejects_2d_weights
  / test_workflow_rejects_2d_weights (column-vector rejection at all
  three direct-helper / workflow entry points)

177 pretest tests pass (was 171 after R3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f8e959cb5cd242e8959401631cecc48644e0ff07


Overall Assessment

⚠️ Needs changes — one unmitigated P1 remains in the new survey Stute calibration.

Executive Summary

  • The prior blockers on raw-weight scale dependence and 1-D weights= validation look addressed in this rerun.
  • New P1: the Stute-family survey guard treats every df_survey <= 0 design as bootstrap-degenerate, but the shared survey multiplier kernel has a nonzero pooled-singleton branch for lonely_psu="adjust". That makes some singleton-strata survey designs spuriously inconclusive, or at minimum undocumentedly unsupported.
  • Replicate-weight deferral is consistently guarded and tracked in TODO.md, so I did not count it against the assessment.
  • Static review only: this sandbox does not have numpy / pytest, so I could not execute the added tests.

Methodology

  • Severity: P1. Impact: stute_test and stute_joint_pretest now return NaN for every df_survey <= 0 design (diff_diff/had_pretests.py:L1674-L1703, diff_diff/had_pretests.py:L2842-L2868). That is correct for truly degenerate all-zero-multiplier cases, but it is too broad for lonely_psu="adjust": the shared bootstrap helper explicitly pools singleton strata and can generate nonzero multipliers there (diff_diff/bootstrap_utils.py:L587-L651), and the survey variance scaffolding separately treats singleton-adjust designs as variance-computable (diff_diff/survey.py:L1566-L1572). The Phase 4.5 C registry note advertises Stute-family survey support without documenting a singleton-adjust carve-out (docs/methodology/REGISTRY.md:L2434-L2445). As written, those designs become inconclusive on the overall and event-study workflow paths even though the helper contract is nondegenerate. Concrete fix: either (1) replace the df_survey <= 0 proxy with a guard on actual bootstrap degeneracy and allow nonzero pooled-singleton cases, or (2) reject lonely_psu="adjust" singleton-strata Stute calibrations explicitly with NotImplementedError and document that carve-out, mirroring the explicit lonely-PSU handling already used for the sup-t bootstrap (diff_diff/had.py:L2081-L2118).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity: P3. Impact: replicate-weight survey pretests remain deferred, but the limitation is explicitly tracked and the new entry points reject those designs up front (TODO.md:L98-L100). Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new survey regressions cover lonely_psu="remove" and the single-PSU degenerate case, but they do not pin the pooled-singleton lonely_psu="adjust" branch in the shared bootstrap helper (tests/test_had_pretests.py::TestStuteTestSurvey, ::TestJointStuteSurvey, ::TestPhase45CR1Regressions). That is how the P1 above can slip through without an explicit contract decision. Concrete fix: add lonely_psu="adjust" singleton-strata regressions for stute_test, stute_joint_pretest, and did_had_pretest_workflow, asserting the intended behavior explicitly (finite p-value if supported, or NotImplementedError if intentionally unsupported).

Path to Approval

  1. Decide the contract for lonely_psu="adjust" singleton-strata Stute calibrations, and implement it explicitly: either support the nondegenerate pooled-singleton bootstrap path or reject it with a targeted NotImplementedError.
  2. Update docs/methodology/REGISTRY.md and CHANGELOG.md to match that contract.
  3. Add focused regressions for lonely_psu="adjust" singleton-strata survey designs on the single-horizon Stute, joint Stute, and workflow surfaces.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R5 P1 (Methodology) -- Stute survey df_survey<=0 guard was too broad.
generate_survey_multiplier_weights_batch's pooled-singleton branch for
lonely_psu='adjust' produces NONZERO multipliers, so the helper contract
is non-degenerate even when df_survey=0 (one PSU per stratum).
The previous guard rejected those designs as variance-unidentified
when actually the issue is methodological: the analytical variance
target requires a pseudo-stratum centering transform that has not
been derived for the Stute CvM.

Fix: mirror HAD sup-t bootstrap's explicit lonely_psu='adjust'
rejection at had.py:2081-2118. Add new helper
_has_lonely_psu_adjust_singletons in had_pretests.py. stute_test and
stute_joint_pretest now reject lonely_psu='adjust' with singleton
strata via NotImplementedError BEFORE the df_survey<=0 guard.
The guard remains for genuinely degenerate designs (single-PSU
unstratified, or one-PSU-per-stratum under remove/certainty where
the helper returns all-zero multipliers).

R5 P2 -- new regressions cover lonely_psu='adjust' rejection on both
single-horizon and joint-Stute helpers; also a positive control showing
lonely_psu='remove' singletons still flow through the existing
df_survey-based NaN path. 3 new tests in TestPhase45CR1Regressions.

REGISTRY note clarifies the carve-out: 'adjust' with singletons raises
NotImplementedError; 'remove' / 'certainty' produce all-zero singleton
multipliers and return NaN via the variance-unidentified guard.

180 pretest tests pass (was 177 after R4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 646e207f8b5f7401953c1c71e90b7724c043f228


Overall Assessment

⚠️ Needs changes — one P1 edge-case bug remains in the new survey/event-study workflow, and one P1 coverage gap leaves key survey branches unpinned.

Executive Summary

  • The prior lonely_psu='adjust' singleton-strata blocker looks resolved: the Stute paths now reject it explicitly, and the carve-out is documented in docs/methodology/REGISTRY.md:L2444-L2444 and enforced in diff_diff/had_pretests.py:L1712-L1723, diff_diff/had_pretests.py:L2899-L2911.
  • I did not find an unmitigated paper/registry mismatch in the new survey methodology itself; the major departures from the unweighted paper path are documented in docs/methodology/REGISTRY.md:L2428-L2446.
  • P1: did_had_pretest_workflow(..., aggregate="event_study", survey=/weights=) resolves and validates weights on the full panel before the last-cohort filter, so units that would be dropped can still cause hard failures. diff_diff/had_pretests.py:L2978-L3079, diff_diff/had_pretests.py:L3676-L3751
  • P1: the new survey= parameter is not tested on the main successful event-study/post-treatment survey branches. Current survey smokes are mostly overall-path or trivial/no-PSU designs, so the new PSU-aware calibration surface is still under-covered. tests/test_had_pretests.py:L2936-L2955, tests/test_had_pretests.py:L3060-L3069, tests/test_had_pretests.py:L3169-L3175, tests/test_had_pretests.py:L3261-L3298, tests/test_had_pretests.py:L3526-L3595
  • Replicate-weight deferral is consistently documented and tracked, so I did not count it against approval. docs/methodology/REGISTRY.md:L2442-L2444, TODO.md:L98-L100
  • Static review only: this sandbox does not have pytest or numpy, so I could not execute the added tests.

Methodology

  • Severity: P3. Impact: The survey extension is a documented library-specific adaptation rather than a paper-literal survey recipe: QUG is skipped under survey, Stute uses PSU-level Mammen multipliers, Yatchew uses weighted variance components without PSU-cluster correction, replicate-weight designs are deferred, and lonely_psu='adjust' singleton strata are explicitly unsupported. Those choices are all recorded in docs/methodology/REGISTRY.md:L2428-L2446, so I do not count them as defects. Concrete fix: None required for this PR.

Code Quality

  • Severity: P1. Impact: The event-study workflow resolves weights/survey on the full input panel before _validate_multi_period_panel() applies the staggered last-cohort filter. Because _resolve_pretest_unit_weights() enforces strictly positive per-unit weights on whatever data it sees, zero-weight or otherwise invalid units that would be dropped by the filter can still abort a valid event-study run. This is inconsistent with the wrapper-level fix that already subsets row-level weights= after filtering. diff_diff/had_pretests.py:L2978-L3079, diff_diff/had_pretests.py:L3676-L3751 Concrete fix: For aggregate="event_study", do not call _resolve_pretest_unit_weights() on the full panel. Set use_survey_path = (survey is not None or weights is not None), run _validate_multi_period_panel() first, then resolve on data_filtered only, or let joint_pretrends_test() / joint_homogeneity_test() own resolution entirely.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: replicate-weight survey pretests remain deferred, but the limitation is explicitly tracked and every entry point rejects those designs consistently. docs/methodology/REGISTRY.md:L2442-L2444, TODO.md:L98-L100 Concrete fix: None required in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P1. Impact: The new survey= parameter is not exercised across all of the new public code paths. The added tests cover overall-workflow survey support and trivial/no-PSU survey smokes (tests/test_had_pretests.py:L2936-L2955, tests/test_had_pretests.py:L3060-L3069, tests/test_had_pretests.py:L3169-L3175), and they cover event-study only on weights= plus survey pretrends (tests/test_had_pretests.py:L3261-L3298, tests/test_had_pretests.py:L3526-L3595). There is still no successful survey= coverage for joint_homogeneity_test(...) or did_had_pretest_workflow(..., aggregate="event_study"), and no positive test of a non-trivial PSU/strata survey design. Given that the core new feature is PSU-aware bootstrap calibration, this is the exact parameter-interaction gap that can hide regressions. Concrete fix: Add successful survey= regressions for joint_homogeneity_test(...) and did_had_pretest_workflow(..., aggregate="event_study"), using a design with at least 2 PSUs per stratum and more than 1 unit per PSU; include one staggered last-cohort case so filtering and survey resolution are exercised together.

Path to Approval

  1. Move event-study weights/survey resolution in did_had_pretest_workflow() so it happens only after _validate_multi_period_panel() has produced data_filtered, or remove the eager full-panel resolution from that branch entirely.
  2. Add a regression where the staggered last-cohort filter drops units with zero/invalid weights, and assert the survey-backed event-study workflow succeeds on the surviving valid subset.
  3. Add positive non-trivial survey= tests for joint_homogeneity_test(...) and did_had_pretest_workflow(..., aggregate="event_study"); use actual psu/strata structure so the shared-PSU multiplier path is covered, not just the trivial weight-only path.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved
weights/survey on the FULL panel before _validate_multi_period_panel
applied the staggered last-cohort filter. Because
_resolve_pretest_unit_weights enforces strictly-positive per-unit
weights / pweight type / etc. on whatever data it sees, zero or
otherwise-invalid weights on the soon-to-be-dropped cohort would abort
an otherwise-valid event-study run.

Fix: defer resolution to per-aggregate branches.
- Top-level: only the survey/weights mutex check + use_survey_path
  presence detection (no resolution).
- Overall path: resolve weights/survey AFTER _validate_had_panel
  (no cohort filtering on this path; original data IS the panel).
- Event-study path: do NOT resolve at the workflow level. The joint
  wrappers (joint_pretrends_test / joint_homogeneity_test) own
  resolution and already see data_filtered (post staggered filter).
  Row-level weights= passed through with the existing positional
  subsetting (R1 P1 fix preserved).

R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage
gap. Existing tests covered overall-workflow + trivial/no-PSU smokes;
the PSU-aware multiplier-bootstrap path (the core new methodology)
was unpinned for joint_homogeneity_test and the event-study workflow.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial
  SurveyDesign(weights=, strata=, psu=) on the linearity wrapper).
- test_workflow_event_study_psu_strata_survey_smoke (full event-study
  dispatch under PSU/strata clustering: validate_multi_period_panel +
  resolve on data_filtered + pretrends_joint + homogeneity_joint).
- test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1
  fix regression: panel where the dropped early cohort has zero
  weights succeeds on the surviving last cohort; pre-fix this crashed
  with "weights must be strictly positive").

183 pretest tests pass (was 180 after R5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 20f158ac5402459cf3f9ee0a6c654e3f5fed9de5


Overall Assessment

Blocker - one new P0 methodology bug in the survey-weighted Stute statistic, plus one P1 report-contract regression. The two P1s from the previous review appear fixed.

Executive Summary

  • The prior event-study filter/weight-resolution bug looks fixed: weight resolution is now deferred until after event-study filtering, and the direct joint wrappers subset row weights before re-aggregation. diff_diff/had_pretests.py:L3262-L3283, diff_diff/had_pretests.py:L3487-L3506, diff_diff/had_pretests.py:L3676-L3757
  • The prior positive-survey coverage gap also looks fixed: new tests now cover joint_homogeneity_test(..., survey=...) and did_had_pretest_workflow(..., aggregate="event_study", survey=...) on non-trivial PSU/strata designs. tests/test_had_pretests.py:L3936-L3987
  • P0: the new weighted Stute implementation weights the inner cusum but not the outer integration measure. That does not match the survey-weighted plug-in functional the registry/survey docs describe, so survey Stute statistics and p-values are silently wrong for non-uniform weights. diff_diff/had_pretests.py:L1049-L1085, docs/methodology/survey-theory.md:L221-L225, docs/methodology/survey-theory.md:L383-L405
  • P1: survey workflow pass cases can set all_pass=True while the human verdict still begins with inconclusive, reintroducing the report-contract inconsistency the unweighted path already had a regression against. diff_diff/had_pretests.py:L3815-L3849, diff_diff/had_pretests.py:L3934-L3958, tests/test_had_pretests.py:L709-L728
  • Replicate-weight deferral, QUG-under-survey deferral, and lonely_psu='adjust' singleton rejection are documented/tracked and should not block approval. docs/methodology/REGISTRY.md:L2428-L2446, TODO.md:L98-L100
  • Static review only: pytest is not installed in this sandbox, and the available Python environment also lacks numpy, so I could not execute the new tests.

Methodology

  • Severity: P0. Impact: Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and both survey/weights branches of did_had_pretest_workflow. The new _cvm_statistic_weighted() implements S_w = W^-2 Σ_g C_g^2 with C_g = Σ_{h: D_h <= D_g} w_h ε_h. But the survey methodology note frames the extension as a plug-in of the survey-weighted empirical distribution F_hat_w = Σ_i w_i δ_{x_i} / Σ_i w_i. Starting from the original Stute functional S = (1/G) Σ_g c_G(D_g)^2, the weighted plug-in requires integrating c_w(D_g)^2 against F_hat_w, i.e. the outer measure must also be weighted: S_w = W^-2 Σ_g w_g C_g^2. Omitting that outer w_g silently turns the statistic into a count-weighted functional of the sample rather than the documented survey-weighted functional. Every non-uniform-weight survey Stute result is therefore off with no warning. diff_diff/had_pretests.py:L974-L1017, diff_diff/had_pretests.py:L1049-L1085, diff_diff/had_pretests.py:L1682-L1774, diff_diff/had_pretests.py:L2802-L2956, docs/methodology/survey-theory.md:L221-L225, docs/methodology/survey-theory.md:L383-L405 Concrete fix: change _cvm_statistic_weighted() to apply the outer w_sorted factor as well, then propagate that correction through the single- and joint-Stute survey paths and document the exact weighted CvM formula in docs/methodology/REGISTRY.md.
  • Severity: P3. Impact: The QUG skip under survey, replicate-weight rejection, and lonely_psu='adjust' singleton rejection are explicitly documented and/or tracked, so I did not count them against approval. docs/methodology/REGISTRY.md:L2428-L2446, TODO.md:L98-L100 Concrete fix: None required for this PR.

Code Quality

  • Severity: P1. Impact: On both survey workflow branches, the code synthesizes a NaN QUG result, calls the existing verdict composer, and then string-replaces "QUG NaN". For pass cases that still leaves verdict starting with inconclusive even when all_pass=True. That violates the report’s semantic contract and recreates the exact inconsistency the existing unweighted regression guards against. diff_diff/had_pretests.py:L3815-L3849, diff_diff/had_pretests.py:L3934-L3958, tests/test_had_pretests.py:L709-L728 Concrete fix: stop post-processing the old verdict strings. Add explicit survey-aware success/inconclusive verdict builders for the overall and event-study paths, and add regressions asserting that survey pass cases never emit verdict.startswith("inconclusive").

Performance

  • No findings.

Maintainability

  • No findings beyond the verdict-string issue above.

Tech Debt

  • Severity: P3. Impact: replicate-weight survey pretests remain deferred, but the limitation is explicitly tracked in TODO.md and guarded consistently at each entry point. TODO.md:L98-L100, docs/methodology/REGISTRY.md:L2443-L2444 Concrete fix: None required for this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests mostly cover smoke paths, w=ones(G) reductions, and branch guards. They still do not pin the non-uniform weighted-CvM formula or the survey all_pass/verdict consistency, which is why both bugs above slip through. tests/test_had_pretests.py:L3036-L3215, tests/test_had_pretests.py:L3936-L4016 Concrete fix: add one oracle test for _cvm_statistic_weighted() or stute_test() with a non-uniform weight vector, and add workflow regressions asserting that all_pass=True never coexists with an inconclusive survey verdict on either aggregate path.

Path to Approval

  1. Fix _cvm_statistic_weighted() so the Stute survey statistic integrates against the weighted empirical measure, not the unweighted observation count, and rerun/update the single- and joint-Stute survey paths accordingly.
  2. Replace the current survey verdict string-rewrite with explicit survey-aware verdict composition, then add regressions for both aggregate="overall" and aggregate="event_study" survey pass cases.
  3. Add a non-uniform-weight Stute regression that distinguishes Σ w_g C_g^2 from Σ C_g^2; the current w=ones(G) reductions cannot catch the methodology bug.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R7 P0 (Methodology) -- weighted CvM was missing the outer measure
factor. Survey-weighted plug-in of Stute's CvM functional integrates
the squared cusum process against the survey-weighted EDF
F_hat_w = (1/W) sum_i w_i delta_{D_i}, which weights BOTH the inner
cusum AND the outer integration measure:
    C_g = sum_{h <= g} w_h * eps_h
    S_w = (1/W^2) * sum_g w_g * (C_g)^2

Earlier revisions used (1/W^2) * sum_g C_g^2 (no outer w_g) which is
a count-weighted-cusum / uniform-outer-measure functional and silently
misreports survey-weighted Stute statistics for non-uniform weights.
At w=ones(G) both forms reduce to (1/G^2) sum_g C_g^2; only non-uniform
weights distinguish them, which is why the prior reduction tests
didn't catch this.

Fix: add the outer w_sorted factor to _cvm_statistic_weighted. New
oracle test pins the formula on a hand-computed non-uniform-weight
example (w=[1,2,3], eps=[1,-2,3] -> 127/36 outer-weighted, NOT 46/36
count-weighted-cusum). Reduction at w=1 still bit-exact.

R7 P1 (Code Quality) -- survey verdict could leave pass cases starting
with "inconclusive". The previous approach composed the unweighted
verdict against a synthetic NaN QUG and string-replaced "QUG NaN" out;
when no rejections fired and linearity was conclusive, the underlying
"inconclusive - QUG NaN" template would collapse to "inconclusive"
even for all_pass=True paths.

Fix: explicit survey-aware verdict composers
_compose_verdict_overall_survey and _compose_verdict_event_study_survey
that drop QUG from consideration entirely and emit linearity-only
priority text. Both append the
"(linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)"
suffix in every branch (rejections, inconclusive, fail-to-reject).

4 new regression tests:
- test_cvm_statistic_weighted_outer_measure_oracle (R7 P0 hand-computed)
- test_cvm_statistic_weighted_reduces_at_uniform_weights (R7 P0 reduction)
- test_workflow_overall_survey_pass_does_not_say_inconclusive (R7 P1)
- test_workflow_event_study_survey_pass_does_not_say_inconclusive (R7 P1)

187 pretest tests pass (was 183 after R6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 43230619ebed25a84af60c4300f2ac69ef8e9ab4


Overall Assessment

Looks good — the prior P0 weighted-CvM issue and prior P1 survey-verdict regression both appear fixed. I did not find any unmitigated P0/P1 issues in the changed code. Static review only: I could not run pytest here because the sandbox Python lacks numpy.

Executive Summary

  • The previous Stute methodology blocker appears resolved: _cvm_statistic_weighted() now applies the outer survey-weight measure, and the new non-uniform oracle test would fail on the old count-weighted form. diff_diff/had_pretests.py:L1049-L1097, tests/test_had_pretests.py:L4020-L4053
  • The previous report-contract regression also appears resolved: survey verdicts now use dedicated composers instead of post-processing the unweighted string output, with regressions locking all_pass=True against "inconclusive" verdicts. diff_diff/had_pretests.py:L3554-L3655, diff_diff/had_pretests.py:L3928-L4041, tests/test_had_pretests.py:L4057-L4100
  • Weight/survey propagation looks consistent across direct helpers, data-in wrappers, and workflow dispatch, including staggered-panel subsetting and variance-unidentified survey guards. diff_diff/had_pretests.py:L2990-L3534, diff_diff/had_pretests.py:L3822-L4017
  • The remaining methodology limitations I saw are documented/tracked rather than silent defects: replicate-weight pretests are deferred, lonely_psu='adjust' singleton strata are rejected, and survey Yatchew remains PSU-agnostic by design. docs/methodology/REGISTRY.md:L2436-L2444, TODO.md:L98-L100
  • One P3 documentation mismatch remains: the new survey all_pass prose is broader than the implemented event-study behavior. docs/methodology/REGISTRY.md:L2442-L2442, diff_diff/had_pretests.py:L609-L620, diff_diff/had_pretests.py:L3759-L3776, diff_diff/had_pretests.py:L3933-L3942

Methodology

  • Severity: P3-informational. Impact: The remaining survey-path limitations are explicitly documented/tracked: survey Yatchew does not propagate PSU clustering, replicate-weight pretests remain unsupported, and lonely_psu='adjust' singleton strata are rejected. These are disclosed in the Phase 4.5 C registry note and TODO, so they should not block approval. docs/methodology/REGISTRY.md:L2436-L2444, TODO.md:L98-L100 Concrete fix: None in this PR.

Code Quality

  • Severity: None. Impact: No unmitigated code-quality findings in the modified paths. The previous survey verdict/report inconsistency is addressed by the new survey-specific composers. diff_diff/had_pretests.py:L3554-L3655 Concrete fix: None.

Performance

  • Severity: None. Impact: No performance regressions stood out in the diff. The joint Stute path still precomputes the projection operator once per call, and the unweighted fast path remains separate. diff_diff/had_pretests.py:L2859-L2874 Concrete fix: None.

Maintainability

  • Severity: None. Impact: No maintainability findings. _resolve_pretest_unit_weights() centralizes the new weight/survey validation, normalization, and aggregation logic across workflow and wrapper entry points. diff_diff/had_pretests.py:L2990-L3091 Concrete fix: None.

Tech Debt

  • Severity: P3-informational. Impact: The new replicate-weight pretest deferral is properly tracked in TODO.md, so it is not a blocker. TODO.md:L98-L100 Concrete fix: None in this PR.

Security

  • Severity: None. Impact: No security findings in the changed files. Concrete fix: None.

Documentation/Tests

  • Severity: P3. Impact: The new survey all_pass documentation is broader than the implemented event-study behavior. The registry/docstrings describe the survey workflow generically as passing when “at least one linearity test” is conclusive, but aggregate="event_study" still requires both pretrends_joint and homogeneity_joint to be conclusive and non-rejecting. The code is fine; the published contract is looser than the implementation. docs/methodology/REGISTRY.md:L2442-L2442, diff_diff/had_pretests.py:L609-L620, diff_diff/had_pretests.py:L3759-L3776, diff_diff/had_pretests.py:L3933-L3942 Concrete fix: Split the survey all_pass description by aggregate and mirror the actual event-study condition in the registry and docstrings.

igerber added a commit that referenced this pull request Apr 25, 2026
R8 P3 -- survey all_pass docstring/REGISTRY described the contract too
generically ("at least one linearity test conclusive"), which matches
the overall path but is looser than the implemented event-study path
(which requires BOTH pretrends_joint AND homogeneity_joint conclusive
+ non-rejecting).

Fix: split the survey all_pass description by aggregate in both
HADPretestReport docstring and REGISTRY note:
- overall: "at least one of Stute/Yatchew conclusive + no rejection"
  (mirrors paper Section 4 step-3 'Stute OR Yatchew' wording).
- event_study: "both joint variants conclusive + neither rejects"
  (same step-2 + step-3 closure as the unweighted aggregate, minus the
  QUG step).

Code unchanged; only documentation. 187 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 194668274d1ce6f3c2e9892815980cff50cd5bff


Overall Assessment

⚠️ Needs changes — the prior weighted-CvM blocker and the prior survey-verdict regression both look fixed, and the new survey methodology is now documented in docs/methodology/REGISTRY.md. I did find one unmitigated P1 in the new staggered weights= path: row-level weights are subset before the normal front-door length check, so malformed arrays can be silently truncated or fail with raw indexing errors instead of the package’s usual ValueError. Static review only; I could not run pytest here because the sandbox Python is missing numpy/scipy/pytest.

Executive Summary

  • Affected methods are stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow.
  • The previous Stute methodology blocker appears fixed: _cvm_statistic_weighted() now weights both the inner cusum and the outer integration measure, matching the new Phase 4.5 C registry note. diff_diff/had_pretests.py:L1058-L1106, docs/methodology/REGISTRY.md:L2434-L2438
  • The previous survey-verdict/report inconsistency also appears fixed: the workflow now uses dedicated survey-path verdict composers instead of post-processing the unweighted string output. diff_diff/had_pretests.py:L3563-L3664, diff_diff/had_pretests.py:L3942-L4045
  • QUG skip under survey, Yatchew’s PSU-agnostic variance, and replicate-weight deferral are all explicitly documented/tracked, so I am not treating those as blockers. docs/methodology/REGISTRY.md:L2428-L2448, TODO.md:L97-L100
  • [New] P1: the staggered weights= subsetting fix bypasses the standard len(weights) == len(data) guard on event-study paths; malformed arrays can be silently accepted or raise raw indexing errors. diff_diff/had_pretests.py:L3292-L3304, diff_diff/had_pretests.py:L3514-L3526, diff_diff/had_pretests.py:L3874-L3884, compare diff_diff/had_pretests.py:L3033-L3045
  • One residual P3 remains in the docs: did_had_pretest_workflow’s survey all_pass note is still broader than the implemented aggregate="event_study" rule. diff_diff/had_pretests.py:L3768-L3785, diff_diff/had_pretests.py:L3942-L3956, docs/methodology/REGISTRY.md:L2442-L2444

Methodology

  • Severity: P3-informational. Impact: The major survey-path deviations from the original unweighted paper workflow are now explicitly documented: QUG is skipped under survey=/weights=, Yatchew does not propagate PSU clustering, replicate-weight designs are deferred, and lonely_psu='adjust' singleton strata are rejected. Under the review rubric, these are documented deviations/tracked limitations, not defects. docs/methodology/REGISTRY.md:L2428-L2448, TODO.md:L98-L100 Concrete fix: None in this PR.

Code Quality

  • Severity: P1. Impact: On staggered event-study paths, row-level weights are sliced with np.asarray(weights)[pos_idx] before any validation against the original data length. That means oversized arrays can be silently accepted with their tail ignored, and undersized arrays can surface raw NumPy indexing errors instead of the library’s standard front-door ValueError. This affects did_had_pretest_workflow, joint_pretrends_test, and joint_homogeneity_test. diff_diff/had_pretests.py:L3292-L3304, diff_diff/had_pretests.py:L3514-L3526, diff_diff/had_pretests.py:L3874-L3884, compare diff_diff/had_pretests.py:L3033-L3045 Concrete fix: validate weights as 1D and len(weights) == len(data) before any staggered-panel subsetting, then subset the already-validated array.

Performance

  • No unmitigated findings in the changed code.

Maintainability

  • No unmitigated findings in the changed code.

Tech Debt

  • Severity: P3-informational. Impact: Replicate-weight survey pretests are explicitly deferred and tracked in TODO.md, so that follow-up is properly recorded and should not block this PR. TODO.md:L98-L100 Concrete fix: None in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: did_had_pretest_workflow’s Notes section still says the survey-path all_pass rule is “at least one linearity test is conclusive,” but the implemented aggregate="event_study" logic requires both pretrends_joint and homogeneity_joint to be conclusive and non-rejecting. The registry and HADPretestReport docstring were updated, but this workflow docstring is still looser than the code. diff_diff/had_pretests.py:L3768-L3785, diff_diff/had_pretests.py:L3942-L3956, docs/methodology/REGISTRY.md:L2442-L2444 Concrete fix: split the workflow docstring’s survey all_pass note by aggregate, mirroring the registry text and actual event-study logic.

Path to Approval

  1. Add a front-door validator for staggered weights= paths so weights must be 1D and length-matched to the original data before pos_idx subsetting in did_had_pretest_workflow, joint_pretrends_test, and joint_homogeneity_test.
  2. Add regressions for too-short and too-long row-level weights arrays on staggered aggregate="event_study" workflow and both direct joint wrappers, asserting clean ValueErrors rather than silent truncation or raw indexing failures.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R9 P1 (Code Quality) -- staggered weights= subsetting bypassed the
standard length check. The R6 P1 fix added pos_idx-based subsetting
to align row-level weights with data_filtered, but the subsetting
ran without first validating that weights had length matching the
ORIGINAL data:
- Oversized arrays were silently truncated (pos_idx slice keeps only
  matched rows; tail of weights was discarded with no warning).
- Undersized arrays surfaced raw NumPy IndexError instead of the
  package's standard front-door ValueError.

Fix: validate weights as 1D AND len(weights) == len(data) BEFORE the
pos_idx slice, on all three staggered-weights entry points
(did_had_pretest_workflow, joint_pretrends_test,
joint_homogeneity_test). Subset only after validation passes.

R9 P3 -- did_had_pretest_workflow's Notes section still had the
un-split survey all_pass description ("at least one linearity test
conclusive"). The HADPretestReport docstring and REGISTRY were
updated in R8; this third location was missed. Now matches the
event-study contract that requires BOTH joint variants conclusive +
non-rejecting.

4 new regression tests in TestPhase45CR1Regressions:
- test_workflow_event_study_oversized_weights_raises
- test_workflow_event_study_undersized_weights_raises
- test_joint_pretrends_test_oversized_weights_raises
- test_joint_homogeneity_test_undersized_weights_raises

All assert the front-door ValueError contract instead of silent
truncation / raw IndexError. 191 pretest tests pass (was 187 after R8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6878f98e0c5c0fd27bce8411149d346b6c7732c1


Overall Assessment

⚠️ Needs changes — the prior weighted-CvM and survey-verdict/report issues look fixed, and the documented survey-path deviations are properly recorded, but one unmitigated P1 remains in the new survey Stute calibration on stratified designs.

Executive Summary

  • Affected methods: stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow, plus HADPretestReport serialization.
  • The previous weighted outer-measure CvM blocker appears fixed by _cvm_statistic_weighted in diff_diff/had_pretests.py.
  • The previous survey-path verdict / qug=None inconsistency also appears fixed by the explicit qug-optional rendering and dedicated survey verdict composers in diff_diff/had_pretests.py and diff_diff/had_pretests.py.
  • QUG skip under survey, Yatchew’s PSU-agnostic variance, and replicate-weight deferral are documented/tracked, so I am not treating those as blockers in this review: docs/methodology/REGISTRY.md and TODO.md.
  • P1 remains: the new survey Stute paths use raw PSU multipliers directly on the residual perturbation, but the same generator only provides the multiplier draw/FPC scaling; elsewhere in the codebase the stratified survey target requires an additional PSU aggregation + stratum-demeaning + sqrt(n_h/(n_h-1)) correction after that call. See diff_diff/had_pretests.py, diff_diff/had_pretests.py, compare diff_diff/bootstrap_utils.py and diff_diff/had.py.
  • Static review only: I could not run pytest here because the review environment does not have it installed.

Methodology

  • Severity: P1. Impact: REGISTRY.md says Phase 4.5 C’s Stute family uses a calibrated survey-aware multiplier bootstrap and reuses the same kernel as the weighted event-study bootstrap, but the implementation in the new Stute paths only calls generate_survey_multiplier_weights_batch(...) and then applies those raw PSU draws directly to eps in the residual perturbation loop. In this codebase, that generator is not the full stratified-survey calibration by itself: the weighted event-study bootstrap has to aggregate to PSU, demean within strata, and apply the sqrt(n_h/(n_h-1)) small-sample correction after the same generator call to match the survey target. On stratified survey designs, the new stute_test / stute_joint_pretest p-values can therefore be miscalibrated silently. References: diff_diff/had_pretests.py, diff_diff/had_pretests.py, diff_diff/bootstrap_utils.py, diff_diff/had.py, docs/methodology/REGISTRY.md. Concrete fix: either derive and implement the analogous Stute-specific stratified correction, or narrow Phase 4.5 C support by rejecting stratified survey designs on the Stute family until that derivation exists and document that narrower support in REGISTRY.md.

Code Quality

  • No unmitigated findings.

Performance

  • No unmitigated findings.

Maintainability

  • No unmitigated findings.

Tech Debt

  • Severity: P3-informational. Impact: replicate-weight survey pretests are explicitly deferred and tracked, so that limitation is properly recorded and does not block this PR. Reference: TODO.md. Concrete fix: none in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new nontrivial survey-design coverage in tests/test_had_pretests.py is smoke-only (np.isfinite(...), 0 <= p <= 1) and would not catch the stratified-bootstrap calibration issue above. The closest existing oracle in the repo is the weighted event-study sup-t calibration lock in tests/test_had.py. References: tests/test_had_pretests.py, tests/test_had_pretests.py. Concrete fix: add a stratified SurveyDesign(weights=..., strata=..., psu=...) regression for both stute_test and stute_joint_pretest that locks calibration/behavior against a known target, not just finiteness.

Path to Approval

  1. Make the survey Stute bootstrap match a derived/calibrated stratified-design target: either implement the missing PSU aggregation + within-stratum centering + sqrt(n_h/(n_h-1)) correction (or another fully derived equivalent) in stute_test and stute_joint_pretest, or reject stratified survey designs on those paths and document the narrower support in docs/methodology/REGISTRY.md.
  2. Add regression tests that fail on the current implementation for a nontrivial stratified survey design, covering both single-horizon and joint Stute paths.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 25, 2026
R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated
on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a
within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER
generate_survey_multiplier_weights_batch returns, to make the bootstrap
variance match the Binder-TSL stratified target. The same correction
has NOT been derived for the Stute CvM functional, so applying the
helper's raw multipliers directly to residual perturbations on stratified
designs left the bootstrap p-value silently miscalibrated.

Per the reviewer's offered "narrow support" path: Phase 4.5 C now
explicitly rejects stratified designs on the Stute family with
NotImplementedError. Pweight-only, PSU-only, and FPC-only designs
remain supported (the helper's output is appropriately scaled for those
without further correction). Stratified is a follow-up after the
matching Stute-CvM stratified-correction derivation lands.

Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are
methodology-gap-driven explicit NotImplementedErrors with documented
follow-up. The strata guard supersedes the lonely_psu='adjust'
singleton-strata guard for any stratified design (the latter is now
defense-in-depth for the unstratified residual case).

R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's
Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented
survey contract.

Tests updated:
- test_stute_test_lonely_psu_adjust_singletons_raises ->
  test_stute_test_stratified_design_raises (the strata guard fires
  first; the test is still meaningful but on a strata key match).
- Same renaming for stute_joint_pretest variant.
- test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED
  (singleton strata under lonely_psu='remove' now hits the strata
  guard instead of the df_survey<=0 guard).
- test_joint_homogeneity_test_psu_strata_survey_smoke ->
  test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage
  on PSU-only design) + new test_joint_homogeneity_test_stratified_raises.
- test_workflow_event_study_psu_strata_survey_smoke ->
  test_workflow_event_study_psu_only_survey_smoke.
- test_workflow_event_study_survey_pass_does_not_say_inconclusive
  switched from strata to PSU-only.

191 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ad44e39b634a4b8b63804beec38fb1f488b85266


Overall Assessment

⚠️ Needs changes — the prior weighted-CvM and qug=None/survey-verdict issues look fixed, but the earlier P1 on stratified survey Stute calibration is still unmitigated, and there is one additional P1 on the new staggered weights= routing.

Executive Summary

  • Affected methods: stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow, plus HADPretestReport serialization/rendering.
  • The previous weighted outer-measure CvM issue appears fixed by _cvm_statistic_weighted in diff_diff/had_pretests.py:L1058-L1106.
  • The previous survey-path qug=None rendering / verdict inconsistency also appears fixed by the HADPretestReport changes and explicit survey verdict composers in diff_diff/had_pretests.py:L676-L809 and diff_diff/had_pretests.py:L3563-L3664.
  • Replicate-weight deferral and Yatchew’s PSU-agnostic survey path are documented in docs/methodology/REGISTRY.md:L2436-L2446 and tracked in TODO.md:L98, so I am not treating those as blockers.
  • P1 remains: the new Stute survey paths still advertise and run on stratified designs, but they apply raw PSU multipliers directly after generate_survey_multiplier_weights_batch(...) instead of the post-draw PSU aggregation + within-stratum demeaning + sqrt(n_h/(n_h-1)) correction that the existing HAD sup-t stratified calibration uses. On stratified survey designs, the Stute p-values can therefore be miscalibrated silently.
  • [Newly identified] P1: the new staggered weights= routing slices weights by pos_idx before validating shape/length. Oversized arrays are silently truncated; undersized arrays can fail with raw indexing errors instead of the package’s front-door ValueError.
  • Static review only: I could not run the test suite here because the review environment is missing project dependencies.

Methodology

  • Severity: P1. Impact: stute_test and stute_joint_pretest still treat stratified survey designs as supported, and the registry/changelog say Phase 4.5 C supports PSU/strata/FPC survey designs. But the implementation in diff_diff/had_pretests.py:L1724-L1793 and diff_diff/had_pretests.py:L2916-L2978 only draws PSU multipliers from generate_survey_multiplier_weights_batch(...) and applies them directly to the residual perturbation. In this codebase, that helper is not the full stratified calibration by itself: the existing weighted event-study sup-t path has to aggregate to PSU level, demean within strata, and apply the sqrt(n_h/(n_h-1)) correction after the same helper call to match the stratified Binder target, see diff_diff/had.py:L2120-L2175. The registry still documents stratified survey support rather than a deviation or limitation, docs/methodology/REGISTRY.md:L2434-L2446, and the new tests actively smoke-test stratified designs as supported, tests/test_had_pretests.py:L3936-L3987. Concrete fix: either implement the missing Stute-specific stratified correction after the multiplier draw, or reject survey.strata is not None everywhere the Stute family is reachable and document the narrower support surface in REGISTRY.md/CHANGELOG.md.

  • No other unmitigated methodology findings. The replicate-weight deferral is tracked in TODO.md:L98, and Yatchew’s decision not to propagate PSU clustering is explicitly documented in docs/methodology/REGISTRY.md:L2436-L2441.

Code Quality

  • Severity: P1 [Newly identified]. Impact: on the new staggered aggregate="event_study" weights= path, the code slices weights with np.asarray(weights)[pos_idx] before validating that weights is 1-D and has length len(data), in diff_diff/had_pretests.py:L3292-L3303, diff_diff/had_pretests.py:L3514-L3525, and diff_diff/had_pretests.py:L3874-L3883. Because _resolve_pretest_unit_weights(...) only validates after this slicing (diff_diff/had_pretests.py:L3033-L3062), oversized arrays are silently truncated to the filtered panel and undersized arrays can surface raw indexing failures instead of a front-door ValueError. That is a real correctness/edge-case problem on the new parameter path. Concrete fix: validate weights.ndim == 1 and len(weights) == len(data) before any pos_idx slicing in all three sites, then subset.

Performance

  • No unmitigated findings.

Maintainability

  • No unmitigated findings.

Tech Debt

  • Severity: P3-informational. Impact: replicate-weight survey pretests are explicitly deferred and tracked in TODO.md:L98, so that limitation is properly recorded and does not block this PR. Concrete fix: none in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new stratified survey tests only assert finiteness / 0 <= p <= 1, so they would not catch the calibration problem above and currently enshrine the unsupported behavior as a passing path, tests/test_had_pretests.py:L3936-L3987. The docs also overstate the supported surface by claiming Phase 4.5 C supports PSU/strata/FPC Stute survey designs, docs/methodology/REGISTRY.md:L2434-L2446 and CHANGELOG.md:L8-L11. Concrete fix: if stratified Stute remains unsupported, replace these tests with NotImplementedError regressions and narrow the docs accordingly; if stratified Stute is implemented, add a calibration/parity oracle that would fail on the current raw-multiplier implementation.

Path to Approval

  1. Make the Stute survey bootstrap match a derived stratified target, or explicitly reject stratified survey designs on stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow, then update docs/methodology/REGISTRY.md and CHANGELOG.md to match.
  2. Add front-door validation for staggered weights= inputs before any pos_idx slicing in joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow, so wrong-length arrays raise clean ValueErrors instead of being silently truncated or failing deep in NumPy.
  3. Replace the current stratified smoke tests with either rejection tests for the narrowed support surface or a real stratified calibration/parity regression, and add oversized/undersized staggered-weights= regression tests.

igerber and others added 11 commits April 25, 2026 12:59
Closes the Phase 4.5 C0 promise (PR #367 commit 29f8b12). Linearity-
family pretests now accept survey=/weights= keyword-only kwargs:

- stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test,
  joint_homogeneity_test, did_had_pretest_workflow.

Stute family: PSU-level Mammen multiplier bootstrap via
generate_survey_multiplier_weights_batch. Each replicate draws
(B, n_psu) Mammen multipliers, broadcast to per-obs perturbation
eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new
_cvm_statistic_weighted helper. Joint Stute SHARES the multiplier matrix
across horizons within each replicate, preserving both vector-valued
empirical-process unit-level dependence (Delgado 1993; Escanciano 2006)
AND PSU clustering (Krieger-Pfeffermann 1997). NOT Rao-Wu rescaling --
multiplier bootstrap is a different mechanism.

Yatchew: closed-form weighted OLS + pweight-sandwich variance components
(no bootstrap):
  sigma2_lin  = sum(w * eps^2) / sum(w)
  sigma2_diff = sum(w_avg * diff^2) / (2 * sum(w))   [Reviewer CRITICAL #2]
  sigma4_W    = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg)
  T_hr        = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W
where w_avg_g = (w_g + w_{g-1}) / 2 (Krieger-Pfeffermann 1997 Section 3).
All three components reduce bit-exactly to existing unweighted formulas
at w=ones(G); locked at atol=1e-14 by direct helper test.

Workflow under survey/weights: skips the QUG step with UserWarning (per
C0 deferral), sets qug=None on the report, dispatches the linearity
family with survey-aware mechanism, appends "linearity-conditional
verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix to the
verdict. all_pass drops the QUG-conclusiveness gate (one less
precondition). HADPretestReport.qug retyped from QUGTestResults to
Optional[QUGTestResults]; summary/to_dict/to_dataframe updated to
None-tolerant rendering.

Pweight shortcut routing: weights= passes through a synthetic trivial
ResolvedSurveyDesign (new survey._make_trivial_resolved helper) so the
same kernel handles both entry paths -- mirrors PR #363's R7 fix pattern
on HAD sup-t.

Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise
NotImplementedError at every entry point (defense in depth, reciprocal-
guard discipline). The per-replicate weight-ratio rescaling for the
OLS-on-residuals refit step is not covered by the multiplier-bootstrap
composition; deferred to a parallel follow-up.

Per-row weights= / survey=col aggregated to per-unit via existing HAD
helpers (_aggregate_unit_weights, _aggregate_unit_resolved_survey;
constant-within-unit invariant enforced) through new
_resolve_pretest_unit_weights helper. Strictly-positive weights required
on Yatchew (the adjacent-difference variance is undefined under
contiguous-zero blocks).

Stability invariants preserved:
- Unweighted code paths bit-exact pre-PR (the new survey/weights branch
  is a separate if arm; existing 138 pretest tests pass unchanged).
- Yatchew weighted variance components reduce to unweighted at w=1 at
  atol=1e-14 (locked by TestYatchewHRTestSurvey).
- HADPretestReport schema bit-exact on the unweighted path; qug=None
  triggers the new None-tolerant rendering only on the survey path.

20 new tests across TestHADPretestWorkflowSurveyGuards (revised from
C0 rejection-only to C functional + 2 mutex/replicate-weight retained),
TestStuteTestSurvey (7), TestYatchewHRTestSurvey (7), TestJointStuteSurvey
(5). Full pretest suite: 158 tests pass.

Patch-level addition (additive on stable surfaces). See
docs/methodology/REGISTRY.md "QUG Null Test" -- Note (Phase 4.5 C) for
the full methodology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R1 P0 — Stute survey path silently accepted zero-weight units, which
leak into the dose-variation check + CvM cusum + bootstrap refit while
contributing zero population mass. Extreme case: only zero-weight units
carry dose variation -> spurious finite test statistic with no warning.
Fix: strictly-positive guards on every survey-aware Stute / Yatchew /
workflow entry point (the weights= shortcut already had this; survey=
branch was the gap).

R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only
formulas silently (the variance components are derived assuming pweight
sandwich semantics). Fix: weight_type='pweight' guards added in
_resolve_pretest_unit_weights and on every direct-helper survey= branch
(stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit
guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914.

R1 P1 #2 — workflow's row-level weights= crashed on staggered event-
study panels because _validate_multi_period_panel filters to last
cohort but the joint wrappers re-aggregate with the original full-
panel weights array. Fix: subset joint_weights to data_filtered's
rows via data.index.get_indexer(data_filtered.index) BEFORE passing
to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional-
index pattern. Survey= path is unaffected (column references resolve
internally on data_filtered).

R1 P3 — REGISTRY C0 note still said "the same gate applies to
did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both
are stale post-C. Updated to clarify (a) workflow gate was temporary
and is now closed by C, (b) qug_test direct-helper gate remains
permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT
Rao-Wu rescaling).

7 new tests in TestPhase45CR1Regressions covering: zero-weight survey
on stute_test / stute_joint_pretest / workflow; aweight rejection on
stute_test / workflow; fweight rejection on yatchew_hr_test; staggered
event-study workflow with weights= (catches the length-mismatch crash).
165 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test
direct calls still crashed on staggered panels because the staggered-
weights subset fix from R1 was only applied at the workflow level. The
wrappers run their own _validate_had_panel_event_study() and may filter
to data_filtered, then passed the original full-panel weights array to
_resolve_pretest_unit_weights(data_filtered, ...) which expects the
filtered row count. Fix: subset row-level weights to data_filtered.index
positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights,
mirroring the workflow fix.

R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap
perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does
`dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper
Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows
through the OLS refit and the weighted CvM, not through the perturbation.
Adding `* w` would over-weight by w². Fix: update REGISTRY note to
remove the spurious `* w` and clarify the canonical form. Add a
regression that pins (a) bit-exact cvm_stat reduction at uniform weights,
(b) bootstrap p-value distributional agreement within Monte-Carlo noise.

R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract:
- qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled
  bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped).
  Updated to reflect the correct mechanism.
- HADPretestReport.all_pass docstring described the unweighted contract
  only; survey/weights path drops the QUG-conclusiveness gate
  (linearity-conditional admissibility per C0 deferral). Updated.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_pretrends_test_staggered_weights_subset
- test_joint_homogeneity_test_staggered_weights_subset
- test_stute_survey_perturbation_does_not_double_weight (locks the
  perturbation form via cvm_stat bit-exact reduction + p-value MC bound)

168 pretest tests pass (was 165 after R1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R3 P0 (Methodology) -- variance-unidentified survey-design guard.
generate_survey_multiplier_weights_batch returns an all-zero (B, n_psu)
multiplier matrix when n_psu - n_strata <= 0 (e.g. unstratified
single-PSU design where df_survey = n_psu - 1 = 0). Without a guard,
the Stute survey path treated zero perturbations as a valid bootstrap
law and emitted p_value = 1/(B+1) for any positive observed CvM
(spurious rejection on a survey edge case).

Fix: detect df_survey <= 0 BEFORE calling the multiplier helper and
return a NaN p_value with a UserWarning, matching the broader package
convention for variance-unidentified survey designs (mirrors
compute_survey_vcov NaN treatment, ResolvedSurveyDesign.df_survey
semantics). Applied in stute_test (single-horizon) and
stute_joint_pretest (joint variant); the data-in wrappers and the
workflow propagate via the same calls.

R3 P3 -- in-code docstrings still drifted on three surfaces:
- qug_test NotImplementedError text said survey-aware Stute is
  "planned" and "admits a Rao-Wu rescaled bootstrap"; both are stale
  post-Phase-4.5-C. Updated to reflect the shipped reality (PSU-level
  Mammen multiplier bootstrap; QUG step is skipped under survey/weights
  in the workflow).
- joint_pretrends_test, joint_homogeneity_test, stute_joint_pretest
  Parameters sections did not document the new weights= / survey=
  kwargs at all. Added with the per-test contract (per-unit
  aggregation, staggered subsetting, replicate/aweight/fweight
  rejection, variance-unidentified guard).

3 new regression tests in TestPhase45CR1Regressions:
- test_stute_test_single_psu_unstratified_returns_nan
- test_stute_joint_pretest_single_psu_unstratified_returns_nan
- test_workflow_single_psu_propagates_nan_through_stute (also documents
  Yatchew's PSU-agnostic-by-design behavior; verdict carries the
  linearity-conditional suffix)

171 pretest tests pass (was 168 after R2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R4 P0 (Methodology) -- Yatchew test statistic was not invariant to
uniform pweight rescaling. The formula `T_hr = sqrt(sum(w)) * (...)`
makes T_hr scale as sqrt(c) under weights -> w * c, so weights=w and
weights=100*w produced different p-values for the same design. Worse,
SurveyDesign.resolve() normalizes pweights to mean=1 internally, so
the survey= entry path and the weights= shortcut disagreed numerically.

Fix: normalize per-unit pweights to mean=1 at every helper entry
(stute_test, yatchew_hr_test, stute_joint_pretest) and at the workflow
resolution helper. Matches SurveyDesign.resolve() convention; makes
the Yatchew statistic scale-invariant; ensures weights=w and
survey=SurveyDesign(weights="w") produce identical results for the
same design. Stute is internally scale-invariant in functional form
but normalization is required so the bootstrap helper sees the same
weight vector under both entry paths (cross-path numerical agreement).

R4 P1 (Code Quality) -- column-vector weights (e.g. `df[["w"]].to_numpy()`
producing (G, 1)) silently broadcast through weighted moments / CvM
sums instead of raising. Fix: validate via `_validate_1d_numeric` on
all `weights=` arrays in stute_test, yatchew_hr_test, stute_joint_pretest;
add explicit ndim check in `_resolve_pretest_unit_weights` with a
hint about the common df[["w"]].to_numpy() mistake.

6 new regression tests in TestPhase45CR1Regressions:
- test_yatchew_weights_scale_invariant (weights=w vs weights=100*w)
- test_stute_weights_scale_invariant (mirror for Stute)
- test_workflow_weights_eq_survey_at_overall_path (weights= shortcut
  and survey=SurveyDesign(...) produce identical Yatchew + Stute
  results, atol=1e-10)
- test_stute_test_rejects_2d_weights / test_yatchew_hr_test_rejects_2d_weights
  / test_workflow_rejects_2d_weights (column-vector rejection at all
  three direct-helper / workflow entry points)

177 pretest tests pass (was 171 after R3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R5 P1 (Methodology) -- Stute survey df_survey<=0 guard was too broad.
generate_survey_multiplier_weights_batch's pooled-singleton branch for
lonely_psu='adjust' produces NONZERO multipliers, so the helper contract
is non-degenerate even when df_survey=0 (one PSU per stratum).
The previous guard rejected those designs as variance-unidentified
when actually the issue is methodological: the analytical variance
target requires a pseudo-stratum centering transform that has not
been derived for the Stute CvM.

Fix: mirror HAD sup-t bootstrap's explicit lonely_psu='adjust'
rejection at had.py:2081-2118. Add new helper
_has_lonely_psu_adjust_singletons in had_pretests.py. stute_test and
stute_joint_pretest now reject lonely_psu='adjust' with singleton
strata via NotImplementedError BEFORE the df_survey<=0 guard.
The guard remains for genuinely degenerate designs (single-PSU
unstratified, or one-PSU-per-stratum under remove/certainty where
the helper returns all-zero multipliers).

R5 P2 -- new regressions cover lonely_psu='adjust' rejection on both
single-horizon and joint-Stute helpers; also a positive control showing
lonely_psu='remove' singletons still flow through the existing
df_survey-based NaN path. 3 new tests in TestPhase45CR1Regressions.

REGISTRY note clarifies the carve-out: 'adjust' with singletons raises
NotImplementedError; 'remove' / 'certainty' produce all-zero singleton
multipliers and return NaN via the variance-unidentified guard.

180 pretest tests pass (was 177 after R4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved
weights/survey on the FULL panel before _validate_multi_period_panel
applied the staggered last-cohort filter. Because
_resolve_pretest_unit_weights enforces strictly-positive per-unit
weights / pweight type / etc. on whatever data it sees, zero or
otherwise-invalid weights on the soon-to-be-dropped cohort would abort
an otherwise-valid event-study run.

Fix: defer resolution to per-aggregate branches.
- Top-level: only the survey/weights mutex check + use_survey_path
  presence detection (no resolution).
- Overall path: resolve weights/survey AFTER _validate_had_panel
  (no cohort filtering on this path; original data IS the panel).
- Event-study path: do NOT resolve at the workflow level. The joint
  wrappers (joint_pretrends_test / joint_homogeneity_test) own
  resolution and already see data_filtered (post staggered filter).
  Row-level weights= passed through with the existing positional
  subsetting (R1 P1 fix preserved).

R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage
gap. Existing tests covered overall-workflow + trivial/no-PSU smokes;
the PSU-aware multiplier-bootstrap path (the core new methodology)
was unpinned for joint_homogeneity_test and the event-study workflow.

3 new regression tests in TestPhase45CR1Regressions:
- test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial
  SurveyDesign(weights=, strata=, psu=) on the linearity wrapper).
- test_workflow_event_study_psu_strata_survey_smoke (full event-study
  dispatch under PSU/strata clustering: validate_multi_period_panel +
  resolve on data_filtered + pretrends_joint + homogeneity_joint).
- test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1
  fix regression: panel where the dropped early cohort has zero
  weights succeeds on the surviving last cohort; pre-fix this crashed
  with "weights must be strictly positive").

183 pretest tests pass (was 180 after R5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 P0 (Methodology) -- weighted CvM was missing the outer measure
factor. Survey-weighted plug-in of Stute's CvM functional integrates
the squared cusum process against the survey-weighted EDF
F_hat_w = (1/W) sum_i w_i delta_{D_i}, which weights BOTH the inner
cusum AND the outer integration measure:
    C_g = sum_{h <= g} w_h * eps_h
    S_w = (1/W^2) * sum_g w_g * (C_g)^2

Earlier revisions used (1/W^2) * sum_g C_g^2 (no outer w_g) which is
a count-weighted-cusum / uniform-outer-measure functional and silently
misreports survey-weighted Stute statistics for non-uniform weights.
At w=ones(G) both forms reduce to (1/G^2) sum_g C_g^2; only non-uniform
weights distinguish them, which is why the prior reduction tests
didn't catch this.

Fix: add the outer w_sorted factor to _cvm_statistic_weighted. New
oracle test pins the formula on a hand-computed non-uniform-weight
example (w=[1,2,3], eps=[1,-2,3] -> 127/36 outer-weighted, NOT 46/36
count-weighted-cusum). Reduction at w=1 still bit-exact.

R7 P1 (Code Quality) -- survey verdict could leave pass cases starting
with "inconclusive". The previous approach composed the unweighted
verdict against a synthetic NaN QUG and string-replaced "QUG NaN" out;
when no rejections fired and linearity was conclusive, the underlying
"inconclusive - QUG NaN" template would collapse to "inconclusive"
even for all_pass=True paths.

Fix: explicit survey-aware verdict composers
_compose_verdict_overall_survey and _compose_verdict_event_study_survey
that drop QUG from consideration entirely and emit linearity-only
priority text. Both append the
"(linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)"
suffix in every branch (rejections, inconclusive, fail-to-reject).

4 new regression tests:
- test_cvm_statistic_weighted_outer_measure_oracle (R7 P0 hand-computed)
- test_cvm_statistic_weighted_reduces_at_uniform_weights (R7 P0 reduction)
- test_workflow_overall_survey_pass_does_not_say_inconclusive (R7 P1)
- test_workflow_event_study_survey_pass_does_not_say_inconclusive (R7 P1)

187 pretest tests pass (was 183 after R6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 P3 -- survey all_pass docstring/REGISTRY described the contract too
generically ("at least one linearity test conclusive"), which matches
the overall path but is looser than the implemented event-study path
(which requires BOTH pretrends_joint AND homogeneity_joint conclusive
+ non-rejecting).

Fix: split the survey all_pass description by aggregate in both
HADPretestReport docstring and REGISTRY note:
- overall: "at least one of Stute/Yatchew conclusive + no rejection"
  (mirrors paper Section 4 step-3 'Stute OR Yatchew' wording).
- event_study: "both joint variants conclusive + neither rejects"
  (same step-2 + step-3 closure as the unweighted aggregate, minus the
  QUG step).

Code unchanged; only documentation. 187 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9 P1 (Code Quality) -- staggered weights= subsetting bypassed the
standard length check. The R6 P1 fix added pos_idx-based subsetting
to align row-level weights with data_filtered, but the subsetting
ran without first validating that weights had length matching the
ORIGINAL data:
- Oversized arrays were silently truncated (pos_idx slice keeps only
  matched rows; tail of weights was discarded with no warning).
- Undersized arrays surfaced raw NumPy IndexError instead of the
  package's standard front-door ValueError.

Fix: validate weights as 1D AND len(weights) == len(data) BEFORE the
pos_idx slice, on all three staggered-weights entry points
(did_had_pretest_workflow, joint_pretrends_test,
joint_homogeneity_test). Subset only after validation passes.

R9 P3 -- did_had_pretest_workflow's Notes section still had the
un-split survey all_pass description ("at least one linearity test
conclusive"). The HADPretestReport docstring and REGISTRY were
updated in R8; this third location was missed. Now matches the
event-study contract that requires BOTH joint variants conclusive +
non-rejecting.

4 new regression tests in TestPhase45CR1Regressions:
- test_workflow_event_study_oversized_weights_raises
- test_workflow_event_study_undersized_weights_raises
- test_joint_pretrends_test_oversized_weights_raises
- test_joint_homogeneity_test_undersized_weights_raises

All assert the front-door ValueError contract instead of silent
truncation / raw IndexError. 191 pretest tests pass (was 187 after R8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated
on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a
within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER
generate_survey_multiplier_weights_batch returns, to make the bootstrap
variance match the Binder-TSL stratified target. The same correction
has NOT been derived for the Stute CvM functional, so applying the
helper's raw multipliers directly to residual perturbations on stratified
designs left the bootstrap p-value silently miscalibrated.

Per the reviewer's offered "narrow support" path: Phase 4.5 C now
explicitly rejects stratified designs on the Stute family with
NotImplementedError. Pweight-only, PSU-only, and FPC-only designs
remain supported (the helper's output is appropriately scaled for those
without further correction). Stratified is a follow-up after the
matching Stute-CvM stratified-correction derivation lands.

Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are
methodology-gap-driven explicit NotImplementedErrors with documented
follow-up. The strata guard supersedes the lonely_psu='adjust'
singleton-strata guard for any stratified design (the latter is now
defense-in-depth for the unstratified residual case).

R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's
Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented
survey contract.

Tests updated:
- test_stute_test_lonely_psu_adjust_singletons_raises ->
  test_stute_test_stratified_design_raises (the strata guard fires
  first; the test is still meaningful but on a strata key match).
- Same renaming for stute_joint_pretest variant.
- test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED
  (singleton strata under lonely_psu='remove' now hits the strata
  guard instead of the df_survey<=0 guard).
- test_joint_homogeneity_test_psu_strata_survey_smoke ->
  test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage
  on PSU-only design) + new test_joint_homogeneity_test_stratified_raises.
- test_workflow_event_study_psu_strata_survey_smoke ->
  test_workflow_event_study_psu_only_survey_smoke.
- test_workflow_event_study_survey_pass_does_not_say_inconclusive
  switched from strata to PSU-only.

191 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d8a7353dd8eaf0821566f3176bb2fff2079cbace


Overall Assessment

Looks good — the prior P1s are addressed, and I did not find any new unmitigated P0/P1 issues in the changed methodology or inference code.

Executive Summary

  • Affected methods: stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow, plus HADPretestReport serialization/rendering.
  • The prior stratified-Stute methodology issue is resolved by explicitly rejecting stratified designs in the Stute family and documenting that narrowed support surface in the registry/changelog, rather than silently running a miscalibrated bootstrap (diff_diff/had_pretests.py:L1724-L1750, diff_diff/had_pretests.py:L2938-L2954, docs/methodology/REGISTRY.md:L2445-L2449, CHANGELOG.md:L11-L11).
  • The prior staggered-weights= routing issue is resolved: the workflow and both joint wrappers now validate shape/length before any pos_idx slicing, so oversized arrays are no longer silently truncated and undersized arrays no longer fail deep in NumPy (diff_diff/had_pretests.py:L3331-L3363, diff_diff/had_pretests.py:L3574-L3600, diff_diff/had_pretests.py:L3957-L3985).
  • The weighted-CvM outer-measure fix and the qug=None survey-report rendering/verdict logic remain correctly wired and tested in the changed code (diff_diff/had_pretests.py:L1058-L1106, diff_diff/had_pretests.py:L677-L808, tests/test_had_pretests.py:L2978-L2999).
  • Two minor follow-ups remain: TODO.md still overstates stratified support, and the new tests do not include a positive FPC-only survey case.
  • Static review only: I could not run the test suite here because this environment is missing basic Python test/runtime dependencies (pytest, numpy).

Methodology

No unmitigated findings.

The main prior methodology blocker is closed: the PR now rejects stratified Stute paths up front instead of applying raw survey multipliers on a target that the codebase itself says still lacks the matching Stute-specific correction (diff_diff/had_pretests.py:L1724-L1750, diff_diff/had_pretests.py:L2938-L2954). That narrowing is documented in the methodology registry and release notes, so it is no longer an undocumented deviation (docs/methodology/REGISTRY.md:L2445-L2449, CHANGELOG.md:L11-L11).

Code Quality

No unmitigated findings.

The previously flagged shape/length bug on staggered weights= routing is fixed by front-door validation before subsetting in all three affected entry points (diff_diff/had_pretests.py:L3331-L3363, diff_diff/had_pretests.py:L3574-L3600, diff_diff/had_pretests.py:L3957-L3985).

Performance

No findings.

Maintainability

No unmitigated findings.

Tech Debt

  • Severity: P3
    Impact: TODO.md still says Phase 4.5 C ships PSU/strata/FPC support, but the actual changed code/docs now reject stratified Stute designs. That makes the debt ledger disagree with the real support surface and weakens future review tracking. (TODO.md:L98, docs/methodology/REGISTRY.md:L2447-L2449, CHANGELOG.md:L11-L11)
    Concrete fix: Update TODO.md:L98 to say Phase 4.5 C ships pweight + PSU + FPC support only, and either add stratified-Stute derivation as its own follow-up item or fold it into that row.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: The PR now advertises FPC-only support in the changelog/registry, but the added survey test matrix does not include a positive fpc= case. That leaves one newly supported inference branch unpinned by regression coverage. (CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2447-L2449, tests/test_had_pretests.py:L2854-L4190)
    Concrete fix: Add at least one positive smoke/regression test for SurveyDesign(weights="w", fpc="fpc") on the Stute family, ideally one overall-path test and one event-study or direct-helper test.

Runtime verification unavailable here because the environment does not have pytest or numpy.

R12 P3 #1 -- TODO row 98 said Phase 4.5 C ships "PSU/strata/FPC" but
R10 narrowed Stute-family support to pweight + PSU + FPC only
(stratified rejected with NotImplementedError pending derivation).
Updated to reflect the actual support surface and consolidated the
stratified-Stute follow-up alongside replicate-weight pretests as the
two known Phase 4.5 C follow-ups.

R12 P3 #2 -- the new survey test matrix covered pweight-only and
PSU-only smokes but no FPC-only case. The bootstrap helper applies
sqrt(1 - f) FPC scaling to multipliers under FPC, which was unpinned
by direct regression. 2 new positive smokes:
- test_stute_test_fpc_only_survey_smoke: direct helper with
  ResolvedSurveyDesign(fpc=...) populated.
- test_workflow_overall_fpc_only_survey_smoke: workflow path with
  SurveyDesign(weights=, fpc=) column reference.

193 pretest tests pass (was 191).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a4fc64e2ea68434bcbee9455535bb62dff56224e


Overall Assessment

Looks good — the prior P1s appear addressed, and I did not find any new unmitigated P0/P1 issues in the changed survey-aware HAD pretest paths.

Executive Summary

  • Affected methods: stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow, plus HADPretestReport serialization/rendering (diff_diff/had_pretests.py:L1414-L1818, diff_diff/had_pretests.py:L1831-L2228, diff_diff/had_pretests.py:L2509-L3034, diff_diff/had_pretests.py:L568-L830).
  • The changed survey methodology matches the new registry note: workflow skips QUG under survey, Stute uses PSU-level Mammen multipliers, and Yatchew uses the documented weighted closed-form components (docs/methodology/REGISTRY.md:L2434-L2449).
  • The prior stratified-Stute blocker is resolved by explicit rejection plus documentation/tracking, instead of silently running an unsupported calibration (diff_diff/had_pretests.py:L1724-L1766, diff_diff/had_pretests.py:L2938-L2970, docs/methodology/REGISTRY.md:L2445-L2447, TODO.md:L98-L98).
  • The prior staggered weights= routing issue is resolved: row-level weights are validated and subset before re-aggregation in both joint wrappers and the workflow (diff_diff/had_pretests.py:L3331-L3363, diff_diff/had_pretests.py:L3574-L3600, diff_diff/had_pretests.py:L3957-L3985).
  • The all-zero survey-bootstrap edge case is now guarded: variance-unidentified survey designs return NaN instead of calibrating against a zero multiplier law (diff_diff/had_pretests.py:L1767-L1797, diff_diff/had_pretests.py:L2971-L2996).
  • Added tests now pin the earlier gaps around qug=None rendering, uniform-weight Yatchew reduction, staggered weight subsetting, stratified rejection, and FPC-only support (tests/test_had_pretests.py:L2977-L3005, tests/test_had_pretests.py:L3141-L3160, tests/test_had_pretests.py:L3526-L3548, tests/test_had_pretests.py:L3852-L3885, tests/test_had_pretests.py:L4195-L4242). I could not rerun pytest here because pytest is not installed in this environment.

Methodology

  • Severity: P3. Impact: The remaining survey-path caveats in this PR are documented rather than silent: Yatchew under survey remains pweight-based rather than PSU-cluster-adjusted, and replicate-weight / stratified Stute designs are explicitly out of scope. That matches the registry note and the code guards, so I am treating them as documented deviations, not defects (docs/methodology/REGISTRY.md:L2436-L2447, diff_diff/had_pretests.py:L1535-L1551, diff_diff/had_pretests.py:L1724-L1766, diff_diff/had_pretests.py:L1964-L1976, diff_diff/had_pretests.py:L2608-L2621). Concrete fix: None required for approval.

Code Quality

No findings. The changed code also avoids the specific anti-patterns called out in the review rubric: I did not find new inline generic SE/t-stat inference code or partial NaN-guard handling in the modified paths, and the new front-door weight validation/subsetting is consistent across the affected wrappers (diff_diff/had_pretests.py:L3331-L3363, diff_diff/had_pretests.py:L3574-L3600, diff_diff/had_pretests.py:L3957-L3985).

Performance

No findings. The PR keeps the existing literal refit/bootstrap structure; I did not see a new asymptotic regression beyond the already-tracked Stute refit debt in TODO.md:L104-L104.

Maintainability

No findings. The support surface is materially clearer now because unsupported designs fail at the entry points with method-specific messages, and the survey-only verdict composition was split into dedicated helpers instead of string post-processing (diff_diff/had_pretests.py:L3632-L3738).

Tech Debt

  • Severity: P3. Impact: The remaining deferred work on this surface is now correctly tracked in TODO.md: replicate-weight pretests and stratified Stute derivation are explicitly listed as follow-ups, so there is no untracked technical debt blocking merge (TODO.md:L98-L98). Concrete fix: None required for this PR.

Security

No findings.

Documentation/Tests

No findings. The registry, changelog, and TODO now agree on the shipped support surface, and the earlier missing FPC-only coverage is present (CHANGELOG.md:L11-L11, TODO.md:L98-L98, docs/methodology/REGISTRY.md:L2434-L2449, tests/test_had_pretests.py:L4195-L4242). Runtime verification was not possible here because pytest is unavailable locally.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 25, 2026
@igerber igerber merged commit 435dfc2 into main Apr 25, 2026
23 of 24 checks passed
@igerber igerber deleted the had-phase-4.5-c branch April 25, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant