HAD Phase 4.5 C: linearity-family pretests under survey by igerber · Pull Request #370 · igerber/diff-diff

igerber · 2026-04-25T12:40:38Z

Summary

Closes the Phase 4.5 C0 promise (PR HAD Phase 4.5 C0: QUG-under-survey decision gate #367 commit 29f8b12). Linearity-family pretests now accept weights= / survey= kwargs.
Stute family: PSU-level Mammen multiplier bootstrap via bootstrap_utils.generate_survey_multiplier_weights_batch (same kernel as PR HAD Phase 4.5 B: weighted mass-point 2SLS + event-study survey composition + sup-t bootstrap #363 sup-t). Joint Stute SHARES the multiplier matrix across horizons, preserving vector-valued empirical-process unit-level dependence + PSU clustering.
Yatchew: closed-form weighted OLS + pweight-sandwich variance components (no bootstrap). All three components reduce bit-exactly to existing unweighted formulas at w=ones(G) (locked at atol=1e-14).
Workflow: skips QUG with UserWarning per C0 deferral, sets qug=None on report, dispatches survey-aware sub-tests, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix.
HADPretestReport.qug retyped from QUGTestResults to Optional[QUGTestResults]; summary / to_dict / to_dataframe updated to None-tolerant rendering.
Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError at every entry point (defense in depth) — parallel follow-up.
20 new tests; 158 pretest tests pass.

Methodology

Stute calibration (locked decision via plan-mode AskUserQuestion): PSU-level Mammen multipliers, NOT Rao-Wu rescaling — different mechanism. Reuses the kernel from PR #363's HAD event-study sup-t bootstrap. Per-obs perturbation eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new _cvm_statistic_weighted. Joint Stute shares the (B, n_psu) matrix across horizons within each replicate.

Yatchew weighted variance components (Krieger-Pfeffermann 1997 §3 pair-weight convention):

sigma2_lin = sum(w·eps²) / sum(w)
sigma2_diff = sum(w_avg·diff²) / (2·sum(w)) with w_avg_g = (w_g+w_{g-1})/2. Divisor uses sum(w) (=G at w=1), NOT sum(w_avg), to match the existing (1/(2G)) unweighted formula at had_pretests.py:1635 (Reviewer CRITICAL Add fixed effects and absorb parameters to DifferenceInDifferences #2).
sigma4_W = sum(w_avg·prod) / sum(w_avg) reduces to (1/(G-1))·sum(prod) at w=1.
T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W (effective-sample-size convention).

Trivial-survey-equivalence is DISTRIBUTIONAL (not bit-exact at atol=1e-10) — the survey path uses batched generate_survey_multiplier_weights_batch while unweighted uses per-iteration _generate_mammen_weights. Different RNG consumption ordering means same-seed produces different multiplier matrices but the bootstrap p-value distributions agree at large B (Reviewer CRITICAL #3 reframe).

Stability invariants preserved

Unweighted code paths bit-exact pre-PR (separate if branch for survey/weights). All 138 existing pretest tests pass unchanged.
Yatchew weighted variance components reduce to unweighted at w=1 at atol=1e-14 (locked by TestYatchewHRTestSurvey::test_weighted_reduces_to_unweighted_at_uniform_weights).
HADPretestReport schema bit-exact on the unweighted path; qug=None triggers None-tolerant rendering only on the survey path.

Files

File	Change
`diff_diff/had_pretests.py`	survey/weights kwargs + survey-aware bootstrap on 5 helpers + workflow dispatcher rewrite + `_fit_weighted_ols_intercept_slope` / `_cvm_statistic_weighted` / `_resolve_pretest_unit_weights` helpers + `HADPretestReport.qug` retyped to Optional
`diff_diff/survey.py`	`_make_trivial_resolved` helper for pweight-shortcut routing
`docs/methodology/REGISTRY.md`	Note (Phase 4.5 C) under QUG Null Test entry with full methodology
`tests/test_had_pretests.py`	20 new tests (TestHADPretestWorkflowSurveyGuards revised, TestStuteTestSurvey, TestYatchewHRTestSurvey, TestJointStuteSurvey)
`CHANGELOG.md`	[Unreleased] Patch entry
`TODO.md`	Replace Phase 4.5 C row with replicate-weight follow-up row

Test plan

pytest tests/test_had_pretests.py -v — 158/158 green (138 pre-PR + 20 new)
black diff_diff tests — clean
ruff check diff_diff tests — clean (auto-fixed import ordering)
Manual REPL smoke: stute_test(weights=), yatchew_hr_test(weights=), did_had_pretest_workflow(weights=) all functional with QUG-skip warning
CI

🤖 Generated with Claude Code

github-actions · 2026-04-25T12:53:27Z

Overall Assessment
⛔ Blocker

Executive Summary

The new survey-aware Stute family has a silent zero-weight bug on the survey= path: zero-weight/domain-excluded units are neither dropped nor rejected, so they can create effective dose variation and contribute to the weighted CvM statistic. That can produce wrong pretest decisions with no warning.
The new pretest survey path silently accepts SurveyDesign(weight_type="aweight"|"fweight") even though the registry note and the existing HAD estimator code are pweight-only.
weights= does not interact correctly with the existing last-cohort auto-filter on the event-study path: staggered panels will hit a length-mismatch crash after filtering.
Replicate-weight deferral and Yatchew’s lack of PSU-cluster propagation are documented/tracked, so I did not count those as blockers.

Methodology
Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, yatchew_hr_test, and did_had_pretest_workflow.

P0 Impact: the Stute survey path allows zero survey weights, but the implementation still checks dose variation on the full unweighted d vector and _cvm_statistic_weighted() still sums over every row. That means zero-weight/domain-excluded units can create variation and CvM mass in what should be the excluded part of the sample. In the extreme case where all positive-weight units share the same dose and only zero-weight units vary, the code can return a conclusive Stute result instead of an inconclusive/degenerate one. This is silent wrong statistical output on supported survey= inputs. References: diff_diff/had_pretests.py:L1016-L1052, diff_diff/had_pretests.py:L1548-L1552, diff_diff/had_pretests.py:L1562-L1655, diff_diff/had_pretests.py:L2598-L2741. Concrete fix: mask to w > 0 before every Stute-family dose-variation check, sort/CvM computation, and bootstrap refit; alternatively reject zero survey weights consistently at every survey-aware Stute/workflow entry point until true domain-weight handling is implemented.
P1 Impact: the new survey-aware pretests never check resolved.weight_type, so aweight and fweight survey designs will run through pweight formulas with no warning. That is an undocumented methodology deviation from the new registry note and from the existing HAD survey guard behavior. References: docs/methodology/REGISTRY.md:L2434-L2441, diff_diff/had_pretests.py:L1535-L1541, diff_diff/had_pretests.py:L1834-L1846, diff_diff/had_pretests.py:L2600-L2606, diff_diff/had_pretests.py:L2763-L2815, diff_diff/had.py:L2976-L2994. Concrete fix: reject weight_type != "pweight" in _resolve_pretest_unit_weights and in the direct-helper survey= branches.

Code Quality

P1 Impact: the new weights= event-study path is broken on staggered panels that trigger last-cohort filtering. joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(..., aggregate="event_study") validate/filter to data_filtered, then call _aggregate_unit_weights on the original row-level weights array, so the length check fails after filtering. This is exactly the new-parameter interaction the edge-case checklist calls out. References: diff_diff/had_pretests.py:L2925-L2991, diff_diff/had_pretests.py:L3124-L3192, diff_diff/had_pretests.py:L3414-L3467, diff_diff/had.py:L1604-L1645. Concrete fix: subset the row-aligned weights array to data_filtered before re-aggregating, using the same positional-index pattern already used in HeterogeneousAdoptionDiD.fit().

Performance
No material findings in the changed lines.

Maintainability
No additional maintainability findings beyond the bugs above.

Tech Debt

P3 Impact: replicate-weight pretests are deferred, but that limitation is explicitly tracked and guarded, so it is not a blocker. References: TODO.md:L98-L100, diff_diff/had_pretests.py:L1478-L1485, diff_diff/had_pretests.py:L1804-L1809, diff_diff/had_pretests.py:L2418-L2423. Concrete fix: none required in this PR.

Security
No findings.

Documentation/Tests

P3 Impact: REGISTRY.md now contains conflicting methodology statements: the old Phase 4.5 C0 note still says the workflow surface is gated and that Phase 4.5 C uses Rao-Wu rescaling, while the new Phase 4.5 C note and code say the workflow is enabled and Stute uses PSU-level multiplier bootstrap. The new tests also never exercise the broken staggered-weights= path or the zero-weight survey Stute path. References: docs/methodology/REGISTRY.md:L2428-L2445, diff_diff/had_pretests.py:L609-L618, diff_diff/had_pretests.py:L2263-L2269, tests/test_had_pretests.py:L2909-L2955, tests/test_had_pretests.py:L3241-L3356. Concrete fix: rewrite the C0 note as historical/narrow it to qug_test, update the stale docstrings, and add regressions for zero-weight survey designs and staggered event-study weights=.

Path to Approval

Make the survey-aware Stute family domain-consistent: drop w==0 units before variation checks/CvM/bootstrap, or reject zero survey weights everywhere until that handling exists; add a regression where only zero-weight units carry dose variation.
Add pweight-only guards to all pretest survey entry points and add aweight/fweight rejection tests.
Subset row-level weights= after event-study last-cohort filtering before calling the joint wrappers/workflow, and add staggered-panel regressions for both the direct wrappers and did_had_pretest_workflow(..., aggregate="event_study", weights=...).

Static review only: I couldn’t run pytest in this sandbox because the available interpreter does not have project dependencies installed (numpy is missing).

R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T13:07:50Z

/ai-review

github-actions · 2026-04-25T13:23:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fb032672790cddb0cbc4d7df7a283fdfe54c7ecd

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior zero-weight-survey and non-pweight guard findings look addressed on the direct-helper and workflow paths.
One prior P1 remains: the new weights= parameter still breaks on staggered panels for the direct public wrappers joint_pretrends_test() and joint_homogeneity_test().
There is also a new methodology-level inconsistency: the Phase 4.5 C registry note and the Stute survey implementation do not agree on the bootstrap perturbation law.
Replicate-weight deferral is properly guarded and tracked in TODO.md, so I did not count it as a blocker.
In-code docstrings/tests are not fully synced to the shipped Phase 4.5 C behavior.

Methodology

P1 — Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and the survey/weights branch of did_had_pretest_workflow. Impact: the new registry note documents the survey Stute bootstrap as dy_b = fitted + eps * w * eta_obs, but the implementation in both Stute paths uses dy_b = fitted + eps * eta_obs. Because that perturbation defines the bootstrap law, this is not a cosmetic wording issue: either the registry note or the code is wrong, and the PR does not document which. The underlying HAD paper’s Appendix D bootstrap perturbs residuals with a wild multiplier and then recomputes the statistic, while Theorem 7 keeps the unweighted Yatchew formulas as 1/(2G), 1/(G-1), and sqrt(G), so the unweighted branches still look paper-faithful; the ambiguity is specific to the new weighted Stute extension. Refs: docs/methodology/REGISTRY.md:L2434-L2435, diff_diff/had_pretests.py:L1656-L1678, diff_diff/had_pretests.py:L2760-L2785. Concrete fix: reconcile code and REGISTRY.md to one derivation-backed perturbation rule and lock it with a targeted regression test. (researchgate.net)

Code Quality

P1 — Impact: the direct public wrappers joint_pretrends_test(..., weights=...) and joint_homogeneity_test(..., weights=...) still crash on staggered event-study panels. Both wrappers call _validate_had_panel_event_study() and may filter to data_filtered, but then pass the original row-aligned weights array into _resolve_pretest_unit_weights(data_filtered, ...), which expects alignment with the filtered rows. The workflow path now subsets weights correctly before calling the wrappers, but the wrappers themselves still do not, so the new parameter remains broken on one public code path. Refs: diff_diff/had_pretests.py:L3003-L3013, diff_diff/had_pretests.py:L3064-L3099, diff_diff/had_pretests.py:L3202-L3211, diff_diff/had_pretests.py:L3267-L3297, compare the workflow fix at diff_diff/had_pretests.py:L3492-L3515. Concrete fix: in both wrappers, subset row-level weights to data_filtered.index via data.index.get_indexer(data_filtered.index) before calling _resolve_pretest_unit_weights, mirroring did_had_pretest_workflow() / HeterogeneousAdoptionDiD.fit().

Performance

No findings.

Maintainability

No additional findings beyond the code/registry/docstring inconsistencies above.

Tech Debt

P3 — Impact: replicate-weight survey pretests are deferred, but that limitation is explicitly tracked and consistently guarded, so it is not a blocker. Refs: TODO.md:L98-L100, diff_diff/had_pretests.py:L1478-L1485, diff_diff/had_pretests.py:L1827-L1832, diff_diff/had_pretests.py:L2448-L2453, diff_diff/had_pretests.py:L2863-L2868. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P3 — Impact: the in-code docs are not fully synchronized with Phase 4.5 C. qug_test still tells readers that survey-aware Stute “admits a Rao-Wu rescaled bootstrap,” and HADPretestReport still documents the old event-study all_pass contract that requires finite QUG output, both of which contradict the new registry note/code. The new tests also cover the workflow-level staggered-weights= regression, but not the direct-wrapper staggered path that still fails. Refs: diff_diff/had_pretests.py:L609-L618, diff_diff/had_pretests.py:L1201-L1215, tests/test_had_pretests.py:L3241-L3278, tests/test_had_pretests.py:L3526-L3549. Concrete fix: sync the docstrings to the shipped Phase 4.5 C contract and add staggered-panel weights= regressions for the two direct wrappers.

Path to Approval

Fix joint_pretrends_test() and joint_homogeneity_test() to subset row-aligned weights= after last-cohort filtering, before _resolve_pretest_unit_weights.
Reconcile the survey Stute bootstrap methodology end-to-end: either update the implementation to the registry’s documented perturbation rule or correct REGISTRY.md (and surrounding docstrings) to the implemented rule, then add a regression that pins the intended law.
Add direct-wrapper staggered-panel tests for joint_pretrends_test(..., weights=...) and joint_homogeneity_test(..., weights=...), plus doc updates for the final approved survey-bootstrap contract.

Static review only: I could not run the test suite in this sandbox because the available Python environment is missing project dependencies (numpy, pandas).

igerber · 2026-04-25T13:29:05Z

/ai-review

R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T13:37:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 085d8ebfa23667930bed2cd292717f1a3b38dfcb

Overall Assessment

⛔ Blocker

Executive Summary

The two prior re-review P1s appear fixed: the registry/code Stute perturbation law is now consistent, and the staggered-panel weights= alignment bug is patched in both direct joint wrappers and the workflow.
The new survey Stute implementation still has a blocker on variance-unidentified survey designs: an unstratified single-PSU design can yield a finite near-zero p-value instead of NaN/inconclusive.
That failure propagates into both direct Stute helpers and did_had_pretest_workflow, so the workflow can falsely mark TWFE inadmissible on a survey edge case.
Replicate-weight deferral is consistently guarded and properly tracked in TODO.md, so I did not count it as a blocker.
Some in-code docs remain stale: one qug_test error path still says Phase 4.5 C is “planned” and names Rao-Wu, and the direct joint-helper docstrings still omit their new weights/survey kwargs.

Methodology

P0 Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and both workflow branches that call them. Impact: on variance-unidentified survey designs, the new bootstrap path can emit materially wrong statistical output instead of an inconclusive result. The clearest case is an unstratified single-PSU design: the survey layer already documents that this should yield NaN variance, and the multiplier helper explicitly treats it as “variance unidentified” by returning an all-zero multiplier matrix. But the new Stute code then treats that zero matrix as a valid bootstrap law and computes p_value = (1 + sum(S_b >= S)) / (B + 1), which collapses to p ≈ 1/(B+1) whenever the observed CvM statistic is positive. That is a false rejection surface, not just a conservative choice. Refs: docs/methodology/REGISTRY.md:L2939-L2941, diff_diff/bootstrap_utils.py:L548-L561, diff_diff/had_pretests.py:L1664-L1685, diff_diff/had_pretests.py:L2770-L2793, diff_diff/had_pretests.py:L3565-L3598, diff_diff/had_pretests.py:L3690-L3749. Concrete fix: add the same NaN/inconclusive gate the rest of the survey stack uses for variance-unidentified designs before converting survey multipliers into a p-value. At minimum, unstratified n_psu < 2 must return p_value=NaN, reject=False, and an explicit warning; more generally, the Stute survey path should distinguish “unidentified variance” from “legitimate zero variance” instead of blindly accepting an all-zero multiplier matrix.

Code Quality

No findings. The prior staggered-panel weights= propagation bug looks fixed in the direct wrappers and the workflow.

Performance

No findings.

Maintainability

No findings beyond the documentation drift below.

Tech Debt

P3 Replicate-weight survey pretests are still deferred, but that limitation is explicitly tracked and guarded at every entry point, so it is not blocking. Refs: TODO.md:L98-L100, diff_diff/had_pretests.py:L1482-L1488, diff_diff/had_pretests.py:L1831-L1835, diff_diff/had_pretests.py:L2452-L2456, diff_diff/had_pretests.py:L2867-L2871. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P3 The in-code documentation is still not fully synchronized with the shipped Phase 4.5 C contract. qug_test’s NotImplementedError text still says survey-aware Stute is “planned” and “admits a Rao-Wu rescaled bootstrap,” while stute_joint_pretest, joint_pretrends_test, and joint_homogeneity_test do not document their newly added weights/survey kwargs at all. Refs: diff_diff/had_pretests.py:L1247-L1268, diff_diff/had_pretests.py:L2399-L2443, diff_diff/had_pretests.py:L2931-L2954, diff_diff/had_pretests.py:L3155-L3168. Concrete fix: update the stale qug_test message and add parameter/notes coverage for the joint helper survey contract.
Validation note: static review only. I could not run the test suite in this sandbox because pytest is not installed.

Path to Approval

In the survey Stute paths, detect variance-unidentified survey designs before p-value computation and return an inconclusive result (p_value=NaN, reject=False, warning) instead of calibrating against an all-zero multiplier matrix.
Add regressions for stute_test, stute_joint_pretest, and did_had_pretest_workflow covering at least one unstratified single-PSU survey design and asserting the NaN/inconclusive surface.
Sync the remaining stale direct-helper docs (qug_test error text and the joint-helper docstrings) to the final Phase 4.5 C methodology.

igerber · 2026-04-25T13:42:37Z

/ai-review

R3 P0 (Methodology) -- variance-unidentified survey-design guard. generate_survey_multiplier_weights_batch returns an all-zero (B, n_psu) multiplier matrix when n_psu - n_strata <= 0 (e.g. unstratified single-PSU design where df_survey = n_psu - 1 = 0). Without a guard, the Stute survey path treated zero perturbations as a valid bootstrap law and emitted p_value = 1/(B+1) for any positive observed CvM (spurious rejection on a survey edge case). Fix: detect df_survey <= 0 BEFORE calling the multiplier helper and return a NaN p_value with a UserWarning, matching the broader package convention for variance-unidentified survey designs (mirrors compute_survey_vcov NaN treatment, ResolvedSurveyDesign.df_survey semantics). Applied in stute_test (single-horizon) and stute_joint_pretest (joint variant); the data-in wrappers and the workflow propagate via the same calls. R3 P3 -- in-code docstrings still drifted on three surfaces: - qug_test NotImplementedError text said survey-aware Stute is "planned" and "admits a Rao-Wu rescaled bootstrap"; both are stale post-Phase-4.5-C. Updated to reflect the shipped reality (PSU-level Mammen multiplier bootstrap; QUG step is skipped under survey/weights in the workflow). - joint_pretrends_test, joint_homogeneity_test, stute_joint_pretest Parameters sections did not document the new weights= / survey= kwargs at all. Added with the per-test contract (per-unit aggregation, staggered subsetting, replicate/aweight/fweight rejection, variance-unidentified guard). 3 new regression tests in TestPhase45CR1Regressions: - test_stute_test_single_psu_unstratified_returns_nan - test_stute_joint_pretest_single_psu_unstratified_returns_nan - test_workflow_single_psu_propagates_nan_through_stute (also documents Yatchew's PSU-agnostic-by-design behavior; verdict carries the linearity-conditional suffix) 171 pretest tests pass (was 168 after R2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T13:53:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 39b29a97d31633f538805b2db957e163c3160c96

Overall Assessment

⛔ Blocker

Executive Summary

The prior blocker on variance-unidentified survey Stute looks fixed: both stute_test and stute_joint_pretest now gate df_survey <= 0 and return NaN/inconclusive instead of spuriously rejecting.
The stale Phase 4.5 C docs called out in the last review also appear updated.
New blocker: the weighted Yatchew path is not pweight-scale invariant. SurveyDesign.resolve() normalizes pweights to mean 1, but the new weights= shortcut feeds raw weights into sqrt(sum(w)), so the same design can yield different Yatchew p-values depending on whether the caller uses weights=, survey=SurveyDesign(weights=...), or just rescales weights by a constant.
New P1: the new direct-helper weights= entries do not enforce 1-D input. A common (G, 1) array from df[["w"]].to_numpy() will broadcast through the weighted moments/CvM calculations and silently corrupt results instead of raising.
Replicate-weight deferral is consistently guarded and properly tracked in TODO.md; I did not count it against the assessment.
Static review only; I could not run the suite in this sandbox because numpy/pytest are unavailable.

Methodology

P0 docs/methodology/REGISTRY.md:L2436-L2440, diff_diff/survey.py:L189-L205, diff_diff/had_pretests.py:L1911-L1925, diff_diff/had_pretests.py:L2097-L2101, diff_diff/had_pretests.py:L2916-L2930, diff_diff/had_pretests.py:L3790-L3795.
Impact: the new yatchew_hr_test(weights=...) and the workflow’s overall-path weights= branch use raw weights in sqrt(sum(w)), while the survey= path uses SurveyDesign.resolve()’s mean-1-normalized pweights. That makes the test statistic depend on arbitrary weight scale: w, 100*w, and survey=SurveyDesign(weights=...) can produce materially different p-values/verdicts for the same design. Because the workflow treats any conclusive Yatchew rejection as disqualifying, this can silently flip the overall pretest verdict.
Concrete fix: normalize all pweight inputs to the same mean-1 convention before weighted pretest arithmetic, or otherwise use a scale-invariant effective-sample-size rule and apply it identically on both weights= and survey= paths. Add regressions for weights=w vs weights=100*w and for weights= vs survey=SurveyDesign(weights="w") on both yatchew_hr_test and did_had_pretest_workflow(..., aggregate="overall").

Code Quality

P1 diff_diff/had_pretests.py:L1573-L1584, diff_diff/had_pretests.py:L1912-L1923, diff_diff/had_pretests.py:L2701-L2712, diff_diff/survey.py:L706-L716.
Impact: the new direct-helper weights= paths accept non-1-D arrays. If a caller passes a column vector such as df[["w"]].to_numpy(), NumPy broadcasting changes the weighted OLS moments / CvM sums instead of raising, so stute_test and yatchew_hr_test can return silently wrong numbers and stute_joint_pretest can miscompute or fail unpredictably.
Concrete fix: validate weights (and any direct survey.weights) as strictly 1-D before use, following the existing _validate_1d_numeric / .ravel() pattern used elsewhere. Add regressions with (G, 1) inputs for all three direct helpers.

Performance

No findings.

Maintainability

No findings.

Tech Debt

P3 TODO.md:L98-L100, diff_diff/had_pretests.py:L1485-L1492, diff_diff/had_pretests.py:L1861-L1866, diff_diff/had_pretests.py:L2497-L2502, diff_diff/had_pretests.py:L2938-L2943.
Impact: replicate-weight pretests remain deferred, but the limitation is explicitly tracked and guarded at every entry point.
Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P2 tests/test_had_pretests.py:L3162-L3175, tests/test_had_pretests.py:L3052-L3068.
Impact: the new coverage only smoke-tests weights= and survey=_make_trivial_resolved(w) separately; it does not exercise the real SurveyDesign.resolve() normalization path or weight-scale invariance, so the blocker above slips through. It also does not cover (G,1) weight rejection.
Concrete fix: add tests for weights= vs survey=SurveyDesign(weights="w"), w vs c*w, and (G,1) weight arrays on the new public helper paths.

Path to Approval

Make weighted Yatchew invariant to arbitrary global pweight scaling and identical across weights= and survey= entry paths.
Add regressions covering weights=w vs weights=100*w and weights= vs survey=SurveyDesign(weights="w") for yatchew_hr_test and the overall workflow.
Reject non-1-D weights / survey.weights at all new direct-helper entry points and add (G,1) regression tests.

igerber · 2026-04-25T14:00:54Z

/ai-review

R4 P0 (Methodology) -- Yatchew test statistic was not invariant to uniform pweight rescaling. The formula `T_hr = sqrt(sum(w)) * (...)` makes T_hr scale as sqrt(c) under weights -> w * c, so weights=w and weights=100*w produced different p-values for the same design. Worse, SurveyDesign.resolve() normalizes pweights to mean=1 internally, so the survey= entry path and the weights= shortcut disagreed numerically. Fix: normalize per-unit pweights to mean=1 at every helper entry (stute_test, yatchew_hr_test, stute_joint_pretest) and at the workflow resolution helper. Matches SurveyDesign.resolve() convention; makes the Yatchew statistic scale-invariant; ensures weights=w and survey=SurveyDesign(weights="w") produce identical results for the same design. Stute is internally scale-invariant in functional form but normalization is required so the bootstrap helper sees the same weight vector under both entry paths (cross-path numerical agreement). R4 P1 (Code Quality) -- column-vector weights (e.g. `df[["w"]].to_numpy()` producing (G, 1)) silently broadcast through weighted moments / CvM sums instead of raising. Fix: validate via `_validate_1d_numeric` on all `weights=` arrays in stute_test, yatchew_hr_test, stute_joint_pretest; add explicit ndim check in `_resolve_pretest_unit_weights` with a hint about the common df[["w"]].to_numpy() mistake. 6 new regression tests in TestPhase45CR1Regressions: - test_yatchew_weights_scale_invariant (weights=w vs weights=100*w) - test_stute_weights_scale_invariant (mirror for Stute) - test_workflow_weights_eq_survey_at_overall_path (weights= shortcut and survey=SurveyDesign(...) produce identical Yatchew + Stute results, atol=1e-10) - test_stute_test_rejects_2d_weights / test_yatchew_hr_test_rejects_2d_weights / test_workflow_rejects_2d_weights (column-vector rejection at all three direct-helper / workflow entry points) 177 pretest tests pass (was 171 after R3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T14:09:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f8e959cb5cd242e8959401631cecc48644e0ff07

Overall Assessment

⚠️ Needs changes — one unmitigated P1 remains in the new survey Stute calibration.

Executive Summary

The prior blockers on raw-weight scale dependence and 1-D weights= validation look addressed in this rerun.
New P1: the Stute-family survey guard treats every df_survey <= 0 design as bootstrap-degenerate, but the shared survey multiplier kernel has a nonzero pooled-singleton branch for lonely_psu="adjust". That makes some singleton-strata survey designs spuriously inconclusive, or at minimum undocumentedly unsupported.
Replicate-weight deferral is consistently guarded and tracked in TODO.md, so I did not count it against the assessment.
Static review only: this sandbox does not have numpy / pytest, so I could not execute the added tests.

Methodology

Severity: P1. Impact: stute_test and stute_joint_pretest now return NaN for every df_survey <= 0 design (diff_diff/had_pretests.py:L1674-L1703, diff_diff/had_pretests.py:L2842-L2868). That is correct for truly degenerate all-zero-multiplier cases, but it is too broad for lonely_psu="adjust": the shared bootstrap helper explicitly pools singleton strata and can generate nonzero multipliers there (diff_diff/bootstrap_utils.py:L587-L651), and the survey variance scaffolding separately treats singleton-adjust designs as variance-computable (diff_diff/survey.py:L1566-L1572). The Phase 4.5 C registry note advertises Stute-family survey support without documenting a singleton-adjust carve-out (docs/methodology/REGISTRY.md:L2434-L2445). As written, those designs become inconclusive on the overall and event-study workflow paths even though the helper contract is nondegenerate. Concrete fix: either (1) replace the df_survey <= 0 proxy with a guard on actual bootstrap degeneracy and allow nonzero pooled-singleton cases, or (2) reject lonely_psu="adjust" singleton-strata Stute calibrations explicitly with NotImplementedError and document that carve-out, mirroring the explicit lonely-PSU handling already used for the sup-t bootstrap (diff_diff/had.py:L2081-L2118).

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: replicate-weight survey pretests remain deferred, but the limitation is explicitly tracked and the new entry points reject those designs up front (TODO.md:L98-L100). Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new survey regressions cover lonely_psu="remove" and the single-PSU degenerate case, but they do not pin the pooled-singleton lonely_psu="adjust" branch in the shared bootstrap helper (tests/test_had_pretests.py::TestStuteTestSurvey, ::TestJointStuteSurvey, ::TestPhase45CR1Regressions). That is how the P1 above can slip through without an explicit contract decision. Concrete fix: add lonely_psu="adjust" singleton-strata regressions for stute_test, stute_joint_pretest, and did_had_pretest_workflow, asserting the intended behavior explicitly (finite p-value if supported, or NotImplementedError if intentionally unsupported).

Path to Approval

Decide the contract for lonely_psu="adjust" singleton-strata Stute calibrations, and implement it explicitly: either support the nondegenerate pooled-singleton bootstrap path or reject it with a targeted NotImplementedError.
Update docs/methodology/REGISTRY.md and CHANGELOG.md to match that contract.
Add focused regressions for lonely_psu="adjust" singleton-strata survey designs on the single-horizon Stute, joint Stute, and workflow surfaces.

igerber · 2026-04-25T14:16:30Z

/ai-review

R5 P1 (Methodology) -- Stute survey df_survey<=0 guard was too broad. generate_survey_multiplier_weights_batch's pooled-singleton branch for lonely_psu='adjust' produces NONZERO multipliers, so the helper contract is non-degenerate even when df_survey=0 (one PSU per stratum). The previous guard rejected those designs as variance-unidentified when actually the issue is methodological: the analytical variance target requires a pseudo-stratum centering transform that has not been derived for the Stute CvM. Fix: mirror HAD sup-t bootstrap's explicit lonely_psu='adjust' rejection at had.py:2081-2118. Add new helper _has_lonely_psu_adjust_singletons in had_pretests.py. stute_test and stute_joint_pretest now reject lonely_psu='adjust' with singleton strata via NotImplementedError BEFORE the df_survey<=0 guard. The guard remains for genuinely degenerate designs (single-PSU unstratified, or one-PSU-per-stratum under remove/certainty where the helper returns all-zero multipliers). R5 P2 -- new regressions cover lonely_psu='adjust' rejection on both single-horizon and joint-Stute helpers; also a positive control showing lonely_psu='remove' singletons still flow through the existing df_survey-based NaN path. 3 new tests in TestPhase45CR1Regressions. REGISTRY note clarifies the carve-out: 'adjust' with singletons raises NotImplementedError; 'remove' / 'certainty' produce all-zero singleton multipliers and return NaN via the variance-unidentified guard. 180 pretest tests pass (was 177 after R4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T14:24:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 646e207f8b5f7401953c1c71e90b7724c043f228

Overall Assessment

⚠️ Needs changes — one P1 edge-case bug remains in the new survey/event-study workflow, and one P1 coverage gap leaves key survey branches unpinned.

Executive Summary

The prior lonely_psu='adjust' singleton-strata blocker looks resolved: the Stute paths now reject it explicitly, and the carve-out is documented in docs/methodology/REGISTRY.md:L2444-L2444 and enforced in diff_diff/had_pretests.py:L1712-L1723, diff_diff/had_pretests.py:L2899-L2911.
I did not find an unmitigated paper/registry mismatch in the new survey methodology itself; the major departures from the unweighted paper path are documented in docs/methodology/REGISTRY.md:L2428-L2446.
P1: did_had_pretest_workflow(..., aggregate="event_study", survey=/weights=) resolves and validates weights on the full panel before the last-cohort filter, so units that would be dropped can still cause hard failures. diff_diff/had_pretests.py:L2978-L3079, diff_diff/had_pretests.py:L3676-L3751
P1: the new survey= parameter is not tested on the main successful event-study/post-treatment survey branches. Current survey smokes are mostly overall-path or trivial/no-PSU designs, so the new PSU-aware calibration surface is still under-covered. tests/test_had_pretests.py:L2936-L2955, tests/test_had_pretests.py:L3060-L3069, tests/test_had_pretests.py:L3169-L3175, tests/test_had_pretests.py:L3261-L3298, tests/test_had_pretests.py:L3526-L3595
Replicate-weight deferral is consistently documented and tracked, so I did not count it against approval. docs/methodology/REGISTRY.md:L2442-L2444, TODO.md:L98-L100
Static review only: this sandbox does not have pytest or numpy, so I could not execute the added tests.

Methodology

Severity: P3. Impact: The survey extension is a documented library-specific adaptation rather than a paper-literal survey recipe: QUG is skipped under survey, Stute uses PSU-level Mammen multipliers, Yatchew uses weighted variance components without PSU-cluster correction, replicate-weight designs are deferred, and lonely_psu='adjust' singleton strata are explicitly unsupported. Those choices are all recorded in docs/methodology/REGISTRY.md:L2428-L2446, so I do not count them as defects. Concrete fix: None required for this PR.

Code Quality

Severity: P1. Impact: The event-study workflow resolves weights/survey on the full input panel before _validate_multi_period_panel() applies the staggered last-cohort filter. Because _resolve_pretest_unit_weights() enforces strictly positive per-unit weights on whatever data it sees, zero-weight or otherwise invalid units that would be dropped by the filter can still abort a valid event-study run. This is inconsistent with the wrapper-level fix that already subsets row-level weights= after filtering. diff_diff/had_pretests.py:L2978-L3079, diff_diff/had_pretests.py:L3676-L3751 Concrete fix: For aggregate="event_study", do not call _resolve_pretest_unit_weights() on the full panel. Set use_survey_path = (survey is not None or weights is not None), run _validate_multi_period_panel() first, then resolve on data_filtered only, or let joint_pretrends_test() / joint_homogeneity_test() own resolution entirely.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: replicate-weight survey pretests remain deferred, but the limitation is explicitly tracked and every entry point rejects those designs consistently. docs/methodology/REGISTRY.md:L2442-L2444, TODO.md:L98-L100 Concrete fix: None required in this PR.

Security

No findings.

Documentation/Tests

Severity: P1. Impact: The new survey= parameter is not exercised across all of the new public code paths. The added tests cover overall-workflow survey support and trivial/no-PSU survey smokes (tests/test_had_pretests.py:L2936-L2955, tests/test_had_pretests.py:L3060-L3069, tests/test_had_pretests.py:L3169-L3175), and they cover event-study only on weights= plus survey pretrends (tests/test_had_pretests.py:L3261-L3298, tests/test_had_pretests.py:L3526-L3595). There is still no successful survey= coverage for joint_homogeneity_test(...) or did_had_pretest_workflow(..., aggregate="event_study"), and no positive test of a non-trivial PSU/strata survey design. Given that the core new feature is PSU-aware bootstrap calibration, this is the exact parameter-interaction gap that can hide regressions. Concrete fix: Add successful survey= regressions for joint_homogeneity_test(...) and did_had_pretest_workflow(..., aggregate="event_study"), using a design with at least 2 PSUs per stratum and more than 1 unit per PSU; include one staggered last-cohort case so filtering and survey resolution are exercised together.

Path to Approval

Move event-study weights/survey resolution in did_had_pretest_workflow() so it happens only after _validate_multi_period_panel() has produced data_filtered, or remove the eager full-panel resolution from that branch entirely.
Add a regression where the staggered last-cohort filter drops units with zero/invalid weights, and assert the survey-backed event-study workflow succeeds on the surviving valid subset.
Add positive non-trivial survey= tests for joint_homogeneity_test(...) and did_had_pretest_workflow(..., aggregate="event_study"); use actual psu/strata structure so the shared-PSU multiplier path is covered, not just the trivial weight-only path.

igerber · 2026-04-25T14:30:03Z

/ai-review

R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T14:37:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 20f158ac5402459cf3f9ee0a6c654e3f5fed9de5

Overall Assessment

⛔ Blocker - one new P0 methodology bug in the survey-weighted Stute statistic, plus one P1 report-contract regression. The two P1s from the previous review appear fixed.

Executive Summary

The prior event-study filter/weight-resolution bug looks fixed: weight resolution is now deferred until after event-study filtering, and the direct joint wrappers subset row weights before re-aggregation. diff_diff/had_pretests.py:L3262-L3283, diff_diff/had_pretests.py:L3487-L3506, diff_diff/had_pretests.py:L3676-L3757
The prior positive-survey coverage gap also looks fixed: new tests now cover joint_homogeneity_test(..., survey=...) and did_had_pretest_workflow(..., aggregate="event_study", survey=...) on non-trivial PSU/strata designs. tests/test_had_pretests.py:L3936-L3987
P0: the new weighted Stute implementation weights the inner cusum but not the outer integration measure. That does not match the survey-weighted plug-in functional the registry/survey docs describe, so survey Stute statistics and p-values are silently wrong for non-uniform weights. diff_diff/had_pretests.py:L1049-L1085, docs/methodology/survey-theory.md:L221-L225, docs/methodology/survey-theory.md:L383-L405
P1: survey workflow pass cases can set all_pass=True while the human verdict still begins with inconclusive, reintroducing the report-contract inconsistency the unweighted path already had a regression against. diff_diff/had_pretests.py:L3815-L3849, diff_diff/had_pretests.py:L3934-L3958, tests/test_had_pretests.py:L709-L728
Replicate-weight deferral, QUG-under-survey deferral, and lonely_psu='adjust' singleton rejection are documented/tracked and should not block approval. docs/methodology/REGISTRY.md:L2428-L2446, TODO.md:L98-L100
Static review only: pytest is not installed in this sandbox, and the available Python environment also lacks numpy, so I could not execute the new tests.

Methodology

Severity: P0. Impact: Affected methods: stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and both survey/weights branches of did_had_pretest_workflow. The new _cvm_statistic_weighted() implements S_w = W^-2 Σ_g C_g^2 with C_g = Σ_{h: D_h <= D_g} w_h ε_h. But the survey methodology note frames the extension as a plug-in of the survey-weighted empirical distribution F_hat_w = Σ_i w_i δ_{x_i} / Σ_i w_i. Starting from the original Stute functional S = (1/G) Σ_g c_G(D_g)^2, the weighted plug-in requires integrating c_w(D_g)^2 against F_hat_w, i.e. the outer measure must also be weighted: S_w = W^-2 Σ_g w_g C_g^2. Omitting that outer w_g silently turns the statistic into a count-weighted functional of the sample rather than the documented survey-weighted functional. Every non-uniform-weight survey Stute result is therefore off with no warning. diff_diff/had_pretests.py:L974-L1017, diff_diff/had_pretests.py:L1049-L1085, diff_diff/had_pretests.py:L1682-L1774, diff_diff/had_pretests.py:L2802-L2956, docs/methodology/survey-theory.md:L221-L225, docs/methodology/survey-theory.md:L383-L405 Concrete fix: change _cvm_statistic_weighted() to apply the outer w_sorted factor as well, then propagate that correction through the single- and joint-Stute survey paths and document the exact weighted CvM formula in docs/methodology/REGISTRY.md.
Severity: P3. Impact: The QUG skip under survey, replicate-weight rejection, and lonely_psu='adjust' singleton rejection are explicitly documented and/or tracked, so I did not count them against approval. docs/methodology/REGISTRY.md:L2428-L2446, TODO.md:L98-L100 Concrete fix: None required for this PR.

Code Quality

Severity: P1. Impact: On both survey workflow branches, the code synthesizes a NaN QUG result, calls the existing verdict composer, and then string-replaces "QUG NaN". For pass cases that still leaves verdict starting with inconclusive even when all_pass=True. That violates the report’s semantic contract and recreates the exact inconsistency the existing unweighted regression guards against. diff_diff/had_pretests.py:L3815-L3849, diff_diff/had_pretests.py:L3934-L3958, tests/test_had_pretests.py:L709-L728 Concrete fix: stop post-processing the old verdict strings. Add explicit survey-aware success/inconclusive verdict builders for the overall and event-study paths, and add regressions asserting that survey pass cases never emit verdict.startswith("inconclusive").

Performance

No findings.

Maintainability

No findings beyond the verdict-string issue above.

Tech Debt

Severity: P3. Impact: replicate-weight survey pretests remain deferred, but the limitation is explicitly tracked in TODO.md and guarded consistently at each entry point. TODO.md:L98-L100, docs/methodology/REGISTRY.md:L2443-L2444 Concrete fix: None required for this PR.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests mostly cover smoke paths, w=ones(G) reductions, and branch guards. They still do not pin the non-uniform weighted-CvM formula or the survey all_pass/verdict consistency, which is why both bugs above slip through. tests/test_had_pretests.py:L3036-L3215, tests/test_had_pretests.py:L3936-L4016 Concrete fix: add one oracle test for _cvm_statistic_weighted() or stute_test() with a non-uniform weight vector, and add workflow regressions asserting that all_pass=True never coexists with an inconclusive survey verdict on either aggregate path.

Path to Approval

Fix _cvm_statistic_weighted() so the Stute survey statistic integrates against the weighted empirical measure, not the unweighted observation count, and rerun/update the single- and joint-Stute survey paths accordingly.
Replace the current survey verdict string-rewrite with explicit survey-aware verdict composition, then add regressions for both aggregate="overall" and aggregate="event_study" survey pass cases.
Add a non-uniform-weight Stute regression that distinguishes Σ w_g C_g^2 from Σ C_g^2; the current w=ones(G) reductions cannot catch the methodology bug.

igerber · 2026-04-25T14:46:22Z

/ai-review

R7 P0 (Methodology) -- weighted CvM was missing the outer measure factor. Survey-weighted plug-in of Stute's CvM functional integrates the squared cusum process against the survey-weighted EDF F_hat_w = (1/W) sum_i w_i delta_{D_i}, which weights BOTH the inner cusum AND the outer integration measure: C_g = sum_{h <= g} w_h * eps_h S_w = (1/W^2) * sum_g w_g * (C_g)^2 Earlier revisions used (1/W^2) * sum_g C_g^2 (no outer w_g) which is a count-weighted-cusum / uniform-outer-measure functional and silently misreports survey-weighted Stute statistics for non-uniform weights. At w=ones(G) both forms reduce to (1/G^2) sum_g C_g^2; only non-uniform weights distinguish them, which is why the prior reduction tests didn't catch this. Fix: add the outer w_sorted factor to _cvm_statistic_weighted. New oracle test pins the formula on a hand-computed non-uniform-weight example (w=[1,2,3], eps=[1,-2,3] -> 127/36 outer-weighted, NOT 46/36 count-weighted-cusum). Reduction at w=1 still bit-exact. R7 P1 (Code Quality) -- survey verdict could leave pass cases starting with "inconclusive". The previous approach composed the unweighted verdict against a synthetic NaN QUG and string-replaced "QUG NaN" out; when no rejections fired and linearity was conclusive, the underlying "inconclusive - QUG NaN" template would collapse to "inconclusive" even for all_pass=True paths. Fix: explicit survey-aware verdict composers _compose_verdict_overall_survey and _compose_verdict_event_study_survey that drop QUG from consideration entirely and emit linearity-only priority text. Both append the "(linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)" suffix in every branch (rejections, inconclusive, fail-to-reject). 4 new regression tests: - test_cvm_statistic_weighted_outer_measure_oracle (R7 P0 hand-computed) - test_cvm_statistic_weighted_reduces_at_uniform_weights (R7 P0 reduction) - test_workflow_overall_survey_pass_does_not_say_inconclusive (R7 P1) - test_workflow_event_study_survey_pass_does_not_say_inconclusive (R7 P1) 187 pretest tests pass (was 183 after R6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T14:53:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 43230619ebed25a84af60c4300f2ac69ef8e9ab4

Overall Assessment

✅ Looks good — the prior P0 weighted-CvM issue and prior P1 survey-verdict regression both appear fixed. I did not find any unmitigated P0/P1 issues in the changed code. Static review only: I could not run pytest here because the sandbox Python lacks numpy.

Executive Summary

The previous Stute methodology blocker appears resolved: _cvm_statistic_weighted() now applies the outer survey-weight measure, and the new non-uniform oracle test would fail on the old count-weighted form. diff_diff/had_pretests.py:L1049-L1097, tests/test_had_pretests.py:L4020-L4053
The previous report-contract regression also appears resolved: survey verdicts now use dedicated composers instead of post-processing the unweighted string output, with regressions locking all_pass=True against "inconclusive" verdicts. diff_diff/had_pretests.py:L3554-L3655, diff_diff/had_pretests.py:L3928-L4041, tests/test_had_pretests.py:L4057-L4100
Weight/survey propagation looks consistent across direct helpers, data-in wrappers, and workflow dispatch, including staggered-panel subsetting and variance-unidentified survey guards. diff_diff/had_pretests.py:L2990-L3534, diff_diff/had_pretests.py:L3822-L4017
The remaining methodology limitations I saw are documented/tracked rather than silent defects: replicate-weight pretests are deferred, lonely_psu='adjust' singleton strata are rejected, and survey Yatchew remains PSU-agnostic by design. docs/methodology/REGISTRY.md:L2436-L2444, TODO.md:L98-L100
One P3 documentation mismatch remains: the new survey all_pass prose is broader than the implemented event-study behavior. docs/methodology/REGISTRY.md:L2442-L2442, diff_diff/had_pretests.py:L609-L620, diff_diff/had_pretests.py:L3759-L3776, diff_diff/had_pretests.py:L3933-L3942

Methodology

Severity: P3-informational. Impact: The remaining survey-path limitations are explicitly documented/tracked: survey Yatchew does not propagate PSU clustering, replicate-weight pretests remain unsupported, and lonely_psu='adjust' singleton strata are rejected. These are disclosed in the Phase 4.5 C registry note and TODO, so they should not block approval. docs/methodology/REGISTRY.md:L2436-L2444, TODO.md:L98-L100 Concrete fix: None in this PR.

Code Quality

Severity: None. Impact: No unmitigated code-quality findings in the modified paths. The previous survey verdict/report inconsistency is addressed by the new survey-specific composers. diff_diff/had_pretests.py:L3554-L3655 Concrete fix: None.

Performance

Severity: None. Impact: No performance regressions stood out in the diff. The joint Stute path still precomputes the projection operator once per call, and the unweighted fast path remains separate. diff_diff/had_pretests.py:L2859-L2874 Concrete fix: None.

Maintainability

Severity: None. Impact: No maintainability findings. _resolve_pretest_unit_weights() centralizes the new weight/survey validation, normalization, and aggregation logic across workflow and wrapper entry points. diff_diff/had_pretests.py:L2990-L3091 Concrete fix: None.

Tech Debt

Severity: P3-informational. Impact: The new replicate-weight pretest deferral is properly tracked in TODO.md, so it is not a blocker. TODO.md:L98-L100 Concrete fix: None in this PR.

Security

Severity: None. Impact: No security findings in the changed files. Concrete fix: None.

Documentation/Tests

Severity: P3. Impact: The new survey all_pass documentation is broader than the implemented event-study behavior. The registry/docstrings describe the survey workflow generically as passing when “at least one linearity test” is conclusive, but aggregate="event_study" still requires both pretrends_joint and homogeneity_joint to be conclusive and non-rejecting. The code is fine; the published contract is looser than the implementation. docs/methodology/REGISTRY.md:L2442-L2442, diff_diff/had_pretests.py:L609-L620, diff_diff/had_pretests.py:L3759-L3776, diff_diff/had_pretests.py:L3933-L3942 Concrete fix: Split the survey all_pass description by aggregate and mirror the actual event-study condition in the registry and docstrings.

R8 P3 -- survey all_pass docstring/REGISTRY described the contract too generically ("at least one linearity test conclusive"), which matches the overall path but is looser than the implemented event-study path (which requires BOTH pretrends_joint AND homogeneity_joint conclusive + non-rejecting). Fix: split the survey all_pass description by aggregate in both HADPretestReport docstring and REGISTRY note: - overall: "at least one of Stute/Yatchew conclusive + no rejection" (mirrors paper Section 4 step-3 'Stute OR Yatchew' wording). - event_study: "both joint variants conclusive + neither rejects" (same step-2 + step-3 closure as the unweighted aggregate, minus the QUG step). Code unchanged; only documentation. 187 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T15:16:08Z

/ai-review

github-actions · 2026-04-25T15:25:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 194668274d1ce6f3c2e9892815980cff50cd5bff

Overall Assessment

⚠️ Needs changes — the prior weighted-CvM blocker and the prior survey-verdict regression both look fixed, and the new survey methodology is now documented in docs/methodology/REGISTRY.md. I did find one unmitigated P1 in the new staggered weights= path: row-level weights are subset before the normal front-door length check, so malformed arrays can be silently truncated or fail with raw indexing errors instead of the package’s usual ValueError. Static review only; I could not run pytest here because the sandbox Python is missing numpy/scipy/pytest.

Executive Summary

Affected methods are stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow.
The previous Stute methodology blocker appears fixed: _cvm_statistic_weighted() now weights both the inner cusum and the outer integration measure, matching the new Phase 4.5 C registry note. diff_diff/had_pretests.py:L1058-L1106, docs/methodology/REGISTRY.md:L2434-L2438
The previous survey-verdict/report inconsistency also appears fixed: the workflow now uses dedicated survey-path verdict composers instead of post-processing the unweighted string output. diff_diff/had_pretests.py:L3563-L3664, diff_diff/had_pretests.py:L3942-L4045
QUG skip under survey, Yatchew’s PSU-agnostic variance, and replicate-weight deferral are all explicitly documented/tracked, so I am not treating those as blockers. docs/methodology/REGISTRY.md:L2428-L2448, TODO.md:L97-L100
[New] P1: the staggered weights= subsetting fix bypasses the standard len(weights) == len(data) guard on event-study paths; malformed arrays can be silently accepted or raise raw indexing errors. diff_diff/had_pretests.py:L3292-L3304, diff_diff/had_pretests.py:L3514-L3526, diff_diff/had_pretests.py:L3874-L3884, compare diff_diff/had_pretests.py:L3033-L3045
One residual P3 remains in the docs: did_had_pretest_workflow’s survey all_pass note is still broader than the implemented aggregate="event_study" rule. diff_diff/had_pretests.py:L3768-L3785, diff_diff/had_pretests.py:L3942-L3956, docs/methodology/REGISTRY.md:L2442-L2444

Methodology

Severity: P3-informational. Impact: The major survey-path deviations from the original unweighted paper workflow are now explicitly documented: QUG is skipped under survey=/weights=, Yatchew does not propagate PSU clustering, replicate-weight designs are deferred, and lonely_psu='adjust' singleton strata are rejected. Under the review rubric, these are documented deviations/tracked limitations, not defects. docs/methodology/REGISTRY.md:L2428-L2448, TODO.md:L98-L100 Concrete fix: None in this PR.

Code Quality

Severity: P1. Impact: On staggered event-study paths, row-level weights are sliced with np.asarray(weights)[pos_idx] before any validation against the original data length. That means oversized arrays can be silently accepted with their tail ignored, and undersized arrays can surface raw NumPy indexing errors instead of the library’s standard front-door ValueError. This affects did_had_pretest_workflow, joint_pretrends_test, and joint_homogeneity_test. diff_diff/had_pretests.py:L3292-L3304, diff_diff/had_pretests.py:L3514-L3526, diff_diff/had_pretests.py:L3874-L3884, compare diff_diff/had_pretests.py:L3033-L3045 Concrete fix: validate weights as 1D and len(weights) == len(data) before any staggered-panel subsetting, then subset the already-validated array.

Performance

No unmitigated findings in the changed code.

Maintainability

No unmitigated findings in the changed code.

Tech Debt

Severity: P3-informational. Impact: Replicate-weight survey pretests are explicitly deferred and tracked in TODO.md, so that follow-up is properly recorded and should not block this PR. TODO.md:L98-L100 Concrete fix: None in this PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: did_had_pretest_workflow’s Notes section still says the survey-path all_pass rule is “at least one linearity test is conclusive,” but the implemented aggregate="event_study" logic requires both pretrends_joint and homogeneity_joint to be conclusive and non-rejecting. The registry and HADPretestReport docstring were updated, but this workflow docstring is still looser than the code. diff_diff/had_pretests.py:L3768-L3785, diff_diff/had_pretests.py:L3942-L3956, docs/methodology/REGISTRY.md:L2442-L2444 Concrete fix: split the workflow docstring’s survey all_pass note by aggregate, mirroring the registry text and actual event-study logic.

Path to Approval

Add a front-door validator for staggered weights= paths so weights must be 1D and length-matched to the original data before pos_idx subsetting in did_had_pretest_workflow, joint_pretrends_test, and joint_homogeneity_test.
Add regressions for too-short and too-long row-level weights arrays on staggered aggregate="event_study" workflow and both direct joint wrappers, asserting clean ValueErrors rather than silent truncation or raw indexing failures.

igerber · 2026-04-25T15:44:26Z

/ai-review

R9 P1 (Code Quality) -- staggered weights= subsetting bypassed the standard length check. The R6 P1 fix added pos_idx-based subsetting to align row-level weights with data_filtered, but the subsetting ran without first validating that weights had length matching the ORIGINAL data: - Oversized arrays were silently truncated (pos_idx slice keeps only matched rows; tail of weights was discarded with no warning). - Undersized arrays surfaced raw NumPy IndexError instead of the package's standard front-door ValueError. Fix: validate weights as 1D AND len(weights) == len(data) BEFORE the pos_idx slice, on all three staggered-weights entry points (did_had_pretest_workflow, joint_pretrends_test, joint_homogeneity_test). Subset only after validation passes. R9 P3 -- did_had_pretest_workflow's Notes section still had the un-split survey all_pass description ("at least one linearity test conclusive"). The HADPretestReport docstring and REGISTRY were updated in R8; this third location was missed. Now matches the event-study contract that requires BOTH joint variants conclusive + non-rejecting. 4 new regression tests in TestPhase45CR1Regressions: - test_workflow_event_study_oversized_weights_raises - test_workflow_event_study_undersized_weights_raises - test_joint_pretrends_test_oversized_weights_raises - test_joint_homogeneity_test_undersized_weights_raises All assert the front-door ValueError contract instead of silent truncation / raw IndexError. 191 pretest tests pass (was 187 after R8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T15:56:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6878f98e0c5c0fd27bce8411149d346b6c7732c1

Overall Assessment

⚠️ Needs changes — the prior weighted-CvM and survey-verdict/report issues look fixed, and the documented survey-path deviations are properly recorded, but one unmitigated P1 remains in the new survey Stute calibration on stratified designs.

Executive Summary

Affected methods: stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow, plus HADPretestReport serialization.
The previous weighted outer-measure CvM blocker appears fixed by _cvm_statistic_weighted in diff_diff/had_pretests.py.
The previous survey-path verdict / qug=None inconsistency also appears fixed by the explicit qug-optional rendering and dedicated survey verdict composers in diff_diff/had_pretests.py and diff_diff/had_pretests.py.
QUG skip under survey, Yatchew’s PSU-agnostic variance, and replicate-weight deferral are documented/tracked, so I am not treating those as blockers in this review: docs/methodology/REGISTRY.md and TODO.md.
P1 remains: the new survey Stute paths use raw PSU multipliers directly on the residual perturbation, but the same generator only provides the multiplier draw/FPC scaling; elsewhere in the codebase the stratified survey target requires an additional PSU aggregation + stratum-demeaning + sqrt(n_h/(n_h-1)) correction after that call. See diff_diff/had_pretests.py, diff_diff/had_pretests.py, compare diff_diff/bootstrap_utils.py and diff_diff/had.py.
Static review only: I could not run pytest here because the review environment does not have it installed.

Methodology

Severity: P1. Impact: REGISTRY.md says Phase 4.5 C’s Stute family uses a calibrated survey-aware multiplier bootstrap and reuses the same kernel as the weighted event-study bootstrap, but the implementation in the new Stute paths only calls generate_survey_multiplier_weights_batch(...) and then applies those raw PSU draws directly to eps in the residual perturbation loop. In this codebase, that generator is not the full stratified-survey calibration by itself: the weighted event-study bootstrap has to aggregate to PSU, demean within strata, and apply the sqrt(n_h/(n_h-1)) small-sample correction after the same generator call to match the survey target. On stratified survey designs, the new stute_test / stute_joint_pretest p-values can therefore be miscalibrated silently. References: diff_diff/had_pretests.py, diff_diff/had_pretests.py, diff_diff/bootstrap_utils.py, diff_diff/had.py, docs/methodology/REGISTRY.md. Concrete fix: either derive and implement the analogous Stute-specific stratified correction, or narrow Phase 4.5 C support by rejecting stratified survey designs on the Stute family until that derivation exists and document that narrower support in REGISTRY.md.

Code Quality

No unmitigated findings.

Performance

No unmitigated findings.

Maintainability

No unmitigated findings.

Tech Debt

Severity: P3-informational. Impact: replicate-weight survey pretests are explicitly deferred and tracked, so that limitation is properly recorded and does not block this PR. Reference: TODO.md. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new nontrivial survey-design coverage in tests/test_had_pretests.py is smoke-only (np.isfinite(...), 0 <= p <= 1) and would not catch the stratified-bootstrap calibration issue above. The closest existing oracle in the repo is the weighted event-study sup-t calibration lock in tests/test_had.py. References: tests/test_had_pretests.py, tests/test_had_pretests.py. Concrete fix: add a stratified SurveyDesign(weights=..., strata=..., psu=...) regression for both stute_test and stute_joint_pretest that locks calibration/behavior against a known target, not just finiteness.

Path to Approval

Make the survey Stute bootstrap match a derived/calibrated stratified-design target: either implement the missing PSU aggregation + within-stratum centering + sqrt(n_h/(n_h-1)) correction (or another fully derived equivalent) in stute_test and stute_joint_pretest, or reject stratified survey designs on those paths and document the narrower support in docs/methodology/REGISTRY.md.
Add regression tests that fail on the current implementation for a nontrivial stratified survey design, covering both single-horizon and joint Stute paths.

igerber · 2026-04-25T16:25:22Z

/ai-review

R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER generate_survey_multiplier_weights_batch returns, to make the bootstrap variance match the Binder-TSL stratified target. The same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs left the bootstrap p-value silently miscalibrated. Per the reviewer's offered "narrow support" path: Phase 4.5 C now explicitly rejects stratified designs on the Stute family with NotImplementedError. Pweight-only, PSU-only, and FPC-only designs remain supported (the helper's output is appropriately scaled for those without further correction). Stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are methodology-gap-driven explicit NotImplementedErrors with documented follow-up. The strata guard supersedes the lonely_psu='adjust' singleton-strata guard for any stratified design (the latter is now defense-in-depth for the unstratified residual case). R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented survey contract. Tests updated: - test_stute_test_lonely_psu_adjust_singletons_raises -> test_stute_test_stratified_design_raises (the strata guard fires first; the test is still meaningful but on a strata key match). - Same renaming for stute_joint_pretest variant. - test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED (singleton strata under lonely_psu='remove' now hits the strata guard instead of the df_survey<=0 guard). - test_joint_homogeneity_test_psu_strata_survey_smoke -> test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage on PSU-only design) + new test_joint_homogeneity_test_stratified_raises. - test_workflow_event_study_psu_strata_survey_smoke -> test_workflow_event_study_psu_only_survey_smoke. - test_workflow_event_study_survey_pass_does_not_say_inconclusive switched from strata to PSU-only. 191 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T16:31:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ad44e39b634a4b8b63804beec38fb1f488b85266

Overall Assessment

⚠️ Needs changes — the prior weighted-CvM and qug=None/survey-verdict issues look fixed, but the earlier P1 on stratified survey Stute calibration is still unmitigated, and there is one additional P1 on the new staggered weights= routing.

Executive Summary

Affected methods: stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow, plus HADPretestReport serialization/rendering.
The previous weighted outer-measure CvM issue appears fixed by _cvm_statistic_weighted in diff_diff/had_pretests.py:L1058-L1106.
The previous survey-path qug=None rendering / verdict inconsistency also appears fixed by the HADPretestReport changes and explicit survey verdict composers in diff_diff/had_pretests.py:L676-L809 and diff_diff/had_pretests.py:L3563-L3664.
Replicate-weight deferral and Yatchew’s PSU-agnostic survey path are documented in docs/methodology/REGISTRY.md:L2436-L2446 and tracked in TODO.md:L98, so I am not treating those as blockers.
P1 remains: the new Stute survey paths still advertise and run on stratified designs, but they apply raw PSU multipliers directly after generate_survey_multiplier_weights_batch(...) instead of the post-draw PSU aggregation + within-stratum demeaning + sqrt(n_h/(n_h-1)) correction that the existing HAD sup-t stratified calibration uses. On stratified survey designs, the Stute p-values can therefore be miscalibrated silently.
[Newly identified] P1: the new staggered weights= routing slices weights by pos_idx before validating shape/length. Oversized arrays are silently truncated; undersized arrays can fail with raw indexing errors instead of the package’s front-door ValueError.
Static review only: I could not run the test suite here because the review environment is missing project dependencies.

Methodology

Severity: P1. Impact: stute_test and stute_joint_pretest still treat stratified survey designs as supported, and the registry/changelog say Phase 4.5 C supports PSU/strata/FPC survey designs. But the implementation in diff_diff/had_pretests.py:L1724-L1793 and diff_diff/had_pretests.py:L2916-L2978 only draws PSU multipliers from generate_survey_multiplier_weights_batch(...) and applies them directly to the residual perturbation. In this codebase, that helper is not the full stratified calibration by itself: the existing weighted event-study sup-t path has to aggregate to PSU level, demean within strata, and apply the sqrt(n_h/(n_h-1)) correction after the same helper call to match the stratified Binder target, see diff_diff/had.py:L2120-L2175. The registry still documents stratified survey support rather than a deviation or limitation, docs/methodology/REGISTRY.md:L2434-L2446, and the new tests actively smoke-test stratified designs as supported, tests/test_had_pretests.py:L3936-L3987. Concrete fix: either implement the missing Stute-specific stratified correction after the multiplier draw, or reject survey.strata is not None everywhere the Stute family is reachable and document the narrower support surface in REGISTRY.md/CHANGELOG.md.
No other unmitigated methodology findings. The replicate-weight deferral is tracked in TODO.md:L98, and Yatchew’s decision not to propagate PSU clustering is explicitly documented in docs/methodology/REGISTRY.md:L2436-L2441.

Code Quality

Severity: P1 [Newly identified]. Impact: on the new staggered aggregate="event_study" weights= path, the code slices weights with np.asarray(weights)[pos_idx] before validating that weights is 1-D and has length len(data), in diff_diff/had_pretests.py:L3292-L3303, diff_diff/had_pretests.py:L3514-L3525, and diff_diff/had_pretests.py:L3874-L3883. Because _resolve_pretest_unit_weights(...) only validates after this slicing (diff_diff/had_pretests.py:L3033-L3062), oversized arrays are silently truncated to the filtered panel and undersized arrays can surface raw indexing failures instead of a front-door ValueError. That is a real correctness/edge-case problem on the new parameter path. Concrete fix: validate weights.ndim == 1 and len(weights) == len(data) before any pos_idx slicing in all three sites, then subset.

Performance

No unmitigated findings.

Maintainability

No unmitigated findings.

Tech Debt

Severity: P3-informational. Impact: replicate-weight survey pretests are explicitly deferred and tracked in TODO.md:L98, so that limitation is properly recorded and does not block this PR. Concrete fix: none in this PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new stratified survey tests only assert finiteness / 0 <= p <= 1, so they would not catch the calibration problem above and currently enshrine the unsupported behavior as a passing path, tests/test_had_pretests.py:L3936-L3987. The docs also overstate the supported surface by claiming Phase 4.5 C supports PSU/strata/FPC Stute survey designs, docs/methodology/REGISTRY.md:L2434-L2446 and CHANGELOG.md:L8-L11. Concrete fix: if stratified Stute remains unsupported, replace these tests with NotImplementedError regressions and narrow the docs accordingly; if stratified Stute is implemented, add a calibration/parity oracle that would fail on the current raw-multiplier implementation.

Path to Approval

Make the Stute survey bootstrap match a derived stratified target, or explicitly reject stratified survey designs on stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow, then update docs/methodology/REGISTRY.md and CHANGELOG.md to match.
Add front-door validation for staggered weights= inputs before any pos_idx slicing in joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow, so wrong-length arrays raise clean ValueErrors instead of being silently truncated or failing deep in NumPy.
Replace the current stratified smoke tests with either rejection tests for the narrowed support surface or a real stratified calibration/parity regression, and add oversized/undersized staggered-weights= regression tests.

Closes the Phase 4.5 C0 promise (PR #367 commit 29f8b12). Linearity- family pretests now accept survey=/weights= keyword-only kwargs: - stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow. Stute family: PSU-level Mammen multiplier bootstrap via generate_survey_multiplier_weights_batch. Each replicate draws (B, n_psu) Mammen multipliers, broadcast to per-obs perturbation eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new _cvm_statistic_weighted helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). NOT Rao-Wu rescaling -- multiplier bootstrap is a different mechanism. Yatchew: closed-form weighted OLS + pweight-sandwich variance components (no bootstrap): sigma2_lin = sum(w * eps^2) / sum(w) sigma2_diff = sum(w_avg * diff^2) / (2 * sum(w)) [Reviewer CRITICAL #2] sigma4_W = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg) T_hr = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W where w_avg_g = (w_g + w_{g-1}) / 2 (Krieger-Pfeffermann 1997 Section 3). All three components reduce bit-exactly to existing unweighted formulas at w=ones(G); locked at atol=1e-14 by direct helper test. Workflow under survey/weights: skips the QUG step with UserWarning (per C0 deferral), sets qug=None on the report, dispatches the linearity family with survey-aware mechanism, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix to the verdict. all_pass drops the QUG-conclusiveness gate (one less precondition). HADPretestReport.qug retyped from QUGTestResults to Optional[QUGTestResults]; summary/to_dict/to_dataframe updated to None-tolerant rendering. Pweight shortcut routing: weights= passes through a synthetic trivial ResolvedSurveyDesign (new survey._make_trivial_resolved helper) so the same kernel handles both entry paths -- mirrors PR #363's R7 fix pattern on HAD sup-t. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError at every entry point (defense in depth, reciprocal- guard discipline). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; deferred to a parallel follow-up. Per-row weights= / survey=col aggregated to per-unit via existing HAD helpers (_aggregate_unit_weights, _aggregate_unit_resolved_survey; constant-within-unit invariant enforced) through new _resolve_pretest_unit_weights helper. Strictly-positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Stability invariants preserved: - Unweighted code paths bit-exact pre-PR (the new survey/weights branch is a separate if arm; existing 138 pretest tests pass unchanged). - Yatchew weighted variance components reduce to unweighted at w=1 at atol=1e-14 (locked by TestYatchewHRTestSurvey). - HADPretestReport schema bit-exact on the unweighted path; qug=None triggers the new None-tolerant rendering only on the survey path. 20 new tests across TestHADPretestWorkflowSurveyGuards (revised from C0 rejection-only to C functional + 2 mutex/replicate-weight retained), TestStuteTestSurvey (7), TestYatchewHRTestSurvey (7), TestJointStuteSurvey (5). Full pretest suite: 158 tests pass. Patch-level addition (additive on stable surfaces). See docs/methodology/REGISTRY.md "QUG Null Test" -- Note (Phase 4.5 C) for the full methodology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>