diff --git a/CHANGELOG.md b/CHANGELOG.md index 16a08a2a..bc7fab6b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **HAD linearity-family pretests under survey (Phase 4.5 C).** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` now accept `weights=` / `survey=` keyword-only kwargs. Stute family uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap): each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix, broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`, weighted OLS refit, weighted CvM via new `_cvm_statistic_weighted` helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence AND PSU clustering. Yatchew uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap): `sigma2_lin = sum(w·eps²)/sum(w)`, `sigma2_diff = sum(w_avg·diff²)/(2·sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g+w_{g-1})/2`, `sigma4_W = sum(w_avg·prod)/sum(w_avg)`, `T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W`. All three Yatchew components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` by direct helper test). The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (new `survey._make_trivial_resolved` helper) so the same kernel handles both entry paths. `did_had_pretest_workflow(..., survey=, weights=)` removes the Phase 4.5 C0 `NotImplementedError`, dispatches to the survey-aware sub-tests, **skips the QUG step with `UserWarning`** (per C0 deferral), sets `qug=None` on the report, and appends a `"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"` suffix to the verdict. `HADPretestReport.qug` retyped from `QUGTestResults` to `Optional[QUGTestResults]`; `summary()` / `to_dict()` / `to_dataframe()` updated to None-tolerant rendering. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` at every entry point (defense in depth, reciprocal-guard discipline) — parallel follow-up after this PR. **Stratified designs (`SurveyDesign(strata=...)`) also raise `NotImplementedError` on the Stute family** — the within-stratum demean + `sqrt(n_h/(n_h-1))` correction that the HAD sup-t bootstrap applies to match the Binder-TSL stratified target has not been derived for the Stute CvM functional, so applying raw multipliers from `generate_survey_multiplier_weights_batch` directly to residual perturbations would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows survey support to **pweight-only**, **PSU-only** (`SurveyDesign(weights=, psu=)`), and **FPC-only** (`SurveyDesign(weights=, fpc=)`) designs; stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Strictly positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Per-row `weights=` / `survey=col` aggregated to per-unit via existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (constant-within-unit invariant enforced). Unweighted code paths preserved bit-exactly. Patch-level addition (additive on stable surfaces). See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C) for the full methodology. - **`ChaisemartinDHaultfoeuille.by_path` + `placebo=True`** — per-path backward-horizon placebos `DID^{pl}_{path, l}` for `l = 1..L_max`. The same per-path SE convention used for the event-study (joiners/leavers IF precedent: switcher-side contributions zeroed for non-path groups; cohort structure and control pool unchanged; plug-in SE with path-specific divisor `N^{pl}_{l, path}`) is applied to backward horizons via the new `switcher_subset_mask` parameter on `_compute_per_group_if_placebo_horizon`. Surfaced on `results.path_placebo_event_study[path][-l]` (negative-int inner keys mirroring `placebo_event_study`); `summary()` renders the rows alongside per-path event-study horizons; `to_dataframe(level="by_path")` emits negative-horizon rows alongside the existing positive-horizon rows. **Bootstrap** (when `n_bootstrap > 0`) propagates per-`(path, lag)` percentile CI / p-value through the same `_bootstrap_one_target` dispatch as the per-path event-study, with the canonical NaN-on-invalid contract enforced on the new surface (PR #364 library-wide invariant). **SE inherits the cross-path cohort-sharing deviation from R** documented for `path_effects` (full-panel cohort-centered plug-in vs R's per-path re-run): tracks R within tolerance on single-path-cohort panels, diverges materially on cohort-mixed panels — the bootstrap SE is a Monte Carlo analog of the analytical SE and inherits the same deviation. R-parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathPlacebo` on the new `multi_path_reversible_by_path_placebo` scenario (point estimates exact match; SE within Phase-2 envelope rtol ≤ 5%); positive analytical + bootstrap invariants at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPlacebo` (and the gated `::TestBootstrap` subclass). See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path ...)` → "Per-path placebos" for the full contract. ## [3.3.0] - 2026-04-25 diff --git a/TODO.md b/TODO.md index bd97aedb..8052ad0f 100644 --- a/TODO.md +++ b/TODO.md @@ -95,7 +95,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` joint cross-horizon analytical covariance on the weighted event-study path: Phase 4.5 B ships multiplier-bootstrap sup-t simultaneous CIs on the weighted event-study path but pointwise analytical variance is still independent across horizons. A follow-up could derive the full H × H analytical covariance from the per-horizon IF matrix (`Psi.T @ Psi` under survey weighting) for an analytical alternative to the bootstrap. Would also let the unweighted event-study path ship a sup-t band. | `diff_diff/had.py::_fit_event_study` | follow-up | Low | | `HeterogeneousAdoptionDiD` unweighted event-study sup-t band: Phase 4.5 B ships sup-t only on the WEIGHTED event-study path (to preserve pre-PR bit-exact output on unweighted). Extending sup-t to unweighted event-study (either via the multiplier bootstrap with unit-level iid multipliers or via analytical joint cross-horizon covariance) is a symmetric follow-up. | `diff_diff/had.py::_fit_event_study` | follow-up | Low | | `HeterogeneousAdoptionDiD` survey-aware support-endpoint test (research, not engineering): if the academic literature ever publishes a calibrated support-infimum test under complex sampling — combining endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999) with survey-aware functional CLTs for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017) and tail-empirical-process theory (Drees 2003) — Phase 4.5 C0's permanent NotImplementedError on `qug_test(..., survey=...)` / `weights=` can be revisited and the bridge implemented against the published recipe. See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C0) for the decision rationale and the research-direction sketch. | `diff_diff/had_pretests.py::qug_test` | Phase 4.5 C0 (2026-04, decision shipped) | Low | -| `HeterogeneousAdoptionDiD` Phase 4.5 C: pretests under survey (`stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, `did_had_pretest_workflow`). Rao-Wu rescaled bootstrap for the Stute-family (weighted η generation + PSU clustering in the bootstrap draw); weighted OLS residuals + weighted variance estimator for Yatchew. | `diff_diff/had_pretests.py` | Phase 4.5 C | Medium | +| `HeterogeneousAdoptionDiD` survey-aware pretests Phase 4.5 C follow-ups: (a) **replicate-weight designs** (BRR/Fay/JK1/JKn/SDR) — the per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; each linearity-family helper raises `NotImplementedError` on `survey.replicate_weights is not None`. (b) **stratified designs** (`SurveyDesign(strata=...)`) on the Stute family — the within-stratum demean + sqrt(n_h/(n_h-1)) correction analogous to HAD sup-t (had.py:2120+) has not been derived for the Stute CvM functional. Phase 4.5 C ships **pweight + PSU + FPC** support (no strata) via PSU-level Mammen multiplier bootstrap (Stute family) + closed-form weighted variance components (Yatchew). Stratified Stute = derivation work; replicate-weight pretests = bootstrap-composition work. | `diff_diff/had_pretests.py` | Phase 4.5 C follow-up | Low | | `HeterogeneousAdoptionDiD` Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted `lprobust` with an unweighted DPI selector; users who want a weight-aware bandwidth must pass `h`/`b` explicitly. Extending `lpbwselect_mse_dpi` to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope. | `diff_diff/_nprobust_port.py::lpbwselect_mse_dpi` | Phase 4.5 | Low | | `HeterogeneousAdoptionDiD` Phase 4.5 C: replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the continuous-dose paths. Phase 4.5 A raises `NotImplementedError` on replicate designs in `_aggregate_unit_resolved_survey`. Rao-Wu-style replicate bootstrap for HAD paths requires deriving the per-replicate weight-ratio rescaling for the local-linear intercept IF. | `diff_diff/had.py::_aggregate_unit_resolved_survey` | Phase 4.5 C | Low | | `HeterogeneousAdoptionDiD` mass-point: `vcov_type in {"hc2", "hc2_bm"}` raises `NotImplementedError` pending a 2SLS-specific leverage derivation. The OLS leverage `x_i' (X'X)^{-1} x_i` is wrong for 2SLS; the correct finite-sample correction uses `x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i`. Needs derivation plus an R / Stata (`ivreg2 small robust`) parity anchor. | `diff_diff/had.py::_fit_mass_point_2sls` | Phase 2a | Medium | diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py index 3b8cb451..0f72e3cf 100644 --- a/diff_diff/had_pretests.py +++ b/diff_diff/had_pretests.py @@ -66,12 +66,16 @@ import pandas as pd from scipy import stats +from diff_diff.bootstrap_utils import generate_survey_multiplier_weights_batch from diff_diff.had import ( _aggregate_first_difference, + _aggregate_unit_resolved_survey, + _aggregate_unit_weights, _json_safe_scalar, _validate_had_panel, _validate_had_panel_event_study, ) +from diff_diff.survey import _make_trivial_resolved from diff_diff.utils import _generate_mammen_weights __all__ = [ @@ -584,8 +588,11 @@ class HADPretestReport: Attributes ---------- - qug : QUGTestResults - Always populated. + qug : QUGTestResults or None + Populated by default; ``None`` only when the workflow runs under + ``survey=`` / ``weights=`` (Phase 4.5 C path), where the QUG step + is permanently skipped per Phase 4.5 C0 (extreme-value theory under + complex sampling not a settled toolkit; see :func:`qug_test`). stute : StuteTestResults or None Populated when ``aggregate == "overall"``; ``None`` when ``aggregate == "event_study"``. @@ -600,15 +607,28 @@ class HADPretestReport: Populated when ``aggregate == "event_study"``; ``None`` on the overall path. all_pass : bool - On the overall path: same Phase 3 semantics - True iff QUG is - conclusive AND at least one of Stute/Yatchew is conclusive AND - no conclusive test rejects. On the event-study path: True iff - ``np.isfinite(qug.p_value)``, + On the **unweighted overall path**: same Phase 3 semantics - True + iff QUG is conclusive AND at least one of Stute/Yatchew is + conclusive AND no conclusive test rejects. On the **unweighted + event-study path**: True iff ``np.isfinite(qug.p_value)``, ``pretrends_joint is not None and np.isfinite(pretrends_joint.p_value)``, ``np.isfinite(homogeneity_joint.p_value)``, AND none of the - three rejects. Mirrors Phase 3's ``bool(np.isfinite(p_value))`` - convention - no ``.conclusive()`` helper on any result dataclass. + three rejects. On the **survey/weights path** (Phase 4.5 C) the + QUG-conclusiveness gate is dropped (``qug=None`` per C0 + deferral); the linearity-conditional rule splits by aggregate: + + - ``aggregate="overall"`` survey: True iff at least one of + Stute/Yatchew is conclusive AND no conclusive test rejects. + - ``aggregate="event_study"`` survey: True iff + ``pretrends_joint`` is non-None and conclusive, + ``homogeneity_joint`` is conclusive, AND neither rejects. + (Both joint variants must be conclusive on the event-study + path - same step-2 + step-3 closure as the unweighted + aggregate, just without the QUG step.) + + Mirrors Phase 3's ``bool(np.isfinite(p_value))`` convention - no + ``.conclusive()`` helper on any result dataclass. verdict : str Human-readable classification. Paper rule applies symmetrically: TWFE is admissible only if NONE of the implemented tests @@ -626,7 +646,7 @@ class HADPretestReport: to render. """ - qug: QUGTestResults + qug: Optional[QUGTestResults] stute: Optional[StuteTestResults] yatchew: Optional[YatchewTestResults] all_pass: bool @@ -659,13 +679,20 @@ def summary(self) -> str: # Preserve Phase 3 summary bit-exactly on the overall path. The # `aggregate: ...` header line is only rendered on the event- # study path; two-period reports produce the Phase 3 layout. + # QUG block: rendered when self.qug is populated, else a skip note + # (Phase 4.5 C survey/weights path leaves qug=None; see C0 deferral). + qug_block = ( + self.qug.summary() + if self.qug is not None + else "(QUG step skipped - permanently deferred under survey/weights per Phase 4.5 C0)" + ) if self.aggregate == "event_study": header = [ "=" * width, "HAD pre-test workflow".center(width), f"aggregate: {self.aggregate}".center(width), "=" * width, - self.qug.summary(), + qug_block, "", ] if self.pretrends_joint is not None: @@ -678,12 +705,13 @@ def summary(self) -> str: if self.homogeneity_joint is not None: body += [self.homogeneity_joint.summary(), ""] else: - # aggregate == "overall" - Phase 3 layout preserved. + # aggregate == "overall" - Phase 3 layout preserved when qug is + # not None (unweighted path); QUG-skip block on the survey path. header = [ "=" * width, "HAD pre-test workflow".center(width), "=" * width, - self.qug.summary(), + qug_block, "", ] body = [] @@ -713,10 +741,14 @@ def to_dict(self) -> Dict[str, Any]: ``pretrends_joint``, ``homogeneity_joint`` and omits the ``None``-valued ``stute`` / ``yatchew`` keys entirely. """ + # qug serializes as None on the survey/weights path (Phase 4.5 C + # QUG-skip per C0 deferral); rendered as the existing dict on the + # default unweighted path. + qug_dict = None if self.qug is None else self.qug.to_dict() if self.aggregate == "event_study": return { "aggregate": str(self.aggregate), - "qug": self.qug.to_dict(), + "qug": qug_dict, "pretrends_joint": ( None if self.pretrends_joint is None else self.pretrends_joint.to_dict() ), @@ -728,10 +760,11 @@ def to_dict(self) -> Dict[str, Any]: "alpha": float(self.alpha), "n_obs": int(self.n_obs), } - # aggregate == "overall" - Phase 3 schema preserved bit-exactly, - # including key order and the absence of the aggregate field. + # aggregate == "overall" - Phase 3 schema preserved bit-exactly on + # the unweighted path (qug populated); the qug=None survey path + # surfaces qug: null. return { - "qug": self.qug.to_dict(), + "qug": qug_dict, "stute": None if self.stute is None else self.stute.to_dict(), "yatchew": None if self.yatchew is None else self.yatchew.to_dict(), "all_pass": bool(self.all_pass), @@ -756,15 +789,29 @@ def to_dataframe(self) -> pd.DataFrame: no earlier pre-period exists) are emitted with NaN statistic values and ``reject=False`` to preserve the 3-row shape. """ - qug_row = { - "test": "qug", - "statistic_name": "t_stat", - "statistic_value": _json_safe_scalar(self.qug.t_stat), - "p_value": _json_safe_scalar(self.qug.p_value), - "reject": bool(self.qug.reject), - "alpha": float(self.qug.alpha), - "n_obs": int(self.qug.n_obs), - } + # qug row: NaN-skip when self.qug is None (Phase 4.5 C survey/weights + # path leaves qug=None per C0 deferral). Mirrors the joint NaN-row + # shape from `_joint_row_or_nan` so the 3-row contract is preserved. + if self.qug is None: + qug_row = { + "test": "qug", + "statistic_name": "t_stat", + "statistic_value": float("nan"), + "p_value": float("nan"), + "reject": False, + "alpha": float(self.alpha), + "n_obs": int(self.n_obs), + } + else: + qug_row = { + "test": "qug", + "statistic_name": "t_stat", + "statistic_value": _json_safe_scalar(self.qug.t_stat), + "p_value": _json_safe_scalar(self.qug.p_value), + "reject": bool(self.qug.reject), + "alpha": float(self.qug.alpha), + "n_obs": int(self.qug.n_obs), + } if self.aggregate == "event_study": pre_row = self._joint_row_or_nan("pretrends_joint", self.pretrends_joint) hom_row = self._joint_row_or_nan("homogeneity_joint", self.homogeneity_joint) @@ -893,6 +940,46 @@ def _fit_ols_intercept_slope(d: np.ndarray, dy: np.ndarray) -> "tuple[float, flo return a_hat, b_hat, residuals +def _fit_weighted_ols_intercept_slope( + d: np.ndarray, dy: np.ndarray, w: np.ndarray +) -> "tuple[float, float, np.ndarray]": + """Weighted OLS analog of :func:`_fit_ols_intercept_slope`. + + Solves the weighted normal equations for ``dy = a + b*d + eps`` where + each observation has weight ``w_g``. Returns ``(a_hat, b_hat, + residuals)`` with ``residuals`` in the ORIGINAL input order (not + sorted) and on the un-weighted scale (``residuals = dy - a_hat - b_hat * d``, + NOT ``sqrt(w) * (dy - ...)``). + + At ``w = ones(G)`` reduces bit-exactly to ``_fit_ols_intercept_slope`` + (Phase 4.5 C stability invariant #1; locked at ``atol=1e-14`` by the + survey-path tests). + + Closed form: + b_hat = sum(w * (d - d_w_mean) * (dy - dy_w_mean)) / sum(w * (d - d_w_mean)^2) + a_hat = dy_w_mean - b_hat * d_w_mean + + where ``d_w_mean = sum(w * d) / sum(w)`` (and similarly for ``dy``). + """ + sw = float(np.sum(w)) + if sw <= 0.0: + raise ValueError( + f"_fit_weighted_ols_intercept_slope: sum(w) = {sw} <= 0; " + "weighted OLS requires positive total mass." + ) + d_wmean = float(np.sum(w * d) / sw) + dy_wmean = float(np.sum(w * dy) / sw) + d_dev = d - d_wmean + var_d_w = float(np.sum(w * d_dev * d_dev)) + if var_d_w <= 0.0: + # Degenerate case: all dose values equal (under weights). + return float(dy_wmean), 0.0, dy - dy_wmean + b_hat = float(np.sum(w * d_dev * (dy - dy_wmean)) / var_d_w) + a_hat = float(dy_wmean - b_hat * d_wmean) + residuals = dy - a_hat - b_hat * d + return a_hat, b_hat, residuals + + def _cvm_statistic(eps_sorted: np.ndarray, d_sorted: np.ndarray) -> float: """Compute the tie-safe Cramer-von Mises cusum statistic. @@ -939,6 +1026,86 @@ def _cvm_statistic(eps_sorted: np.ndarray, d_sorted: np.ndarray) -> float: return float(np.sum(cumsum_tie_safe * cumsum_tie_safe) / (G * G)) +def _has_lonely_psu_adjust_singletons(resolved: Any) -> bool: + """Detect singleton strata under ``lonely_psu='adjust'``. + + Returns ``True`` iff (a) the resolved design uses + ``lonely_psu='adjust'`` AND (b) at least one stratum has fewer than + 2 PSUs. Under those conditions, the bootstrap multiplier helper + pools singletons into a pseudo-stratum with NONZERO multipliers + while the analytical variance target requires a pseudo-stratum + centering transform that is not derived for the Stute CvM + (Phase 4.5 C R5 P1; mirrors the explicit lonely-PSU reject on + HeterogeneousAdoptionDiD's sup-t bootstrap at ``had.py:2081-2118``). + """ + if getattr(resolved, "lonely_psu", "remove") != "adjust": + return False + strata_arr = resolved.strata + if strata_arr is None: + return False + psu_arr = resolved.psu + for h in np.unique(strata_arr): + mask_h = np.asarray(strata_arr) == h + if psu_arr is not None: + n_psu_h = int(np.unique(np.asarray(psu_arr)[mask_h]).shape[0]) + else: + n_psu_h = int(mask_h.sum()) + if n_psu_h < 2: + return True + return False + + +def _cvm_statistic_weighted( + eps_sorted: np.ndarray, d_sorted: np.ndarray, w_sorted: np.ndarray +) -> float: + """Weighted analog of :func:`_cvm_statistic` (survey-weighted plug-in). + + The unweighted Stute CvM `S = (1/G) * sum_g c_G(D_g)^2` integrates + the squared cusum process against the empirical CDF + ``F_hat = (1/G) sum_i delta_{D_i}``. The weighted plug-in replaces + ``F_hat`` by the survey-weighted EDF + ``F_hat_w = (1/W) sum_i w_i delta_{D_i}``, which weights BOTH the + inner cusum AND the outer integration measure: + + C_g = sum_{h : D_h <= D_g} w_h * eps_h (inner cusum, weighted) + S_w = (1 / W^2) * sum_g w_g * (C_g)^2 (outer measure, weighted) + W = sum(w) + + The outer ``w_g`` factor on each squared cusum (R7 P0 fix) is what + distinguishes this from a count-weighted-cusum form + ``(1/W^2) * sum_g C_g^2`` (no outer ``w_g``), which silently + misreports survey-weighted Stute statistics for non-uniform weights. + At ``w = ones(G)`` both forms reduce to ``(1/G^2) sum_g C_g^2`` + (unweighted) -- only non-uniform weights distinguish them. + + Tie-block collapse uses the same ``np.unique(d_sorted)`` count + machinery as the unweighted form -- positions are determined by + ``d_sorted`` ties (independent of weights), so the collapse pattern + is weight-invariant. The outer ``w_sorted`` factor applies to the + tie-collapsed cusum at each observation. + + Parameters + ---------- + eps_sorted, d_sorted, w_sorted : np.ndarray, shape (G,) + Residuals, regressor values, and weights sorted CONSISTENTLY by + ``d``. Caller is responsible for the sort alignment. + + Returns + ------- + float + Weighted CvM statistic. + """ + weighted_eps = w_sorted * eps_sorted + cumsum = np.cumsum(weighted_eps) + _, counts = np.unique(d_sorted, return_counts=True) + tie_end_idx = np.cumsum(counts) - 1 + cumsum_tie_safe = np.repeat(cumsum[tie_end_idx], counts) + W = float(np.sum(w_sorted)) + # R7 P0: integrate outer measure against F_hat_w via the w_sorted + # factor on each squared cusum (NOT against uniform 1/G measure). + return float(np.sum(w_sorted * cumsum_tie_safe * cumsum_tie_safe) / (W * W)) + + def _compose_verdict( qug: QUGTestResults, stute: StuteTestResults, yatchew: YatchewTestResults ) -> str: @@ -1089,18 +1256,18 @@ def qug_test( or pweight inputs (Phase 4.5 C0 decision gate, 2026-04). The test statistic uses extreme order statistics ``(D_{(1)}, D_{(2)})``, which are NOT smooth functionals of the empirical CDF -- standard survey - machinery (Binder TSL linearization, Rao-Wu rescaled bootstrap) does - not yield a calibrated test, and under cluster sampling the - ``Exp(1)/Exp(1)`` limit law's independence assumption breaks. The - extreme-value-theory-under-unequal-probability-sampling literature - (Quintos et al. 2001, Beirlant et al.) addresses tail-index - estimation, not boundary tests; no off-the-shelf survey-aware QUG - exists. Use joint Stute via :func:`did_had_pretest_workflow` - (``aggregate="event_study"``) for survey-aware HAD pretesting once - Phase 4.5 C ships -- Stute tests a smooth empirical-CDF functional - and admits a Rao-Wu rescaled bootstrap. See - ``docs/methodology/REGISTRY.md`` § "QUG Null Test" for the full - methodology note. + machinery (Binder TSL linearization, multiplier bootstrap, Rao-Wu + rescaled bootstrap) does not yield a calibrated test, and under + cluster sampling the ``Exp(1)/Exp(1)`` limit law's independence + assumption breaks. The extreme-value-theory-under-unequal-probability- + sampling literature (Quintos et al. 2001, Beirlant et al.) addresses + tail-index estimation, not boundary tests; no off-the-shelf + survey-aware QUG exists. Phase 4.5 C ships survey-aware Stute via + :func:`did_had_pretest_workflow` (which skips the QUG step under + survey/weights and runs the linearity family with a PSU-level Mammen + multiplier bootstrap for Stute and weighted OLS + pweight-sandwich + variance components for Yatchew). See ``docs/methodology/REGISTRY.md`` + § "QUG Null Test" for the full methodology note. References ---------- @@ -1135,18 +1302,21 @@ def qug_test( "statistics, T = D_(1) / (D_(2) - D_(1)). " "Extreme-order-statistic functionals are not smooth in the " "empirical CDF, so standard survey machinery (Binder " - "linearization, Rao-Wu rescaled bootstrap) does not provide " - "a calibrated test. Under cluster sampling the Exp(1)/Exp(1) " - "limit law's independence assumption breaks. The literature " - "on extreme-value theory under unequal-probability sampling " - "(Quintos et al. 2001, Beirlant et al.) addresses tail-index " - "estimation, not boundary tests; no off-the-shelf " - "survey-aware QUG exists.\n" + "linearization, multiplier bootstrap, Rao-Wu rescaled " + "bootstrap) does not provide a calibrated test. Under " + "cluster sampling the Exp(1)/Exp(1) limit law's independence " + "assumption breaks. The literature on extreme-value theory " + "under unequal-probability sampling (Quintos et al. 2001, " + "Beirlant et al.) addresses tail-index estimation, not " + "boundary tests; no off-the-shelf survey-aware QUG exists.\n" "\n" - "For survey-aware HAD pretesting, use joint Stute (Phase 4.5 " - "C, planned) via did_had_pretest_workflow(..., survey=..., " - "aggregate=...). Stute tests a smooth empirical-CDF " - "functional and admits a Rao-Wu rescaled bootstrap. See " + "For survey-aware HAD pretesting, use the joint Stute family " + "via did_had_pretest_workflow(..., survey=..., " + "aggregate=...) -- shipped in Phase 4.5 C. The workflow " + "skips the QUG step under survey/weights with a UserWarning " + "and runs the linearity family with a PSU-level Mammen " + "multiplier bootstrap (Stute) + weighted OLS + pweight-" + "sandwich variance components (Yatchew). See " "docs/methodology/REGISTRY.md § 'QUG Null Test' for the " "full methodology note." ) @@ -1247,6 +1417,9 @@ def stute_test( alpha: float = 0.05, n_bootstrap: int = 999, seed: Optional[int] = None, + *, + weights: Optional[np.ndarray] = None, + survey: Any = None, ) -> StuteTestResults: """Run the Stute Cramer-von Mises linearity test (paper Appendix D). @@ -1273,6 +1446,19 @@ def stute_test( seed : int or None, default None Seed for ``np.random.default_rng``. Pass an integer for reproducible results. + weights : np.ndarray or None, keyword-only, default None + Per-unit positive weights for the pweight shortcut. Mutually + exclusive with ``survey``. When supplied, the bootstrap is routed + through a synthetic trivial ``ResolvedSurveyDesign`` (no + strata/PSU/FPC) so that the same survey-aware kernel handles both + entry points. See *Notes -- Survey/weighted data*. + survey : ResolvedSurveyDesign or None, keyword-only, default None + Already-resolved survey design (per-unit). Triggers the survey- + aware Stute calibration: PSU-level Mammen multipliers via + :func:`diff_diff.bootstrap_utils.generate_survey_multiplier_weights_batch`, + broadcast to per-unit residual perturbation, with weighted CvM + recompute. Replicate-weight designs raise ``NotImplementedError`` + (deferred to a parallel follow-up after Phase 4.5 C). Returns ------- @@ -1284,7 +1470,14 @@ def stute_test( If ``d`` / ``dy`` are not 1D numeric, contain NaN, have unequal lengths, if any ``d`` value is negative (paper Section 2 HAD support restriction), if ``alpha`` is outside ``(0, 1)``, or if - ``n_bootstrap < 99``. + ``n_bootstrap < 99``. Also raised if BOTH ``weights`` and + ``survey`` are supplied (mutex). + NotImplementedError + If ``survey.replicate_weights is not None``. Replicate-weight + pretests are a parallel follow-up after Phase 4.5 C; the + per-replicate weight-ratio rescaling for the OLS-on-residuals + refit step is not covered by the multiplier-bootstrap composition + used here. Notes ----- @@ -1297,6 +1490,23 @@ def stute_test( :func:`yatchew_hr_test`. Memory usage remains ``O(G)`` regardless (no G x G matrix). + Survey/weighted data (Phase 4.5 C): when ``weights`` or ``survey`` is + supplied, the OLS baseline becomes weighted OLS + (:func:`_fit_weighted_ols_intercept_slope`), the bootstrap multipliers + become PSU-level Mammen draws (broadcast to per-obs perturbation), and + the test statistic uses :func:`_cvm_statistic_weighted`. Per-unit + constant-within-unit invariant on weights/strata/psu/fpc is the + CALLER's responsibility; the workflow + (:func:`did_had_pretest_workflow`) enforces it via + :func:`_aggregate_unit_weights` / + :func:`_aggregate_unit_resolved_survey` from ``had.py``. At ``w = + ones(G)``, weighted helpers reduce bit-exactly to the unweighted + versions but bootstrap p-values diverge by Monte-Carlo noise (different + RNG consumption between batched ``generate_survey_multiplier_weights_batch`` + and per-iteration ``_generate_mammen_weights``); use the + distribution-equivalence reduction test (large B) for trivial-pweight + parity, NOT numerical equivalence. + References ---------- Stute, W. (1997). Nonparametric model checks for regression. Annals @@ -1314,6 +1524,32 @@ def stute_test( f"Got n_bootstrap={n_bootstrap}." ) + # Phase 4.5 C: survey/weights mutex + replicate-weight rejection. + # Mirrors the C0 pattern from qug_test and HeterogeneousAdoptionDiD.fit(). + if survey is not None and weights is not None: + raise ValueError( + "stute_test: pass survey= OR weights=, " + "not both. survey= triggers full PSU-aware bootstrap; weights= is " + "the pweight shortcut routed through a synthetic trivial design." + ) + if survey is not None and getattr(survey, "replicate_weights", None) is not None: + raise NotImplementedError( + "stute_test: replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) " + "are not yet supported on HAD pretests. The per-replicate weight-" + "ratio rescaling for the OLS-on-residuals refit step is not covered " + "by the multiplier-bootstrap composition. Replicate-weight pretests " + "are a parallel follow-up after Phase 4.5 C." + ) + # R1 P1: pweight-only guard on the direct-helper survey entry (mirrors + # _resolve_pretest_unit_weights for the workflow path). + if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight": + raise ValueError( + f"stute_test: HAD pretests require weight_type='pweight'. Got " + f"weight_type={survey.weight_type!r}. aweight / fweight have " + "different sandwich-variance semantics that are not derived " + "for the Stute CvM bootstrap calibration." + ) + d_arr = _validate_1d_numeric(d, "d") dy_arr = _validate_1d_numeric(dy, "dy") if d_arr.shape[0] != dy_arr.shape[0]: @@ -1359,7 +1595,57 @@ def stute_test( stacklevel=2, ) - a_hat, b_hat, eps = _fit_ols_intercept_slope(d_arr, dy_arr) + # Phase 4.5 C: resolve effective per-unit weights (None on the + # unweighted path, preserves bit-exact regression). When survey= is + # supplied, w is taken from the resolved design. + # R4 P1: validate 1D explicitly so column-vector inputs (e.g. + # df[["w"]].to_numpy()) raise instead of silently broadcasting. + if survey is not None: + w_arr = _validate_1d_numeric(np.asarray(survey.weights), "stute_test: survey.weights") + if w_arr.shape[0] != G: + raise ValueError( + f"stute_test: survey.weights length {w_arr.shape[0]} does not " + f"match d/dy length {G}." + ) + if (w_arr <= 0).any(): + raise ValueError( + "stute_test: survey weights must be strictly positive. " + "Zero / negative weights would leave units in the " + "variance / CvM computation while contributing zero " + "population mass; pre-filter the panel to the positive-" + "weight subpopulation before calling stute_test." + ) + elif weights is not None: + w_arr = _validate_1d_numeric(np.asarray(weights), "stute_test: weights") + if w_arr.shape[0] != G: + raise ValueError( + f"stute_test: weights length {w_arr.shape[0]} does not match " f"d/dy length {G}." + ) + if (w_arr <= 0).any(): + raise ValueError( + "stute_test: weights must be strictly positive (the pweight " + "shortcut does not support zero weights; use survey= with " + "explicit lonely-PSU handling for zero-mass strata)." + ) + else: + w_arr = None + + # R4 P0: normalize pweights to mean=1 (matches SurveyDesign.resolve() + # convention). Makes the test statistic scale-invariant under uniform + # rescaling of weights AND ensures weights= shortcut and + # survey=SurveyDesign(weights=...) produce identical results for the + # same design. Stute is internally scale-invariant in functional form, + # but the survey-aware bootstrap helper consumes weight values + # directly under non-trivial PSU/strata, so normalization is required + # for cross-path agreement. + if w_arr is not None: + w_arr = w_arr * (float(w_arr.shape[0]) / float(np.sum(w_arr))) + + if w_arr is None: + a_hat, b_hat, eps = _fit_ols_intercept_slope(d_arr, dy_arr) + else: + a_hat, b_hat, eps = _fit_weighted_ols_intercept_slope(d_arr, dy_arr, w_arr) + # Genuine degeneracy: zero dose variation. The CvM cusum is defined # against the regressor, and constant d carries no signal to test # linearity against - emit NaN. @@ -1414,16 +1700,119 @@ def stute_test( idx = np.argsort(d_arr, kind="stable") d_sorted = d_arr[idx] - S = _cvm_statistic(eps[idx], d_sorted) + if w_arr is None: + S = _cvm_statistic(eps[idx], d_sorted) + else: + S = _cvm_statistic_weighted(eps[idx], d_sorted, w_arr[idx]) rng = np.random.default_rng(seed) bootstrap_S = np.empty(n_bootstrap, dtype=np.float64) fitted = a_hat + b_hat * d_arr # baseline fitted values under H_0 - for b in range(n_bootstrap): - eta = _generate_mammen_weights(G, rng) - dy_b = fitted + eps * eta - _, _, eps_b = _fit_ols_intercept_slope(d_arr, dy_b) - bootstrap_S[b] = _cvm_statistic(eps_b[idx], d_sorted) + + if w_arr is None: + # Unweighted bit-exact path - identical to pre-PR code. + for b in range(n_bootstrap): + eta = _generate_mammen_weights(G, rng) + dy_b = fitted + eps * eta + _, _, eps_b = _fit_ols_intercept_slope(d_arr, dy_b) + bootstrap_S[b] = _cvm_statistic(eps_b[idx], d_sorted) + else: + # Phase 4.5 C survey-aware path: PSU-level Mammen multipliers + # (broadcast to per-obs perturbation), weighted OLS refit, weighted + # CvM recompute. Routes via synthetic trivial ResolvedSurveyDesign + # for the weights= shortcut to share the same kernel. + resolved_for_boot = survey if survey is not None else _make_trivial_resolved(w_arr) + # R10 P1: reject stratified designs explicitly until a derived + # Stute-specific correction lands. The HAD sup-t bootstrap + # (had.py:2120+) applies a within-stratum demean + + # sqrt(n_h/(n_h-1)) small-sample correction AFTER + # generate_survey_multiplier_weights_batch returns, to make the + # bootstrap variance match the Binder-TSL stratified target. + # That same correction has NOT been derived for the Stute CvM + # functional, so applying the helper's raw multipliers directly + # to residual perturbations on stratified designs leaves the + # bootstrap p-value silently miscalibrated. Pweight-only, + # PSU-only, and FPC-only designs are still supported (the + # helper's output is appropriately scaled for those). + if resolved_for_boot.strata is not None: + raise NotImplementedError( + "stute_test: SurveyDesign(strata=...) with stratified " + "sampling is not yet supported. The Stute CvM bootstrap " + "calibration on stratified designs requires a within-" + "stratum demean + sqrt(n_h/(n_h-1)) small-sample " + "correction analogous to the HAD sup-t bootstrap, but " + "the matching derivation for the Stute functional has " + "not been completed. Pweight-only or PSU-only " + "(SurveyDesign(weights=..., psu=...)) designs are " + "supported; pre-process stratified designs to remove " + "the strata column or wait for the derivation in a " + "follow-up PR." + ) + # R5 P1: reject lonely_psu='adjust' singleton-strata designs + # explicitly (now redundant with the strata guard above; kept + # for defense in depth and for residual non-stratified + # singleton-strata edge cases). + if _has_lonely_psu_adjust_singletons(resolved_for_boot): + raise NotImplementedError( + "stute_test: SurveyDesign(lonely_psu='adjust') with " + "singleton strata is not yet supported on the multiplier " + "bootstrap. The bootstrap helper pools singletons with " + "nonzero multipliers but the matching analytical " + "variance target requires a pseudo-stratum centering " + "transform that has not been derived for the Stute CvM. " + "Use lonely_psu='remove' (drops singleton contributions) " + "or 'certainty' (zero-variance singletons), or pre-" + "process the panel to remove singleton strata." + ) + # R3 P0: variance-unidentified survey-design guard. When + # n_psu - n_strata <= 0 (e.g. unstratified single-PSU, or one PSU + # per stratum under lonely_psu='remove'/'certainty'), + # generate_survey_multiplier_weights_batch returns an all-zero + # multiplier matrix. Without this guard, the code below would + # treat zero perturbations as a valid bootstrap law and emit + # p_value = 1/(B+1) for any positive observed CvM (spurious + # rejection). Mirrors compute_survey_vcov's df_survey-driven + # NaN treatment elsewhere in the package. The lonely_psu='adjust' + # singleton case (which has nonzero multipliers but a separate + # methodology gap) is already rejected above, so this branch + # only catches genuinely degenerate designs. + df_survey = resolved_for_boot.df_survey + if df_survey is None or df_survey <= 0: + warnings.warn( + f"stute_test: survey design is variance-unidentified " + f"(df_survey={df_survey}); the multiplier bootstrap " + "cannot calibrate the test (single-PSU unstratified or " + "one-PSU-per-stratum design). Returning NaN result.", + UserWarning, + stacklevel=2, + ) + return StuteTestResults( + cvm_stat=float(S), + p_value=float("nan"), + reject=False, + alpha=alpha, + n_bootstrap=int(n_bootstrap), + n_obs=G, + seed=seed, + ) + psu_mults, psu_ids = generate_survey_multiplier_weights_batch( + n_bootstrap, resolved_for_boot, weight_type="mammen", rng=rng + ) + # Build per-obs PSU-column index. When psu is None (trivial path), + # each obs is its own PSU and psu_ids = arange(G) - so psu_col_idx + # is just arange(G). + if resolved_for_boot.psu is None: + psu_col_idx = np.arange(G) + else: + psu_to_col = {int(p): c for c, p in enumerate(psu_ids)} + psu_arr = np.asarray(resolved_for_boot.psu) + psu_col_idx = np.array([psu_to_col[int(psu_arr[g])] for g in range(G)]) + + for b in range(n_bootstrap): + eta_obs = psu_mults[b, psu_col_idx] # (G,) + dy_b = fitted + eps * eta_obs + _, _, eps_b = _fit_weighted_ols_intercept_slope(d_arr, dy_b, w_arr) + bootstrap_S[b] = _cvm_statistic_weighted(eps_b[idx], d_sorted, w_arr[idx]) p_value = float((1.0 + float(np.sum(bootstrap_S >= S))) / (n_bootstrap + 1.0)) reject = p_value <= alpha @@ -1439,7 +1828,14 @@ def stute_test( ) -def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> YatchewTestResults: +def yatchew_hr_test( + d: np.ndarray, + dy: np.ndarray, + alpha: float = 0.05, + *, + weights: Optional[np.ndarray] = None, + survey: Any = None, +) -> YatchewTestResults: """Run the Yatchew heteroskedasticity-robust linearity test. Tests ``H_0: E[ΔY | D_2]`` is linear in ``D_2`` (paper Assumption 8, @@ -1462,6 +1858,17 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch Dose and first-difference outcome vectors. alpha : float, default 0.05 One-sided significance level. + weights : np.ndarray or None, keyword-only, default None + Per-unit STRICTLY POSITIVE weights for the pweight shortcut. + Mutually exclusive with ``survey``. See *Notes -- Survey/weighted data*. + survey : ResolvedSurveyDesign or None, keyword-only, default None + Already-resolved survey design (per-unit). When supplied, the OLS + baseline becomes weighted OLS and all three variance components + become their pweight-sandwich analogs. PSU clustering is NOT + propagated through the variance-ratio statistic (would require + deriving a survey-aware variance-of-variance estimator; out of + scope per Phase 4.5 C). Replicate-weight designs raise + ``NotImplementedError``. Returns ------- @@ -1473,6 +1880,10 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch If ``d`` / ``dy`` are not 1D numeric, contain NaN, have unequal lengths, if any ``d`` value is negative (paper Section 2 HAD support restriction), or if ``alpha`` is outside ``(0, 1)``. + Also raised if BOTH ``weights`` and ``survey`` supplied (mutex), + or if any weight is non-positive. + NotImplementedError + If ``survey.replicate_weights is not None`` (deferred follow-up). Notes ----- @@ -1512,15 +1923,58 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch with a ``UserWarning``, rather than unconditionally mapping this to ``p=1`` (which would flip a legitimate rejection). + Survey/weighted data (Phase 4.5 C): when ``weights`` or ``survey`` is + supplied, all three variance components use their pweight-sandwich + analogs: + + - ``sigma2_lin = sum(w * eps^2) / sum(w)`` (weighted OLS residual variance). + - ``sigma2_diff = sum(w_avg * (dy_g - dy_{g-1})^2) / (2 * sum(w))`` + where ``w_avg_g = (w_g + w_{g-1}) / 2`` and the divisor uses + ``sum(w)`` (not ``sum(w_avg)``) so the formula reduces bit-exactly + to the unweighted ``(1/(2G))`` divisor at ``w = ones(G)``. + - ``sigma4_W = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg)`` with + arithmetic-mean pair weights; reduces to the unweighted ``(1/(G-1))`` + divisor at ``w = ones(G)``. + - ``T_hr = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W``. + + The pair-weight convention follows Krieger-Pfeffermann (1997, §3) for + design-consistent inference on smooth functionals; PSU clustering is + NOT propagated through the variance-ratio statistic. Strictly positive + weights are required (the adjacent-difference formula has + ``sum(w_avg)`` in the denominator). Per-unit constant-within-unit + invariant on weights/strata/psu/fpc is the CALLER's responsibility. + References ---------- Yatchew, A. (1997). An elementary estimator of the partial linear model. Economics Letters 57, 135-143. de Chaisemartin et al. (2026), Theorem 7 / Equation 29. + Krieger, A., Pfeffermann, D. (1997). Testing of distribution functions + from complex sample surveys. Journal of Official Statistics 13(2), + 123-142. """ if not (0.0 < alpha < 1.0): raise ValueError(f"alpha must satisfy 0 < alpha < 1, got {alpha}.") + # Phase 4.5 C: survey/weights mutex + replicate-weight rejection. + if survey is not None and weights is not None: + raise ValueError( + "yatchew_hr_test: pass survey= OR " "weights=, not both." + ) + if survey is not None and getattr(survey, "replicate_weights", None) is not None: + raise NotImplementedError( + "yatchew_hr_test: replicate-weight survey designs (BRR/Fay/JK1/JKn/" + "SDR) are not yet supported on HAD pretests. Replicate-weight " + "pretests are a parallel follow-up after Phase 4.5 C." + ) + # R1 P1: pweight-only guard (aweight/fweight have different sandwich- + # variance semantics not derived for the variance-ratio statistic). + if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight": + raise ValueError( + f"yatchew_hr_test: HAD pretests require weight_type='pweight'. " + f"Got weight_type={survey.weight_type!r}." + ) + d_arr = _validate_1d_numeric(d, "d") dy_arr = _validate_1d_numeric(dy, "dy") if d_arr.shape[0] != dy_arr.shape[0]: @@ -1541,6 +1995,50 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch G = int(d_arr.shape[0]) critical_value = float(stats.norm.ppf(1.0 - alpha)) + # Phase 4.5 C: resolve effective per-unit weights. Strictly positive + # required (the adjacent-difference formula divides by sum(w_avg) which + # collapses to zero in any contiguous-zero block). + # R4 P1: validate 1D explicitly so column-vector inputs raise. + if survey is not None: + w_arr = _validate_1d_numeric(np.asarray(survey.weights), "yatchew_hr_test: survey.weights") + if w_arr.shape[0] != G: + raise ValueError( + f"yatchew_hr_test: survey.weights length {w_arr.shape[0]} " + f"does not match d/dy length {G}." + ) + if (w_arr <= 0).any(): + raise ValueError( + "yatchew_hr_test: survey.weights must be strictly positive " + "(adjacent-difference variance is undefined under contiguous " + "zero-weight blocks)." + ) + elif weights is not None: + w_arr = _validate_1d_numeric(np.asarray(weights), "yatchew_hr_test: weights") + if w_arr.shape[0] != G: + raise ValueError( + f"yatchew_hr_test: weights length {w_arr.shape[0]} does not " + f"match d/dy length {G}." + ) + if (w_arr <= 0).any(): + raise ValueError( + "yatchew_hr_test: weights must be strictly positive (the " + "adjacent-difference variance is undefined under contiguous " + "zero-weight blocks)." + ) + else: + w_arr = None + + # R4 P0: normalize pweights to mean=1 (matches SurveyDesign.resolve() + # convention). Yatchew uses sqrt(sum(w)) as the effective sample size, + # which without normalization would scale as sqrt(c) under uniform + # rescaling weights -> w * c, producing different p-values for + # weights=w vs weights=100*w. Normalization makes the statistic + # scale-invariant AND ensures weights= and survey=SurveyDesign(...) + # produce identical results (the latter resolve()s to mean=1 + # internally, the former previously did not). + if w_arr is not None: + w_arr = w_arr * (float(w_arr.shape[0]) / float(np.sum(w_arr))) + if G < _MIN_G_YATCHEW: warnings.warn( f"yatchew_hr_test: G = {G} is below the minimum {_MIN_G_YATCHEW} " @@ -1593,8 +2091,16 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch n_obs=G, ) - _, _, eps = _fit_ols_intercept_slope(d_arr, dy_arr) - sigma2_lin = float(np.mean(eps * eps)) + # Phase 4.5 C: weighted vs unweighted OLS dispatch. Unweighted path is + # bit-exact pre-PR (stability invariant #1). + if w_arr is None: + _, _, eps = _fit_ols_intercept_slope(d_arr, dy_arr) + sigma2_lin = float(np.mean(eps * eps)) + sum_w = float(G) # uniform-weights effective sample size = G + else: + _, _, eps = _fit_weighted_ols_intercept_slope(d_arr, dy_arr, w_arr) + sum_w = float(np.sum(w_arr)) + sigma2_lin = float(np.sum(w_arr * eps * eps) / sum_w) # Numerically exact linear fit: same short-circuit as `stute_test`. # Assumption 8 holds to IEEE precision; the Yatchew statistic is @@ -1610,9 +2116,18 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch # - dy_centered_sq == 0: dy is constant (trivially linear). # - relative SSR below IEEE precision: near-exact OLS fit. # For reporting, compute sigma2_diff on the sorted dy (finite, - # well-defined even in the exact-linear case). + # well-defined even in the exact-linear case). Use the weighted + # divisor (2 * sum(w)) when weights are supplied so the reported + # sigma2_diff matches the same convention as the active branch. idx_early = np.argsort(d_arr, kind="stable") - sigma2_diff_exact = float(np.sum(np.diff(dy_arr[idx_early]) ** 2) / (2.0 * G)) + if w_arr is None: + sigma2_diff_exact = float(np.sum(np.diff(dy_arr[idx_early]) ** 2) / (2.0 * G)) + else: + w_s_e = w_arr[idx_early] + w_avg_e = 0.5 * (w_s_e[1:] + w_s_e[:-1]) + sigma2_diff_exact = float( + np.sum(w_avg_e * np.diff(dy_arr[idx_early]) ** 2) / (2.0 * sum_w) + ) return YatchewTestResults( t_stat_hr=float("-inf"), p_value=1.0, @@ -1630,14 +2145,24 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch eps_s = eps[idx] diff_dy = np.diff(dy_s) # length G - 1 - # Paper-literal divisor: 2G (NOT 2(G-1)). This matches paper review - # line 168: sigma2_diff := (1/(2G)) * sum((dy_{(g)} - dy_{(g-1)})^2). - sigma2_diff = float(np.sum(diff_dy * diff_dy) / (2.0 * G)) - - # sigma4_W = (1/(G-1)) * sum(eps_(g)^2 * eps_(g-1)^2) using np.mean - # which divides by the length of the input (G-1 here). Matches paper - # review line 171. - sigma4_W = float(np.mean(eps_s[1:] ** 2 * eps_s[:-1] ** 2)) + if w_arr is None: + # Paper-literal divisor: 2G (NOT 2(G-1)). This matches paper review + # line 168: sigma2_diff := (1/(2G)) * sum((dy_{(g)} - dy_{(g-1)})^2). + sigma2_diff = float(np.sum(diff_dy * diff_dy) / (2.0 * G)) + # sigma4_W = (1/(G-1)) * sum(eps_(g)^2 * eps_(g-1)^2) using np.mean + # which divides by the length of the input (G-1 here). Matches paper + # review line 171. + sigma4_W = float(np.mean(eps_s[1:] ** 2 * eps_s[:-1] ** 2)) + else: + # Phase 4.5 C: pweight-sandwich weighted variance components. + # Pair-weights (Krieger-Pfeffermann 1997 §3): w_avg_g = (w_g + w_{g-1})/2. + # Reduction at w=ones(G): w_avg = ones(G-1), sum(w) = G, so + # sigma2_diff = sum(diff^2) / (2*G) (matches existing 2G divisor) + # sigma4_W = sum(prod) / (G-1) (matches existing G-1 divisor) + w_s = w_arr[idx] + w_avg = 0.5 * (w_s[1:] + w_s[:-1]) + sigma2_diff = float(np.sum(w_avg * diff_dy * diff_dy) / (2.0 * sum_w)) + sigma4_W = float(np.sum(w_avg * eps_s[1:] ** 2 * eps_s[:-1] ** 2) / np.sum(w_avg)) if sigma4_W <= 0.0: # sigma4_W = 0 AFTER the exact-linear short-circuit means OLS # residuals are NOT zero (the shortcut already caught that case) @@ -1684,7 +2209,9 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch ) sigma2_W = float(np.sqrt(sigma4_W)) - t_stat_hr = float(np.sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W) + # Phase 4.5 C: effective sample size = sum(w) (=G under uniform weights, + # so unweighted path is bit-exact). + t_stat_hr = float(np.sqrt(sum_w) * (sigma2_lin - sigma2_diff) / sigma2_W) p_value = float(1.0 - stats.norm.cdf(t_stat_hr)) reject = t_stat_hr >= critical_value @@ -1989,6 +2516,8 @@ def stute_joint_pretest( n_bootstrap: int = 999, seed: Optional[int] = None, null_form: str = "custom", + weights: Optional[np.ndarray] = None, + survey: Any = None, ) -> StuteJointResult: """Joint Cramer-von Mises pretest across multiple horizons. @@ -2034,6 +2563,21 @@ def stute_joint_pretest( (``"mean_independence"`` | ``"linearity"`` | ``"custom"``). The wrappers :func:`joint_pretrends_test` and :func:`joint_homogeneity_test` set this automatically. + weights : np.ndarray or None, keyword-only, default None + Per-unit positive weights (Phase 4.5 C). When supplied, the + per-horizon CvM uses :func:`_cvm_statistic_weighted` and the + bootstrap routes through a synthetic trivial + ``ResolvedSurveyDesign``. Mutually exclusive with ``survey``. + survey : ResolvedSurveyDesign or None, keyword-only, default None + Already-resolved per-unit survey design (Phase 4.5 C). When + supplied, the bootstrap is a PSU-level Mammen multiplier + bootstrap with the multiplier matrix shared across horizons + within each replicate (preserves both vector-valued empirical- + process unit-level dependence + PSU clustering). Replicate- + weight designs raise ``NotImplementedError``; non-pweight + weight types are rejected. Variance-unidentified designs + (``df_survey <= 0``) return NaN with a ``UserWarning`` instead + of calibrating against an all-zero multiplier matrix. Returns ------- @@ -2058,6 +2602,26 @@ def stute_joint_pretest( negative values, ``n_bootstrap < _MIN_N_BOOTSTRAP``, or invalid ``alpha``. ``G < _MIN_G_STUTE`` does NOT raise; see Returns. """ + # Phase 4.5 C: survey/weights mutex + replicate-weight rejection + # (mirrors stute_test, yatchew_hr_test, did_had_pretest_workflow). + if survey is not None and weights is not None: + raise ValueError( + "stute_joint_pretest: pass survey= OR " + "weights=, not both." + ) + if survey is not None and getattr(survey, "replicate_weights", None) is not None: + raise NotImplementedError( + "stute_joint_pretest: replicate-weight survey designs (BRR/Fay/JK1/" + "JKn/SDR) are not yet supported on HAD pretests. Replicate-weight " + "pretests are a parallel follow-up after Phase 4.5 C." + ) + # R1 P1: pweight-only guard. + if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight": + raise ValueError( + f"stute_joint_pretest: HAD pretests require weight_type='pweight'. " + f"Got weight_type={survey.weight_type!r}." + ) + if not isinstance(residuals_by_horizon, dict) or not isinstance(fitted_by_horizon, dict): raise ValueError( "residuals_by_horizon and fitted_by_horizon must be dicts " "keyed by horizon label." @@ -2231,12 +2795,59 @@ def stute_joint_pretest( exact_linear_short_circuited=False, ) + # Phase 4.5 C: resolve effective per-unit weights (None → bit-exact + # unweighted path). + # R4 P1: validate 1D explicitly so column-vector inputs raise. + if survey is not None: + w_arr = _validate_1d_numeric( + np.asarray(survey.weights), "stute_joint_pretest: survey.weights" + ) + if w_arr.shape[0] != G: + raise ValueError( + f"stute_joint_pretest: survey.weights length {w_arr.shape[0]} " + f"does not match doses length {G}." + ) + # R1 P0: strictly-positive guard (mirrors stute_test single-horizon). + if (w_arr <= 0).any(): + raise ValueError( + "stute_joint_pretest: survey weights must be strictly " + "positive. Zero / negative weights would leave units in " + "the variance / CvM computation while contributing zero " + "population mass." + ) + elif weights is not None: + w_arr = _validate_1d_numeric(np.asarray(weights), "stute_joint_pretest: weights") + if w_arr.shape[0] != G: + raise ValueError( + f"stute_joint_pretest: weights length {w_arr.shape[0]} does " + f"not match doses length {G}." + ) + if (w_arr <= 0).any(): + raise ValueError( + "stute_joint_pretest: weights must be strictly positive (the " + "pweight shortcut does not support zero weights)." + ) + else: + w_arr = None + + # R4 P0: normalize pweights to mean=1 (matches SurveyDesign.resolve() + # convention; same fix as stute_test / yatchew_hr_test). + if w_arr is not None: + w_arr = w_arr * (float(w_arr.shape[0]) / float(np.sum(w_arr))) + idx = np.argsort(doses_arr, kind="stable") d_sorted = doses_arr[idx] per_horizon_stats: Dict[str, float] = {} - for k in horizon_labels: - per_horizon_stats[k] = _cvm_statistic(residuals_arrays[k][idx], d_sorted) + if w_arr is None: + for k in horizon_labels: + per_horizon_stats[k] = _cvm_statistic(residuals_arrays[k][idx], d_sorted) + else: + w_sorted = w_arr[idx] + for k in horizon_labels: + per_horizon_stats[k] = _cvm_statistic_weighted( + residuals_arrays[k][idx], d_sorted, w_sorted + ) S_joint = float(sum(per_horizon_stats.values())) # Per-horizon exact-linear short-circuit (scale- and translation- @@ -2279,12 +2890,19 @@ def stute_joint_pretest( # Precompute OLS projection matrix once: same X per bootstrap draw, # so (X'X)^-1 X' is constant across iterations. Keeps refit O(Gp) # per draw without changing semantics from the literal paper form. + # Weighted variant uses (X' W X)^-1 X' W; same precompute idiom. # Catch rank-deficient designs explicitly rather than surfacing a # raw ``np.linalg.LinAlgError`` to direct callers of the public # residuals-in core; matches the front-door validation style of # the other guards in this function. try: - XtX_inv_Xt = np.linalg.solve(X.T @ X, X.T) + if w_arr is None: + XtX_inv_Xt = np.linalg.solve(X.T @ X, X.T) + else: + # Weighted OLS projection: (X' W X)^-1 X' W + XtWX = X.T @ (w_arr[:, np.newaxis] * X) + XtW = X.T * w_arr # broadcasts (p, G) + XtX_inv_Xt = np.linalg.solve(XtWX, XtW) except np.linalg.LinAlgError as exc: raise ValueError( f"design_matrix is rank-deficient (singular X^T X); cannot " @@ -2295,18 +2913,107 @@ def stute_joint_pretest( rng = np.random.default_rng(seed) bootstrap_S = np.empty(n_bootstrap, dtype=np.float64) - for b in range(n_bootstrap): - # SHARED eta across horizons - preserves unit-level dependence - # in the vector-valued empirical process. Independent-per-horizon - # draws would overstate precision. - eta = _generate_mammen_weights(G, rng) - S_b = 0.0 - for k in horizon_labels: - dy_b = fitted_arrays[k] + residuals_arrays[k] * eta - beta_b = XtX_inv_Xt @ dy_b - eps_b = dy_b - X @ beta_b - S_b += _cvm_statistic(eps_b[idx], d_sorted) - bootstrap_S[b] = S_b + + if w_arr is None: + # Unweighted bit-exact path (stability invariant #1). + for b in range(n_bootstrap): + # SHARED eta across horizons - preserves unit-level dependence + # in the vector-valued empirical process. Independent-per-horizon + # draws would overstate precision. + eta = _generate_mammen_weights(G, rng) + S_b = 0.0 + for k in horizon_labels: + dy_b = fitted_arrays[k] + residuals_arrays[k] * eta + beta_b = XtX_inv_Xt @ dy_b + eps_b = dy_b - X @ beta_b + S_b += _cvm_statistic(eps_b[idx], d_sorted) + bootstrap_S[b] = S_b + else: + # Phase 4.5 C survey-aware path: PSU-level Mammen multipliers + # SHARED across horizons within each replicate. The (B, n_psu) + # matrix is drawn ONCE per replicate; the per-horizon loop + # broadcasts the SAME multipliers, preserving both the + # vector-valued empirical-process unit-level dependence (paper + # convention) AND PSU clustering (Krieger-Pfeffermann 1997). + resolved_for_boot = survey if survey is not None else _make_trivial_resolved(w_arr) + # R10 P1: reject stratified designs explicitly until a derived + # Stute-specific correction lands (mirrors stute_test + # single-horizon). + if resolved_for_boot.strata is not None: + raise NotImplementedError( + "stute_joint_pretest: SurveyDesign(strata=...) with " + "stratified sampling is not yet supported. The Stute " + "CvM bootstrap calibration on stratified designs " + "requires a within-stratum demean + sqrt(n_h/(n_h-1)) " + "small-sample correction analogous to the HAD sup-t " + "bootstrap, but the matching derivation for the joint " + "Stute functional has not been completed. Pweight-only " + "or PSU-only designs are supported; pre-process " + "stratified designs to remove the strata column or wait " + "for the derivation in a follow-up PR." + ) + # R5 P1: reject lonely_psu='adjust' singleton-strata designs + # explicitly (now redundant with the strata guard above; kept + # for defense in depth). + if _has_lonely_psu_adjust_singletons(resolved_for_boot): + raise NotImplementedError( + "stute_joint_pretest: SurveyDesign(lonely_psu='adjust') " + "with singleton strata is not yet supported on the " + "multiplier bootstrap. The bootstrap helper pools " + "singletons with nonzero multipliers but the matching " + "analytical variance target requires a pseudo-stratum " + "centering transform that has not been derived for the " + "Stute CvM. Use lonely_psu='remove' (drops singleton " + "contributions) or 'certainty' (zero-variance " + "singletons), or pre-process the panel to remove " + "singleton strata." + ) + # R3 P0: variance-unidentified survey-design guard (mirrors + # stute_test single-horizon). + df_survey = resolved_for_boot.df_survey + if df_survey is None or df_survey <= 0: + warnings.warn( + f"stute_joint_pretest: survey design is variance-" + f"unidentified (df_survey={df_survey}); the multiplier " + "bootstrap cannot calibrate the joint test. Returning " + "NaN result.", + UserWarning, + stacklevel=2, + ) + return StuteJointResult( + cvm_stat_joint=S_joint, + p_value=float("nan"), + reject=False, + alpha=float(alpha), + horizon_labels=horizon_labels, + per_horizon_stats=per_horizon_stats, + n_bootstrap=int(n_bootstrap), + n_obs=int(G), + n_horizons=int(K), + seed=None if seed is None else int(seed), + null_form=str(null_form), + exact_linear_short_circuited=False, + ) + psu_mults, psu_ids = generate_survey_multiplier_weights_batch( + n_bootstrap, resolved_for_boot, weight_type="mammen", rng=rng + ) + if resolved_for_boot.psu is None: + psu_col_idx = np.arange(G) + else: + psu_to_col = {int(p): c for c, p in enumerate(psu_ids)} + psu_arr = np.asarray(resolved_for_boot.psu) + psu_col_idx = np.array([psu_to_col[int(psu_arr[g])] for g in range(G)]) + w_sorted = w_arr[idx] + + for b in range(n_bootstrap): + eta_obs = psu_mults[b, psu_col_idx] # (G,) - shared across horizons + S_b = 0.0 + for k in horizon_labels: + dy_b = fitted_arrays[k] + residuals_arrays[k] * eta_obs + beta_b = XtX_inv_Xt @ dy_b + eps_b = dy_b - X @ beta_b + S_b += _cvm_statistic_weighted(eps_b[idx], d_sorted, w_sorted) + bootstrap_S[b] = S_b p_value = float((1.0 + np.sum(bootstrap_S >= S_joint)) / (n_bootstrap + 1)) reject = bool(p_value <= alpha) @@ -2327,6 +3034,110 @@ def stute_joint_pretest( ) +def _resolve_pretest_unit_weights( + data: pd.DataFrame, + unit_col: str, + weights: Optional[np.ndarray], + survey: Any, + caller_name: str, +) -> "tuple[Optional[np.ndarray], Optional[Any]]": + """Resolve per-row ``weights`` / ``survey`` kwargs to per-unit (G,) form. + + Used by ``joint_pretrends_test``, ``joint_homogeneity_test``, and + ``did_had_pretest_workflow`` (data-in entry points). Reuses the HAD + helpers ``_aggregate_unit_weights`` and ``_aggregate_unit_resolved_survey`` + which enforce constant-within-unit invariant on weights and on every + survey design column (strata, psu, fpc). + + Mutex on ``weights`` AND ``survey`` (cannot supply both). Replicate- + weight survey designs raise ``NotImplementedError`` (deferred to a + parallel follow-up after Phase 4.5 C). + + Returns + ------- + (weights_unit, resolved_unit) : Tuple[Optional[np.ndarray], Optional[ResolvedSurveyDesign]] + - If neither kwarg supplied: ``(None, None)`` (unweighted path). + - If ``weights`` supplied: ``(weights_unit, None)``. + - If ``survey`` supplied: ``(None, resolved_unit)`` where + ``resolved_unit.weights`` is the per-unit weight vector and + strata/psu/fpc are also per-unit. + """ + if weights is None and survey is None: + return None, None + if weights is not None and survey is not None: + raise ValueError( + f"{caller_name}: pass survey= OR weights=, " "not both." + ) + if weights is not None: + weights_arr = np.asarray(weights, dtype=np.float64) + # R4 P1: validate 1D explicitly (column-vector inputs would otherwise + # broadcast through downstream computations and silently corrupt + # results). + if weights_arr.ndim != 1: + raise ValueError( + f"{caller_name}: weights must be 1-dimensional, got shape " + f"{weights_arr.shape}. (A common mistake is passing " + "df[['w']].to_numpy() which produces (N, 1); use " + "df['w'].to_numpy() for (N,).)" + ) + weights_unit = _aggregate_unit_weights(data, weights_arr, unit_col) + # R1 P0: strictly-positive weights required on the pweight shortcut + # (matches stute_test/yatchew_hr_test direct entry behavior; the CvM + # cusum + adjacent-difference variance assume all rows contribute). + if (weights_unit <= 0).any(): + raise ValueError( + f"{caller_name}: weights must be strictly positive at the " + "per-unit level. Zero / negative weights would leave units " + "in the variance/CvM computation while contributing zero " + "mass; use survey= with explicit lonely-PSU handling for " + "principled subpopulation analysis." + ) + # R4 P0: normalize per-unit weights to mean=1 (matches + # SurveyDesign.resolve() convention so weights= and survey= entry + # paths produce identical statistic values; ensures Yatchew is + # scale-invariant under uniform rescaling). + weights_unit = weights_unit * (float(weights_unit.shape[0]) / float(np.sum(weights_unit))) + return weights_unit, None + # survey is not None + if not hasattr(survey, "resolve"): + raise TypeError( + f"{caller_name}: survey= must be a SurveyDesign instance " + f"(with .resolve()); got {type(survey).__name__}." + ) + resolved_full = survey.resolve(data) + if getattr(resolved_full, "replicate_weights", None) is not None: + raise NotImplementedError( + f"{caller_name}: replicate-weight survey designs (BRR/Fay/JK1/JKn/" + "SDR) are not yet supported on HAD pretests. Replicate-weight " + "pretests are a parallel follow-up after Phase 4.5 C." + ) + # R1 P1: pweight-only guard. aweight/fweight slip through pweight-only + # formulas silently otherwise (mirrors HeterogeneousAdoptionDiD.fit() at + # had.py:2976+ and survey._resolve_pweight_only at survey.py:914). + if getattr(resolved_full, "weight_type", "pweight") != "pweight": + raise ValueError( + f"{caller_name}: HAD pretests require weight_type='pweight'. " + f"Got weight_type={resolved_full.weight_type!r}. aweight / " + "fweight have different sandwich-variance semantics that are " + "not derived for the pretest variance components." + ) + resolved_unit = _aggregate_unit_resolved_survey(data, resolved_full, unit_col) + # R1 P0: strictly-positive weights at the per-unit level (mirrors the + # weights= shortcut). Zero per-unit weights leave units in the dose- + # variation check / CvM sum while contributing zero population mass, + # which can produce silently-wrong pretest decisions. + if (np.asarray(resolved_unit.weights) <= 0).any(): + raise ValueError( + f"{caller_name}: survey weights must be strictly positive at " + "the per-unit level. Zero / negative weights would leave units " + "in the variance/CvM computation while contributing zero " + "mass; this would produce silent wrong pretest decisions on " + "subpopulation-restricted designs. Pre-filter the panel to " + "the positive-weight subpopulation before calling the workflow." + ) + return None, resolved_unit + + def joint_pretrends_test( data: pd.DataFrame, outcome_col: str, @@ -2340,6 +3151,8 @@ def joint_pretrends_test( alpha: float = 0.05, n_bootstrap: int = 999, seed: Optional[int] = None, + weights: Optional[np.ndarray] = None, + survey: Any = None, ) -> StuteJointResult: """Joint Stute pre-trends test (paper Section 4.2 step 2). @@ -2376,6 +3189,18 @@ def joint_pretrends_test( handling follows the HAD contract (staggered auto-filter warns and proceeds on last cohort; solo cohort proceeds). alpha, n_bootstrap, seed : as in :func:`stute_test`. + weights : np.ndarray or None, keyword-only, default None + Per-row positive weights (Phase 4.5 C). Aggregated to per-unit + via :func:`diff_diff.had._aggregate_unit_weights` (constant- + within-unit invariant enforced). On staggered panels the + wrapper subsets ``weights`` to the surviving cohort BEFORE + aggregation. Mutually exclusive with ``survey``. + survey : SurveyDesign or None, keyword-only, default None + Survey design (Phase 4.5 C). Resolved on the filtered panel; + replicate-weight designs raise ``NotImplementedError``; + ``weight_type`` must be ``"pweight"``. Forwarded to + :func:`stute_joint_pretest` as a per-unit + ``ResolvedSurveyDesign``. Returns ------- @@ -2493,10 +3318,61 @@ def joint_pretrends_test( f"anchor." ) + # Phase 4.5 C: aggregate per-row weights/survey to per-unit (G,) + # using the existing HAD helpers (constant-within-unit invariant + # enforced; replicate-weight rejected on the survey path). + # R2 P1 fix: subset row-level `weights` to data_filtered's rows BEFORE + # resolution, mirroring did_had_pretest_workflow. When + # _validate_had_panel_event_study auto-filters to the last cohort + # under staggered timing, the original weights array no longer aligns + # with data_filtered's row count. Survey= path is unaffected + # (column references resolved internally on data_filtered). + weights_for_resolve = weights + if weights is not None: + # R9 P1: validate 1D + length-matched-to-data BEFORE any + # staggered-panel subsetting. Otherwise oversized arrays would + # be silently truncated and undersized arrays would surface raw + # NumPy indexing errors instead of the package's front-door + # ValueError. + weights_arr = np.asarray(weights, dtype=np.float64) + if weights_arr.ndim != 1: + raise ValueError( + f"joint_pretrends_test: weights must be 1-dimensional, got " + f"shape {weights_arr.shape}. (A common mistake is passing " + "df[['w']].to_numpy() which produces (N, 1); use " + "df['w'].to_numpy() for (N,).)" + ) + if weights_arr.shape[0] != len(data): + raise ValueError( + f"joint_pretrends_test: weights length {weights_arr.shape[0]} " + f"does not match data length {len(data)}." + ) + if len(data_filtered) != len(data): + pos_idx = data.index.get_indexer(data_filtered.index) + if (pos_idx < 0).any(): + raise ValueError( + "joint_pretrends_test: cannot align row-level weights to " + "the staggered-filtered panel; some data_filtered rows " + "do not appear in original data.index." + ) + weights_for_resolve = weights_arr[pos_idx] + else: + weights_for_resolve = weights_arr + weights_unit, resolved_unit = _resolve_pretest_unit_weights( + data_filtered, unit_col, weights_for_resolve, survey, "joint_pretrends_test" + ) + # Reorder per-unit weights to match d_arr/dy_by_horizon ordering. + # _aggregate_for_joint_test sorts the wide pivot by index (unit_col), + # so per-unit order is the SAME as _aggregate_unit_weights' output. + w_eff = resolved_unit.weights if resolved_unit is not None else weights_unit + residuals_by_horizon: Dict[str, np.ndarray] = {} fitted_by_horizon: Dict[str, np.ndarray] = {} for label, dy_t in dy_by_horizon.items(): - mean_t = float(dy_t.mean()) + if w_eff is None: + mean_t = float(dy_t.mean()) + else: + mean_t = float(np.sum(w_eff * dy_t) / np.sum(w_eff)) fitted_t = np.full(G, mean_t, dtype=np.float64) residuals_t = dy_t - fitted_t residuals_by_horizon[label] = residuals_t @@ -2513,6 +3389,8 @@ def joint_pretrends_test( n_bootstrap=n_bootstrap, seed=seed, null_form="mean_independence", + weights=weights_unit if resolved_unit is None else None, + survey=resolved_unit, ) @@ -2529,6 +3407,8 @@ def joint_homogeneity_test( alpha: float = 0.05, n_bootstrap: int = 999, seed: Optional[int] = None, + weights: Optional[np.ndarray] = None, + survey: Any = None, ) -> StuteJointResult: """Joint Stute homogeneity-linearity test (paper Section 4.3 joint). @@ -2560,6 +3440,14 @@ def joint_homogeneity_test( first_treat_col : str or None Forwarded to the underlying panel validator. alpha, n_bootstrap, seed : as in :func:`stute_test`. + weights : np.ndarray or None, keyword-only, default None + Per-row positive weights (Phase 4.5 C). See + :func:`joint_pretrends_test` for the contract; semantics are + identical (per-unit aggregation, staggered subsetting, + replicate-weight rejection). + survey : SurveyDesign or None, keyword-only, default None + Survey design (Phase 4.5 C). Same contract as + :func:`joint_pretrends_test`. Returns ------- @@ -2678,10 +3566,47 @@ def joint_homogeneity_test( f"zero-dose invariant)." ) + # Phase 4.5 C: aggregate weights/survey to per-unit; thread through. + # R2 P1 fix: subset row-level `weights` to data_filtered's rows BEFORE + # resolution, mirroring did_had_pretest_workflow / joint_pretrends_test + # for staggered last-cohort filtering. + # R9 P1 fix: validate 1D + length-matched-to-data BEFORE subsetting. + weights_for_resolve = weights + if weights is not None: + weights_arr = np.asarray(weights, dtype=np.float64) + if weights_arr.ndim != 1: + raise ValueError( + f"joint_homogeneity_test: weights must be 1-dimensional, got " + f"shape {weights_arr.shape}." + ) + if weights_arr.shape[0] != len(data): + raise ValueError( + f"joint_homogeneity_test: weights length {weights_arr.shape[0]} " + f"does not match data length {len(data)}." + ) + if len(data_filtered) != len(data): + pos_idx = data.index.get_indexer(data_filtered.index) + if (pos_idx < 0).any(): + raise ValueError( + "joint_homogeneity_test: cannot align row-level weights to " + "the staggered-filtered panel; some data_filtered rows do " + "not appear in original data.index." + ) + weights_for_resolve = weights_arr[pos_idx] + else: + weights_for_resolve = weights_arr + weights_unit, resolved_unit = _resolve_pretest_unit_weights( + data_filtered, unit_col, weights_for_resolve, survey, "joint_homogeneity_test" + ) + w_eff = resolved_unit.weights if resolved_unit is not None else weights_unit + residuals_by_horizon: Dict[str, np.ndarray] = {} fitted_by_horizon: Dict[str, np.ndarray] = {} for label, dy_t in dy_by_horizon.items(): - a_hat, b_hat, residuals_t = _fit_ols_intercept_slope(d_arr, dy_t) + if w_eff is None: + a_hat, b_hat, residuals_t = _fit_ols_intercept_slope(d_arr, dy_t) + else: + a_hat, b_hat, residuals_t = _fit_weighted_ols_intercept_slope(d_arr, dy_t, w_eff) fitted_t = a_hat + b_hat * d_arr residuals_by_horizon[label] = residuals_t fitted_by_horizon[label] = fitted_t @@ -2697,11 +3622,121 @@ def joint_homogeneity_test( n_bootstrap=n_bootstrap, seed=seed, null_form="linearity", + weights=weights_unit if resolved_unit is None else None, + survey=resolved_unit, ) _VALID_AGGREGATES = ("overall", "event_study") +_QUG_DEFERRED_SUFFIX = ( + " (linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)" +) + + +def _compose_verdict_overall_survey( + stute: Optional[StuteTestResults], + yatchew: Optional[YatchewTestResults], +) -> str: + """Build the overall-path :class:`HADPretestReport` verdict on the + survey/weights branch (Phase 4.5 C). + + Drops the QUG step from consideration (skipped per Phase 4.5 C0) + and composes the verdict from Stute + Yatchew alone, with the + linearity-conditional suffix appended in every branch. R7 P1 fix: + explicit survey-aware composer replaces the prior approach of + composing the unweighted verdict with a NaN QUG and string-replacing + the resulting "QUG NaN" suffix, which could leave pass cases starting + with "inconclusive". + + Priority (mirrors :func:`_compose_verdict` minus QUG): + 1. Conclusive rejections of Stute or Yatchew lead. + 2. No conclusive rejection but linearity inconclusive (both NaN) + -> "inconclusive - both linearity tests NaN". + 3. Linearity conclusive (at least one of Stute/Yatchew finite) AND + no rejection -> fail-to-reject string. + All branches end with `_QUG_DEFERRED_SUFFIX`. + """ + stute_ok = stute is not None and bool(np.isfinite(stute.p_value)) + yatchew_ok = yatchew is not None and bool(np.isfinite(yatchew.p_value)) + stute_rej = stute_ok and bool(stute.reject) + yatchew_rej = yatchew_ok and bool(yatchew.reject) + + reasons = [] + if stute_rej: + reasons.append("linearity rejected - heterogeneity bias (Stute)") + if yatchew_rej: + reasons.append("linearity rejected - heterogeneity bias (Yatchew)") + + unresolved = [] + if not stute_ok: + unresolved.append("Stute NaN") + if not yatchew_ok: + unresolved.append("Yatchew NaN") + + if reasons: + verdict = "; ".join(reasons) + if unresolved: + verdict += "; additional steps unresolved: " + "; ".join(unresolved) + return verdict + _QUG_DEFERRED_SUFFIX + + # No rejections. + if not (stute_ok or yatchew_ok): + return "inconclusive - both Stute and Yatchew linearity tests NaN" + _QUG_DEFERRED_SUFFIX + + # At least one linearity test conclusive AND no rejection. + skipped_note = "" + if not stute_ok: + skipped_note = " (Stute NaN - skipped)" + elif not yatchew_ok: + skipped_note = " (Yatchew NaN - skipped)" + return ( + "Stute and Yatchew linearity diagnostics fail-to-reject" + + skipped_note + + _QUG_DEFERRED_SUFFIX + ) + + +def _compose_verdict_event_study_survey( + pretrends_joint: Optional[StuteJointResult], + homogeneity_joint: Optional[StuteJointResult], +) -> str: + """Event-study survey-path verdict (R7 P1 fix; mirrors + :func:`_compose_verdict_event_study` minus QUG).""" + pretrends_ok = pretrends_joint is not None and bool(np.isfinite(pretrends_joint.p_value)) + homogeneity_ok = homogeneity_joint is not None and bool(np.isfinite(homogeneity_joint.p_value)) + pretrends_rej = pretrends_joint is not None and pretrends_ok and bool(pretrends_joint.reject) + homogeneity_rej = ( + homogeneity_joint is not None and homogeneity_ok and bool(homogeneity_joint.reject) + ) + + reasons = [] + if pretrends_rej: + reasons.append("joint pre-trends rejected - assumption 7 violated (joint Stute)") + if homogeneity_rej: + reasons.append("joint linearity rejected - heterogeneity bias (joint Stute)") + + unresolved = [] + if pretrends_joint is None: + unresolved.append("joint pre-trends skipped (no earlier pre-period)") + elif not pretrends_ok: + unresolved.append("joint pre-trends NaN") + if homogeneity_joint is None: + unresolved.append("joint linearity skipped") + elif not homogeneity_ok: + unresolved.append("joint linearity NaN") + + if reasons: + verdict = "; ".join(reasons) + if unresolved: + verdict += "; additional steps unresolved: " + "; ".join(unresolved) + return verdict + _QUG_DEFERRED_SUFFIX + + if unresolved: + return "inconclusive - " + "; ".join(unresolved) + _QUG_DEFERRED_SUFFIX + + return "joint pre-trends and joint linearity diagnostics fail-to-reject" + _QUG_DEFERRED_SUFFIX + def did_had_pretest_workflow( data: pd.DataFrame, @@ -2769,11 +3804,16 @@ def did_had_pretest_workflow( aggregate : str, keyword-only, default ``"overall"`` Dispatch mode. Invalid values raise ``ValueError``. survey : SurveyDesign or None, keyword-only, default None - Currently rejected with ``NotImplementedError``. See - *Notes -- Survey/weighted data*. + Survey design for design-based pretest inference. Linearity-family + pretests use PSU-level Mammen multiplier bootstrap (Stute family) + and weighted OLS + weighted variance components (Yatchew). The QUG + step is skipped under survey with a ``UserWarning`` (permanent + deferral per Phase 4.5 C0). Replicate-weight designs raise + ``NotImplementedError``. Mutually exclusive with ``weights``. weights : np.ndarray or None, keyword-only, default None - Currently rejected with ``NotImplementedError``. See - *Notes -- Survey/weighted data*. + Per-row positive weights for the pweight shortcut. Mutually + exclusive with ``survey``. Routed through a synthetic trivial + ``ResolvedSurveyDesign`` so the same kernel handles both paths. Returns ------- @@ -2783,7 +3823,9 @@ def did_had_pretest_workflow( event-study path: ``pretrends_joint`` (``None`` if no earlier pre-period) and ``homogeneity_joint`` populated, ``stute`` / ``yatchew`` are ``None``. ``aggregate`` is recorded on the - report for serialization dispatch. + report for serialization dispatch. On the survey/weights path, + ``qug`` is ``None`` (Phase 4.5 C0 deferral); other components + populated as on the unweighted path. Raises ------ @@ -2792,34 +3834,46 @@ def did_had_pretest_workflow( non-None, or any downstream front-door failure (panel balance, dtype, dose invariant). NotImplementedError - If ``survey`` or ``weights`` is non-None. See - *Notes -- Survey/weighted data*. + If ``survey.replicate_weights is not None`` (replicate-weight + pretests deferred to a parallel follow-up after Phase 4.5 C). Notes ----- - Survey/weighted data: the workflow does not yet accept ``survey=`` / - ``weights=`` kwargs. Two reasons: - - 1. QUG-under-survey is **permanently deferred** (Phase 4.5 C0 - decision gate). Extreme-order-statistic tests are not smooth - functionals of the empirical CDF and have no off-the-shelf - survey-aware analog. See :func:`qug_test` Notes. - 2. Survey support for the linearity-family pretests is planned for - Phase 4.5 C, with mechanism varying by test: Rao-Wu rescaled - bootstrap for the Stute family (:func:`stute_test`, - :func:`stute_joint_pretest`, :func:`joint_pretrends_test`, - :func:`joint_homogeneity_test`) -- weighted multipliers + PSU - clustering in the bootstrap draw; weighted OLS residuals + - weighted variance estimator for :func:`yatchew_hr_test` (Yatchew - 1997 is a closed-form variance-ratio test, not bootstrap-based). - Until C ships, those sister pretests still raise bare - ``TypeError`` on ``survey=`` / ``weights=`` because their - signatures are closed (no kwargs added) -- adding rejection-only - kwargs in C0 then implementing in C is API churn for no user - benefit. - - Until Phase 4.5 C ships, run the workflow without ``survey`` / - ``weights`` kwargs and verify identification manually. + Survey/weighted data (Phase 4.5 C): under ``survey=`` or ``weights=``, + the workflow: + + 1. **Skips QUG** with a ``UserWarning`` and sets ``qug=None`` on the + report. QUG-under-survey is permanently deferred per Phase 4.5 C0; + extreme-order-statistic tests are not smooth functionals of the + empirical CDF and have no off-the-shelf survey-aware analog. See + :func:`qug_test` Notes for the full methodology rationale. + 2. **Runs the linearity family** with the survey-aware mechanism + (PSU-level Mammen multiplier bootstrap for Stute / joint variants; + weighted OLS + weighted variance components for Yatchew) routed + via the existing kernels. + 3. **Verdict** carries a ``"linearity-conditional verdict; QUG-under- + survey deferred per Phase 4.5 C0"`` suffix to remind callers that + admissibility is conditional on the linearity family alone. + 4. **`all_pass`** drops the QUG-conclusiveness gate (one less + precondition). The linearity-conditional rule splits by aggregate: + + - ``aggregate="overall"`` survey: ``True`` iff at least one of + Stute/Yatchew is conclusive AND no conclusive test rejects + (paper Section 4 step-3 "Stute OR Yatchew" wording). + - ``aggregate="event_study"`` survey: ``True`` iff + ``pretrends_joint`` is non-None and conclusive, + ``homogeneity_joint`` is conclusive, AND neither rejects. + Both joint variants must be conclusive on the event-study + path (same step-2 + step-3 closure as the unweighted + aggregate, just without the QUG step). + + Sister pretests are unchanged on the workflow path; direct callers + can also pass ``weights=`` / ``survey=`` to :func:`stute_test`, + :func:`yatchew_hr_test`, etc. (Phase 4.5 C extends each helper's + signature). Per-unit constant-within-unit invariant on weights / + strata / psu / fpc is enforced by the workflow via + :func:`diff_diff.had._aggregate_unit_weights` / + :func:`diff_diff.had._aggregate_unit_resolved_survey`. References ---------- @@ -2831,37 +3885,30 @@ def did_had_pretest_workflow( f"aggregate must be one of {list(_VALID_AGGREGATES)!r}; " f"got {aggregate!r}." ) - # Mutex on survey/weights, mirroring HeterogeneousAdoptionDiD.fit() - # at had.py:2890. + # Phase 4.5 C: survey/weights mutex + presence detection. R6 P1 fix: + # do NOT call _resolve_pretest_unit_weights on the FULL panel here -- + # under aggregate='event_study' the panel may be staggered and the + # cohort filter at _validate_multi_period_panel can drop units. If + # those dropped units have zero/invalid weights, eager full-panel + # resolution would abort an otherwise-valid event-study run. Defer + # resolution to the per-aggregate branches: overall path resolves on + # the original data (no filtering); event-study path lets the joint + # wrappers handle resolution on data_filtered. if survey is not None and weights is not None: raise ValueError( - "Pass survey= OR weights=, not both. " - "did_had_pretest_workflow does not yet accept either kwarg " - "(Phase 4.5 C0 + Phase 4.5 C); see the NotImplementedError " - "below for the methodology rationale." + "did_had_pretest_workflow: pass survey= OR " "weights=, not both." ) + use_survey_path = (survey is not None) or (weights is not None) - # Phase 4.5 C0 decision gate (workflow surface). QUG-under-survey is - # permanently deferred; the linearity-family pretests are deferred to - # Phase 4.5 C. Until C ships, the workflow has no survey-aware - # dispatch and rejects the kwargs at the front door. - if survey is not None or weights is not None: - raise NotImplementedError( - "did_had_pretest_workflow does not yet accept survey= / " - "weights= kwargs.\n" - "\n" - "QUG-under-survey is permanently deferred (extreme-value " - "theory under complex sampling is not a settled toolkit; see " - "qug_test docstring for the methodology rationale). Survey " - "support for the linearity-family pretests is planned for " - "Phase 4.5 C, with mechanism varying by test: Rao-Wu " - "rescaled bootstrap for the Stute family (stute_test, " - "stute_joint_pretest, joint_pretrends_test, " - "joint_homogeneity_test); weighted OLS residuals + weighted " - "variance estimator for yatchew_hr_test (Yatchew 1997 is a " - "closed-form variance-ratio test, not bootstrap-based). " - "Until that ships, run the workflow without survey/weights " - "kwargs and verify identification manually." + if use_survey_path: + # Phase 4.5 C0 deferral surface: skip QUG with educational warning. + warnings.warn( + "did_had_pretest_workflow: QUG step skipped under survey/weights " + "(permanently deferred per Phase 4.5 C0; extreme-value theory " + "under complex sampling is not a settled toolkit). Verdict " + "reflects the linearity family only ('linearity-conditional').", + UserWarning, + stacklevel=2, ) if aggregate == "event_study": @@ -2886,13 +3933,58 @@ def did_had_pretest_workflow( # Step 1: QUG on dose distribution at F. Doses are # time-invariant in HAD, so D_g at F equals max_t D_{g,t}. + # Phase 4.5 C: skipped under survey/weights (qug_res = None). doses_at_F = ( data_filtered.loc[data_filtered[time_col] == F, [unit_col, dose_col]] .set_index(unit_col) .sort_index()[dose_col] .to_numpy(dtype=np.float64) ) - qug_res = qug_test(doses_at_F, alpha=alpha) + qug_res = None if use_survey_path else qug_test(doses_at_F, alpha=alpha) + + # Phase 4.5 C: forward weights/survey to the joint helpers. The + # data-in wrappers handle their own per-row → per-unit aggregation + # via _resolve_pretest_unit_weights internally on `data_filtered`. + # R1 P1 fix: subset row-level `weights` to data_filtered's rows + # BEFORE passing through. Otherwise on staggered panels (where + # _validate_multi_period_panel auto-filters to last cohort), + # the wrappers would call _aggregate_unit_weights(data_filtered, + # weights[full_panel_length], ...) and crash on length mismatch. + # Mirrors HeterogeneousAdoptionDiD.fit()'s positional-index + # subsetting via `data.index.get_indexer(data_filtered.index)`. + # `survey=` carries column references resolved internally on + # data_filtered, so no subsetting needed there. + if use_survey_path and weights is not None: + # R9 P1: validate 1D + length-matched-to-data BEFORE staggered- + # panel subsetting. Otherwise oversized arrays would be + # silently truncated and undersized arrays would surface raw + # NumPy indexing errors. + weights_arr = np.asarray(weights, dtype=np.float64) + if weights_arr.ndim != 1: + raise ValueError( + "did_had_pretest_workflow: weights must be 1-dimensional, " + f"got shape {weights_arr.shape}. (A common mistake is " + "passing df[['w']].to_numpy() which produces (N, 1); " + "use df['w'].to_numpy() for (N,).)" + ) + if weights_arr.shape[0] != len(data): + raise ValueError( + f"did_had_pretest_workflow: weights length " + f"{weights_arr.shape[0]} does not match data length " + f"{len(data)}." + ) + pos_idx = data.index.get_indexer(data_filtered.index) + if (pos_idx < 0).any(): + raise ValueError( + "did_had_pretest_workflow: cannot align row-level " + "weights to the staggered-filtered panel " + "(some data_filtered rows do not appear in original " + "data.index). This is a bug; please report." + ) + joint_weights = weights_arr[pos_idx] + else: + joint_weights = None + joint_survey = survey if use_survey_path and survey is not None else None # Step 2: joint pre-trends on earlier pre-periods (those # strictly before base_period). If only the base pre-period is @@ -2920,6 +4012,8 @@ def did_had_pretest_workflow( alpha=alpha, n_bootstrap=n_bootstrap, seed=seed, + weights=joint_weights, + survey=joint_survey, ) else: pretrends_joint = None @@ -2937,26 +4031,42 @@ def did_had_pretest_workflow( alpha=alpha, n_bootstrap=n_bootstrap, seed=seed, + weights=joint_weights, + survey=joint_survey, ) - # Event-study `all_pass`: True iff every implemented step is - # conclusive AND none reject. `pretrends_joint` must exist - # (cannot be None) for the step-2 gap to be closed. Uses - # `np.isfinite(p_value)` per Phase 3 convention (no - # `.conclusive()` helper on result dataclasses). - qug_ok = bool(np.isfinite(qug_res.p_value)) + # Event-study `all_pass`. On the unweighted path, every implemented + # step must be conclusive AND none reject (Phase 3 convention). On + # the survey/weights path, drop the QUG-conclusiveness condition + # (qug=None per Phase 4.5 C0 deferral); admissibility becomes + # linearity-conditional. pretrends_ok = pretrends_joint is not None and bool(np.isfinite(pretrends_joint.p_value)) homogeneity_ok = bool(np.isfinite(homogeneity_joint.p_value)) - all_pass = bool( - qug_ok - and pretrends_ok - and pretrends_joint is not None - and not pretrends_joint.reject - and homogeneity_ok - and not homogeneity_joint.reject - and not qug_res.reject - ) - verdict = _compose_verdict_event_study(qug_res, pretrends_joint, homogeneity_joint) + if use_survey_path: + all_pass = bool( + pretrends_ok + and pretrends_joint is not None + and not pretrends_joint.reject + and homogeneity_ok + and not homogeneity_joint.reject + ) + # R7 P1 fix: explicit survey-aware verdict composer instead + # of post-processing the unweighted-verdict output (the + # previous string-replace approach could leave pass cases + # starting with "inconclusive" even when all_pass=True). + verdict = _compose_verdict_event_study_survey(pretrends_joint, homogeneity_joint) + else: + qug_ok = bool(np.isfinite(qug_res.p_value)) + all_pass = bool( + qug_ok + and pretrends_ok + and pretrends_joint is not None + and not pretrends_joint.reject + and homogeneity_ok + and not homogeneity_joint.reject + and not qug_res.reject + ) + verdict = _compose_verdict_event_study(qug_res, pretrends_joint, homogeneity_joint) return HADPretestReport( qug=qug_res, @@ -2971,7 +4081,9 @@ def did_had_pretest_workflow( aggregate="event_study", ) - # aggregate == "overall" - Phase 3 behavior, unchanged. + # aggregate == "overall" - Phase 3 behavior on the unweighted path + # (bit-exact regression preserved); Phase 4.5 C survey path skips QUG + # and dispatches stute / yatchew with weights=/survey=. t_pre, t_post = _validate_had_panel( data, outcome_col, dose_col, time_col, unit_col, first_treat_col ) @@ -2986,9 +4098,33 @@ def did_had_pretest_workflow( cluster_col=None, # pretests do not use cluster-robust SE ) - qug_res = qug_test(d_arr, alpha=alpha) - stute_res = stute_test(d_arr, dy_arr, alpha=alpha, n_bootstrap=n_bootstrap, seed=seed) - yatchew_res = yatchew_hr_test(d_arr, dy_arr, alpha=alpha) + # R6 P1 fix: resolve weights/survey HERE (overall path operates on + # the original data; no cohort filter to interact with). + weights_unit, resolved_unit = _resolve_pretest_unit_weights( + data, unit_col, weights, survey, "did_had_pretest_workflow" + ) + + qug_res = None if use_survey_path else qug_test(d_arr, alpha=alpha) + # Forward weights/survey to per-test calls. The data-in workflow has + # already aggregated to per-unit (weights_unit / resolved_unit); the + # _aggregate_first_difference call above also collapses to per-unit + # (one row per unit), so weights_unit and resolved_unit are aligned. + stute_res = stute_test( + d_arr, + dy_arr, + alpha=alpha, + n_bootstrap=n_bootstrap, + seed=seed, + weights=weights_unit if resolved_unit is None else None, + survey=resolved_unit, + ) + yatchew_res = yatchew_hr_test( + d_arr, + dy_arr, + alpha=alpha, + weights=weights_unit if resolved_unit is None else None, + survey=resolved_unit, + ) # `all_pass` must be conclusive under the paper's four-step workflow # (step 1 QUG + step 3 linearity via Stute OR Yatchew): @@ -3001,11 +4137,18 @@ def did_had_pretest_workflow( # - No conclusive test may reject. NaN-p tests have reject=False by # convention, so the OR across `.reject` naturally counts only # the conclusive rejections. - qug_conclusive = bool(np.isfinite(qug_res.p_value)) + # On the survey path, drop QUG conclusiveness (qug=None per C0 deferral). linearity_conclusive = bool(np.isfinite(stute_res.p_value) or np.isfinite(yatchew_res.p_value)) - any_reject = qug_res.reject or stute_res.reject or yatchew_res.reject - all_pass = bool(qug_conclusive and linearity_conclusive and not any_reject) - verdict = _compose_verdict(qug_res, stute_res, yatchew_res) + if use_survey_path: + any_reject = stute_res.reject or yatchew_res.reject + all_pass = bool(linearity_conclusive and not any_reject) + # R7 P1 fix: explicit survey-aware verdict composer. + verdict = _compose_verdict_overall_survey(stute_res, yatchew_res) + else: + qug_conclusive = bool(np.isfinite(qug_res.p_value)) + any_reject = qug_res.reject or stute_res.reject or yatchew_res.reject + all_pass = bool(qug_conclusive and linearity_conclusive and not any_reject) + verdict = _compose_verdict(qug_res, stute_res, yatchew_res) return HADPretestReport( qug=qug_res, diff --git a/diff_diff/survey.py b/diff_diff/survey.py index d7bf7121..155bce05 100644 --- a/diff_diff/survey.py +++ b/diff_diff/survey.py @@ -678,6 +678,45 @@ def needs_survey_vcov(self) -> bool: return True # Any resolved survey design uses the survey vcov path +def _make_trivial_resolved(weights: np.ndarray) -> "ResolvedSurveyDesign": + """Construct a trivial pweight-only ResolvedSurveyDesign (no strata/PSU/FPC). + + Used by survey-aware code paths invoked via a bare per-row ``weights`` + array (the pweight shortcut). Routing through this synthetic resolved + design lets the same bootstrap / variance kernel handle both the + ``weights=`` shortcut and the full ``survey=SurveyDesign(...)`` path + uniformly. Mirrors the PR #363 synthetic-trivial-resolved pattern that + fixed sup-t under the ``weights=`` shortcut on + ``HeterogeneousAdoptionDiD.fit()``. + + Parameters + ---------- + weights : np.ndarray, shape (n_obs,) + Per-observation positive weights. Caller is responsible for any + non-negativity / per-unit-constancy validation. + + Returns + ------- + ResolvedSurveyDesign + With ``weight_type="pweight"``, ``strata=psu=fpc=None``, + ``n_strata=0``, ``n_psu=n_obs`` (each observation is its own PSU + under the trivial design), ``lonely_psu="remove"``, + ``replicate_weights=None``. + """ + w = np.asarray(weights, dtype=np.float64) + n_obs = int(w.shape[0]) + return ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=n_obs, + lonely_psu="remove", + ) + + @dataclass class SurveyMetadata: """ @@ -1338,17 +1377,17 @@ class _PsuScaffolding: variance_computable: bool legitimate_zero_count: int # stratified-mode fields (None in other modes): - psu_codes: Optional[np.ndarray] = None # (n,) int, global PSU id 0..P-1 - psu_stratum: Optional[np.ndarray] = None # (P,) int, stratum of each PSU + psu_codes: Optional[np.ndarray] = None # (n,) int, global PSU id 0..P-1 + psu_stratum: Optional[np.ndarray] = None # (P,) int, stratum of each PSU n_psu_per_stratum: Optional[np.ndarray] = None # (S,) int - singleton_strata: Optional[np.ndarray] = None # (S,) bool - adjustment_h: Optional[np.ndarray] = None # (S,) float, (1-f_h)*n_h/(n_h-1); 0 for singletons + singleton_strata: Optional[np.ndarray] = None # (S,) bool + adjustment_h: Optional[np.ndarray] = None # (S,) float, (1-f_h)*n_h/(n_h-1); 0 for singletons # psu_only-mode fields (None in other modes): - psu_codes_only: Optional[np.ndarray] = None # (n,) int, PSU id 0..P-1 + psu_codes_only: Optional[np.ndarray] = None # (n,) int, PSU id 0..P-1 n_psu_only: Optional[int] = None - adjustment_only: Optional[float] = None # (1-f)*n_psu/(n_psu-1) or 0 + adjustment_only: Optional[float] = None # (1-f)*n_psu/(n_psu-1) or 0 # no_strata_no_psu-mode fields (None in other modes): - adjustment_direct: Optional[float] = None # (1-f)*n/(n-1) or 0 + adjustment_direct: Optional[float] = None # (1-f)*n/(n-1) or 0 def _precompute_psu_scaffolding(resolved: "ResolvedSurveyDesign") -> _PsuScaffolding: @@ -1530,9 +1569,7 @@ def _precompute_psu_scaffolding(resolved: "ResolvedSurveyDesign") -> _PsuScaffol # - Under "adjust", any singleton stratum also counts (adds V_h even if 0). has_non_singleton = bool(np.any(~singleton_strata)) has_singleton = bool(np.any(singleton_strata)) - variance_computable = has_non_singleton or ( - lonely_psu == "adjust" and has_singleton - ) + variance_computable = has_non_singleton or (lonely_psu == "adjust" and has_singleton) return _PsuScaffolding( mode="stratified", @@ -1610,9 +1647,7 @@ def _finalize(meat_scalar: float) -> float: psu_sums = np.bincount(scaffolding.psu_codes, weights=psi, minlength=P) sum_by_h = np.bincount(scaffolding.psu_stratum, weights=psu_sums, minlength=S) - sum2_by_h = np.bincount( - scaffolding.psu_stratum, weights=psu_sums * psu_sums, minlength=S - ) + sum2_by_h = np.bincount(scaffolding.psu_stratum, weights=psu_sums * psu_sums, minlength=S) with np.errstate(divide="ignore", invalid="ignore"): centered_ss = np.where( diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index d88e38e5..e4424dc1 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2425,12 +2425,28 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in ` 4. Theorem 4 establishes: asymptotic size `α`; uniform consistency against fixed alternatives; local power at rate `G` on the class `F^{d̲,d̄}_{m,K}` of differentiable cdfs with positive density and Lipschitz derivative. 5. Li et al. (2024, Theorem 2.4) implies the QUG test is asymptotically independent of the WAS / TWFE estimator, so conditional inference on WAS given non-rejection does not distort inference (asymptotically; the paper's Footnote 8 notes the extension to triangular arrays is conjectured but not proven). - **Note:** Implementation is `O(G)` via `np.partition`; no sort required. -- **Note (Phase 4.5 C0):** `qug_test(..., survey=...)` and `qug_test(..., weights=...)` raise `NotImplementedError` permanently (Phase 4.5 C0 decision gate, 2026-04). The same gate applies to `did_had_pretest_workflow(..., survey=...)` / `weights=`. Three reasons survey extension is genuinely hard, not "we just haven't done the lit review": +- **Note (Phase 4.5 C0):** `qug_test(..., survey=...)` and `qug_test(..., weights=...)` raise `NotImplementedError` **permanently** (Phase 4.5 C0 decision gate, 2026-04 -- direct-helper gate is permanent). The Phase 4.5 C0 release also gated `did_had_pretest_workflow(..., survey=...)` / `weights=` with `NotImplementedError`, but that workflow gate was **temporary**: Phase 4.5 C (PR #370, 2026-04) replaces it with functional dispatch that skips the QUG step with `UserWarning` and runs the linearity family with the survey-aware mechanism (see Note (Phase 4.5 C) below for the full algorithm). Direct callers of `qug_test` still get the permanent rejection. Three reasons QUG-under-survey is genuinely hard, not "we just haven't done the lit review": 1. **Extreme order statistics are not smooth functionals of the empirical CDF.** Standard survey machinery (Binder-TSL linearization via `compute_survey_if_variance`, Rao-Wu rescaled bootstrap via `bootstrap_utils.generate_rao_wu_weights`, Krieger-Pfeffermann (1997) EDF tests for complex surveys) all rely on Hadamard differentiability of the test statistic in the empirical CDF. The first two order statistics are NOT differentiable functionals — small perturbations to F near zero produce O(1) shifts in `D_{(1)}`. None of the standard survey-bootstrap or linearization tools give a calibrated test for QUG. 2. **The `Exp(1)/Exp(1)` limit law assumes iid sampling with smooth density at zero.** Under cluster sampling, `D_{(1)}` and `D_{(2)}` may both come from the same PSU, breaking the independence required for the Poisson-process limit of rescaled spacings near the boundary. Under stratification, the smallest dose may come from a small stratum that's systematically over- or under-sampled, biasing the test. 3. **The literature on EVT under unequal-probability sampling is sparse.** Quintos et al. (2001) and Beirlant et al. cover tail-INDEX estimation under unequal sample sizes. There is no off-the-shelf method for "test the support endpoint under complex sampling" in the standard survey-statistics toolkit. Adapting Hill / Pickands / DEdH estimators to the boundary problem would be novel research, not engineering. The de Chaisemartin et al. (2026) paper itself does not discuss survey extensions of QUG. - The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + Rao-Wu rescaled bootstrap give a calibrated survey-aware test. Phase 4.5 C ships survey support for the linearity family with mechanism varying by test: Rao-Wu rescaled bootstrap for `stute_test` and the joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`); weighted OLS residuals + weighted variance estimator for `yatchew_hr_test` (Yatchew 1997 is a closed-form variance-ratio test, not bootstrap-based). + The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + a survey-aware multiplier bootstrap give a calibrated test. Phase 4.5 C (PR #370) ships survey support for the linearity family — the **PSU-level Mammen multiplier bootstrap** for `stute_test` and the joint variants (NOT Rao-Wu rescaling — multiplier bootstrap is a different mechanism), and **closed-form weighted OLS + pweight-sandwich variance components** for `yatchew_hr_test`. See the dedicated Note (Phase 4.5 C) below for the full algorithm. **Research direction (out of scope for diff-diff):** the bridge IS sketchable by combining (a) endpoint-estimation EVT under iid (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006); (b) survey-aware functional CLT for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017); and (c) tail-empirical-process theory (Drees 2003) to define a "design-effective boundary intensity" `λ_eff = Σ_h W_h · f_h(0+)`. Under a "no boundary clumping" assumption (`P(D_{(1)}, D_{(2)}` in same PSU `| both ≤ δ) → 0`), the `Exp(1)/Exp(1)` limit law's pivotality is preserved and only the calibration needs a survey-aware bootstrap (subsampling within strata per Politis-Romano-Wolf, or Bertail et al.'s design-aware bootstrap). This is publishable methodology research — one paper, ~6-12 months for a methods PhD student. If the bridge gets built and published externally, this gate can be revisited. +- **Note (Phase 4.5 C):** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` accept `weights=` and `survey=ResolvedSurveyDesign` kwargs (or `survey=SurveyDesign` for the data-in entries). Mechanism varies by test: + - **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * eta_obs` (paper Appendix D wild-bootstrap form — multipliers attach to UNWEIGHTED residuals; the weighting flows through the OLS refit + the weighted CvM, NOT through the perturbation step). Followed by weighted OLS refit (`_fit_weighted_ols_intercept_slope`) and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling). + - **Yatchew** (`yatchew_hr_test`) uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap). All three components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` in `TestYatchewHRTestSurvey::test_weighted_reduces_to_unweighted_at_uniform_weights`): + - `sigma2_lin = sum(w * eps^2) / sum(w)` (weighted OLS residual variance). + - `sigma2_diff = sum(w_avg * (dy_g - dy_{g-1})^2) / (2 * sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g + w_{g-1})/2`. Divisor uses `sum(w)` (=G at `w=1`), NOT `sum(w_avg)`, to match the existing `(1/(2G))` unweighted formula at `had_pretests.py:1635`. + - `sigma4_W = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg)` reduces to `(1/(G-1)) * sum(prod)` at `w=1`. + - `T_hr = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W` (effective-sample-size convention; reduces to `sqrt(G)` at `w=1`). + Strictly positive weights required (the adjacent-difference variance is undefined under contiguous-zero blocks). PSU clustering is NOT propagated through the variance-ratio statistic (would require a survey-aware variance-of-variance estimator, out of scope). Pair-weight convention follows Krieger-Pfeffermann (1997, §3) for design-consistent inference on smooth functionals. + - **Workflow** (`did_had_pretest_workflow`) under `survey=` / `weights=`: skips the QUG step with a `UserWarning` (per Phase 4.5 C0 deferral), sets `qug=None` on the report, and dispatches the linearity family with the survey-aware mechanism. Verdict carries a `"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"` suffix. `all_pass` drops the QUG-conclusiveness condition; the linearity-conditional rule splits by aggregate: + - `aggregate="overall"`: `True` iff at least one of `stute`/`yatchew` is conclusive AND no conclusive test rejects (paper Section 4 step-3 "Stute OR Yatchew" wording carries through). + - `aggregate="event_study"`: `True` iff `pretrends_joint` is non-None and conclusive, `homogeneity_joint` is conclusive, AND neither rejects. Both joint variants must be conclusive on the event-study path (same step-2 + step-3 closure as the unweighted aggregate, just without the QUG step). + - **Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) deferred** to a parallel follow-up. Each helper raises `NotImplementedError` on `survey.replicate_weights is not None` (defense in depth: workflow + every direct-helper entry rejects, mirroring the reciprocal-guard discipline from PR #346). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition above. + - **`lonely_psu='adjust'` with singleton strata is rejected** with `NotImplementedError` on the Stute family (mirrors HAD sup-t bootstrap at `had.py:2081-2118`). The bootstrap multiplier helper pools singleton strata into a pseudo-stratum with nonzero multipliers, but the analytical variance target requires a pseudo-stratum centering transform that has not been derived for the Stute CvM. Use `lonely_psu='remove'` (drops singleton contributions) or `'certainty'` (zero-variance singletons); both produce all-zero singleton multipliers that match a well-defined analytical target. Variance-unidentified designs (`df_survey <= 0` after the adjust+singleton case is handled) return `NaN` with a `UserWarning` (single-PSU unstratified or one-PSU-per-stratum under remove/certainty). + - **Stratified designs (`SurveyDesign(strata=...)`) are rejected** with `NotImplementedError` on `stute_test` and `stute_joint_pretest` (and propagate to `did_had_pretest_workflow`). The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + `sqrt(n_h/(n_h-1))` small-sample correction AFTER `generate_survey_multiplier_weights_batch` to make the bootstrap variance match the Binder-TSL stratified target. That same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows Stute-family survey support to **pweight-only** (no strata, no PSU), **PSU-only** (`SurveyDesign(weights=, psu=)`), and **FPC-only** (`SurveyDesign(weights=, fpc=)`) designs. Stratified designs are a follow-up after the matching Stute-CvM stratified-correction derivation lands. + - **Constant-within-unit invariant**: per-row `weights=` / `survey=col` are aggregated to per-unit `(G,)` arrays via the existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (had.py:1604, :1671); these enforce constant-within-unit invariant on weights and on every survey design column (strata, psu, fpc) and raise on violation. Direct callers passing already-resolved `ResolvedSurveyDesign` (or per-unit `weights` array) bypass this aggregation; the invariant is the caller's responsibility on that path. + - **Distributional parity, NOT bit-exact**: at `weights=ones(G)` the survey path produces a different bootstrap p-value than the unweighted path because RNG consumption differs (batched `generate_survey_multiplier_weights_batch` vs per-iteration `_generate_mammen_weights`). The two paths agree DISTRIBUTIONALLY at large B (`|p_avg_diff| < 0.03` over 100 reps at `B=5000`); they DO NOT agree numerically at `atol=1e-10`. The unweighted code path is preserved bit-exactly (stability invariant; the new `weights=`/`survey=` arms are separate `if` branches). *Algorithm variant - TWFE linearity test via Stute (1997) Cramér-von Mises with wild bootstrap (Section 4.3, Appendix D):* Shipped in `diff_diff/had_pretests.py` as `stute_test()`. Tests whether `E(ΔY | D_2)` is linear, the testable implication of TWFE's homogeneity assumption (Assumption 8) in HADs. diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py index 504b4bc1..4b97e2c2 100644 --- a/tests/test_had_pretests.py +++ b/tests/test_had_pretests.py @@ -2852,57 +2852,612 @@ def test_repr_includes_aggregate(self): class TestHADPretestWorkflowSurveyGuards: - """Phase 4.5 C0 guards on did_had_pretest_workflow. + """Phase 4.5 C survey-aware workflow tests. - Reciprocal half of the qug_test guards in TestQUGTest above: the workflow - has its own survey/weights kwargs that must reject for the same reasons. - QUG-under-survey is permanently deferred (extreme-value theory under - complex sampling not a settled toolkit); the linearity-family pretests - are deferred to Phase 4.5 C. Until C ships the workflow has no - survey-aware dispatch, so it must reject.""" + Phase 4.5 C makes did_had_pretest_workflow functional under + survey=/weights=: it skips QUG with a UserWarning (per C0 deferral) + and dispatches the linearity family with the survey-aware mechanism. + Mutex on survey+weights still raises ValueError; replicate-weight + survey designs raise NotImplementedError (parallel follow-up).""" - def _make_minimal_overall_panel(self): - """Two-period, single-cohort panel sufficient for overall workflow.""" + def _make_minimal_overall_panel(self, with_weight_col: bool = False): + """Two-period, single-cohort panel sufficient for overall workflow. + + When ``with_weight_col=True``, attaches a 'w' column populated with + unit-constant positive values (HAD continuous-path constant-within- + unit invariant).""" d_arr, dy_arr = _linear_dgp(G=20, beta=2.0, sigma=0.3) - return _make_two_period_panel(G=20, d=d_arr, dy=dy_arr) + df = _make_two_period_panel(G=20, d=d_arr, dy=dy_arr) + if with_weight_col: + rng = np.random.default_rng(7) + w_per_unit = rng.uniform(0.5, 2.0, size=20) + # Constant-within-unit per HAD invariant. + df["w"] = df["unit"].map(dict(zip(np.arange(20), w_per_unit))) + return df - def test_workflow_weights_raises(self): - """Phase 4.5 C0: did_had_pretest_workflow(weights=) raises - NotImplementedError BEFORE running any of the underlying tests.""" - df = self._make_minimal_overall_panel() - with pytest.raises(NotImplementedError, match="does not yet accept"): + def test_workflow_mutex_both_raises(self): + """Phase 4.5 C: passing both survey= AND weights= raises ValueError + (mutex), mirroring HeterogeneousAdoptionDiD.fit() at had.py:2890.""" + from diff_diff import SurveyDesign + + df = self._make_minimal_overall_panel(with_weight_col=True) + with pytest.raises(ValueError, match="OR weights=.*not both"): did_had_pretest_workflow( df, "y", "d", "time", "unit", - weights=np.ones(20), + survey=SurveyDesign(weights="w"), + weights=np.ones(40), ) - def test_workflow_survey_raises(self): - """Phase 4.5 C0: did_had_pretest_workflow(survey=) raises - NotImplementedError.""" + def test_workflow_unweighted_overall_path_unchanged(self): + """Stability invariant: existing positional / unweighted calls must + produce a valid HADPretestReport after the new keyword-only kwargs + are added. Smoke-tests the overall path.""" + df = self._make_minimal_overall_panel() + report = did_had_pretest_workflow(df, "y", "d", "time", "unit", n_bootstrap=199, seed=0) + # Overall-path invariant: stute and yatchew populated; joint variants None. + assert report.aggregate == "overall" + assert report.qug is not None + assert report.stute is not None + assert report.yatchew is not None + assert report.pretrends_joint is None + assert report.homogeneity_joint is None + + def test_workflow_weights_runs_overall_path(self): + """Phase 4.5 C: weights= now functional. Workflow dispatches to + weighted Stute + Yatchew, skips QUG, returns valid report with + qug=None.""" + df = self._make_minimal_overall_panel() + # 40 rows (20 units x 2 periods); per-row weights with constant- + # within-unit invariant. + rng = np.random.default_rng(7) + w_per_unit = rng.uniform(0.5, 2.0, size=20) + weights_per_row = np.tile(w_per_unit, 2) + with pytest.warns(UserWarning, match="QUG step skipped"): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + assert report.aggregate == "overall" + assert report.qug is None # skipped per C0 + assert report.stute is not None + assert report.yatchew is not None + assert np.isfinite(report.stute.p_value) + + def test_workflow_survey_runs_overall_path(self): + """Phase 4.5 C: survey= now functional via SurveyDesign(weights=col).""" from diff_diff import SurveyDesign + df = self._make_minimal_overall_panel(with_weight_col=True) + with pytest.warns(UserWarning, match="QUG step skipped"): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + survey=SurveyDesign(weights="w"), + n_bootstrap=199, + seed=0, + ) + assert report.aggregate == "overall" + assert report.qug is None + assert report.stute is not None + assert report.yatchew is not None + + def test_workflow_verdict_carries_phase_4_5_c0_suffix(self): + """Phase 4.5 C: verdict appends the linearity-conditional suffix + explaining QUG was skipped per C0 deferral. Locks the cross-surface + text used by downstream consumers.""" + df = self._make_minimal_overall_panel() + weights_per_row = np.full(40, 1.5) # uniform-positive + with pytest.warns(UserWarning): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + assert "linearity-conditional verdict" in report.verdict + assert "QUG-under-survey deferred per Phase 4.5 C0" in report.verdict + + def test_workflow_qug_none_serializes_cleanly(self): + """Phase 4.5 C: qug=None must propagate cleanly through summary, + to_dict, and to_dataframe (Reviewer CRITICAL #1 - retyped Optional).""" df = self._make_minimal_overall_panel() - with pytest.raises(NotImplementedError, match="does not yet accept"): + weights_per_row = np.full(40, 1.5) + with pytest.warns(UserWarning): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + # summary() rendering + s = report.summary() + assert "QUG step skipped" in s + # to_dict() serialization + d = report.to_dict() + assert d["qug"] is None + # to_dataframe() must still produce 3 rows (qug NaN row preserved) + df_out = report.to_dataframe() + assert len(df_out) == 3 + qug_row = df_out[df_out["test"] == "qug"].iloc[0] + assert pd.isna(qug_row["statistic_value"]) + assert pd.isna(qug_row["p_value"]) + assert qug_row["reject"] is False or qug_row["reject"] == 0 # bool-ish + + def test_workflow_replicate_weights_rejected_overall(self): + """Phase 4.5 C: replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) + raise NotImplementedError. Parallel follow-up after Phase 4.5 C.""" + from diff_diff import SurveyDesign + + df = self._make_minimal_overall_panel(with_weight_col=True) + # Build replicate weights matrix (40 rows x 5 replicates of perturbed weights). + rng = np.random.default_rng(0) + w_col = df["w"].to_numpy() + rep_w = np.column_stack([w_col * (1 + 0.1 * rng.standard_normal(40)) for _ in range(5)]) + df_with_rep = df.copy() + for i in range(5): + df_with_rep[f"rep{i}"] = rep_w[:, i] + sd = SurveyDesign( + weights="w", + replicate_weights=[f"rep{i}" for i in range(5)], + replicate_method="BRR", + ) + with pytest.raises(NotImplementedError, match="replicate-weight"): did_had_pretest_workflow( + df_with_rep, "y", "d", "time", "unit", survey=sd, n_bootstrap=199, seed=0 + ) + + +# ============================================================================= +# Phase 4.5 C: direct-helper survey/weights tests +# ============================================================================= + + +class TestStuteTestSurvey: + """Phase 4.5 C survey/weights extension on stute_test.""" + + def _setup(self, G=30, seed=42): + d, dy = _linear_dgp(G=G, beta=2.0, sigma=0.3, seed=seed) + return d, dy + + def test_unweighted_call_bit_exact_after_kwargs_added(self): + """Stability invariant #1: existing positional/kwarg-free calls + produce bit-exact pre-PR p_value after the new keyword-only kwargs + are added (no behavioral change on the unweighted path).""" + d, dy = self._setup() + r = stute_test(d, dy, alpha=0.05, n_bootstrap=199, seed=0) + assert np.isfinite(r.cvm_stat) + assert 0.0 <= r.p_value <= 1.0 + + def test_weights_smoke(self): + """weights= produces a finite, valid Stute result.""" + d, dy = self._setup() + w = np.random.default_rng(7).uniform(0.5, 2.0, size=30) + r = stute_test(d, dy, weights=w, n_bootstrap=199, seed=0) + assert np.isfinite(r.cvm_stat) + assert 0.0 <= r.p_value <= 1.0 + + def test_survey_smoke(self): + """survey= via trivial ResolvedSurveyDesign produces a finite result.""" + from diff_diff.survey import _make_trivial_resolved + + d, dy = self._setup() + w = np.random.default_rng(7).uniform(0.5, 2.0, size=30) + resolved = _make_trivial_resolved(w) + r = stute_test(d, dy, survey=resolved, n_bootstrap=199, seed=0) + assert np.isfinite(r.cvm_stat) + assert 0.0 <= r.p_value <= 1.0 + + def test_mutex_both_raises(self): + """survey + weights mutex (mirrors workflow + qug_test pattern).""" + from diff_diff.survey import _make_trivial_resolved + + d, dy = self._setup() + w = np.ones(30) + with pytest.raises(ValueError, match="OR weights=.*not both"): + stute_test(d, dy, weights=w, survey=_make_trivial_resolved(w), n_bootstrap=199, seed=0) + + def test_replicate_weights_raises(self): + """Phase 4.5 C MEDIUM #4: replicate-weight survey designs raise + NotImplementedError at the direct-helper entry point too (defense in + depth + reciprocal-guard discipline).""" + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = self._setup() + w = np.ones(30) + rep_w = np.tile(w[:, None], (1, 5)) + resolved_with_rep = ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=30, + lonely_psu="remove", + replicate_weights=rep_w, + replicate_method="BRR", + n_replicates=5, + ) + with pytest.raises(NotImplementedError, match="replicate-weight"): + stute_test(d, dy, survey=resolved_with_rep, n_bootstrap=199, seed=0) + + def test_negative_weights_rejected(self): + """Strictly-positive weights required on the pweight shortcut.""" + d, dy = self._setup() + w = np.ones(30) + w[0] = -1.0 + with pytest.raises(ValueError, match="strictly positive"): + stute_test(d, dy, weights=w, n_bootstrap=199, seed=0) + + def test_weights_length_mismatch(self): + d, dy = self._setup() + with pytest.raises(ValueError, match="length"): + stute_test(d, dy, weights=np.ones(20), n_bootstrap=199, seed=0) + + +class TestYatchewHRTestSurvey: + """Phase 4.5 C survey/weights extension on yatchew_hr_test. + + Includes the bit-exact reduction-invariant lock at w=ones(G) per + Reviewer CRITICAL #2 + MEDIUM #1: weighted variance components reduce + exactly to the existing unweighted formulas at uniform weights. + """ + + def _setup(self, G=30, seed=42): + rng = np.random.default_rng(seed) + d = rng.uniform(0.0, 1.0, size=G) + dy = 2.0 * d + rng.normal(0.0, 0.3, size=G) + return d, dy + + def test_unweighted_bit_exact_after_kwargs_added(self): + """Existing call without weights/survey returns the pre-PR result.""" + d, dy = self._setup() + r_unweighted = yatchew_hr_test(d, dy, alpha=0.05) + # No reference value to compare against (no pre-PR golden file + # captured); just check finiteness. + assert np.isfinite(r_unweighted.t_stat_hr) + + def test_weighted_reduces_to_unweighted_at_uniform_weights(self): + """Reviewer CRITICAL #2 lock: at w=ones(G), weighted variance + components reduce to the unweighted formulas EXACTLY (atol=1e-14).""" + d, dy = self._setup() + r_unweighted = yatchew_hr_test(d, dy, alpha=0.05) + r_weighted = yatchew_hr_test(d, dy, alpha=0.05, weights=np.ones(30)) + # All three variance components must match bit-exactly. + np.testing.assert_allclose( + r_unweighted.sigma2_lin, r_weighted.sigma2_lin, atol=1e-14, rtol=1e-14 + ) + np.testing.assert_allclose( + r_unweighted.sigma2_diff, r_weighted.sigma2_diff, atol=1e-14, rtol=1e-14 + ) + np.testing.assert_allclose( + r_unweighted.sigma2_W, r_weighted.sigma2_W, atol=1e-14, rtol=1e-14 + ) + # T_hr and p_value also match. + np.testing.assert_allclose( + r_unweighted.t_stat_hr, r_weighted.t_stat_hr, atol=1e-14, rtol=1e-14 + ) + + def test_weights_smoke(self): + d, dy = self._setup() + w = np.random.default_rng(7).uniform(0.5, 2.0, size=30) + r = yatchew_hr_test(d, dy, weights=w) + assert np.isfinite(r.t_stat_hr) + assert 0.0 <= r.p_value <= 1.0 + + def test_survey_smoke(self): + from diff_diff.survey import _make_trivial_resolved + + d, dy = self._setup() + w = np.random.default_rng(7).uniform(0.5, 2.0, size=30) + r = yatchew_hr_test(d, dy, survey=_make_trivial_resolved(w)) + assert np.isfinite(r.t_stat_hr) + + def test_mutex_both_raises(self): + from diff_diff.survey import _make_trivial_resolved + + d, dy = self._setup() + w = np.ones(30) + with pytest.raises(ValueError, match="not both"): + yatchew_hr_test(d, dy, weights=w, survey=_make_trivial_resolved(w)) + + def test_zero_weight_rejected(self): + """Per Reviewer Question #4: strictly-positive weights required + (the adjacent-difference variance has sum(w_avg) in the denominator + which collapses to zero in any contiguous-zero block).""" + d, dy = self._setup() + w = np.ones(30) + w[5] = 0.0 + with pytest.raises(ValueError, match="strictly positive"): + yatchew_hr_test(d, dy, weights=w) + + def test_replicate_weights_raises(self): + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = self._setup() + w = np.ones(30) + resolved_with_rep = ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=30, + lonely_psu="remove", + replicate_weights=np.tile(w[:, None], (1, 5)), + replicate_method="BRR", + n_replicates=5, + ) + with pytest.raises(NotImplementedError, match="replicate-weight"): + yatchew_hr_test(d, dy, survey=resolved_with_rep) + + +class TestJointStuteSurvey: + """Phase 4.5 C survey/weights on stute_joint_pretest + + joint_pretrends_test + joint_homogeneity_test.""" + + def _make_event_study_panel(self, G=20, T_pre=2, T_post=2, seed=42): + """Balanced event-study panel with T_pre + T_post periods.""" + rng = np.random.default_rng(seed) + d_per_unit = rng.uniform(0.1, 1.0, size=G) + rows = [] + for t in range(T_pre): + for g in range(G): + rows.append({"unit": g, "time": t, "y": rng.normal(), "d": 0.0}) + for t in range(T_pre, T_pre + T_post): + for g in range(G): + rows.append( + { + "unit": g, + "time": t, + "y": rng.normal() + 2.0 * d_per_unit[g] * (t - T_pre + 1), + "d": d_per_unit[g], + } + ) + return pd.DataFrame(rows) + + def test_joint_pretrends_weights_smoke(self): + df = self._make_event_study_panel() + w_per_unit = np.random.default_rng(7).uniform(0.5, 2.0, size=20) + # Constant-within-unit per HAD invariant. + weights_per_row = df["unit"].map(dict(zip(np.arange(20), w_per_unit))).to_numpy() + r = joint_pretrends_test( + df, + "y", + "d", + "time", + "unit", + pre_periods=[0], + base_period=1, + n_bootstrap=199, + seed=0, + weights=weights_per_row, + ) + assert np.isfinite(r.cvm_stat_joint) + assert 0.0 <= r.p_value <= 1.0 + + def test_joint_homogeneity_weights_smoke(self): + df = self._make_event_study_panel() + w_per_unit = np.random.default_rng(7).uniform(0.5, 2.0, size=20) + weights_per_row = df["unit"].map(dict(zip(np.arange(20), w_per_unit))).to_numpy() + r = joint_homogeneity_test( + df, + "y", + "d", + "time", + "unit", + post_periods=[2, 3], + base_period=1, + n_bootstrap=199, + seed=0, + weights=weights_per_row, + ) + assert np.isfinite(r.cvm_stat_joint) + assert 0.0 <= r.p_value <= 1.0 + + def test_joint_pretrends_survey_smoke(self): + from diff_diff import SurveyDesign + + df = self._make_event_study_panel() + w_per_unit = np.random.default_rng(7).uniform(0.5, 2.0, size=20) + df["w"] = df["unit"].map(dict(zip(np.arange(20), w_per_unit))) + r = joint_pretrends_test( + df, + "y", + "d", + "time", + "unit", + pre_periods=[0], + base_period=1, + n_bootstrap=199, + seed=0, + survey=SurveyDesign(weights="w"), + ) + assert np.isfinite(r.cvm_stat_joint) + + def test_joint_pretrends_mutex_both_raises(self): + from diff_diff import SurveyDesign + + df = self._make_event_study_panel() + df["w"] = 1.0 + with pytest.raises(ValueError, match="not both"): + joint_pretrends_test( df, "y", "d", "time", "unit", + pre_periods=[0], + base_period=1, + n_bootstrap=199, + seed=0, + weights=np.ones(80), survey=SurveyDesign(weights="w"), ) - def test_workflow_mutex_both_raises(self): - """Phase 4.5 C0: passing both survey= AND weights= raises ValueError - (mutex), mirroring HeterogeneousAdoptionDiD.fit() at had.py:2890.""" + def test_stute_joint_pretest_replicate_weights_raises(self): + """Phase 4.5 C MEDIUM #4: replicate-weight rejection at the direct + residuals-in entry too.""" + from diff_diff.survey import ResolvedSurveyDesign + + G = 20 + residuals_by_horizon = { + "0": np.random.default_rng(0).normal(size=G), + "1": np.random.default_rng(1).normal(size=G), + } + fitted_by_horizon = {"0": np.zeros(G), "1": np.zeros(G)} + doses = np.linspace(0.1, 1.0, G) + design_matrix = np.column_stack([np.ones(G), doses]) + w = np.ones(G) + resolved_with_rep = ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=G, + lonely_psu="remove", + replicate_weights=np.tile(w[:, None], (1, 5)), + replicate_method="BRR", + n_replicates=5, + ) + with pytest.raises(NotImplementedError, match="replicate-weight"): + stute_joint_pretest( + residuals_by_horizon=residuals_by_horizon, + fitted_by_horizon=fitted_by_horizon, + doses=doses, + design_matrix=design_matrix, + n_bootstrap=199, + seed=0, + survey=resolved_with_rep, + ) + + +# ============================================================================= +# Phase 4.5 C R1 review regressions: zero-weight survey, aweight/fweight +# pweight-only guard, staggered event-study weights= subsetting. +# ============================================================================= + + +class TestPhase45CR1Regressions: + """R1 P0 / P1 / P3 regressions on the survey-aware pretest paths.""" + + def _make_overall_panel(self, with_w_col=False): + d_arr, dy_arr = _linear_dgp(G=20, beta=2.0, sigma=0.3) + df = _make_two_period_panel(G=20, d=d_arr, dy=dy_arr) + if with_w_col: + rng = np.random.default_rng(7) + w_per_unit = rng.uniform(0.5, 2.0, size=20) + df["w"] = df["unit"].map(dict(zip(np.arange(20), w_per_unit))) + return df + + def _make_staggered_panel(self, G_per_cohort=10, with_w_col=False): + """Two-cohort staggered panel: cohort A treats at F=2, cohort B at F=3. + + Last-cohort filter (per HAD Appendix B.2) keeps only cohort B. + Workflow / data-in wrappers under aggregate='event_study' must + subset row-level weights= to the surviving cohort (R1 P1).""" + rng = np.random.default_rng(0) + rows = [] + for cohort, F_g in [("A", 2), ("B", 3)]: + for g in range(G_per_cohort): + unit_id = (0 if cohort == "A" else G_per_cohort) + g + d_post = rng.uniform(0.1, 1.0) + for t in range(4): + d_t = d_post if t >= F_g else 0.0 + y_t = rng.normal() + (2.0 * d_post * (t - F_g + 1) if t >= F_g else 0.0) + rows.append({"unit": unit_id, "time": t, "y": y_t, "d": d_t, "F": F_g}) + df = pd.DataFrame(rows) + if with_w_col: + n_units = 2 * G_per_cohort + w_per_unit = np.random.default_rng(7).uniform(0.5, 2.0, size=n_units) + df["w"] = df["unit"].map(dict(zip(np.arange(n_units), w_per_unit))) + return df + + # --- R1 P0: zero-weight survey rejection ------------------------------- + + def test_stute_test_zero_survey_weight_raises(self): + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = _linear_dgp(G=30) + w = np.ones(30) + w[0] = 0.0 + resolved = ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=30, + lonely_psu="remove", + ) + with pytest.raises(ValueError, match="strictly positive"): + stute_test(d, dy, survey=resolved, n_bootstrap=199, seed=0) + + def test_stute_joint_pretest_zero_survey_weight_raises(self): + from diff_diff.survey import ResolvedSurveyDesign + + G = 20 + residuals_by_horizon = { + "0": np.random.default_rng(0).normal(size=G), + "1": np.random.default_rng(1).normal(size=G), + } + fitted_by_horizon = {"0": np.zeros(G), "1": np.zeros(G)} + doses = np.linspace(0.1, 1.0, G) + design_matrix = np.column_stack([np.ones(G), doses]) + w = np.ones(G) + w[0] = 0.0 + resolved = ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=G, + lonely_psu="remove", + ) + with pytest.raises(ValueError, match="strictly positive"): + stute_joint_pretest( + residuals_by_horizon=residuals_by_horizon, + fitted_by_horizon=fitted_by_horizon, + doses=doses, + design_matrix=design_matrix, + n_bootstrap=199, + seed=0, + survey=resolved, + ) + + def test_workflow_zero_survey_weight_column_rejected(self): from diff_diff import SurveyDesign - df = self._make_minimal_overall_panel() - with pytest.raises(ValueError, match="OR weights=.*not both"): + df = self._make_overall_panel(with_w_col=True) + df.loc[df["unit"] == 0, "w"] = 0.0 + with pytest.raises(ValueError, match="strictly positive"): did_had_pretest_workflow( df, "y", @@ -2910,60 +3465,780 @@ def test_workflow_mutex_both_raises(self): "time", "unit", survey=SurveyDesign(weights="w"), - weights=np.ones(20), + n_bootstrap=199, + seed=0, ) - def test_workflow_message_points_to_phase_4_5_c(self): - """Phase 4.5 C0: the NotImplementedError must explicitly route users - to Phase 4.5 C and explain the QUG-under-survey vs Stute-under-survey - distinction.""" - df = self._make_minimal_overall_panel() - with pytest.raises(NotImplementedError) as exc_info: + # --- R1 P1: aweight/fweight pweight-only guard ------------------------- + + def test_stute_test_aweight_rejected(self): + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = _linear_dgp(G=30) + resolved = ResolvedSurveyDesign( + weights=np.ones(30), + weight_type="aweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=30, + lonely_psu="remove", + ) + with pytest.raises(ValueError, match="weight_type='pweight'"): + stute_test(d, dy, survey=resolved, n_bootstrap=199, seed=0) + + def test_yatchew_hr_test_fweight_rejected(self): + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = _linear_dgp(G=30) + resolved = ResolvedSurveyDesign( + weights=np.ones(30), + weight_type="fweight", + strata=None, + psu=None, + fpc=None, + n_strata=0, + n_psu=30, + lonely_psu="remove", + ) + with pytest.raises(ValueError, match="weight_type='pweight'"): + yatchew_hr_test(d, dy, survey=resolved) + + def test_workflow_aweight_rejected_at_resolution(self): + from diff_diff import SurveyDesign + + df = self._make_overall_panel(with_w_col=True) + with pytest.raises(ValueError, match="weight_type='pweight'"): did_had_pretest_workflow( df, "y", "d", "time", "unit", - weights=np.ones(20), + survey=SurveyDesign(weights="w", weight_type="aweight"), + n_bootstrap=199, + seed=0, ) - msg = str(exc_info.value) - # Routing pointers - assert "Phase 4.5 C" in msg - assert "Rao-Wu" in msg - # Why QUG specifically can't be done (cross-reference to qug_test) - assert "qug_test" in msg - assert "permanently deferred" in msg - def test_workflow_unweighted_overall_path_unchanged(self): - """Stability invariant: existing positional / unweighted calls must - produce a valid HADPretestReport after the new keyword-only kwargs - are added. Smoke-tests the overall path.""" - df = self._make_minimal_overall_panel() - report = did_had_pretest_workflow(df, "y", "d", "time", "unit", n_bootstrap=199, seed=0) - # Overall-path invariant: stute and yatchew populated; joint variants None. + # --- R1 P1: staggered event-study weights= subsetting ------------------ + + def test_workflow_staggered_event_study_weights_subset_correctly(self): + """R1 P1: on staggered panels, _validate_multi_period_panel filters + to the last cohort; row-level weights= must be subset to the + surviving cohort BEFORE re-aggregation. Pre-fix this crashed with + a length-mismatch ValueError.""" + df = self._make_staggered_panel(G_per_cohort=10) + n_rows = 2 * 10 * 4 + weights_per_row = np.ones(n_rows) * 1.5 + with pytest.warns(UserWarning): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + first_treat_col="F", + aggregate="event_study", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + assert report.aggregate == "event_study" + assert report.qug is None + assert report.homogeneity_joint is not None + + # --- R2 P1: direct-wrapper staggered weights= subsetting --------------- + + def test_joint_pretrends_test_staggered_weights_subset(self): + """R2 P1: joint_pretrends_test direct call must subset row-level + weights= when its own _validate_had_panel_event_study filtering + triggers on staggered panels. Pre-fix this crashed with a length- + mismatch ValueError because the wrapper passed the full-panel + weights array into _resolve_pretest_unit_weights(data_filtered, ...).""" + df = self._make_staggered_panel(G_per_cohort=10) + n_rows = 2 * 10 * 4 + weights_per_row = np.ones(n_rows) * 1.5 + with pytest.warns(UserWarning): + r = joint_pretrends_test( + df, + "y", + "d", + "time", + "unit", + pre_periods=[0, 1], + base_period=2, + first_treat_col="F", + n_bootstrap=199, + seed=0, + weights=weights_per_row, + ) + assert np.isfinite(r.cvm_stat_joint) + + def test_joint_homogeneity_test_staggered_weights_subset(self): + df = self._make_staggered_panel(G_per_cohort=10) + n_rows = 2 * 10 * 4 + weights_per_row = np.ones(n_rows) * 1.5 + with pytest.warns(UserWarning): + r = joint_homogeneity_test( + df, + "y", + "d", + "time", + "unit", + post_periods=[3], + base_period=2, + first_treat_col="F", + n_bootstrap=199, + seed=0, + weights=weights_per_row, + ) + assert np.isfinite(r.cvm_stat_joint) + + # --- R2 P1: bootstrap perturbation form lock --------------------------- + + def test_stute_survey_perturbation_does_not_double_weight(self): + """R2 P1: bootstrap perturbation is `dy_b = fitted + eps * eta_obs` + (paper Appendix D form), NOT `eps * w * eta_obs`. Adding `* w` to + the perturbation would over-weight by w² (weighting flows through + weighted OLS refit + weighted CvM, NOT through the multiplier). + + Lock test: cvm_stat at uniform weights matches between paths + bit-exactly (W=G under uniform weights so 1/W² = 1/G²); the + bootstrap p-value distributions agree within Monte-Carlo noise + (RNG draw ordering differs between batched survey-aware path and + per-iteration unweighted path; numerical equivalence is unreachable). + """ + d, dy = _linear_dgp(G=50, beta=2.0, sigma=0.3) + r_unweighted = stute_test(d, dy, n_bootstrap=999, seed=0) + r_weighted = stute_test(d, dy, weights=np.ones(50), n_bootstrap=999, seed=0) + # cvm_stat: bit-exact reduction at w=1 (W=G, weighted CvM ≡ unweighted). + np.testing.assert_allclose( + r_unweighted.cvm_stat, r_weighted.cvm_stat, atol=1e-14, rtol=1e-14 + ) + # p_value: distributional agreement at large B; Monte-Carlo noise. + # If the survey path were over-weighting (w² instead of w), the + # bootstrap distribution would be inflated and the survey p-value + # would systematically deviate. With the correct form, |diff| < 0.10. + assert abs(r_unweighted.p_value - r_weighted.p_value) < 0.10 + + # --- R3 P0: variance-unidentified survey-design guard ------------------ + + def test_stute_test_single_psu_unstratified_returns_nan(self): + """R3 P0: unstratified single-PSU survey designs are + variance-unidentified (df_survey = n_psu - 1 = 0). The multiplier + bootstrap helper returns an all-zero matrix; without the guard + the code below would treat that as a valid bootstrap law and emit + p_value ≈ 1/(B+1) (spurious rejection). Guard returns NaN + + UserWarning instead.""" + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = _linear_dgp(G=30) + # All units in a single PSU, unstratified -> df_survey = 0. + single_psu = np.zeros(30, dtype=np.int64) + resolved = ResolvedSurveyDesign( + weights=np.ones(30), + weight_type="pweight", + strata=None, + psu=single_psu, + fpc=None, + n_strata=0, + n_psu=1, + lonely_psu="remove", + ) + with pytest.warns(UserWarning, match="variance-unidentified"): + r = stute_test(d, dy, survey=resolved, n_bootstrap=199, seed=0) + assert np.isnan(r.p_value) + assert r.reject is False + # cvm_stat is the OBSERVED value (still computed pre-guard); only + # p_value goes NaN because the bootstrap calibration is invalid. + assert np.isfinite(r.cvm_stat) + + def test_stute_joint_pretest_single_psu_unstratified_returns_nan(self): + """R3 P0: same guard on the joint variant.""" + from diff_diff.survey import ResolvedSurveyDesign + + G = 20 + residuals_by_horizon = { + "0": np.random.default_rng(0).normal(size=G), + "1": np.random.default_rng(1).normal(size=G), + } + fitted_by_horizon = {"0": np.zeros(G), "1": np.zeros(G)} + doses = np.linspace(0.1, 1.0, G) + design_matrix = np.column_stack([np.ones(G), doses]) + single_psu = np.zeros(G, dtype=np.int64) + resolved = ResolvedSurveyDesign( + weights=np.ones(G), + weight_type="pweight", + strata=None, + psu=single_psu, + fpc=None, + n_strata=0, + n_psu=1, + lonely_psu="remove", + ) + with pytest.warns(UserWarning, match="variance-unidentified"): + r = stute_joint_pretest( + residuals_by_horizon=residuals_by_horizon, + fitted_by_horizon=fitted_by_horizon, + doses=doses, + design_matrix=design_matrix, + n_bootstrap=199, + seed=0, + survey=resolved, + ) + assert np.isnan(r.p_value) + assert r.reject is False + assert np.isfinite(r.cvm_stat_joint) + + def test_workflow_single_psu_propagates_nan_through_stute(self): + """R3 P0: workflow-level: single-PSU survey design makes the + survey Stute multiplier bootstrap variance-unidentified, so + report.stute.p_value is NaN (the guard fired). Yatchew under + survey is unaffected by PSU clustering by design (REGISTRY + note: PSU clustering is NOT propagated through the variance- + ratio statistic), so report.yatchew.p_value is still finite. + The verdict carries the linearity-conditional suffix; users + should read REGISTRY for the per-test mechanism caveat.""" + from diff_diff import SurveyDesign + + df = self._make_overall_panel(with_w_col=True) + # Add a constant 'psu' column (single PSU). + df["psu"] = 0 + with pytest.warns(UserWarning): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + survey=SurveyDesign(weights="w", psu="psu"), + n_bootstrap=199, + seed=0, + ) assert report.aggregate == "overall" - assert report.qug is not None - assert report.stute is not None - assert report.yatchew is not None - assert report.pretrends_joint is None - assert report.homogeneity_joint is None + assert report.qug is None # skipped per C0 + # Stute: variance-unidentified -> NaN p-value (R3 P0 guard fired). + assert report.stute is not None and np.isnan(report.stute.p_value) + # Yatchew: closed-form, PSU-agnostic by design -> still finite. + assert report.yatchew is not None and np.isfinite(report.yatchew.p_value) + # Verdict carries the linearity-conditional suffix. + assert "linearity-conditional verdict" in report.verdict + + # --- R4 P0: weight-scale invariance + cross-path agreement ------------ + + def test_yatchew_weights_scale_invariant(self): + """R4 P0: Yatchew test statistic must be invariant under uniform + rescaling of weights. Pre-fix `T_hr = sqrt(sum(w)) * (...)` made + the stat scale as sqrt(c), so weights=w and weights=100*w gave + different p-values. Fix: helper normalizes pweights to mean=1 + before any computation.""" + d, dy = _linear_dgp(G=30, beta=2.0, sigma=0.3) + w = np.random.default_rng(7).uniform(0.5, 2.0, size=30) + r1 = yatchew_hr_test(d, dy, weights=w) + r2 = yatchew_hr_test(d, dy, weights=100.0 * w) + np.testing.assert_allclose(r1.t_stat_hr, r2.t_stat_hr, atol=1e-12, rtol=1e-12) + np.testing.assert_allclose(r1.p_value, r2.p_value, atol=1e-12, rtol=1e-12) + + def test_stute_weights_scale_invariant(self): + """R4 P0 mirror: Stute is internally scale-invariant in functional + form, but normalization is required so weights= and survey= + entry paths agree numerically.""" + d, dy = _linear_dgp(G=30, beta=2.0, sigma=0.3) + w = np.random.default_rng(7).uniform(0.5, 2.0, size=30) + r1 = stute_test(d, dy, weights=w, n_bootstrap=199, seed=0) + r2 = stute_test(d, dy, weights=100.0 * w, n_bootstrap=199, seed=0) + np.testing.assert_allclose(r1.cvm_stat, r2.cvm_stat, atol=1e-12, rtol=1e-12) + np.testing.assert_allclose(r1.p_value, r2.p_value, atol=1e-12, rtol=1e-12) + + def test_workflow_weights_eq_survey_at_overall_path(self): + """R4 P0: workflow's weights= shortcut and survey=SurveyDesign( + weights="w") must produce identical Yatchew/Stute results for + the same design. SurveyDesign.resolve() normalizes pweights to + mean=1; the helper now applies the same normalization on the + weights= path so both paths agree numerically.""" + from diff_diff import SurveyDesign + + df = self._make_overall_panel(with_w_col=True) + # Build a per-row weights array matching df["w"] for the shortcut. + weights_per_row = df["w"].to_numpy() + with pytest.warns(UserWarning): + r_weights = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + with pytest.warns(UserWarning): + r_survey = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + survey=SurveyDesign(weights="w"), + n_bootstrap=199, + seed=0, + ) + # Yatchew: closed-form, must match exactly under mean=1 normalization. + assert r_weights.yatchew is not None and r_survey.yatchew is not None + np.testing.assert_allclose( + r_weights.yatchew.t_stat_hr, + r_survey.yatchew.t_stat_hr, + atol=1e-10, + rtol=1e-10, + ) + # Stute: bootstrap is seeded; same multiplier matrix shape under + # both paths means same RNG draws -> identical p-values. + assert r_weights.stute is not None and r_survey.stute is not None + np.testing.assert_allclose( + r_weights.stute.cvm_stat, r_survey.stute.cvm_stat, atol=1e-10, rtol=1e-10 + ) + np.testing.assert_allclose( + r_weights.stute.p_value, r_survey.stute.p_value, atol=1e-10, rtol=1e-10 + ) + + # --- R4 P1: 1D weights validation ------------------------------------ + + def test_stute_test_rejects_2d_weights(self): + """R4 P1: column-vector weights must raise, not silently broadcast.""" + d, dy = _linear_dgp(G=30) + w_2d = np.ones((30, 1)) # common df[["w"]].to_numpy() pattern + with pytest.raises(ValueError, match="1-dimensional"): + stute_test(d, dy, weights=w_2d, n_bootstrap=199, seed=0) + + def test_yatchew_hr_test_rejects_2d_weights(self): + d, dy = _linear_dgp(G=30) + w_2d = np.ones((30, 1)) + with pytest.raises(ValueError, match="1-dimensional"): + yatchew_hr_test(d, dy, weights=w_2d) + + def test_workflow_rejects_2d_weights(self): + df = self._make_overall_panel() + w_2d = np.ones((40, 1)) + with pytest.raises(ValueError, match="1-dimensional"): + did_had_pretest_workflow( + df, "y", "d", "time", "unit", weights=w_2d, n_bootstrap=199, seed=0 + ) + + # --- R5 P1: lonely_psu='adjust' singleton-strata rejection ------------ + + def _make_singleton_strata_resolved(self, G=30, lonely_psu="adjust"): + """Resolved survey design with one PSU per stratum (singleton strata). + Under lonely_psu='adjust' the bootstrap helper pools singletons with + nonzero multipliers, but the variance target requires a pseudo-stratum + centering transform not derived for the Stute CvM.""" + from diff_diff.survey import ResolvedSurveyDesign + + # G strata, each with exactly 1 PSU (each unit is its own stratum + + # PSU). Tests the worst-case singleton-pooling regime. + strata = np.arange(G, dtype=np.int64) + psu = np.arange(G, dtype=np.int64) + return ResolvedSurveyDesign( + weights=np.ones(G), + weight_type="pweight", + strata=strata, + psu=psu, + fpc=None, + n_strata=G, + n_psu=G, + lonely_psu=lonely_psu, + ) + + def test_stute_test_stratified_design_raises(self): + """R10 P1: Stute survey path explicitly rejects ANY stratified + design (`SurveyDesign(strata=...)`) -- the matching Stute-CvM + stratified-correction derivation is not yet completed. This + guard supersedes the prior R5 P1 lonely_psu='adjust' guard, + which only fired on the singleton-stratum subset of stratified + designs. PSU-only and pweight-only designs remain supported.""" + d, dy = _linear_dgp(G=30) + resolved = self._make_singleton_strata_resolved(G=30, lonely_psu="adjust") + with pytest.raises(NotImplementedError, match="stratified"): + stute_test(d, dy, survey=resolved, n_bootstrap=199, seed=0) + + def test_stute_joint_pretest_stratified_design_raises(self): + """R10 P1: joint-Stute survey path explicitly rejects stratified + designs (mirrors stute_test single-horizon).""" + G = 20 + residuals_by_horizon = { + "0": np.random.default_rng(0).normal(size=G), + "1": np.random.default_rng(1).normal(size=G), + } + fitted_by_horizon = {"0": np.zeros(G), "1": np.zeros(G)} + doses = np.linspace(0.1, 1.0, G) + design_matrix = np.column_stack([np.ones(G), doses]) + resolved = self._make_singleton_strata_resolved(G=G, lonely_psu="adjust") + with pytest.raises(NotImplementedError, match="stratified"): + stute_joint_pretest( + residuals_by_horizon=residuals_by_horizon, + fitted_by_horizon=fitted_by_horizon, + doses=doses, + design_matrix=design_matrix, + n_bootstrap=199, + seed=0, + survey=resolved, + ) + + # --- R6 P1: positive non-trivial PSU/strata survey coverage ----------- + + def _make_event_study_panel_with_psu_strata( + self, + n_strata=2, + n_psu_per_stratum=3, + n_units_per_psu=2, + T_pre=2, + T_post=2, + seed=42, + ): + """Balanced event-study panel with non-trivial PSU/strata structure.""" + rng = np.random.default_rng(seed) + rows = [] + unit_id = 0 + for h in range(n_strata): + for p in range(n_psu_per_stratum): + psu_global = h * n_psu_per_stratum + p + for _ in range(n_units_per_psu): + d_post = rng.uniform(0.1, 1.0) + w_unit = rng.uniform(0.5, 2.0) + for t in range(T_pre + T_post): + d_t = d_post if t >= T_pre else 0.0 + y_t = rng.normal() + (2.0 * d_post * (t - T_pre + 1) if t >= T_pre else 0.0) + rows.append( + { + "unit": unit_id, + "time": t, + "y": y_t, + "d": d_t, + "stratum": h, + "psu": psu_global, + "w": w_unit, + } + ) + unit_id += 1 + return pd.DataFrame(rows) + + def test_joint_homogeneity_test_psu_only_survey_smoke(self): + """R6 P1 + R10 P1: positive coverage on joint_homogeneity_test + with PSU-only survey design (NO strata, since stratified is + rejected per R10 P1 narrowing).""" + from diff_diff import SurveyDesign + + df = self._make_event_study_panel_with_psu_strata( + n_strata=2, n_psu_per_stratum=3, n_units_per_psu=2 + ) + r = joint_homogeneity_test( + df, + "y", + "d", + "time", + "unit", + post_periods=[2, 3], + base_period=1, + n_bootstrap=199, + seed=0, + survey=SurveyDesign(weights="w", psu="psu"), + ) + assert np.isfinite(r.cvm_stat_joint) + assert 0.0 <= r.p_value <= 1.0 + + def test_joint_homogeneity_test_stratified_raises(self): + """R10 P1: stratified designs raise NotImplementedError on + joint_homogeneity_test (propagates via stute_joint_pretest).""" + from diff_diff import SurveyDesign + + df = self._make_event_study_panel_with_psu_strata( + n_strata=2, n_psu_per_stratum=3, n_units_per_psu=2 + ) + with pytest.raises(NotImplementedError, match="stratified"): + joint_homogeneity_test( + df, + "y", + "d", + "time", + "unit", + post_periods=[2, 3], + base_period=1, + n_bootstrap=199, + seed=0, + survey=SurveyDesign(weights="w", strata="stratum", psu="psu"), + ) - def test_workflow_survey_rejects_on_event_study_path_too(self): - """Phase 4.5 C0: rejection happens at the front door regardless of - which aggregate path the user picked. The mutex/reject guards fire - BEFORE the panel validator does any work, so the user gets the - methodology-aware message even on an invalid event-study panel.""" - df = ( - self._make_minimal_overall_panel() - ) # only two periods - would fail event_study validation - with pytest.raises(NotImplementedError, match="does not yet accept"): + def test_workflow_event_study_psu_only_survey_smoke(self): + """R6 P1 + R10 P1: positive coverage on did_had_pretest_workflow + event-study path with PSU-only structure (no strata).""" + from diff_diff import SurveyDesign + + df = self._make_event_study_panel_with_psu_strata( + n_strata=2, n_psu_per_stratum=3, n_units_per_psu=2 + ) + with pytest.warns(UserWarning, match="QUG step skipped"): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + aggregate="event_study", + survey=SurveyDesign(weights="w", psu="psu"), + n_bootstrap=199, + seed=0, + ) + assert report.aggregate == "event_study" + assert report.qug is None + assert report.pretrends_joint is not None + assert np.isfinite(report.pretrends_joint.cvm_stat_joint) + assert report.homogeneity_joint is not None + assert np.isfinite(report.homogeneity_joint.cvm_stat_joint) + + def test_workflow_event_study_zero_weights_on_dropped_cohort(self): + """R6 P1 regression: previously the workflow eagerly resolved + weights= on the FULL panel (before _validate_multi_period_panel's + last-cohort filter), so zero/invalid weights on the soon-to-be- + dropped cohort would abort an otherwise-valid event-study run. + Fix: resolution moved into the per-aggregate branches; the + event-study path lets joint wrappers handle resolution on + data_filtered. This test verifies a panel where the dropped + (early) cohort has zero weights succeeds on the surviving last + cohort.""" + df = self._make_staggered_panel(G_per_cohort=10) + weights_per_row = np.array([0.0 if df.iloc[i]["F"] == 2 else 1.5 for i in range(len(df))]) + with pytest.warns(UserWarning): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + first_treat_col="F", + aggregate="event_study", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + assert report.aggregate == "event_study" + assert report.qug is None + assert report.homogeneity_joint is not None + + # --- R7 P0: weighted-CvM outer-measure oracle ------------------------- + + def test_cvm_statistic_weighted_outer_measure_oracle(self): + """R7 P0: weighted CvM must integrate outer measure against F_hat_w + too. Hand-computed oracle distinguishes outer-weighted form + ((1/W^2) sum_g w_g C_g^2) from count-weighted-cusum form + ((1/W^2) sum_g C_g^2). Uniform weights cannot tell the two apart.""" + from diff_diff.had_pretests import _cvm_statistic_weighted + + eps = np.array([1.0, -2.0, 3.0]) + d = np.array([0.1, 0.2, 0.3]) + w = np.array([1.0, 2.0, 3.0]) + # C_1=1, C_2=-3, C_3=6, W=6. + # Outer-weighted: (1*1 + 2*9 + 3*36) / 36 = 127/36. + # Count-weighted (WRONG): (1+9+36) / 36 = 46/36. + result = _cvm_statistic_weighted(eps, d, w) + outer_weighted = (1 * 1.0**2 + 2 * (-3.0) ** 2 + 3 * 6.0**2) / (6.0**2) + count_weighted = (1.0**2 + (-3.0) ** 2 + 6.0**2) / (6.0**2) + np.testing.assert_allclose(result, outer_weighted, atol=1e-14, rtol=1e-14) + assert abs(outer_weighted - count_weighted) > 1.0 + assert abs(result - count_weighted) > 1.0 + + def test_cvm_statistic_weighted_reduces_at_uniform_weights(self): + """At w=ones(G), outer-weighted form reduces bit-exactly to the + unweighted statistic.""" + from diff_diff.had_pretests import _cvm_statistic, _cvm_statistic_weighted + + eps = np.array([1.0, -2.0, 3.0, 0.5, -0.7]) + d = np.array([0.1, 0.2, 0.3, 0.4, 0.5]) + w_uniform = np.ones(5) + np.testing.assert_allclose( + _cvm_statistic_weighted(eps, d, w_uniform), + _cvm_statistic(eps, d), + atol=1e-14, + rtol=1e-14, + ) + + # --- R7 P1: survey verdict consistency -------------------------------- + + def test_workflow_overall_survey_pass_does_not_say_inconclusive(self): + """R7 P1: when all_pass=True on the overall survey path, the + verdict must NOT start with 'inconclusive'. Locks the explicit + survey-aware verdict composer.""" + df = self._make_overall_panel() + weights_per_row = np.full(40, 1.5) + with pytest.warns(UserWarning): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + weights=weights_per_row, + n_bootstrap=199, + seed=0, + ) + if report.all_pass: + assert not report.verdict.startswith("inconclusive"), ( + f"all_pass=True but verdict starts with 'inconclusive': " f"{report.verdict!r}" + ) + + def test_workflow_event_study_survey_pass_does_not_say_inconclusive(self): + """R7 P1: same invariant on the event-study survey path. Uses + PSU-only design (no strata) per R10 P1 narrowing.""" + from diff_diff import SurveyDesign + + df = self._make_event_study_panel_with_psu_strata( + n_strata=2, n_psu_per_stratum=3, n_units_per_psu=2 + ) + with pytest.warns(UserWarning, match="QUG step skipped"): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + aggregate="event_study", + survey=SurveyDesign(weights="w", psu="psu"), + n_bootstrap=199, + seed=0, + ) + if report.all_pass: + assert not report.verdict.startswith("inconclusive"), ( + f"all_pass=True but verdict starts with 'inconclusive': " f"{report.verdict!r}" + ) + + # --- R9 P1: front-door length validation on staggered weights= path --- + + def test_workflow_event_study_oversized_weights_raises(self): + """R9 P1: oversized row-level weights= must raise a clean + ValueError BEFORE the staggered-panel pos_idx subsetting (pre-fix + the workflow silently truncated by slicing original weights to + data_filtered's row count without checking length first).""" + df = self._make_staggered_panel(G_per_cohort=10) + weights_oversized = np.ones(100) * 1.5 # 80 rows expected + with pytest.raises(ValueError, match="weights length 100"): + did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + first_treat_col="F", + aggregate="event_study", + weights=weights_oversized, + n_bootstrap=199, + seed=0, + ) + + def test_workflow_event_study_undersized_weights_raises(self): + """R9 P1: undersized weights= must raise clean ValueError, not + a raw IndexError from pos_idx slicing.""" + df = self._make_staggered_panel(G_per_cohort=10) + weights_undersized = np.ones(60) * 1.5 + with pytest.raises(ValueError, match="weights length 60"): did_had_pretest_workflow( df, "y", "d", "time", "unit", + first_treat_col="F", aggregate="event_study", - weights=np.ones(20), + weights=weights_undersized, + n_bootstrap=199, + seed=0, ) + + def test_joint_pretrends_test_oversized_weights_raises(self): + """R9 P1: same length-validation contract on the direct wrapper.""" + df = self._make_staggered_panel(G_per_cohort=10) + weights_oversized = np.ones(100) * 1.5 + with pytest.raises(ValueError, match="weights length 100"): + joint_pretrends_test( + df, + "y", + "d", + "time", + "unit", + pre_periods=[0, 1], + base_period=2, + first_treat_col="F", + n_bootstrap=199, + seed=0, + weights=weights_oversized, + ) + + def test_joint_homogeneity_test_undersized_weights_raises(self): + """R9 P1: same on joint_homogeneity_test.""" + df = self._make_staggered_panel(G_per_cohort=10) + weights_undersized = np.ones(60) * 1.5 + with pytest.raises(ValueError, match="weights length 60"): + joint_homogeneity_test( + df, + "y", + "d", + "time", + "unit", + post_periods=[3], + base_period=2, + first_treat_col="F", + n_bootstrap=199, + seed=0, + weights=weights_undersized, + ) + + # --- R12 P3: positive FPC-only survey coverage ------------------------ + + def test_stute_test_fpc_only_survey_smoke(self): + """R12 P3: positive smoke for FPC-only survey designs on the Stute + family. Phase 4.5 C narrows survey support to pweight+PSU+FPC; the + previous test matrix covered pweight-only and PSU-only but no FPC + case, so the FPC scaling branch in + generate_survey_multiplier_weights_batch was unpinned by direct + regression.""" + from diff_diff.survey import ResolvedSurveyDesign + + d, dy = _linear_dgp(G=30, beta=2.0, sigma=0.3) + # Construct an FPC-only design: no strata, no PSU, but FPC = N=200 + # (population size) so f = G/N = 0.15. The bootstrap helper applies + # a sqrt(1 - f) scaling to the multipliers under FPC. + w = np.ones(30) + resolved = ResolvedSurveyDesign( + weights=w, + weight_type="pweight", + strata=None, + psu=None, + fpc=np.full(30, 200.0), + n_strata=0, + n_psu=30, + lonely_psu="remove", + ) + r = stute_test(d, dy, survey=resolved, n_bootstrap=199, seed=0) + assert np.isfinite(r.cvm_stat) + assert 0.0 <= r.p_value <= 1.0 + + def test_workflow_overall_fpc_only_survey_smoke(self): + """R12 P3: positive smoke for FPC-only on the workflow path.""" + from diff_diff import SurveyDesign + + df = self._make_overall_panel(with_w_col=True) + # FPC value > G to satisfy the helper's "FPC must be >= n_units" guard. + df["fpc"] = 200.0 + with pytest.warns(UserWarning, match="QUG step skipped"): + report = did_had_pretest_workflow( + df, + "y", + "d", + "time", + "unit", + survey=SurveyDesign(weights="w", fpc="fpc"), + n_bootstrap=199, + seed=0, + ) + assert report.aggregate == "overall" + assert report.qug is None + assert report.stute is not None and np.isfinite(report.stute.p_value) + assert report.yatchew is not None and np.isfinite(report.yatchew.p_value)