From a214168eb144b9c63ba65c68fff74222c64dcfcb Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 31 May 2026 16:47:37 -0400 Subject: [PATCH] WooldridgeDiD: outcome-fit hint for OLS on binary/count outcomes Add a non-fatal UserWarning when WooldridgeDiD(method="ols") is used on a binary ({0,1}) or non-negative integer-count outcome. Following Wooldridge (2023), the warning notes that a matching nonlinear model (logit / Poisson) is often the MORE APPROPRIATE specification for such outcomes: it imposes parallel trends on the link/index scale rather than in levels (level-PT is only valid for continuous/unbounded outcomes), and the paper's Section 5 simulations show the linear model both biased and less precise where the nonlinear mean holds. It is a different identifying assumption than linear OLS -- which fits depends on which parallel-trends restriction holds -- so the hint is a recommended comparison, not an automatic switch or a free efficiency upgrade. OLS remains a valid QMLE for any response (Table 1). - _suggest_nonlinear_method: pure, non-fatal detector (binary -> logit, non-negative count with >2 distinct -> poisson; fractional/continuous/ negative/non-numeric -> None). Bounded binomial-style integers are not separately distinguished from unbounded counts (documented heuristic). - fit() gate 0g (OLS-only, stacklevel=2) emits the hint on the full outcome column before sample filtering; never alters the fit or raises - TestOutcomeFitHint: detector units (incl. the bounded-support case) + gate behavior + suppression + paper-faithful framing guard (more-appropriate / biased / different assumption / recommended-comparison) - Docs: REGISTRY Note (LPT-vs-IPT + Section 5 evidence + two-sided framing provenance), method docstring, llms-full.txt, wooldridge_etwfe.rst, CHANGELOG - TODO: remove the Tier A + Methodology/Correctness rows (#216) Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 1 + TODO.md | 3 +- diff_diff/guides/llms-full.txt | 5 +- diff_diff/wooldridge.py | 92 ++++++++++++++++++++++- docs/api/wooldridge_etwfe.rst | 16 ++++ docs/methodology/REGISTRY.md | 2 + tests/test_wooldridge.py | 130 +++++++++++++++++++++++++++++++++ 7 files changed, 244 insertions(+), 5 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index fe5c90c1..7bd213d8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **PowerAnalysis methodology-review-tracker promotion: In Progress → Complete, with a panel-variance correction (behavior change).** Closes the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits on the tracker (PR-A #506 added both paper reviews + under-review Notes; this PR validates the source against the code and reconciles the discrepancies). **Behavior change:** the analytical *panel* DiD variance was the Moulton design-effect factor `(1+(T−1)·rho)/T`, wrong two ways versus the source — wrong period-scaling (~4× too small at `rho=0`, `m=r=5` versus the iid DiD benchmark) and the **opposite `rho`-sign** (it *raised* the MDE as within-unit correlation grew). It is replaced by the within-unit equicorrelated special case of Burlig et al. Eq. 2, `Var(ATT) = sigma² · (1/n_T + 1/n_C) · (1/n_pre + 1/n_post) · (1 − rho)`, in which within-unit (serial) correlation *lowers* the MDE because the difference-in-differences cancels the shared within-unit component. So `PowerAnalysis.mde` / `power` / `sample_size` (and the `compute_*` wrappers) now return a **smaller** MDE / required N as `rho` rises for **all** designs; the 2×2 path matches Bloom's `2σ²` at the default `rho = 0` and is continuous with the panel form at `n_pre = n_post = 1`. New input validation, enforced for **all** designs *before* the 2×2-vs-panel router: `n_pre >= 1`, `n_post >= 1`, `rho ∈ [−1/(T−1), 1)` (`T = n_pre + n_post`), finite `sigma >= 0`, positive group counts, and `treat_frac ∈ (0, 1)` now raise `ValueError` (previously invalid two-period shapes and out-of-range `rho` fell through to `basic_did` silently). The `(1 − rho)` factor applies at `T = 2` too — the 2×2 path is Burlig's `m = r = 1` special case (footnote 11), so a nonzero `rho` is no longer silently ignored there, while `rho = 0` still recovers Bloom's `2σ²`. The MDE multiplier stays the **normal (z)** Bloom multiplier (a deliberate large-sample approximation to Burlig's t, documented as `**Deviation from R:**`) — unchanged. New `tests/test_methodology_power.py` (Bloom Table 1 multipliers; 2×2 + panel closed forms; a literal-equicorrelated Monte-Carlo validation of the panel variance; `sample_size`↔`mde` round-trip; input-guard + `rho`-at-`T=2` + `compute_*` wrapper validation; base-R `qnorm` parity at `benchmarks/data/r_power_golden.json`, generator `benchmarks/R/generate_power_golden.R`); the two `tests/test_power.py` ICC-direction tests were inverted to Burlig's sign. REGISTRY `## PowerAnalysis` equation block rewritten (z not t; corrected 2×2 / panel SE + sample-size; removed the cluster-`m` and inverted-`R²` terms that matched neither code nor source); `docs/references.rst` adds Frison & Pocock (1992) + McKenzie (2012) as the equicorrelated lineage; tutorial `06_power_analysis.ipynb` corrected. `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-05-31`); priority queue pruned; the PR-A under-review Notes removed across REGISTRY / `power.py` / `references.rst`. +- **`WooldridgeDiD` outcome-fit hint:** `WooldridgeDiD(method="ols")` now emits a `UserWarning` when the outcome is binary (`{0, 1}`) or a non-negative integer count, noting that a matching nonlinear model (`method="logit"` / `method="poisson"`) is often the **more appropriate specification** for such outcomes. Following Wooldridge (2023): the nonlinear paths impose parallel trends on the link/index scale rather than in levels (level-PT is only valid for continuous/unbounded outcomes), and the paper's Section 5 simulations show the linear model both biased and less precise where the nonlinear mean holds. It is a **different identifying assumption** than linear OLS — which one fits depends on which parallel-trends restriction holds — so the warning frames it as a recommended comparison, not an automatic switch or free efficiency upgrade. OLS remains a valid QMLE for *any* response (Table 1). Always-on (suppress via `warnings.filterwarnings`); detection is high-signal (binary requires exactly `{0, 1}`; the count branch suggests Poisson — the natural unbounded-count model — for *any* non-negative integers with >2 distinct values, so bounded binomial / known-upper-bound integer outcomes are not separately distinguished from unbounded counts; fractional / continuous outcomes are not flagged). - **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide. - **`SyntheticControl` in-space placebo permutation inference + reporting-stack integration (ADH 2010 §2.4).** New `SyntheticControlResults.in_space_placebo()` provides the significance test classic SCM lacks an analytical SE for: it reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the **other `J−1` donors** (the real treated unit is excluded from every placebo pool — its post-period is treatment-contaminated; matches `SCtools::generate.placebos`), and ranks the treated unit's post/pre **RMSPE ratio** among the `J+1` units. New fields `placebo_p_value` (`= rank/(n_placebos+1)`, an upper-tail rank test on the unsigned RMSPE ratio — direction-agnostic, so it detects an effect of *either* sign rather than a signed/one-directional hypothesis; ties counted via `≥`), `rmspe_ratio` (the treated statistic, set at fit), and `n_placebos`/`n_failed` (effective reference-set sizes; non-converged placebos are excluded from BOTH numerator and denominator, never penalized into the rank). `placebo_p_value` is a **separate field** from the (always-NaN) `p_value` — it is a permutation p-value with no SE/t-stat and does not flow through `safe_inference`; `is_significant` stays bound to `p_value`. Edge cases fail closed: scale-aware RMSPE-ratio floor (a perfect pre-fit gives a finite ratio, not `inf`), `J<2` → NaN+warn, `J==2` → degenerate+coarse warn, deterministic given `seed`. New `get_placebo_df()` returns the per-unit RMSPE-ratio summary table (incl. the treated row and any failed donors) used for the rank. The design keeps the placebo *compute* opt-in — the per-donor refit loop runs only on the explicit `in_space_placebo()` call. To support that opt-in call, every fit retains a `_SyntheticControlFitSnapshot` of the pivoted panel (memory O(units × periods × predictor-vars), like `SyntheticDiD`'s snapshot for `in_time_placebo`; excluded from pickling). A compact/lazy snapshot representation is tracked as a follow-up in `TODO.md`. **Reporting-stack integration:** `SyntheticControlResults` is now routed through `DiagnosticReport` (fit-based `scm_fit` parallel-trends analogue → verdict `design_enforced_pt` reading `pre_rmspe`; `_scm_native` surfaces `pre_rmspe` + donor-weight concentration + the placebo p-value when already computed — never triggering the refit loop implicitly), `practitioner_next_steps` (`_handle_synthetic_control` with the placebo as the headline significance step), and `BusinessReport` (fit-based assumption block, ADH 2010 attribution, robustness via `estimator_native_diagnostics`; HonestDiD passthrough rejected like SDiD/TROP). Also fixes a latent BR bug where the headline `is_significant` was a non-JSON-serializable numpy `bool_` when `p_value` is a numpy `NaN`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the donor-pool construction, failure handling, RMSPE-ratio floor, and the non-analytical-p-value split), `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. - **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN; significance comes from in-space placebo permutation inference via `in_space_placebo()` (see the dedicated entry below). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. diff --git a/TODO.md b/TODO.md index 013513b4..17fed4c2 100644 --- a/TODO.md +++ b/TODO.md @@ -100,7 +100,6 @@ Deferred items from PR reviews that were not addressed before merge. | WooldridgeDiD: unconditional inference for `aggregate(weights="cohort_share")` accounting for sampling uncertainty in the cohort shares ω̂_g / ω̂_{ge} (paper W2025 Section 7.5). Current impl fail-closes the t-stat / p-value / conf-int fields to NaN under cohort-share aggregation because the analytical SE is conditional-on-shares. Proper APE/GMM-style aggregate inference (Wooldridge 2023 Section 4 framework) re-enables full inference. | `wooldridge_results.py::aggregate` cohort_share inference branch | PR-B follow-up | Medium | | WooldridgeDiD: `cohort_trends=True` + `survey_design` composition. PR-B Stage E fail-closes the cross-product with `NotImplementedError` at `fit()` because the full-dummy `dg_i · t` design composed with the survey TSL variance hasn't been validated against R-parity goldens. Follow-up: validate the composition (or implement a survey-aware alternative) and re-enable the surface. | `wooldridge.py` fit guard, `wooldridge_results.py::aggregate` (if survey-aware cohort_trends variance plumbing is added) | PR-B follow-up | Low | | WooldridgeDiD: `cohort_trends=True` + `control_group="never_treated"` composition. PR-B Stage E (codex R9 P1 fix) fail-closes the cross-product with `NotImplementedError` at `fit()` because the OLS + never_treated branch emits ALL `(g, t)` cells as treatment-cell dummies (paper Section 4.4 placebo coverage); the appended `dg_i · t` trend columns are linearly spanned by the per-cohort sum of those cell dummies, so the Section 8 trend specification is unidentified. Follow-up: implement a separate design-matrix branch that drops the pre-treatment placebo dummies (or restricts the trend interaction to post-treatment cells) under the trend specification, then re-enable the combination. | `wooldridge.py` fit guard + `_build_interaction_matrix` redesign for the cohort_trends path | PR-B follow-up | Low | -| WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low | | WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium | | PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R. | `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R` | PR-C follow-up | Low | | `StackedDiD` `vcov_type="conley"` — deferred for a **methodology** reason, NOT plumbing (unlike the now-shipped SunAbraham / WooldridgeDiD-OLS conley threading): the stacked design replicates each control unit across every sub-experiment it qualifies for (`_build_sub_experiment`), so one geographic unit occupies many stacked rows. Conley's pairwise distance matrix would see those same-unit copies at distance 0 (`K(0)=1`, perfectly correlated), conflating the stacking-replication device with real spatial correlation, and there is no `conleyreg` analogue for stacked DiD to anchor parity. A correct treatment needs a per-stack spatial identifier and is **paper-gated**. | `diff_diff/stacked_did.py` | follow-up | Low | @@ -186,7 +185,7 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk, #### Tier A — Quick wins (≤1 day, ≤3 CI rounds expected) -- WooldridgeDiD: optional efficiency hint when method/outcome pairing is sub-optimal (NOT a canonical-link violation per W2023 Prop 3.1 — see Methodology/Correctness row for the corrected framing) +_(No active items. The sole prior entry — the WooldridgeDiD method/outcome efficiency hint — has shipped; see CHANGELOG `## [Unreleased]` and REGISTRY §WooldridgeDiD "Nonlinear extensions".)_ (SyntheticDiD `placebo_effects` → `variance_effects` rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.) diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index b9a54e21..c0e2caf9 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -1084,7 +1084,10 @@ results.aggregate("event") # ATT by event time (relative to treatment) ```python from diff_diff import WooldridgeDiD -# OLS (linear outcomes) +# OLS (linear outcomes). On a binary/count outcome this emits a UserWarning that +# logit/poisson is often the more appropriate specification -- link-scale (not +# level) parallel trends, and less biased/more precise in Wooldridge (2023) sims; +# a different identifying assumption, so a recommended comparison not a free switch. etwfe = WooldridgeDiD(method='ols') results = etwfe.fit(data, outcome='y', unit='id', time='t', cohort='first_treat') print(results.aggregate("simple")) diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py index fd977bc2..d6597f5d 100644 --- a/diff_diff/wooldridge.py +++ b/diff_diff/wooldridge.py @@ -131,6 +131,45 @@ def _warn_and_fill_nan_cohort(df: pd.DataFrame, cohort: str, stacklevel: int) -> return df +def _suggest_nonlinear_method(outcome_values) -> Optional[str]: + """Outcome-fit hint for ``WooldridgeDiD(method="ols")``. + + Return ``"logit"`` when the outcome support is binary (exactly ``{0, 1}``) + or ``"poisson"`` when it is a non-negative integer count (>2 distinct + values), else ``None``. + + OLS/Gaussian QMLE is a valid method for any response (Wooldridge 2023 + Table 1), but for binary / count outcomes a matching nonlinear model + (logit / Poisson) is often the *more appropriate* specification: it imposes + parallel trends on the link/index scale rather than in levels (level-PT is + implausible for such limited outcomes), and where that nonlinear mean holds + Wooldridge's (2023) simulations show the linear model is both biased and + less precise. It rests on a *different identifying assumption* — which one + fits depends on which parallel-trends restriction holds — so this is a + recommended comparison, not a free upgrade and not a canonical-link + requirement. + + Pure and non-fatal: non-numeric / all-non-finite / empty inputs return + ``None`` rather than raising, so a fit that would otherwise succeed is + never broken by the hint. + """ + try: + y = np.asarray(outcome_values, dtype=float) + except (TypeError, ValueError): + return None + y = y[np.isfinite(y)] + if y.size == 0: + return None + uniq = np.unique(y) + # Binary: support is exactly {0, 1} (both present) -> Bernoulli QMLE (logit). + if uniq.size == 2 and np.all((uniq == 0) | (uniq == 1)): + return "logit" + # Count: non-negative integers with >2 distinct values -> Poisson QMLE. + if uniq.size > 2 and np.all(y >= 0) and np.all(y == np.floor(y)): + return "poisson" + return None + + def _filter_sample( data: pd.DataFrame, unit: str, @@ -275,8 +314,18 @@ class WooldridgeDiD: Parameters ---------- method : {"ols", "logit", "poisson"} - Estimation method. "ols" for continuous outcomes; "logit" for binary - or fractional outcomes; "poisson" for count data. + Estimation method. ``"ols"`` is the linear baseline — valid for any + response (Wooldridge 2023) and the usual choice for continuous outcomes; + ``"logit"`` for binary or fractional outcomes; ``"poisson"`` for count + data. When ``method="ols"`` is used on a binary (``{0, 1}``) or + non-negative integer-count outcome, a ``UserWarning`` notes that a + matching nonlinear model (logit / Poisson) is often the *more + appropriate* specification — it imposes parallel trends on the link + scale rather than in levels, and Wooldridge's (2023) simulations show + the linear model both biased and less precise for such outcomes when + the nonlinear mean holds. It rests on a different identifying assumption + than linear OLS, so it is a recommended comparison, not an automatic + switch; suppress via ``warnings.filterwarnings``. control_group : {"not_yet_treated", "never_treated"} Which units serve as the comparison group. "not_yet_treated" (jwdid default) uses all untreated observations at each time period; @@ -710,6 +759,45 @@ def fit( "analytical Conley SE, or vcov_type='hc1' with the bootstrap." ) + # 0g. Outcome-fit hint (non-fatal): method="ols" on a binary or + # non-negative-count outcome. For such outcomes a matching nonlinear + # model (logit / Poisson) is often the MORE APPROPRIATE specification — + # it imposes parallel trends on the link/index scale, not in levels + # (Wooldridge 2023: level-PT is implausible for limited outcomes), and + # where the nonlinear mean holds its sims show the linear model both + # biased and less precise. A DIFFERENT identifying assumption than + # linear OLS (which fits depends on which PT holds), so the hint is a + # recommended comparison, not a free upgrade (see REGISTRY.md + # §WooldridgeDiD "Nonlinear extensions"). Detection reads the FULL + # outcome column: outcome *type* is a property of the variable, so the + # named column is the right signal. This is also identical to the + # estimation sample here — _filter_sample expresses the control-group + # choice through the design matrix, NOT by dropping rows, so the column + # and the fitted sample share the same outcome support (invariant pinned + # by test_filter_sample_preserves_outcome_support); classifying on the + # variable stays correct should that ever change. + if self.method == "ols": + suggested = _suggest_nonlinear_method(df[outcome]) + if suggested is not None: + kind = "binary" if suggested == "logit" else "count" + warnings.warn( + f"WooldridgeDiD(method='ols'): outcome {outcome!r} looks " + f"{kind}. For a {kind} outcome a matching nonlinear model " + f"(method={suggested!r}) is often the more appropriate " + f"specification — it imposes parallel trends on the " + f"link/index scale, not in levels (Wooldridge 2023: " + f"level-PT is implausible for limited outcomes), and where " + f"that nonlinear mean holds its simulations show the linear " + f"model both biased and less precise. It is a different " + f"identifying assumption than linear OLS (which fits depends " + f"on which parallel-trends restriction holds), so treat " + f"method={suggested!r} as a recommended comparison, not an " + f"automatic switch; suppress this hint via " + f"warnings.filterwarnings.", + UserWarning, + stacklevel=2, + ) + # 1. Filter to analysis sample sample = _filter_sample(df, unit, time, cohort, self.control_group, self.anticipation) diff --git a/docs/api/wooldridge_etwfe.rst b/docs/api/wooldridge_etwfe.rst index 6142f185..cd066fa3 100644 --- a/docs/api/wooldridge_etwfe.rst +++ b/docs/api/wooldridge_etwfe.rst @@ -105,6 +105,22 @@ Basic OLS (follows Stata ``jwdid y, ivar(unit) tvar(time) gvar(cohort)``):: print(r.summary('group')) print(r.summary('simple')) +.. note:: + + When ``method="ols"`` is applied to a binary (``{0, 1}``) or non-negative + integer-count outcome, ``fit()`` emits a ``UserWarning`` noting that a + matching nonlinear model (``method="logit"`` / ``method="poisson"``) is often + the *more appropriate* specification for such outcomes — it imposes parallel + trends on the link/index scale rather than in levels (Wooldridge 2023 notes + level-PT is only valid for continuous/unbounded outcomes), and in that + paper's simulations the linear model is both biased and less precise where + the nonlinear mean holds. It rests on a *different identifying assumption* + than linear OLS, so treat it as a recommended comparison, not an automatic + switch. OLS remains a valid QMLE for *any* response (Wooldridge 2023); + suppress the hint via ``warnings.filterwarnings``. The check is heuristic: + bounded discrete (binomial-style) outcomes with a known upper bound are not + separately detected from unbounded counts. + View cohort×time cell estimates (post-treatment):: for (g, t), v in sorted(r.group_time_effects.items()): diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index a6256831..38c8022f 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1579,6 +1579,8 @@ Average Structural Function (ASF) approach. For each treated cell (g, t): where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual linear predictor (fixed effects + controls), and `δ_{g,t}` is the interaction coefficient from the nonlinear model. +- **Note (outcome-fit hint):** When `method="ols"` is used on a **binary** (`{0,1}`) or a **non-negative integer count** outcome, `WooldridgeDiD.fit()` emits a `UserWarning` noting that a matching nonlinear model (`method="logit"` / `method="poisson"`) is often the **more appropriate specification** for such outcomes. The framing is the paper's, not an efficiency heuristic: the nonlinear paths impose parallel trends on the **link/index scale** (logit index / Poisson log-mean) rather than in **levels**, and Wooldridge (2023) states the **linear-PT assumption is only valid for continuous/unbounded outcomes** (Eq. 2.5 vs the Index-PT Eq. 2.6–2.7). In the paper's **Section 5 simulations** the linear model is both **biased** (POLS −0.15 to −0.29 for binary; OLS >30%, >50% in places, for counts) **and less precise** (nonlinear SEs ~30–70% smaller) where the nonlinear mean holds, and the paper notes pre-trends tests often **fail to detect** the linear model's misspecification. This is a **different identifying assumption** than linear OLS — which one fits depends on which parallel-trends restriction holds — so the warning frames the nonlinear model as a **recommended comparison, not an automatic switch or a free upgrade**. Per **Table 1**, Gaussian/OLS remains a valid QMLE for *any* response, so the hint never asserts OLS is mechanically wrong. **Framing provenance (two-sided):** the earlier "canonical link requirement tied to Prop 3.1" reading was **incorrect** (corrected in **PR #453 R1**); the warning is equally careful *not* to over-correct into framing the nonlinear path as a pure efficiency add-on, since the paper's case is primarily about **bias / appropriateness** (codex R1, reconciled against `docs/methodology/papers/wooldridge-2023-review.md` §5). Detection is deliberately high-signal — binary requires support exactly `{0,1}`; the count branch suggests Poisson for non-negative integers with `>2` distinct values (Poisson is the natural *unbounded* nonnegative/count model per Table 1; bounded-support / **binomial** outcomes with a known upper bound are not separately detected — that routing would need user-supplied metadata). Fractional / bounded-continuous outcomes are **not flagged by this heuristic** (fractional responses are themselves a logistic/Bernoulli case in Table 1; the heuristic simply does not attempt to detect them — a coverage limit, not a claim that OLS is preferred for such outcomes). **Detection sample:** the hint reads the *full* outcome column passed to `fit()` — outcome *type* is a property of the variable, so the named column is the right signal. In this estimator that column is also identical to the estimation sample: `WooldridgeDiD._filter_sample` expresses the control-group choice through the design matrix (`_build_interaction_matrix`), **not** by dropping rows, so the full column and the fitted sample always share the same outcome support (invariant pinned by `tests/test_wooldridge.py::TestOutcomeFitHint::test_filter_sample_preserves_outcome_support`). Classifying on the variable keeps the suggestion methodologically correct (a count is a count regardless of which rows enter estimation) should that selection ever change. Always-on (suppress via `warnings.filterwarnings`), implemented as the pure helper `_suggest_nonlinear_method` + the OLS-only fit-time gate `0g`; it never alters the fit or raises. + *Standard errors:* - OLS: Cluster-robust sandwich estimator at the unit level (default) - Logit/Poisson: QMLE sandwich `(X'WX)^{-1} meat (X'WX)^{-1}` via `compute_robust_vcov(..., weights=w, weight_type="aweight")` where `w = p_i(1-p_i)` for logit or `w = μ_i` for Poisson diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py index 554d4147..e5264a45 100644 --- a/tests/test_wooldridge.py +++ b/tests/test_wooldridge.py @@ -11,6 +11,7 @@ _build_interaction_matrix, _filter_sample, _prepare_covariates, + _suggest_nonlinear_method, ) from diff_diff.wooldridge_results import WooldridgeDiDResults @@ -2221,3 +2222,132 @@ def _raise(X, cluster_ids, bread, contrasts): assert np.isnan(eff["p_value"]) assert np.isnan(eff["conf_int"][0]) assert np.isnan(eff["conf_int"][1]) + + +class TestOutcomeFitHint: + """method='ols' outcome-fit hint: binary -> logit, count -> poisson. + + Per Wooldridge (2023), a matching nonlinear model is often the *more + appropriate* specification for binary/count outcomes -- link-scale (not + level) parallel trends, and less biased / more precise in the paper's + Section 5 simulations. A different identifying assumption, so a recommended + comparison, never a free efficiency upgrade or a canonical-link / validity + requirement. See REGISTRY.md WooldridgeDiD "Nonlinear extensions". + """ + + # ---- detector unit tests ---- + @pytest.mark.parametrize( + "values, expected", + [ + ([0, 1, 1, 0, 1], "logit"), # binary {0, 1} + ([0.0, 1.0, 0.0], "logit"), # binary as floats + (pd.Series([True, False, True]), "logit"), # bool dtype + ([0, 1, 2, 3, 4], "poisson"), # count + ([1, 2, 3], "poisson"), # count without zero + ([0, 1, 2, 3], "poisson"), # bounded-support integer routes to poisson too + # ^ documented heuristic limit: a known-upper-bound (binomial-style) + # integer outcome is NOT separately distinguished from an unbounded + # count -- both take the poisson branch (Wooldridge 2023 Table 1). + ([0.1, 0.5, 1.7, 2.3], None), # continuous + ([0.1, 0.4, 0.9], None), # fractional in [0, 1] + ([-1, 0, 1, 2], None), # has a negative + ([np.nan, np.nan], None), # all non-finite + ([1.0, 1.0, 1.0], None), # constant / single value + (pd.Series(["a", "b", "c"], dtype=object), None), # non-numeric + ], + ) + def test_detector(self, values, expected): + assert _suggest_nonlinear_method(values) == expected + + # ---- gate behavior ---- + def _binary_panel(self): + df = _make_panel(seed=1) + rng = np.random.default_rng(1) + df["y"] = rng.integers(0, 2, len(df)).astype(float) + return df + + def _count_panel(self): + df = _make_panel(seed=2) + rng = np.random.default_rng(2) + df["y"] = rng.integers(0, 6, len(df)).astype(float) + return df + + @staticmethod + def _fit(df, **kwargs): + return WooldridgeDiD(**kwargs).fit(df, "y", "unit", "time", "cohort") + + @staticmethod + def _hint_msgs(rec): + return [str(w.message) for w in rec if "matching nonlinear model" in str(w.message)] + + def test_ols_binary_warns_logit(self): + df = self._binary_panel() + with pytest.warns(UserWarning, match=r"method='logit'.*more appropriate"): + res = self._fit(df, method="ols") + assert np.isfinite(res.overall_att) + + def test_ols_count_warns_poisson(self): + df = self._count_panel() + with pytest.warns(UserWarning, match=r"method='poisson'.*more appropriate"): + self._fit(df, method="ols") + + def test_ols_continuous_silent(self): + df = _make_panel(seed=3) # default y is standard-normal continuous + with warnings.catch_warnings(record=True) as rec: + warnings.simplefilter("always") + self._fit(df, method="ols") + assert not self._hint_msgs(rec) + + def test_logit_binary_no_hint(self): + # The hint is OLS-only; a logit fit on binary data must not emit it. + df = self._binary_panel() + with warnings.catch_warnings(record=True) as rec: + warnings.simplefilter("always") + self._fit(df, method="logit") + assert not self._hint_msgs(rec) + + def test_hint_fires_with_cohort_trends(self): + # Param-interaction smoke: hint still fires on the cohort_trends path. + df = self._binary_panel() + with pytest.warns(UserWarning, match=r"more appropriate"): + self._fit(df, method="ols", cohort_trends=True) + + def test_suppression_via_filterwarnings(self): + df = self._binary_panel() + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=UserWarning) + res = self._fit(df, method="ols") # must not raise + assert np.isfinite(res.overall_att) + + def test_framing_paper_faithful(self): + # Lock the paper-faithful framing (Wooldridge 2023: LPT vs IPT + the + # Section 5 simulations). The nonlinear model is the *more appropriate* + # specification for binary/count outcomes — it reduces bias (not only + # variance) and rests on a *different identifying assumption* (link-scale + # vs level parallel trends): a recommended comparison, never an + # unconditional efficiency upgrade and never a "violation" of OLS. + df = self._binary_panel() + with warnings.catch_warnings(record=True) as rec: + warnings.simplefilter("always") + self._fit(df, method="ols") + msgs = self._hint_msgs(rec) + assert msgs + text = msgs[0].lower() + assert "more appropriate" in text # appropriateness, not only efficiency + assert "biased" in text # the paper's bias finding, not just precision + assert "assumption" in text # a different identifying assumption + assert "recommended comparison" in text # a comparison, not a switch + # Never frame OLS as wrong / a link requirement or an automatic upgrade. + for forbidden in ("canonical", "violation", "required", "must use"): + assert forbidden not in text + + def test_filter_sample_preserves_outcome_support(self): + # Invariant behind full-column detection: _filter_sample selects the + # control group via the design matrix, NOT by dropping rows, so the + # full outcome column and the estimation sample always share the same + # support -- the pre/post-filter detection distinction is therefore + # moot. A future refactor that drops rows would surface here. + df = self._count_panel() + for cg in ("not_yet_treated", "never_treated"): + sample = _filter_sample(df, "unit", "time", "cohort", cg, 0) + assert sorted(sample["y"].unique()) == sorted(df["y"].unique())