igerber · igerber · Jun 1, 2026 · May 31, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **PowerAnalysis methodology-review-tracker promotion: In Progress → Complete, with a panel-variance correction (behavior change).** Closes the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits on the tracker (PR-A #506 added both paper reviews + under-review Notes; this PR validates the source against the code and reconciles the discrepancies). **Behavior change:** the analytical *panel* DiD variance was the Moulton design-effect factor `(1+(T−1)·rho)/T`, wrong two ways versus the source — wrong period-scaling (~4× too small at `rho=0`, `m=r=5` versus the iid DiD benchmark) and the **opposite `rho`-sign** (it *raised* the MDE as within-unit correlation grew). It is replaced by the within-unit equicorrelated special case of Burlig et al. Eq. 2, `Var(ATT) = sigma² · (1/n_T + 1/n_C) · (1/n_pre + 1/n_post) · (1 − rho)`, in which within-unit (serial) correlation *lowers* the MDE because the difference-in-differences cancels the shared within-unit component. So `PowerAnalysis.mde` / `power` / `sample_size` (and the `compute_*` wrappers) now return a **smaller** MDE / required N as `rho` rises for **all** designs; the 2×2 path matches Bloom's `2σ²` at the default `rho = 0` and is continuous with the panel form at `n_pre = n_post = 1`. New input validation, enforced for **all** designs *before* the 2×2-vs-panel router: `n_pre >= 1`, `n_post >= 1`, `rho ∈ [−1/(T−1), 1)` (`T = n_pre + n_post`), finite `sigma >= 0`, positive group counts, and `treat_frac ∈ (0, 1)` now raise `ValueError` (previously invalid two-period shapes and out-of-range `rho` fell through to `basic_did` silently). The `(1 − rho)` factor applies at `T = 2` too — the 2×2 path is Burlig's `m = r = 1` special case (footnote 11), so a nonzero `rho` is no longer silently ignored there, while `rho = 0` still recovers Bloom's `2σ²`. The MDE multiplier stays the **normal (z)** Bloom multiplier (a deliberate large-sample approximation to Burlig's t, documented as `**Deviation from R:**`) — unchanged. New `tests/test_methodology_power.py` (Bloom Table 1 multipliers; 2×2 + panel closed forms; a literal-equicorrelated Monte-Carlo validation of the panel variance; `sample_size`↔`mde` round-trip; input-guard + `rho`-at-`T=2` + `compute_*` wrapper validation; base-R `qnorm` parity at `benchmarks/data/r_power_golden.json`, generator `benchmarks/R/generate_power_golden.R`); the two `tests/test_power.py` ICC-direction tests were inverted to Burlig's sign. REGISTRY `## PowerAnalysis` equation block rewritten (z not t; corrected 2×2 / panel SE + sample-size; removed the cluster-`m` and inverted-`R²` terms that matched neither code nor source); `docs/references.rst` adds Frison & Pocock (1992) + McKenzie (2012) as the equicorrelated lineage; tutorial `06_power_analysis.ipynb` corrected. `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-05-31`); priority queue pruned; the PR-A under-review Notes removed across REGISTRY / `power.py` / `references.rst`.
+- **`WooldridgeDiD` outcome-fit hint:** `WooldridgeDiD(method="ols")` now emits a `UserWarning` when the outcome is binary (`{0, 1}`) or a non-negative integer count, noting that a matching nonlinear model (`method="logit"` / `method="poisson"`) is often the **more appropriate specification** for such outcomes. Following Wooldridge (2023): the nonlinear paths impose parallel trends on the link/index scale rather than in levels (level-PT is only valid for continuous/unbounded outcomes), and the paper's Section 5 simulations show the linear model both biased and less precise where the nonlinear mean holds. It is a **different identifying assumption** than linear OLS — which one fits depends on which parallel-trends restriction holds — so the warning frames it as a recommended comparison, not an automatic switch or free efficiency upgrade. OLS remains a valid QMLE for *any* response (Table 1). Always-on (suppress via `warnings.filterwarnings`); detection is high-signal (binary requires exactly `{0, 1}`; the count branch suggests Poisson — the natural unbounded-count model — for *any* non-negative integers with >2 distinct values, so bounded binomial / known-upper-bound integer outcomes are not separately distinguished from unbounded counts; fractional / continuous outcomes are not flagged).
 - **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide.
 - **`SyntheticControl` in-space placebo permutation inference + reporting-stack integration (ADH 2010 §2.4).** New `SyntheticControlResults.in_space_placebo()` provides the significance test classic SCM lacks an analytical SE for: it reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the **other `J−1` donors** (the real treated unit is excluded from every placebo pool — its post-period is treatment-contaminated; matches `SCtools::generate.placebos`), and ranks the treated unit's post/pre **RMSPE ratio** among the `J+1` units. New fields `placebo_p_value` (`= rank/(n_placebos+1)`, an upper-tail rank test on the unsigned RMSPE ratio — direction-agnostic, so it detects an effect of *either* sign rather than a signed/one-directional hypothesis; ties counted via `≥`), `rmspe_ratio` (the treated statistic, set at fit), and `n_placebos`/`n_failed` (effective reference-set sizes; non-converged placebos are excluded from BOTH numerator and denominator, never penalized into the rank). `placebo_p_value` is a **separate field** from the (always-NaN) `p_value` — it is a permutation p-value with no SE/t-stat and does not flow through `safe_inference`; `is_significant` stays bound to `p_value`. Edge cases fail closed: scale-aware RMSPE-ratio floor (a perfect pre-fit gives a finite ratio, not `inf`), `J<2` → NaN+warn, `J==2` → degenerate+coarse warn, deterministic given `seed`. New `get_placebo_df()` returns the per-unit RMSPE-ratio summary table (incl. the treated row and any failed donors) used for the rank. The design keeps the placebo *compute* opt-in — the per-donor refit loop runs only on the explicit `in_space_placebo()` call. To support that opt-in call, every fit retains a `_SyntheticControlFitSnapshot` of the pivoted panel (memory O(units × periods × predictor-vars), like `SyntheticDiD`'s snapshot for `in_time_placebo`; excluded from pickling). A compact/lazy snapshot representation is tracked as a follow-up in `TODO.md`. **Reporting-stack integration:** `SyntheticControlResults` is now routed through `DiagnosticReport` (fit-based `scm_fit` parallel-trends analogue → verdict `design_enforced_pt` reading `pre_rmspe`; `_scm_native` surfaces `pre_rmspe` + donor-weight concentration + the placebo p-value when already computed — never triggering the refit loop implicitly), `practitioner_next_steps` (`_handle_synthetic_control` with the placebo as the headline significance step), and `BusinessReport` (fit-based assumption block, ADH 2010 attribution, robustness via `estimator_native_diagnostics`; HonestDiD passthrough rejected like SDiD/TROP). Also fixes a latent BR bug where the headline `is_significant` was a non-JSON-serializable numpy `bool_` when `p_value` is a numpy `NaN`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the donor-pool construction, failure handling, RMSPE-ratio floor, and the non-analytical-p-value split), `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
 - **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN; significance comes from in-space placebo permutation inference via `in_space_placebo()` (see the dedicated entry below). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.

diff --git a/TODO.md b/TODO.md
@@ -100,7 +100,6 @@ Deferred items from PR reviews that were not addressed before merge.
 | WooldridgeDiD: unconditional inference for `aggregate(weights="cohort_share")` accounting for sampling uncertainty in the cohort shares ω̂_g / ω̂_{ge} (paper W2025 Section 7.5). Current impl fail-closes the t-stat / p-value / conf-int fields to NaN under cohort-share aggregation because the analytical SE is conditional-on-shares. Proper APE/GMM-style aggregate inference (Wooldridge 2023 Section 4 framework) re-enables full inference. | `wooldridge_results.py::aggregate` cohort_share inference branch | PR-B follow-up | Medium |
 | WooldridgeDiD: `cohort_trends=True` + `survey_design` composition. PR-B Stage E fail-closes the cross-product with `NotImplementedError` at `fit()` because the full-dummy `dg_i · t` design composed with the survey TSL variance hasn't been validated against R-parity goldens. Follow-up: validate the composition (or implement a survey-aware alternative) and re-enable the surface. | `wooldridge.py` fit guard, `wooldridge_results.py::aggregate` (if survey-aware cohort_trends variance plumbing is added) | PR-B follow-up | Low |
 | WooldridgeDiD: `cohort_trends=True` + `control_group="never_treated"` composition. PR-B Stage E (codex R9 P1 fix) fail-closes the cross-product with `NotImplementedError` at `fit()` because the OLS + never_treated branch emits ALL `(g, t)` cells as treatment-cell dummies (paper Section 4.4 placebo coverage); the appended `dg_i · t` trend columns are linearly spanned by the per-cohort sum of those cell dummies, so the Section 8 trend specification is unidentified. Follow-up: implement a separate design-matrix branch that drops the pre-treatment placebo dummies (or restricts the trend interaction to post-treatment cells) under the trend specification, then re-enable the combination. | `wooldridge.py` fit guard + `_build_interaction_matrix` redesign for the cohort_trends path | PR-B follow-up | Low |
-| WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low |
 | WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium |
 | PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R. | `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R` | PR-C follow-up | Low |
 | `StackedDiD` `vcov_type="conley"` — deferred for a **methodology** reason, NOT plumbing (unlike the now-shipped SunAbraham / WooldridgeDiD-OLS conley threading): the stacked design replicates each control unit across every sub-experiment it qualifies for (`_build_sub_experiment`), so one geographic unit occupies many stacked rows. Conley's pairwise distance matrix would see those same-unit copies at distance 0 (`K(0)=1`, perfectly correlated), conflating the stacking-replication device with real spatial correlation, and there is no `conleyreg` analogue for stacked DiD to anchor parity. A correct treatment needs a per-stack spatial identifier and is **paper-gated**. | `diff_diff/stacked_did.py` | follow-up | Low |
@@ -186,7 +185,7 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk,
 
 #### Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)
 
-- WooldridgeDiD: optional efficiency hint when method/outcome pairing is sub-optimal (NOT a canonical-link violation per W2023 Prop 3.1 — see Methodology/Correctness row for the corrected framing)
+_(No active items. The sole prior entry — the WooldridgeDiD method/outcome efficiency hint — has shipped; see CHANGELOG `## [Unreleased]` and REGISTRY §WooldridgeDiD "Nonlinear extensions".)_
 
 (SyntheticDiD `placebo_effects` → `variance_effects` rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.)
 

diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
@@ -1084,7 +1084,10 @@ results.aggregate("event")     # ATT by event time (relative to treatment)
 ```python
 from diff_diff import WooldridgeDiD
 
-# OLS (linear outcomes)
+# OLS (linear outcomes). On a binary/count outcome this emits a UserWarning that
+# logit/poisson is often the more appropriate specification -- link-scale (not
+# level) parallel trends, and less biased/more precise in Wooldridge (2023) sims;
+# a different identifying assumption, so a recommended comparison not a free switch.
 etwfe = WooldridgeDiD(method='ols')
 results = etwfe.fit(data, outcome='y', unit='id', time='t', cohort='first_treat')
 print(results.aggregate("simple"))