diff --git a/CHANGELOG.md b/CHANGELOG.md index 7bd213d8..789da6f4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **PowerAnalysis methodology-review-tracker promotion: In Progress → Complete, with a panel-variance correction (behavior change).** Closes the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits on the tracker (PR-A #506 added both paper reviews + under-review Notes; this PR validates the source against the code and reconciles the discrepancies). **Behavior change:** the analytical *panel* DiD variance was the Moulton design-effect factor `(1+(T−1)·rho)/T`, wrong two ways versus the source — wrong period-scaling (~4× too small at `rho=0`, `m=r=5` versus the iid DiD benchmark) and the **opposite `rho`-sign** (it *raised* the MDE as within-unit correlation grew). It is replaced by the within-unit equicorrelated special case of Burlig et al. Eq. 2, `Var(ATT) = sigma² · (1/n_T + 1/n_C) · (1/n_pre + 1/n_post) · (1 − rho)`, in which within-unit (serial) correlation *lowers* the MDE because the difference-in-differences cancels the shared within-unit component. So `PowerAnalysis.mde` / `power` / `sample_size` (and the `compute_*` wrappers) now return a **smaller** MDE / required N as `rho` rises for **all** designs; the 2×2 path matches Bloom's `2σ²` at the default `rho = 0` and is continuous with the panel form at `n_pre = n_post = 1`. New input validation, enforced for **all** designs *before* the 2×2-vs-panel router: `n_pre >= 1`, `n_post >= 1`, `rho ∈ [−1/(T−1), 1)` (`T = n_pre + n_post`), finite `sigma >= 0`, positive group counts, and `treat_frac ∈ (0, 1)` now raise `ValueError` (previously invalid two-period shapes and out-of-range `rho` fell through to `basic_did` silently). The `(1 − rho)` factor applies at `T = 2` too — the 2×2 path is Burlig's `m = r = 1` special case (footnote 11), so a nonzero `rho` is no longer silently ignored there, while `rho = 0` still recovers Bloom's `2σ²`. The MDE multiplier stays the **normal (z)** Bloom multiplier (a deliberate large-sample approximation to Burlig's t, documented as `**Deviation from R:**`) — unchanged. New `tests/test_methodology_power.py` (Bloom Table 1 multipliers; 2×2 + panel closed forms; a literal-equicorrelated Monte-Carlo validation of the panel variance; `sample_size`↔`mde` round-trip; input-guard + `rho`-at-`T=2` + `compute_*` wrapper validation; base-R `qnorm` parity at `benchmarks/data/r_power_golden.json`, generator `benchmarks/R/generate_power_golden.R`); the two `tests/test_power.py` ICC-direction tests were inverted to Burlig's sign. REGISTRY `## PowerAnalysis` equation block rewritten (z not t; corrected 2×2 / panel SE + sample-size; removed the cluster-`m` and inverted-`R²` terms that matched neither code nor source); `docs/references.rst` adds Frison & Pocock (1992) + McKenzie (2012) as the equicorrelated lineage; tutorial `06_power_analysis.ipynb` corrected. `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-05-31`); priority queue pruned; the PR-A under-review Notes removed across REGISTRY / `power.py` / `references.rst`. - **`WooldridgeDiD` outcome-fit hint:** `WooldridgeDiD(method="ols")` now emits a `UserWarning` when the outcome is binary (`{0, 1}`) or a non-negative integer count, noting that a matching nonlinear model (`method="logit"` / `method="poisson"`) is often the **more appropriate specification** for such outcomes. Following Wooldridge (2023): the nonlinear paths impose parallel trends on the link/index scale rather than in levels (level-PT is only valid for continuous/unbounded outcomes), and the paper's Section 5 simulations show the linear model both biased and less precise where the nonlinear mean holds. It is a **different identifying assumption** than linear OLS — which one fits depends on which parallel-trends restriction holds — so the warning frames it as a recommended comparison, not an automatic switch or free efficiency upgrade. OLS remains a valid QMLE for *any* response (Table 1). Always-on (suppress via `warnings.filterwarnings`); detection is high-signal (binary requires exactly `{0, 1}`; the count branch suggests Poisson — the natural unbounded-count model — for *any* non-negative integers with >2 distinct values, so bounded binomial / known-upper-bound integer outcomes are not separately distinguished from unbounded counts; fractional / continuous outcomes are not flagged). +- **`SyntheticControl` leave-one-out + in-time placebo robustness diagnostics (ADH 2015 §4).** Two opt-in `SyntheticControlResults` methods, each a thin re-run of the validated solver (analytical `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay bound to the NaN analytical `p_value`). **`leave_one_out()`** drops each reportably-weighted donor (weight above the 1e-6 floor — the donors in `donor_weights`) in turn and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`; non-converged refits → `status="failed"` with NaN metrics); a large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal). **`in_time_placebo()`** reassigns the intervention to an earlier pre-date `t_f`, re-fits using only pre-`t_f` information, and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 when there is no real pre-period effect (ADH 2015 Fig. 4). It sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a dimensionally-infeasible valid date yields a `status="infeasible"` row. **Windowing = TRUNCATE** (documented `**Note:**` in REGISTRY): predictor specs are re-cut to the pre-`t_f` window (pre-period-outcome predictors become the pre-`t_f` outcomes; covariate/special windows are intersected), a window lying entirely in the held-out region is **dropped** (surfaced in `n_dropped_specs` + an aggregated warning) and `custom_v` is subset in lockstep with the surviving specs; the true post-treatment periods are excluded from the placebo fit entirely (no peeking). Both fail closed on a non-converged treated fit (and `leave_one_out` on `<2` donors). New accessors `get_leave_one_out_df()` / `get_in_time_placebo_df()` (survive pickling) and long-form `get_leave_one_out_gaps()` / `get_in_time_placebo_gaps()` for the overlay/backdating plots (panel-derived, dropped on pickle). **Validation:** R `Synth` has no in-time/LOO function (verified against its full CRAN function index), so — beyond the solver's existing Basque R parity — the diagnostics are anchored by deterministic self-consistency tests proving each equals a from-scratch `synthetic_control()` fit on the equivalent sub-problem (reduced donor pool / backdated panel) to 1e-7. **Reporting-stack integration:** `_scm_native` surfaces opt-in `leave_one_out` + `in_time_placebo` blocks (`status="not_run"` stub until run), `BusinessReport` lifts them into the SCM native robustness block, and `practitioner_next_steps` emits both as steps (non-`STEPS` tags so a caller's `completed_steps` cannot suppress them). The remaining ADH-2015 items (CV `V`-selection, `W^reg` extrapolation diagnostic, sparse-SC) are tracked in `TODO.md`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl, `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. - **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide. - **`SyntheticControl` in-space placebo permutation inference + reporting-stack integration (ADH 2010 §2.4).** New `SyntheticControlResults.in_space_placebo()` provides the significance test classic SCM lacks an analytical SE for: it reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the **other `J−1` donors** (the real treated unit is excluded from every placebo pool — its post-period is treatment-contaminated; matches `SCtools::generate.placebos`), and ranks the treated unit's post/pre **RMSPE ratio** among the `J+1` units. New fields `placebo_p_value` (`= rank/(n_placebos+1)`, an upper-tail rank test on the unsigned RMSPE ratio — direction-agnostic, so it detects an effect of *either* sign rather than a signed/one-directional hypothesis; ties counted via `≥`), `rmspe_ratio` (the treated statistic, set at fit), and `n_placebos`/`n_failed` (effective reference-set sizes; non-converged placebos are excluded from BOTH numerator and denominator, never penalized into the rank). `placebo_p_value` is a **separate field** from the (always-NaN) `p_value` — it is a permutation p-value with no SE/t-stat and does not flow through `safe_inference`; `is_significant` stays bound to `p_value`. Edge cases fail closed: scale-aware RMSPE-ratio floor (a perfect pre-fit gives a finite ratio, not `inf`), `J<2` → NaN+warn, `J==2` → degenerate+coarse warn, deterministic given `seed`. New `get_placebo_df()` returns the per-unit RMSPE-ratio summary table (incl. the treated row and any failed donors) used for the rank. The design keeps the placebo *compute* opt-in — the per-donor refit loop runs only on the explicit `in_space_placebo()` call. To support that opt-in call, every fit retains a `_SyntheticControlFitSnapshot` of the pivoted panel (memory O(units × periods × predictor-vars), like `SyntheticDiD`'s snapshot for `in_time_placebo`; excluded from pickling). A compact/lazy snapshot representation is tracked as a follow-up in `TODO.md`. **Reporting-stack integration:** `SyntheticControlResults` is now routed through `DiagnosticReport` (fit-based `scm_fit` parallel-trends analogue → verdict `design_enforced_pt` reading `pre_rmspe`; `_scm_native` surfaces `pre_rmspe` + donor-weight concentration + the placebo p-value when already computed — never triggering the refit loop implicitly), `practitioner_next_steps` (`_handle_synthetic_control` with the placebo as the headline significance step), and `BusinessReport` (fit-based assumption block, ADH 2010 attribution, robustness via `estimator_native_diagnostics`; HonestDiD passthrough rejected like SDiD/TROP). Also fixes a latent BR bug where the headline `is_significant` was a non-JSON-serializable numpy `bool_` when `p_value` is a numpy `NaN`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the donor-pool construction, failure handling, RMSPE-ratio floor, and the non-analytical-p-value split), `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. - **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN; significance comes from in-space placebo permutation inference via `in_space_placebo()` (see the dedicated entry below). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`. diff --git a/README.md b/README.md index 1e738591..8330f441 100644 --- a/README.md +++ b/README.md @@ -108,7 +108,7 @@ Full guide: `diff_diff.get_llm_guide("practitioner")`. - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html) - Gardner (2022) two-stage estimator with GMM sandwich variance - [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html) - Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover on near-control units; handles non-staggered and staggered timing; supports survey-design variance under `survey_design=` for HC1 / CR1 (Wave E.1 Binder TSL) and Conley (Wave E.2 panel-aware stratified-Conley sandwich on per-period PSU totals; extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC — `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=`); `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores (Wave E.3, R `svyrecvar(subset())` form) - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html) - Synthetic DiD combining standard DiD and synthetic control for few treated units -- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html) - Abadie, Diamond & Hainmueller (2010) classic synthetic control for a single treated unit (donor-weight counterfactual, nested/custom V; in-space placebo permutation inference via `in_space_placebo()`) +- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html) - Abadie, Diamond & Hainmueller (2010) classic synthetic control for a single treated unit (donor-weight counterfactual, nested/custom V; in-space placebo permutation inference via `in_space_placebo()`, plus ADH-2015 `leave_one_out()` + `in_time_placebo()` robustness) - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html) - triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html) - Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves - [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html) - de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲ = 0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲ > 0`, continuous-near-d̲ or mass-point), with a multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release - repeated cross-sections rejected by the validator. Alias `HAD`. diff --git a/TODO.md b/TODO.md index 17fed4c2..a1b6f545 100644 --- a/TODO.md +++ b/TODO.md @@ -84,7 +84,7 @@ Deferred items from PR reviews that were not addressed before merge. | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) | | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium | | Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Low | -| SyntheticControl: in-time placebo + leave-one-out donor-robustness diagnostics are not implemented (ADH 2015, not the ADH 2010 scope of the current estimator), so `_scm_native` surfaces only pre-fit + in-space placebo. The practitioner / DiagnosticReport / BusinessReport routing and the in-space placebo permutation layer landed in PR-2; this remaining row covers adding the two ADH-2015 diagnostics (and surfacing them under `estimator_native_diagnostics`) in a later 2015-sourced PR. | `synthetic_control.py`, `diagnostic_report.py` | ADH-2015 follow-up | Low | +| SyntheticControl: the remaining ADH-2015 §4 items are out of scope of the leave-one-out + in-time-placebo PR — out-of-sample cross-validation `V`-selection (training/validation split), the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`), and sparse-SC subset search (`l < J`, holding `V` fixed). Leave-one-out (`leave_one_out()`) and the in-time placebo (`in_time_placebo()`) landed and are surfaced under `estimator_native_diagnostics`; these three are the deferred tail. | `synthetic_control.py`, `synthetic_control_results.py` | ADH-2015 follow-up | Low | | ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items. | `diff_diff/continuous_did.py` | — | Low | | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low | | TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low | diff --git a/benchmarks/R/generate_synth_basque_golden.R b/benchmarks/R/generate_synth_basque_golden.R index 3f524ff1..8bd8473b 100644 --- a/benchmarks/R/generate_synth_basque_golden.R +++ b/benchmarks/R/generate_synth_basque_golden.R @@ -79,6 +79,37 @@ synthetic_path <- as.numeric(dp$Y0plot %*% so$solution.w) treated_path <- as.numeric(dp$Y1plot) years <- as.integer(rownames(dp$Y1plot)) +# --- Leave-one-out golden (ADH 2015 §4 donor robustness) --------------------- +# Drop the highest-weight donor (region 10, Cataluna) and re-fit with the +# ORIGINAL solution.v held fixed (custom.v), so the reduced-pool W-solve is +# deterministic and directly comparable to SyntheticControlResults.leave_one_out() +# on a v_method="custom" fit (which likewise reuses the original custom_v on the +# donor pool minus the dropped unit — specs/V are unchanged, only the donors shrink). +loo_drop <- 10L +controls_loo <- controls[controls != loo_drop] +invisible(capture.output({ + dp_loo <- dataprep( + foo = basque, + predictors = predictors, + predictors.op = "mean", + time.predictors.prior = 1964:1969, + special.predictors = special, + dependent = "gdpcap", + unit.variable = "regionno", + unit.names.variable = "regionname", + time.variable = "year", + treatment.identifier = 17, + controls.identifier = controls_loo, + time.optimize.ssr = 1960:1969, + time.plot = 1955:1997 + ) + so_loo <- synth(dp_loo, custom.v = as.numeric(so$solution.v)) +})) +w_loo <- as.numeric(so_loo$solution.w) +synthetic_path_loo <- as.numeric(dp_loo$Y0plot %*% so_loo$solution.w) +gap_loo <- as.numeric(dp_loo$Y1plot) - synthetic_path_loo +att_loo <- mean(gap_loo[years >= 1970]) # mean post-period gap (treatment year 1970) + golden <- list( config = list( treated_regionno = 17, @@ -104,7 +135,13 @@ golden <- list( years = years, treated_path = treated_path, synthetic_path = synthetic_path, - gap = treated_path - synthetic_path + gap = treated_path - synthetic_path, + leave_one_out = list( + dropped_regionno = loo_drop, + solution_w = as.list(setNames(w_loo, colnames(dp_loo$X0))), + att = att_loo, + gap = gap_loo + ) ) dir.create("tests/data", showWarnings = FALSE, recursive = TRUE) diff --git a/diff_diff/business_report.py b/diff_diff/business_report.py index f6edb275..b3160a6e 100644 --- a/diff_diff/business_report.py +++ b/diff_diff/business_report.py @@ -1019,6 +1019,9 @@ def _lift_robustness(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]: native_block["pre_rmspe"] = native.get("pre_rmspe") native_block["weight_concentration"] = native.get("weight_concentration") native_block["in_space_placebo"] = native.get("in_space_placebo") + # ADH-2015 robustness diagnostics (opt-in; "not_run" stub until run). + native_block["leave_one_out"] = native.get("leave_one_out") + native_block["in_time_placebo"] = native.get("in_time_placebo") return { "bacon": { "status": bacon.get("status"), diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py index f474e91b..fd65ea61 100644 --- a/diff_diff/diagnostic_report.py +++ b/diff_diff/diagnostic_report.py @@ -2152,8 +2152,10 @@ def _check_estimator_native(self) -> Dict[str, Any]: selected ``lambda_*``). SyntheticControl: pre-treatment fit (``pre_rmspe``), donor-weight - concentration, and — when already computed — the in-space placebo - permutation p-value (``in_space_placebo``). + concentration, and — each surfaced only when already computed — the + in-space placebo permutation p-value (``in_space_placebo``), the ADH-2015 + leave-one-out donor robustness (``leave_one_out``), and the in-time + backdating placebo (``in_time_placebo``). """ r = self._results name = type(r).__name__ @@ -2251,8 +2253,11 @@ def _scm_native(self, r: Any) -> Dict[str, Any]: DR never triggers it implicitly, because it refits one synthetic control per donor (potentially many nested V searches) and the placebo layer is opt-in by design. (This differs from SDiD's cheaper in-time-placebo sweep, - which ``_sdid_native`` runs inline.) Only the in-space placebo is exposed; - in-time placebo and leave-one-out are ADH 2015 (not implemented). + which ``_sdid_native`` runs inline.) The ADH-2015 §4 diagnostics — + leave-one-out donor robustness (``leave_one_out()``) and the in-time + (backdating) placebo (``in_time_placebo()``) — are surfaced the same way: + opt-in, reported only once the user has run them (each refits the synthetic + control one or more times), else a ``status="not_run"`` stub. """ out: Dict[str, Any] = {"status": "ran", "estimator": "SyntheticControl"} out["pre_rmspe"] = _to_python_float(getattr(r, "pre_rmspe", None)) @@ -2327,6 +2332,132 @@ def _scm_native(self, r: Any) -> Dict[str, Any]: "per donor)." ), } + + # Leave-one-out donor robustness (ADH 2015 §4): opt-in, surfaced once run. + if getattr(r, "_loo_df", None) is not None: + loo_status = getattr(r, "_loo_status", None) + if loo_status == "ran": + att_range = getattr(r, "_loo_att_range", None) + out["leave_one_out"] = { + "status": "ran", + # Headline single-donor-dependence metric: the largest baseline- + # relative swing (max |delta_att|). Preferred over att_range, which + # can look narrow even when every drop shifts the ATT far from the + # full-fit baseline in the same direction. + "max_abs_delta_att": _to_python_float( + getattr(r, "_loo_max_abs_delta_att", None) + ), + "att_range": ( + [_to_python_float(att_range[0]), _to_python_float(att_range[1])] + if att_range is not None + else None + ), + "n_failed": _to_python_scalar(getattr(r, "_loo_n_failed", None)), + } + else: + _loo_reasons = { + "treated_fit_nonconverged": ( + "leave_one_out() was run but the treated unit's own SCM fit " + "did not converge at fit time, so the baseline ATT is not a " + "valid reference for the leave-one-out deltas." + ), + "too_few_donors": ( + "leave_one_out() was run but fewer than 2 donors are available " + "(dropping one must leave a non-empty pool)." + ), + "all_refits_failed": ( + "leave_one_out() was run but every donor-drop refit failed to " + "converge, so no valid leave-one-out estimate was produced " + "(see the status='failed' rows); raise n_starts or loosen the " + "optimizer tolerances." + ), + } + out["leave_one_out"] = { + "status": "infeasible", + # Machine-readable code so consumers can distinguish a numerical + # convergence failure ("all_refits_failed") from structural + # infeasibility ("too_few_donors") without parsing `reason`. + "reason_code": loo_status, + "reason": _loo_reasons.get( + loo_status, "leave_one_out() produced no valid refits." + ), + } + else: + out["leave_one_out"] = { + "status": "not_run", + "reason": ( + "Call results.leave_one_out() to run leave-one-out donor " + "robustness (opt-in; refits once per reportably-weighted donor)." + ), + } + + # In-time (backdating) placebo (ADH 2015 §4): opt-in, surfaced once run. + if getattr(r, "_in_time_df", None) is not None: + in_time_status = getattr(r, "_in_time_status", None) + if in_time_status == "ran": + itp = r._in_time_df + ran = itp[itp["status"] == "ran"] if "status" in itp else itp + max_abs_att = float(ran["placebo_att"].abs().max()) if len(ran) else None + out["in_time_placebo"] = { + "status": "ran", + # Full coverage breakdown so a partially-usable sweep is not + # overstated: n_dates is the requested grid; n_ran are the usable + # placebos; n_failed / n_infeasible are the dropped remainder. + "n_dates": _to_python_scalar(int(len(itp))), + "n_ran": _to_python_scalar(int(len(ran))), + "n_failed": _to_python_scalar(getattr(r, "_in_time_n_failed", None)), + "n_infeasible": _to_python_scalar(getattr(r, "_in_time_n_infeasible", None)), + "max_abs_placebo_att": _to_python_float(max_abs_att), + } + else: + _it_reasons = { + "treated_fit_nonconverged": ( + "in_time_placebo() was run but the treated unit's own SCM fit " + "did not converge at fit time." + ), + "too_few_pre_periods": ( + "in_time_placebo() was run but there are too few pre-treatment " + "periods for any feasible placebo date (need >=3)." + ), + "all_dates_infeasible": ( + "in_time_placebo() was run but every placebo date was " + "infeasible (no pre-fake period, all predictors dropped, or " + "the supplied custom_v had zero mass on the surviving " + "predictors after truncation)." + ), + "all_dates_failed": ( + "in_time_placebo() was run but every placebo refit failed to " + "converge (none was dimensionally infeasible); raise n_starts " + "or loosen the optimizer tolerances." + ), + "all_dates_unusable": ( + "in_time_placebo() was run but no placebo date produced a usable " + "result: some refits failed to converge AND some dates were " + "dimensionally infeasible (see n_failed / n_infeasible)." + ), + } + out["in_time_placebo"] = { + "status": "infeasible", + # Machine-readable code distinguishing a numerical convergence + # failure ("all_dates_failed") from structural infeasibility + # ("all_dates_infeasible" / "too_few_pre_periods") or a mix + # ("all_dates_unusable"), without parsing `reason`. The n_failed / + # n_infeasible counts give the exact breakdown. + "reason_code": in_time_status, + "n_failed": _to_python_scalar(getattr(r, "_in_time_n_failed", None)), + "n_infeasible": _to_python_scalar(getattr(r, "_in_time_n_infeasible", None)), + "reason": _it_reasons.get( + in_time_status, "in_time_placebo() produced no valid refits." + ), + } + else: + out["in_time_placebo"] = { + "status": "not_run", + "reason": ( + "Call results.in_time_placebo() to run the in-time (backdating) " + "placebo (opt-in; refits per backdated date)." + ), + } return out # -- Heterogeneity helpers -------------------------------------------- diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index c0e2caf9..ff1b1ad2 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -616,7 +616,7 @@ scm.fit( ) -> SyntheticControlResults ``` -**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation). +**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. **ADH-2015 §4 robustness (opt-in, analytical inference unchanged):** `results.leave_one_out()` drops each reportably-weighted donor (weight > 1e-6) and re-fits (per-drop ATT/`delta_att` — large `delta_att` ⇒ single-donor dependence); `results.in_time_placebo()` backdates the intervention and checks for a spurious pre-period gap (TRUNCATE windowing — predictor windows in the held-out region are dropped). Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation). **Usage:** @@ -1298,7 +1298,7 @@ Returned by `SyntheticControl.fit()`. | `pre_periods`, `post_periods` | `list` | Calendar-sorted periods | | `v_method`, `standardize` | `str` | Echoed configuration | -**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()` +**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `leave_one_out()` (ADH-2015 §4 donor robustness; drops each reportably-weighted donor (weight > 1e-6) → per-drop ATT/`delta_att` table) + `get_leave_one_out_df()`/`get_leave_one_out_gaps()`, `in_time_placebo()` (ADH-2015 §4 backdating placebo; reassigns the intervention earlier, TRUNCATE windowing, placebo ATT ~0 if no real pre-effect) + `get_in_time_placebo_df()`/`get_in_time_placebo_gaps()`, `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()` ### TripleDifferenceResults diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt index 5261b046..7fb2922b 100644 --- a/diff_diff/guides/llms.txt +++ b/diff_diff/guides/llms.txt @@ -60,7 +60,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")` - [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html): Gardner (2022) two-stage estimator with GMM sandwich variance - [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=` for CR1 via Gerber (2026) Binder TSL (Wave E.1) and under `vcov_type="conley"` via a panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2 cross-sectional `conley_lag_cutoff=0`) extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC (Newey-West 1987 separable form; `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=`), both composed with the Wave D Gardner GMM correction; `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores at the meat-helper boundary (Wave E.3, R `svyrecvar(subset())` form) (replicate weights queued as follow-up) - [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html): Synthetic DiD combining standard DiD and synthetic control methods for few treated units -- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, nested or custom predictor-importance V, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)) +- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, nested or custom predictor-importance V, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)); ADH-2015 §4 robustness: `leave_one_out()` donor-robustness + `in_time_placebo()` backdating placebo - [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html): Triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility - [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html): Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves - [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲=0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲>0`, continuous-near-d̲ or mass-point), with multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release (repeated cross-sections rejected by the validator). Alias `HAD`. diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index b0d6811f..3a636e3c 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -718,6 +718,42 @@ def _handle_synthetic_control(results: Any): priority="medium", step_name="estimator_selection", ), + _step( + baker_step=6, + label="Leave-one-out donor robustness (ADH 2015)", + why=( + "Re-fit dropping each reportably-weighted donor (weight above the 1e-6 " + "floor) in turn to confirm the " + "estimate is not driven by a single donor (Abadie-Diamond-Hainmueller " + "2015, Section 4); a large delta_att when one donor is removed flags " + "single-donor dependence." + ), + code=( + "loo_df = results.leave_one_out()\n" + "print(loo_df) # baseline + per-dropped-donor ATT and delta_att" + ), + priority="medium", + # Not a standard STEPS tag, so a caller's completed_steps (validated + # against STEPS) can never auto-suppress this opt-in recommendation. + step_name="loo_jackknife", + ), + _step( + baker_step=6, + label="In-time (backdating) placebo (ADH 2015)", + why=( + "Reassign the intervention to an earlier pre-period and confirm no " + "spurious gap appears before the true treatment date (Abadie-Diamond-" + "Hainmueller 2015, Section 4, Figure 4)." + ), + code=( + "itp_df = results.in_time_placebo()\n" + "print(itp_df) # per-backdated-date placebo ATT (should be ~0)" + ), + priority="medium", + # Non-standard tag (not in STEPS) -> never auto-suppressed; deliberately + # NOT "sensitivity" (a caller could mark that done and drop this step). + step_name="in_time_placebo", + ), _robustness_compare_step("SyntheticDiD or CS"), ] warnings = _check_nan_att(results) diff --git a/diff_diff/synthetic_control.py b/diff_diff/synthetic_control.py index 164568a2..4004446c 100644 --- a/diff_diff/synthetic_control.py +++ b/diff_diff/synthetic_control.py @@ -30,7 +30,7 @@ """ import warnings -from dataclasses import dataclass +from dataclasses import dataclass, replace from typing import Any, Dict, List, Optional, Tuple, cast import numpy as np @@ -523,6 +523,9 @@ def fit( pre_periods=list(pre_periods), post_periods=list(post_periods), donor_ids=list(donor_ids), + # Freeze the reportably-weighted support (donor_weights keys, in donor_ids + # order) so leave_one_out() is immune to post-fit mutation of donor_weights. + weighted_donor_ids=[d for d in donor_ids if d in donor_weights], treated_id=treated_id, standardize=self.standardize, v_method=self.v_method, @@ -1310,3 +1313,88 @@ def _placebo_fit_unit( post_gaps = np.array([gap_path[p] for p in snap.post_periods], dtype=float) scale = float(np.max(np.abs(Z1))) if Z1.size else 0.0 return gap_path, _rmspe_ratio(pre_gaps, post_gaps, scale) + + +# ============================================================================= +# in-time (backdating) placebo (ADH 2015 §4) — used by +# SyntheticControlResults.in_time_placebo() via function-level import +# ============================================================================= + + +def _truncate_snapshot_in_time( + snap: _SyntheticControlFitSnapshot, + t_f: Any, +) -> Tuple[Optional[_SyntheticControlFitSnapshot], List[str]]: + """Build a snapshot for an in-time placebo reassigning the intervention to ``t_f``. + + Re-cuts the panel so the **pre-fake** window is the pre-periods strictly before + ``t_f`` and the **post-fake** window is the held-out pre-periods from ``t_f`` + onward (``t_f`` is the first post-fake period). The true post-treatment periods + are EXCLUDED entirely — ``all_periods`` is set to ``pre_fake + post_fake`` so the + placebo refit (which fits V/W on ``pre_periods`` and measures the gap over + ``all_periods``) never sees treatment-contaminated data. + + Predictor specs are TRUNCATED to the pre-fake window (ADH 2015 §4 "lag the + predictors accordingly"; for our absolute-period specs this means intersecting + each spec's ``periods`` with the pre-fake window). A spec whose window lies + ENTIRELY in the held-out region is DROPPED (its label is collected for an + aggregated warning), and ``custom_v`` is subset in lockstep with the surviving + specs so its length stays ``== len(kept_specs)``. + + Returns ``(modified_snapshot, dropped_spec_labels)``, or ``(None, dropped)`` when + the date is infeasible: fewer than 2 pre-fake periods (the weight / V solve needs + at least 2), no post-fake period, or every predictor dropped. ``t_f`` must be an + element of ``snap.pre_periods`` (the caller validates membership). + """ + pre = snap.pre_periods + idx = pre.index(t_f) # positional split (period labels may be non-numeric) + new_pre = list(pre[:idx]) + new_post = list(pre[idx:]) # t_f is the first post-fake period + pre_set = set(new_pre) + + kept_specs: List[_PredictorSpec] = [] + keep_mask: List[int] = [] + dropped: List[str] = [] + for i, spec in enumerate(snap.specs): + # Intersect with the pre-fake window (preserves the spec's canonical order). + kept_periods = [p for p in spec.periods if p in pre_set] + if kept_periods: + # NEW spec object — never mutate the shared snapshot specs in place. + kept_specs.append(replace(spec, periods=kept_periods)) + keep_mask.append(i) + else: + dropped.append(spec.label) + + # Feasibility: >=2 pre-fake periods, >=1 post-fake period (to measure the placebo + # gap), and at least one surviving predictor. The >=2 pre-fake rule is DELIBERATELY + # stricter than the base estimator's T0>=1 allowance (which warns that single-pre + # nested-V selection is unreliable): an auto-swept placebo date with a single + # pre-fake period is a trivially-matchable, non-credible pre-fit, so it is dropped + # as infeasible rather than surfaced as a "ran" placebo. Documented as a Note in + # docs/methodology/REGISTRY.md §SyntheticControl. + if len(new_pre) < 2 or len(new_post) < 1 or not kept_specs: + return None, dropped + + # Subset custom_v IN LOCKSTEP with the surviving specs (custom_v length must stay + # == k = len(specs); a desync silently misaligns V on the custom path). RAVEL first: + # fit() accepts array-like custom_v (e.g. a (1, k) row vector) and the snapshot keeps + # its original shape, so `[keep_mask]` would otherwise index the wrong axis of a 2D + # custom_v and raise (matches _placebo_fit_unit, which also ravels at use). + new_custom_v = ( + None if snap.custom_v is None else np.asarray(snap.custom_v, dtype=float).ravel()[keep_mask] + ) + # A custom V whose surviving entries sum to ~0 (all mass was on dropped specs) + # cannot fit — _placebo_fit_unit's `v / v.sum()` would be 0/0. This is the date + # being INFEASIBLE UNDER THE SUPPLIED custom_v, not a solver-convergence failure, + # so return None (→ status="infeasible") rather than letting the refit "fail". + if new_custom_v is not None and float(np.sum(new_custom_v)) <= 0.0: + return None, dropped + snap_mod = replace( + snap, + specs=kept_specs, + all_periods=new_pre + new_post, + pre_periods=new_pre, + post_periods=new_post, + custom_v=new_custom_v, + ) + return snap_mod, dropped diff --git a/diff_diff/synthetic_control_results.py b/diff_diff/synthetic_control_results.py index 8e412057..6dfa10b5 100644 --- a/diff_diff/synthetic_control_results.py +++ b/diff_diff/synthetic_control_results.py @@ -48,6 +48,12 @@ class _SyntheticControlFitSnapshot: pre_periods: List[Any] post_periods: List[Any] donor_ids: List[Any] + # The treated unit's reportably-weighted donor support (donor ids with weight above + # the 1e-6 interpretability floor), FROZEN at fit time and ordered by donor_ids. + # leave_one_out() iterates this immutable list — NOT the mutable, presentation-level + # results.donor_weights dict — so post-fit mutation cannot change which donors are + # dropped, and the robustness result depends only on the fit. + weighted_donor_ids: List[Any] treated_id: Any standardize: str v_method: str @@ -195,6 +201,39 @@ def __post_init__(self) -> None: # "all_placebos_failed". A small string, so it survives pickling. self._placebo_status: Optional[str] = None + # --- ADH 2015 §4 robustness diagnostics (opt-in, populated by --- + # --- leave_one_out() / in_time_placebo()). Same panel-vs-scalar split as --- + # --- the in-space placebo: the small per-row tables (_loo_df / _in_time_df), --- + # --- scalar summaries and status strings survive pickling; the per-refit --- + # --- gap-path dicts (_loo_gaps / _in_time_gaps) are panel-derived and nulled --- + # --- by __getstate__. analytical se/t/p/ci stay NaN throughout. + self._loo_df: Optional[pd.DataFrame] = None + self._loo_gaps: Optional[Dict[Any, Dict[Any, float]]] = None + # Reason a leave-one-out run was infeasible/absent. Values: None (not run), + # "ran", "treated_fit_nonconverged", "too_few_donors", "all_refits_failed". + self._loo_status: Optional[str] = None + # (min, max) ATT across the successful leave-one-out refits (the absolute + # spread of counterfactual ATTs); None until run. + self._loo_att_range: Optional[Tuple[float, float]] = None + # The headline single-donor-dependence number: max |att_loo - baseline_att| + # over the successful drops. Baseline-RELATIVE, so a uniform shift of every + # drop away from the baseline is NOT masked the way a narrow raw att_range + # would be. None until run. + self._loo_max_abs_delta_att: Optional[float] = None + self._loo_n_failed: int = 0 + self._in_time_df: Optional[pd.DataFrame] = None + self._in_time_gaps: Optional[Dict[Any, Dict[Any, float]]] = None + # Reason an in-time placebo run was infeasible/absent. Values: None (not run), + # "ran", "treated_fit_nonconverged", "too_few_pre_periods", + # "all_dates_infeasible", "all_dates_failed", "all_dates_unusable" (a mix of + # failed + infeasible dates with none usable). + self._in_time_status: Optional[str] = None + self._in_time_n_failed: int = 0 + # Number of placebo dates that were dimensionally infeasible (too few pre-fake + # periods, all predictors dropped, or a zero-mass surviving custom_v). Surfaced + # alongside _in_time_n_failed so a mixed no-success run reports an accurate mix. + self._in_time_n_infeasible: int = 0 + def __getstate__(self) -> Dict[str, Any]: """Exclude panel-derived internal state from pickling. @@ -209,6 +248,12 @@ def __getstate__(self) -> Dict[str, Any]: state = self.__dict__.copy() state["_fit_snapshot"] = None state["_placebo_gaps"] = None + # ADH-2015 diagnostic gap paths are panel-derived (same hazard as + # _placebo_gaps); the small _loo_df / _in_time_df tables + scalar summaries + # survive so a round-tripped result still reports the diagnostic, but the + # overlay gap accessors raise (re-fit to recompute). + state["_loo_gaps"] = None + state["_in_time_gaps"] = None return state def __repr__(self) -> str: @@ -727,3 +772,583 @@ def in_space_placebo( self._placebo_status = "ran" if n_placebos > 0 else "all_placebos_failed" self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS) return self._placebo_df.copy() + + _LOO_COLS = [ + "dropped_unit", + "att", + "pre_rmspe", + "post_rmspe", + "rmspe_ratio", + "delta_att", + "status", + ] + + def leave_one_out(self, n_starts: Optional[int] = None) -> pd.DataFrame: + """ + Leave-one-out donor robustness (Abadie-Diamond-Hainmueller 2015, Section 4). + + Drops each **reportably-weighted** donor, one at a time, and re-fits the + treated unit's synthetic control against the remaining donor pool. The + per-drop ATTs reveal whether the estimated effect is driven by any single + donor (ADH 2015 overlay the leave-one-out counterfactual trajectories for + this purpose; :meth:`get_leave_one_out_gaps` returns those paths). This is a + thin re-run of the validated SCM solver — it has **no analytical standard + error**; ``se``/``t_stat``/``p_value``/``conf_int`` and ``is_significant`` + are unaffected (still bound to the NaN analytical ``p_value``). + + The drop set is exactly the donors in ``donor_weights`` — those above the + ``1e-6`` interpretability floor (``synthetic_control._MIN_REPORT_WEIGHT``). + A donor with negligible weight ``0 < w ≤ 1e-6`` is excluded (its removal + moves the ATT by ~the weight, so its ``delta_att`` would be ~0 — an + uninformative row), keeping the LOO table aligned with the reported support; + a zero-weight donor's removal leaves the synthetic unchanged. (This `1e-6` + approximation of "positive weight" is documented in REGISTRY §SyntheticControl.) + A donor that carries ALL the weight is still dropped (the others absorb its + mass on re-fit); its large ``delta_att`` is exactly the single-donor-dependence + signal this diagnostic exists to surface, NOT a failure. + + Parameters + ---------- + n_starts : int, optional + Override the multistart count for each leave-one-out refit's nested V + search. Default None inherits the original fit's ``n_starts``. + + Returns + ------- + pandas.DataFrame + One ``status="baseline"`` row (the full fit, ``delta_att=0``) followed by + one row per dropped donor (``status="loo"``, or ``"failed"`` with NaN + metrics when its refit did not converge), sorted by ``|delta_att|`` + descending (failed rows last). Columns: ``dropped_unit``, ``att``, + ``pre_rmspe``, ``post_rmspe``, ``rmspe_ratio``, ``delta_att`` + (``att_loo - full_att``), ``status``. + + Raises + ------ + ValueError + If the fit snapshot is unavailable (e.g. this result was unpickled). + """ + if self._fit_snapshot is None: + raise ValueError( + "leave_one_out() requires the fit snapshot on the results object. " + "This result appears to have been loaded from serialization (which " + "excludes the snapshot) or produced by an older estimator version. " + "Re-fit to enable leave-one-out donor robustness." + ) + from diff_diff.synthetic_control import _mspe, _placebo_fit_unit + + snap = self._fit_snapshot + if n_starts is None: + n_starts_eff = snap.n_starts + else: + # Mirror the estimator constructor's validation so a bad override fails + # fast instead of silently coercing into a degenerate refit (cf. + # in_space_placebo()). + if not isinstance(n_starts, (int, np.integer)) or n_starts < 1: + raise ValueError(f"n_starts override must be a positive integer, got {n_starts!r}") + n_starts_eff = int(n_starts) + + # Baseline row: read DIRECTLY from the full fit (do NOT re-fit), so the + # reference ATT — and therefore delta_att=0.0 — is exact. + baseline_row = { + "dropped_unit": None, + "att": float(self.att), + "pre_rmspe": float(self.pre_rmspe), + "post_rmspe": float(np.sqrt(_mspe(self.gap_path, snap.post_periods))), + "rmspe_ratio": float(self.rmspe_ratio), + "delta_att": 0.0, + "status": "baseline", + } + + # Fail closed when the treated unit's own fit did not converge: a truncated / + # under-optimized baseline ATT makes every leave-one-out delta meaningless. + if not self._fit_converged: + warnings.warn( + "Leave-one-out skipped: the treated unit's own SCM fit did not " + "converge at fit time (inner Frank-Wolfe weight solve and/or outer V " + "search), so the baseline ATT is not a valid optimum to compare " + "leave-one-out refits against. Re-fit with a larger inner_max_iter / " + "looser inner_min_decrease (inner) and/or a larger " + "optimizer_options['maxiter'] / more n_starts (outer V search).", + UserWarning, + stacklevel=2, + ) + self._loo_status = "treated_fit_nonconverged" + self._loo_att_range = None + self._loo_n_failed = 0 + self._loo_gaps = {} + self._loo_df = pd.DataFrame([baseline_row], columns=self._LOO_COLS) + return self._loo_df.copy() + + # Dropping any donor requires at least one donor left in the pool. + if len(snap.donor_ids) < 2: + warnings.warn( + "Leave-one-out donor robustness requires at least 2 donors (dropping " + f"one must leave a non-empty pool); only {len(snap.donor_ids)} " + "available. Returning the baseline fit only.", + UserWarning, + stacklevel=2, + ) + self._loo_status = "too_few_donors" + self._loo_att_range = None + self._loo_n_failed = 0 + self._loo_gaps = {} + self._loo_df = pd.DataFrame([baseline_row], columns=self._LOO_COLS) + return self._loo_df.copy() + + # Drop the FROZEN reportably-weighted support captured at fit time (donor ids + # with weight above the 1e-6 floor, in donor_ids order). Reading the snapshot — + # NOT the mutable presentation-level self.donor_weights — makes the result + # depend only on the fit and immune to post-fit mutation of donor_weights. + pos_donors = list(snap.weighted_donor_ids) + loo_gaps: Dict[Any, Dict[Any, float]] = {} + loo_rows: List[Dict[str, Any]] = [] + atts: List[float] = [] + n_failed = 0 + + for d in pos_donors: + pool = [x for x in snap.donor_ids if x != d] + fitted = _placebo_fit_unit(snap, snap.treated_id, pool, n_starts_eff) + if fitted is None: + n_failed += 1 + loo_rows.append( + { + "dropped_unit": d, + "att": np.nan, + "pre_rmspe": np.nan, + "post_rmspe": np.nan, + "rmspe_ratio": np.nan, + "delta_att": np.nan, + "status": "failed", + } + ) + continue + gap_path_d, ratio_d = fitted + loo_gaps[d] = gap_path_d + att_d = float(np.mean([gap_path_d[p] for p in snap.post_periods])) + atts.append(att_d) + loo_rows.append( + { + "dropped_unit": d, + "att": att_d, + "pre_rmspe": float(np.sqrt(_mspe(gap_path_d, snap.pre_periods))), + "post_rmspe": float(np.sqrt(_mspe(gap_path_d, snap.post_periods))), + "rmspe_ratio": ratio_d, + "delta_att": att_d - float(self.att), + "status": "loo", + } + ) + + # Sort successful drops by |delta_att| desc (most influential donor first); + # non-converged drops sort last. + finite_rows = sorted( + (r for r in loo_rows if r["status"] == "loo"), + key=lambda r: abs(r["delta_att"]), + reverse=True, + ) + failed_rows = [r for r in loo_rows if r["status"] == "failed"] + ordered = [baseline_row] + finite_rows + failed_rows + + if n_failed > 0: + warnings.warn( + f"{n_failed} of {len(pos_donors)} leave-one-out refits failed to " + "converge and are reported with NaN metrics (status='failed'); the " + "ATT range uses the remaining refits.", + UserWarning, + stacklevel=2, + ) + + self._loo_gaps = loo_gaps + self._loo_n_failed = int(n_failed) + self._loo_att_range = (min(atts), max(atts)) if atts else None + # Baseline-relative headline: the largest swing of any single donor-drop from + # the full-fit ATT (max |delta_att|). Robust to a uniform shift that a raw + # att_range would understate. + self._loo_max_abs_delta_att = max(abs(a - float(self.att)) for a in atts) if atts else None + # Distinguish a real run from "every donor-drop refit failed to converge" + # (no valid leave-one-out estimate produced) so DR/BR do not report an empty + # diagnostic as completed. (pos_donors empty — a converged fit always has >=1 + # positive weight — falls through to "ran": baseline-only, benign.) + self._loo_status = "all_refits_failed" if (pos_donors and not atts) else "ran" + self._loo_df = pd.DataFrame(ordered, columns=self._LOO_COLS) + return self._loo_df.copy() + + def get_leave_one_out_df(self) -> pd.DataFrame: + """ + Get the leave-one-out donor-robustness table (see :meth:`leave_one_out`). + + Survives pickling. Raises if :meth:`leave_one_out` has not been run. + + Returns + ------- + pandas.DataFrame + """ + if self._loo_df is None: + raise ValueError("No leave-one-out results yet; call leave_one_out() first.") + return self._loo_df.copy() + + def get_leave_one_out_gaps(self) -> pd.DataFrame: + """ + Long-form leave-one-out gap paths, for the overlay ("spaghetti") plot. + + One row per (dropped donor, period) for every converged leave-one-out refit. + Columns: ``dropped_unit``, ``period``, ``gap``, ``phase`` (``"pre"``/ + ``"post"``) — mirroring :meth:`get_gap_df`. These per-period paths are + panel-derived and are NOT retained after pickling. + + Returns + ------- + pandas.DataFrame + + Raises + ------ + ValueError + If :meth:`leave_one_out` has not been run, or if the gap paths were + dropped on pickling (re-fit and re-run to recompute them). + """ + if self._loo_df is None: + raise ValueError("No leave-one-out results yet; call leave_one_out() first.") + if self._loo_gaps is None: + raise ValueError( + "Leave-one-out gap paths are not retained after pickling " + "(panel-derived); re-run leave_one_out() on a freshly fitted result " + "to recompute them." + ) + rows: List[Dict[str, Any]] = [] + for unit, gap_path in self._loo_gaps.items(): + for period in list(self.pre_periods) + list(self.post_periods): + if period in gap_path: + phase = "post" if period in self.post_periods else "pre" + rows.append( + { + "dropped_unit": unit, + "period": period, + "gap": gap_path[period], + "phase": phase, + } + ) + return pd.DataFrame(rows, columns=["dropped_unit", "period", "gap", "phase"]) + + _IN_TIME_COLS = [ + "placebo_period", + "placebo_att", + "pre_fit_rmspe", + "rmspe_ratio", + "n_pre_fake", + "n_post_fake", + "n_dropped_specs", + "status", + ] + + def in_time_placebo( + self, + placebo_periods: Optional[Any] = None, + n_starts: Optional[int] = None, + ) -> pd.DataFrame: + """ + In-time (backdating) placebo (Abadie-Diamond-Hainmueller 2015, Section 4). + + Reassigns the intervention to an earlier pre-treatment date ``t_f`` and re-fits + the synthetic control using ONLY pre-``t_f`` information, then measures the + "effect" over the held-out window ``[t_f, T0)``. A credible synthetic control + should show **no spurious gap** there (ADH 2015 Figure 4, German reunification + backdated to 1975). This is a thin re-run of the validated SCM solver — it has + **no analytical standard error**; ``se``/``t_stat``/``p_value``/``conf_int`` and + ``is_significant`` are unaffected. + + **Windowing convention (TRUNCATE).** The placebo fit uses only periods strictly + before ``t_f``: pre-period-outcome predictors become the pre-``t_f`` outcomes, + and covariate / special predictor windows are intersected with the pre-``t_f`` + window. A predictor window lying ENTIRELY in the held-out region ``[t_f, T0)`` + is dropped (surfaced in ``n_dropped_specs`` + an aggregated warning). For + outcome-predictor fits this equals the literal "lag the predictors" re-run of a + manual ``Synth::synth`` (R has no in-time-placebo function); see + ``docs/methodology/REGISTRY.md`` for the recognized deviation note. + + Parameters + ---------- + placebo_periods : period value or list of period values, optional + The pseudo-intervention date(s), each a member of ``pre_periods``. Default + None sweeps every feasible interior pre-date (at least 2 pre-fake periods to + fit + at least 1 post-fake period to measure the gap). A date that is a true + post-treatment period, or not a pre-period at all, raises ``ValueError``; a + valid pre-date that is dimensionally infeasible (too few pre-fake periods, or + all predictors dropped) yields a ``status="infeasible"`` row (no raise). + n_starts : int, optional + Override the multistart count for each placebo refit's nested V search. + Default None inherits the original fit's ``n_starts``. + + Returns + ------- + pandas.DataFrame + One row per placebo date. Columns: ``placebo_period``, ``placebo_att`` (mean + gap over the held-out window — should be ~0 if no real pre-period effect), + ``pre_fit_rmspe``, ``rmspe_ratio`` (post-fake/pre-fake), ``n_pre_fake``, + ``n_post_fake``, ``n_dropped_specs``, ``status`` (``"ran"`` / ``"infeasible"`` + / ``"failed"``). + + Raises + ------ + ValueError + If the fit snapshot is unavailable (e.g. this result was unpickled), or an + explicit ``placebo_periods`` entry is a post-treatment period / not a + pre-period. + """ + if self._fit_snapshot is None: + raise ValueError( + "in_time_placebo() requires the fit snapshot on the results object. " + "This result appears to have been loaded from serialization (which " + "excludes the snapshot) or produced by an older estimator version. " + "Re-fit to enable the in-time placebo." + ) + from diff_diff.synthetic_control import ( + _mspe, + _placebo_fit_unit, + _truncate_snapshot_in_time, + ) + + snap = self._fit_snapshot + if n_starts is None: + n_starts_eff = snap.n_starts + else: + if not isinstance(n_starts, (int, np.integer)) or n_starts < 1: + raise ValueError(f"n_starts override must be a positive integer, got {n_starts!r}") + n_starts_eff = int(n_starts) + + pre = list(snap.pre_periods) + empty = pd.DataFrame([], columns=self._IN_TIME_COLS) + + # Fail closed when the treated unit's own fit did not converge: a truncated / + # under-optimized baseline makes the placebo comparison meaningless. + if not self._fit_converged: + warnings.warn( + "In-time placebo skipped: the treated unit's own SCM fit did not " + "converge at fit time (inner Frank-Wolfe weight solve and/or outer V " + "search). Re-fit with a larger inner_max_iter / looser " + "inner_min_decrease (inner) and/or a larger optimizer_options['maxiter'] " + "/ more n_starts (outer V search).", + UserWarning, + stacklevel=2, + ) + self._in_time_status = "treated_fit_nonconverged" + self._in_time_n_failed = 0 + self._in_time_gaps = {} + self._in_time_df = empty + return empty.copy() + + # A feasible date needs >=2 pre-fake + >=1 post-fake period -> >=3 pre periods. + # The >=2 pre-fake rule is a deliberate Note-documented restriction (an auto- + # swept single-pre-fake placebo is a non-credible pre-fit; see REGISTRY). + if len(pre) < 3: + warnings.warn( + "In-time placebo requires at least 3 pre-treatment periods (a feasible " + "placebo date needs >=2 pre-fake periods to fit and >=1 post-fake period " + f"to measure the gap); only {len(pre)} available.", + UserWarning, + stacklevel=2, + ) + self._in_time_status = "too_few_pre_periods" + self._in_time_n_failed = 0 + self._in_time_gaps = {} + self._in_time_df = empty + return empty.copy() + + if placebo_periods is None: + # Sweep every feasible pre-date (positional: idx>=2 gives >=2 pre-fake + + # >=1 post-fake; idx<2 would leave fewer than 2 pre-fake periods). + dates: List[Any] = [pre[i] for i in range(2, len(pre))] + else: + if isinstance(placebo_periods, (list, tuple, set, np.ndarray, pd.Index, pd.Series)): + dates = list(placebo_periods) + else: + dates = [placebo_periods] + # An explicit but EMPTY container is a malformed request (NOT "every date + # was infeasible") — fail fast, consistent with the post-date / non-pre + # date raises below. Pass None to sweep all feasible pre-dates. + if not dates: + raise ValueError( + "placebo_periods is empty; pass None to sweep all feasible " + "pre-dates, or a non-empty list of pre-period date(s)." + ) + pre_set = set(pre) + post_set = set(snap.post_periods) + for d in dates: + if d in post_set: + raise ValueError( + f"placebo_period {d!r} is a true post-treatment period; an " + "in-time placebo date must lie in the pre-treatment window." + ) + if d not in pre_set: + raise ValueError( + f"placebo_period {d!r} is not a pre-treatment period " + f"(pre_periods = {pre})." + ) + # De-duplicate + canonicalize to pre-period order (mirrors _resolve_periods): + # duplicate / unordered explicit dates must not trigger duplicate refits or + # inflate n_dates. + _requested = set(dates) + dates = [p for p in pre if p in _requested] + + in_time_gaps: Dict[Any, Dict[Any, float]] = {} + rows: List[Dict[str, Any]] = [] + dropped_all: set = set() + n_failed = 0 + n_infeasible = 0 + n_ran = 0 + + for t_f in dates: + idx = pre.index(t_f) + n_pre_fake = idx + n_post_fake = len(pre) - idx + snap_mod, dropped = _truncate_snapshot_in_time(snap, t_f) + dropped_all.update(dropped) + if snap_mod is None: + n_infeasible += 1 + rows.append( + { + "placebo_period": t_f, + "placebo_att": np.nan, + "pre_fit_rmspe": np.nan, + "rmspe_ratio": np.nan, + "n_pre_fake": n_pre_fake, + "n_post_fake": n_post_fake, + "n_dropped_specs": len(dropped), + "status": "infeasible", + } + ) + continue + fitted = _placebo_fit_unit(snap_mod, snap.treated_id, snap.donor_ids, n_starts_eff) + if fitted is None: + n_failed += 1 + rows.append( + { + "placebo_period": t_f, + "placebo_att": np.nan, + "pre_fit_rmspe": np.nan, + "rmspe_ratio": np.nan, + "n_pre_fake": n_pre_fake, + "n_post_fake": n_post_fake, + "n_dropped_specs": len(dropped), + "status": "failed", + } + ) + continue + gap_path, ratio = fitted + in_time_gaps[t_f] = gap_path + placebo_att = float(np.mean([gap_path[p] for p in snap_mod.post_periods])) + rows.append( + { + "placebo_period": t_f, + "placebo_att": placebo_att, + "pre_fit_rmspe": float(np.sqrt(_mspe(gap_path, snap_mod.pre_periods))), + "rmspe_ratio": ratio, + "n_pre_fake": n_pre_fake, + "n_post_fake": n_post_fake, + "n_dropped_specs": len(dropped), + "status": "ran", + } + ) + n_ran += 1 + + if dropped_all: + warnings.warn( + "In-time placebo (TRUNCATE convention): predictor(s) " + f"{sorted(map(str, dropped_all))} fell entirely in the held-out " + "post-fake window for some placebo date(s) and were dropped from those " + "refits (see the n_dropped_specs column).", + UserWarning, + stacklevel=2, + ) + if n_infeasible > 0: + warnings.warn( + f"{n_infeasible} in-time placebo date(s) were dimensionally infeasible " + "(too few pre-fake periods or all predictors dropped) and are reported " + "with status='infeasible' (NaN metrics).", + UserWarning, + stacklevel=2, + ) + if n_failed > 0: + warnings.warn( + f"{n_failed} in-time placebo refit(s) failed to converge and are " + "reported with status='failed' (NaN metrics).", + UserWarning, + stacklevel=2, + ) + + self._in_time_gaps = in_time_gaps + self._in_time_n_failed = int(n_failed) + self._in_time_n_infeasible = int(n_infeasible) + # When no date ran, classify the cause precisely so the downstream reason text + # is never false: a pure convergence failure ("all_dates_failed", actionable — + # raise n_starts / loosen tolerances) and pure dimensional infeasibility + # ("all_dates_infeasible", structural) are distinct; a MIX of both gets its own + # "all_dates_unusable" code (both counters are surfaced) rather than being + # mislabeled as exclusively one or the other. + if n_ran > 0: + self._in_time_status = "ran" + elif n_failed > 0 and n_infeasible > 0: + self._in_time_status = "all_dates_unusable" + elif n_failed > 0: + self._in_time_status = "all_dates_failed" + else: + self._in_time_status = "all_dates_infeasible" + self._in_time_df = pd.DataFrame(rows, columns=self._IN_TIME_COLS) + return self._in_time_df.copy() + + def get_in_time_placebo_df(self) -> pd.DataFrame: + """ + Get the in-time placebo table (see :meth:`in_time_placebo`). + + Survives pickling. Raises if :meth:`in_time_placebo` has not been run. + + Returns + ------- + pandas.DataFrame + """ + if self._in_time_df is None: + raise ValueError("No in-time placebo results yet; call in_time_placebo() first.") + return self._in_time_df.copy() + + def get_in_time_placebo_gaps(self) -> pd.DataFrame: + """ + Long-form in-time placebo gap paths, for the backdating overlay plot. + + One row per (placebo date, period) for every converged in-time refit. Columns: + ``placebo_period``, ``period``, ``gap``, ``phase`` (``"pre_fake"`` for periods + before the placebo date, ``"post_fake"`` for the held-out window from it on). + These per-period paths are panel-derived and are NOT retained after pickling. + + Returns + ------- + pandas.DataFrame + + Raises + ------ + ValueError + If :meth:`in_time_placebo` has not been run, or if the gap paths were + dropped on pickling (re-fit and re-run to recompute them). + """ + if self._in_time_df is None: + raise ValueError("No in-time placebo results yet; call in_time_placebo() first.") + if self._in_time_gaps is None: + raise ValueError( + "In-time placebo gap paths are not retained after pickling " + "(panel-derived); re-run in_time_placebo() on a freshly fitted result " + "to recompute them." + ) + pre = list(self.pre_periods) + rows: List[Dict[str, Any]] = [] + for t_f, gap_path in self._in_time_gaps.items(): + split = pre.index(t_f) + for period in pre: + if period in gap_path: + phase = "post_fake" if pre.index(period) >= split else "pre_fake" + rows.append( + { + "placebo_period": t_f, + "period": period, + "gap": gap_path[period], + "phase": phase, + } + ) + return pd.DataFrame(rows, columns=["placebo_period", "period", "gap", "phase"]) diff --git a/docs/api/synthetic_control.rst b/docs/api/synthetic_control.rst index 506c7422..b727315d 100644 --- a/docs/api/synthetic_control.rst +++ b/docs/api/synthetic_control.rst @@ -25,6 +25,15 @@ reported estimate. Significance comes from **in-space placebo permutation infere ``placebo_p_value = rank/(n_placebos+1)``). This permutation p-value is a separate field from the (NaN) ``p_value``; ``is_significant`` stays bound to ``p_value``. +**Robustness diagnostics (ADH 2015 §4, opt-in):** +:meth:`~diff_diff.SyntheticControlResults.leave_one_out` drops each reportably-weighted (weight > 1e-6) +donor and re-fits (per-drop ATT / ``delta_att`` table — a large ``delta_att`` flags +single-donor dependence), and +:meth:`~diff_diff.SyntheticControlResults.in_time_placebo` reassigns the intervention to an +earlier pre-date and checks for a spurious gap before the true treatment date (the +backdating placebo; ``placebo_att`` should be ~0). Both re-run the validated solver and +leave the analytical inference fields NaN. + **Distinct from** :class:`~diff_diff.SyntheticDiD` (Arkhangelsky et al. 2021), which adds time weights and ridge regularization; classic SCM uses **donor weights only** plus the outer ``V`` search. @@ -71,6 +80,12 @@ Results container for synthetic control estimation. ~SyntheticControlResults.in_space_placebo ~SyntheticControlResults.get_placebo_df + ~SyntheticControlResults.leave_one_out + ~SyntheticControlResults.get_leave_one_out_df + ~SyntheticControlResults.get_leave_one_out_gaps + ~SyntheticControlResults.in_time_placebo + ~SyntheticControlResults.get_in_time_placebo_df + ~SyntheticControlResults.get_in_time_placebo_gaps ~SyntheticControlResults.summary ~SyntheticControlResults.print_summary ~SyntheticControlResults.to_dict diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 38c8022f..f33f263b 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1988,6 +1988,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d **Inference:** **No analytical standard error** (Section 2.4) — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. Significance comes from **in-space placebo permutation inference** via `SyntheticControlResults.in_space_placebo()`: reassign treatment to each donor, refit a synthetic control for it, and rank the treated unit's post/pre RMSPE ratio (`rmspe_ratio` = `RMSPE_post / RMSPE_pre` = `sqrt(MSPE_post / MSPE_pre)`) among all units; `placebo_p_value = rank / (n_placebos + 1)`, where `rank = 1 + #{placebos with ratio ≥ treated ratio}` — an **upper-tail rank test on the (unsigned) RMSPE-ratio statistic**, ties counted conservatively via `≥`. Because the ratio squares the gaps it is direction-agnostic: a large ratio signals an effect of *either* sign, so this is NOT a signed/one-directional ("one-sided") hypothesis test on the treatment-effect direction. ADH 2010 §3.4 reports the *MSPE* ratio (the square of `rmspe_ratio`); the two are monotone-equivalent, so the rank and p-value are identical — only the reported statistic's scale differs. `rmspe_ratio` (the treated statistic) is computed at fit time; `placebo_p_value` / `n_placebos` / `n_failed` are populated by the opt-in `in_space_placebo()` call. `get_placebo_df()` returns the per-unit ratio table used for the rank (the per-period placebo gap paths for a "spaghetti" plot are retained internally, not in this summary). +**ADH-2015 §4 robustness diagnostics (opt-in):** Two further diagnostics from Abadie-Diamond-Hainmueller (2015, §4), each a thin re-run of the validated solver — populated only when called and surfaced under `estimator_native_diagnostics` (the analytical inference contract is unchanged; `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay bound to the NaN analytical `p_value`): +- **Leave-one-out donor robustness** (`leave_one_out()`): drops each **reportably-weighted** donor and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`). A large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal, not a failure). The reporting stack's headline donor-sensitivity number is `max_abs_delta_att` = `max |delta_att|` over the drops (baseline-relative, so a uniform shift of every drop away from the full-fit ATT is not masked the way a raw ATT range would be). `get_leave_one_out_gaps()` returns the per-drop trajectories for the overlay plot. Fails closed on a non-converged treated fit or `< 2` donors. +- **In-time (backdating) placebo** (`in_time_placebo()`): reassigns the intervention to an earlier pre-date `t_f`, re-fits using ONLY pre-`t_f` information (TRUNCATE convention — see Note), and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 if there is no real pre-period effect (ADH 2015 Fig. 4, German reunification backdated to 1975). Sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a valid-but-dimensionally-infeasible date yields a `status="infeasible"` row (no raise). + **Notes / deviations:** - **Note:** The standardization divisor `divisor = sqrt(apply(cbind(X0,X1), 1, var))` (per-predictor SD over donors+treated, ddof=1) and the inner/outer optimizer are **not specified in ADH 2010** (which defers these numerics to Abadie & Gardeazabal 2003 App. B / the `Synth` software). The divisor is pinned from the R `Synth::synth` source; `solution.v` lives in this scaled predictor space, so the deterministic R-parity test feeds `custom_v` in the same scaled space. - **Note:** The outer objective minimizes the pre-period outcome MSPE over **all** pre periods, whereas R `Synth` uses a `time.optimize.ssr` window (1960–1969 in the Basque example). The nested `V` therefore differs from R by an efficiency-only choice (the paper notes inferential validity holds for *any* `V`), so end-to-end nested parity is a tolerance band, not equality. @@ -2000,6 +2004,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d - **Note (placebo failure handling):** a placebo is **excluded from both the numerator and the denominator** of the rank (never penalized into it) and tallied in `n_failed` when its fit is not a valid optimum — EITHER its **inner Frank-Wolfe weight solve** did not converge (a truncated `W` is unusable) OR its **outer `V` search** did not converge (an under-optimized `V` fits the pre-period worse, shrinking the RMSPE ratio and biasing the p-value anti-conservatively, so it must not silently enter the rank). The reported p-value uses the **effective** count `rank / (n_placebos + 1)`, where `n_placebos` is the number of placebos that entered the reference set. Failed donors still appear in `get_placebo_df()` (`status="failed"`, NaN metrics), so once a reference set is produced the table is the full treated + every-donor unit set (`n_donors + 1` rows). In the fail-closed cases the placebo loop does not run and only the treated row is returned: `J < 2` → `placebo_p_value` is NaN with a warning (no placebo distribution; `J == 2` warns the distribution is coarse), and a treated fit whose own **inner OR outer** search did not converge also fails closed (ranking a truncated / under-optimized treated statistic would not be a valid permutation). **Caveat:** each placebo refit inherits the original fit's `optimizer_options` / `n_starts`, so valid inference requires settings adequate for the outer `V` search to converge to a comparable-quality synthetic (production defaults do; cheap settings under-optimize placebo `V` and those placebos are dropped as failed — raise `n_starts` on `in_space_placebo()` or re-fit with a larger `optimizer_options['maxiter']`). - **Note (RMSPE-ratio floor):** the reported `rmspe_ratio = sqrt(MSPE_post / MSPE_pre)` floors the pre-period MSPE denominator at a scale-aware `1e-8 · max(|pre-outcomes|, 1)²` (before the square root) so a (near-)perfect pre-fit (`pre-MSPE → 0`) yields a large-but-FINITE ratio rather than `inf`/`nan` (which would corrupt the rank). Ties (`ratio_j ≥ treated_ratio`) are counted, making the p-value conservative. Mirrors the `_fit_tol` poor-fit guard. - **Note (placebo p-value is non-analytical):** `placebo_p_value` is deliberately a SEPARATE field from `p_value` (which stays NaN) — it is a permutation p-value with no SE / t-stat, so it does not flow through `safe_inference`. `is_significant` likewise stays bound to the (NaN) `p_value`, NOT `placebo_p_value`; a tool gating on `is_significant` will see `False` even when `placebo_p_value` is small. The reporting stack surfaces the placebo p-value through `estimator_native_diagnostics`, never the analytical headline. +- **Note (in-time placebo windowing — TRUNCATE):** ADH 2015 §4 says to re-estimate the in-time placebo "with the same predictors lagged accordingly." Because `diff_diff`'s predictor specs reference **absolute** periods, the in-time placebo re-cuts them by TRUNCATION: pre-period-outcome predictors become the pre-`t_f` outcomes, and covariate / special-predictor windows are intersected with the pre-`t_f` window; a window lying ENTIRELY in the held-out region `[t_f, T0)` is **dropped** (surfaced in the `n_dropped_specs` column + an aggregated warning), and `custom_v` is subset in lockstep with the surviving specs. For an outcome-predictor fit (the R-anchorable case) TRUNCATE is identical to ADH's "lag" — both equal a manual `Synth::synth` re-run with `time.optimize.ssr` cut at `t_f`. The held-out window never enters the fit (the placebo's `all_periods` is the pre-fake + post-fake span; the true post-treatment periods are excluded entirely), so there is no "peeking." This concrete convention is NOT spelled out in ADH 2015 (which gives only the qualitative "lag accordingly"). +- **Note (in-time placebo requires ≥2 pre-fake periods):** the in-time placebo treats a date with fewer than 2 pre-fake periods as `status="infeasible"` (the default sweep starts at the 3rd pre-period). This is DELIBERATELY stricter than the base estimator's `T0 ≥ 1` allowance (which permits a single-pre-period fit but warns that nested-`V` selection is unreliable): an auto-swept placebo date with a single pre-fake period is a trivially-matchable, non-credible pre-fit, so it is dropped rather than surfaced as a `ran` placebo (mirrors `SyntheticDiD.in_time_placebo`'s `i ≥ 2` rule). A date whose surviving `custom_v` has zero mass after truncation is likewise infeasible (not a convergence failure). +- **Note (leave-one-out weight floor):** ADH 2015 §4 leave-one-out omits "each donor that received positive weight." This implementation drops each donor with **reportable** weight — above the `1e-6` interpretability floor (`synthetic_control._MIN_REPORT_WEIGHT`), i.e. exactly the donors in `donor_weights` — rather than every strictly-positive weight. A donor with `0 < w ≤ 1e-6` is numerical dust whose removal moves the ATT by ~its weight (its `delta_att` would be ~0, an uninformative row), and the floor keeps the LOO table aligned with the reported donor support. The drop-set is **frozen at fit time** on the fit snapshot (`weighted_donor_ids`), so `leave_one_out()` is immune to post-fit mutation of the presentation-level `donor_weights` dict. +- **Note (ADH-2015 diagnostics validation):** R `Synth` has **no** in-time-placebo or leave-one-out function (verified against its full CRAN function index; `SCtools` adds only the *in-space* placebo battery, `scpi` only prediction-interval uncertainty), so there is no canonical R *output* to match for these diagnostics — in R they are hand-rolled by re-running `dataprep()`+`synth()`. They are validated instead by (a) the solver's existing Basque R parity (above), and (b) deterministic **self-consistency** tests proving each diagnostic equals a from-scratch `synthetic_control()` fit on the equivalent sub-problem — `leave_one_out()` drop-`d` == a fit on the donor pool minus `d`; `in_time_placebo([t_f])` == a fit on the backdated/truncated panel — both via a fixed `custom_v` (match to 1e-7). The deferred ADH-2015 items (out-of-sample CV `V`-selection, regression-weight `W^reg` extrapolation diagnostic, sparse-SC subset search) are tracked in `TODO.md`. **Reference implementation:** authors' `Synth` package for R/MATLAB/Stata (`Synth::synth`); in-space placebo construction follows `SCtools::generate.placebos`. **R-parity anchor:** the Basque Country study (Abadie-Gardeazabal 2003, `data("basque")`) — published synthetic = region 10 (Cataluña) 0.851 + region 14 (Madrid) 0.149, `loss.v` 0.0089. Two-tier test (`tests/test_methodology_synthetic_control.py`): Tier-1 feeds R's `solution.v` via `custom_v` → donor weights match to atol 1e-3 (deterministic); Tier-2 checks the nested fit in a band. @@ -2010,6 +2018,9 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d - [x] Outer nested `V` (pre-period outcome MSPE) + user-supplied `custom_v`. - [x] Gap path + pre-period RMSPE + predictor-balance table. - [x] No analytical SE (NaN inference); in-space placebo permutation inference (`in_space_placebo()`, `rank/(n_placebos+1)`) with the real treated unit excluded from every placebo pool, effective-count denominator, and a scale-aware RMSPE-ratio floor. +- [x] Leave-one-out donor robustness (`leave_one_out()`, ADH 2015 §4): per-drop ATT / `delta_att` table + overlay gaps; fail-closed. +- [x] In-time (backdating) placebo (`in_time_placebo()`, ADH 2015 §4): TRUNCATE windowing (drop held-out-window predictors + lockstep `custom_v` subset), feasible-date sweep, fail-closed. +- [ ] *Deferred (ADH 2015):* out-of-sample CV `V`-selection, regression-weight `W^reg` extrapolation diagnostic, sparse-SC subset search (see `TODO.md`). - [x] Predictor-leakage, absorbing-suffix/no-anticipation, empty-window, duplicate-label, and inner-non-convergence validation gates. --- diff --git a/docs/methodology/REPORTING.md b/docs/methodology/REPORTING.md index 58058312..17af4cd9 100644 --- a/docs/methodology/REPORTING.md +++ b/docs/methodology/REPORTING.md @@ -266,11 +266,13 @@ a library setting. under `estimator_native_diagnostics`. `SyntheticControlResults` routes parallel-trends to the `scm_fit` analogue (`pre_rmspe`, verdict `design_enforced_pt`) and surfaces `pre_rmspe`, donor-weight - concentration, and the in-space placebo permutation p-value under - `estimator_native_diagnostics` — the placebo block is populated only - when the caller has already run `in_space_placebo()` (opt-in; DR never - triggers the per-donor refit loop implicitly), and it omits - HonestDiD-style `sensitivity` (significance IS the placebo). + concentration, the in-space placebo permutation p-value, and the + ADH-2015 leave-one-out (`leave_one_out`) and in-time placebo + (`in_time_placebo`) blocks under `estimator_native_diagnostics` — each + is populated only when the caller has already run the corresponding + opt-in method (DR never triggers a refit loop implicitly; otherwise a + `status="not_run"` stub), and it omits HonestDiD-style `sensitivity` + (significance IS the placebo). `EfficientDiDResults` PT runs through `EfficientDiD.hausman_pretest` (the estimator's native PT-All vs PT-Post check). diff --git a/tests/data/synth_basque_golden.json b/tests/data/synth_basque_golden.json index c9694aa1..ef52a8b4 100644 --- a/tests/data/synth_basque_golden.json +++ b/tests/data/synth_basque_golden.json @@ -332,5 +332,27 @@ "years": [1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997], "treated_path": [3.853184630005, 3.945658296151, 4.033561734873, 4.023421896897, 4.013781968405, 4.285918396223, 4.574336095797, 4.898957353563, 5.197014981629, 5.338902978753, 5.465153005252, 5.545915627064, 5.614895726639, 5.852184933072, 6.08140541737, 6.17009424135, 6.283633404546, 6.555555398653, 6.810768561103, 7.105184302811, 7.377891682176, 7.232933621923, 7.089831372119, 6.786703607145, 6.639817386857, 6.56283917137, 6.500785454993, 6.545058607, 6.595329801139, 6.761496750091, 6.937160671728, 7.332191151301, 7.742788123594, 8.120536640759, 8.509711162324, 8.776777889074, 9.025278666196, 8.873892824706, 8.718223539089, 9.018137849286, 9.440873861653, 9.686518137675, 10.170665872809], "synthetic_path": [3.702941626236, 3.85396941359, 3.996388245904, 4.029401685261, 4.059672772757, 4.378924693221, 4.733052176452, 4.987634927316, 5.222027456975, 5.298503665995, 5.362138883475, 5.448530450656, 5.523111080498, 5.760711870916, 5.993100448641, 6.137916329668, 6.294338685153, 6.620780377198, 6.933000322789, 7.087051801242, 7.228035388368, 7.220668043084, 7.211129382309, 7.07464338848, 7.057308667347, 7.129300695567, 7.234425280207, 7.325333530431, 7.421847797791, 7.516347610321, 7.610133692242, 8.117951258459, 8.623595409787, 9.086804616765, 9.545547512504, 9.788247373834, 10.037692891258, 9.838212341386, 9.639052163101, 9.987880532374, 10.303885350264, 10.538434998253, 10.998744701142], - "gap": [0.1502430037689, 0.09168888256074, 0.03717348896852, -0.005979788364376, -0.04589080435155, -0.09300629699853, -0.1587160806545, -0.08867757375288, -0.02501247534587, 0.0403993127577, 0.1030141217766, 0.09738517640774, 0.0917846461413, 0.09147306215556, 0.08830496872896, 0.03217791168141, -0.01070528060644, -0.06522497854548, -0.1222317616855, 0.01813250156919, 0.1498562938071, 0.01226557883842, -0.1212980101894, -0.2879397813353, -0.4174912804894, -0.5664615241969, -0.7336398252139, -0.7802749234314, -0.8265179966512, -0.7548508602291, -0.6729730205141, -0.7857601071586, -0.8808072861931, -0.9662679760057, -1.035836350179, -1.01146948476, -1.012414225062, -0.96431951668, -0.9208286240117, -0.969742683088, -0.863011488611, -0.851916860578, -0.8280788283333] + "gap": [0.1502430037689, 0.09168888256074, 0.03717348896852, -0.005979788364376, -0.04589080435155, -0.09300629699853, -0.1587160806545, -0.08867757375288, -0.02501247534587, 0.0403993127577, 0.1030141217766, 0.09738517640774, 0.0917846461413, 0.09147306215556, 0.08830496872896, 0.03217791168141, -0.01070528060644, -0.06522497854548, -0.1222317616855, 0.01813250156919, 0.1498562938071, 0.01226557883842, -0.1212980101894, -0.2879397813353, -0.4174912804894, -0.5664615241969, -0.7336398252139, -0.7802749234314, -0.8265179966512, -0.7548508602291, -0.6729730205141, -0.7857601071586, -0.8808072861931, -0.9662679760057, -1.035836350179, -1.01146948476, -1.012414225062, -0.96431951668, -0.9208286240117, -0.969742683088, -0.863011488611, -0.851916860578, -0.8280788283333], + "leave_one_out": { + "dropped_regionno": 10, + "solution_w": { + "2": 9.808425765417e-09, + "3": 2.549786470331e-08, + "4": 1.480664733111e-06, + "5": 1.139816545885e-07, + "6": 1.578269127035e-08, + "7": 0.7023493162417, + "8": 1.197763079811e-08, + "9": 1.711227888294e-08, + "11": 4.911054589331e-07, + "12": 6.004338842587e-09, + "13": 1.948545433164e-08, + "14": 0.2976484098422, + "15": 1.699497289433e-08, + "16": 3.294816157091e-08, + "18": 3.255185607848e-08 + }, + "att": 0.6138519220436, + "gap": [0.6880403421591, 0.628878774828, 0.5753339248836, 0.54101400783, 0.5037249840412, 0.546426742406, 0.5606408216439, 0.6633437692491, 0.7549698385142, 0.8268369239666, 0.8905574520592, 0.8951691657033, 0.8929960283248, 0.9175321635515, 0.9274417393066, 0.8572191113726, 0.8064665804174, 0.8252573922895, 0.8361572960057, 0.9576844174138, 1.065360164047, 0.8680617841655, 0.6817614133823, 0.431542742956, 0.3121824139876, 0.2118891193835, 0.1045310836943, 0.1442119442113, 0.172403844142, 0.2354616807978, 0.2925811556038, 0.3381223905461, 0.3936699900311, 0.4330380719963, 0.4842571460249, 0.6248622785575, 0.7242896438824, 0.7039633742619, 0.6787159280904, 0.7734776694026, 0.9728259048368, 1.059949514994, 1.197909760725] + } } diff --git a/tests/test_business_report.py b/tests/test_business_report.py index d5668e95..80b04fdf 100644 --- a/tests/test_business_report.py +++ b/tests/test_business_report.py @@ -825,6 +825,19 @@ def test_scm_robustness_block_surfaces_native_fields(self, scm_fit): assert "weight_concentration" in native assert native["in_space_placebo"]["n_placebos"] == res.n_placebos + def test_scm_robustness_block_surfaces_adh2015_diagnostics(self, scm_fit): + # After running the opt-in ADH-2015 diagnostics, the robustness block must + # carry their native sub-blocks (lifted from estimator_native_diagnostics). + res, _ = scm_fit + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.leave_one_out() + res.in_time_placebo() + rob = BusinessReport(res, auto_diagnostics=True).to_dict()["robustness"] + native = rob["estimator_native"] + assert native["leave_one_out"]["status"] == "ran" + assert native["in_time_placebo"]["status"] == "ran" + def test_staggered_triple_diff_assumption_uses_ddd_not_generic_pt(self): class StaggeredTripleDiffResults: pass diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py index 66e594a5..a0a78fcd 100644 --- a/tests/test_diagnostic_report.py +++ b/tests/test_diagnostic_report.py @@ -2077,8 +2077,10 @@ def test_scm_native_section_populated(self, scm_fit): assert "herfindahl" in native["weight_concentration"] # Placebo is opt-in: NOT auto-run inside the report. assert native["in_space_placebo"]["status"] == "not_run" - # In-time placebo / leave-one-out are ADH 2015 (not implemented here). - assert "in_time_placebo" not in native and "leave_one_out" not in native + # The ADH-2015 diagnostics are also opt-in: surfaced as "not_run" stubs until + # the user calls leave_one_out() / in_time_placebo(). + assert native["leave_one_out"]["status"] == "not_run" + assert native["in_time_placebo"]["status"] == "not_run" def test_scm_native_surfaces_placebo_after_optin_run(self, scm_fit): res, _ = scm_fit @@ -2090,6 +2092,25 @@ def test_scm_native_surfaces_placebo_after_optin_run(self, scm_fit): assert block["n_placebos"] == res.n_placebos assert block["placebo_p_value"] == pytest.approx(res.placebo_p_value) + def test_scm_native_surfaces_leave_one_out_after_optin_run(self, scm_fit): + res, _ = scm_fit + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.leave_one_out() + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + block = native["leave_one_out"] + assert block["status"] == "ran" + assert block["att_range"] is not None and len(block["att_range"]) == 2 + + def test_scm_native_surfaces_in_time_placebo_after_optin_run(self, scm_fit): + res, _ = scm_fit + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_time_placebo() + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + block = native["in_time_placebo"] + assert block["status"] == "ran" and block["n_dates"] >= 1 + def test_scm_does_not_call_honest_did(self, scm_fit): """HonestDiD sensitivity should NOT run on SCM (fit-based / native path).""" res, _ = scm_fit @@ -2108,18 +2129,22 @@ def test_scm_significance_not_marked_done_until_placebo_run(self, scm_fit): res, _ = scm_fit # the fixture does NOT run the placebo schema = DiagnosticReport(res).to_dict() labels = " ".join(s.get("label", "") for s in schema.get("next_steps", [])).lower() - assert "placebo" in labels # the significance recommendation still surfaces + # Target the IN-SPACE step specifically (the in-time-placebo step's label also + # contains "placebo" but is a different, always-on ADH-2015 recommendation). + assert "in-space placebo" in labels # the significance recommendation surfaces def test_scm_placebo_step_completes_after_run(self, scm_fit): - """Once the opt-in placebo has been run, DR stops recommending it.""" + """Once the opt-in in-space placebo has been run, DR stops recommending it.""" res, _ = scm_fit before = DiagnosticReport(res).to_dict()["next_steps"] - assert any("placebo" in s.get("label", "").lower() for s in before) + assert any("in-space placebo" in s.get("label", "").lower() for s in before) with warnings.catch_warnings(): warnings.simplefilter("ignore") res.in_space_placebo() # opt-in significance procedure now done after = DiagnosticReport(res).to_dict()["next_steps"] - assert not any("placebo" in s.get("label", "").lower() for s in after) + # The in-space step is suppressed; the (differently-tagged) in-time-placebo + # and leave-one-out recommendations are unaffected. + assert not any("in-space placebo" in s.get("label", "").lower() for s in after) def test_scm_rejects_precomputed_parallel_trends_and_sensitivity(self, scm_fit): # Like SDiD/TROP, SCM computes its PT verdict internally (the scm_fit diff --git a/tests/test_methodology_synthetic_control.py b/tests/test_methodology_synthetic_control.py index ae563d01..ee85ea78 100644 --- a/tests/test_methodology_synthetic_control.py +++ b/tests/test_methodology_synthetic_control.py @@ -32,7 +32,12 @@ import pandas as pd import pytest -from diff_diff import SyntheticControl, SyntheticControlResults, synthetic_control +from diff_diff import ( + DiagnosticReport, + SyntheticControl, + SyntheticControlResults, + synthetic_control, +) from tests.conftest import assert_nan_inference DATA_DIR = Path(__file__).parent / "data" @@ -845,6 +850,35 @@ def test_basque_tier2_nested_band(): assert res.donor_weights.get(10, 0) + res.donor_weights.get(14, 0) > 0.7 +def test_basque_tier1_leave_one_out_parity(): + """Tier-1 LOO (deterministic): dropping the dominant donor (region 10) with R's + ``solution.v`` held fixed, the reduced-pool refit's ATT and gap path match R's + drop-donor ``synth`` exactly (a direct R anchor on the reduced-pool W-solve; + ``leave_one_out()`` on a custom-V fit reuses that fixed V on the donor pool minus + the dropped unit). Region 10 carries ~85% of the full-pool weight, so dropping it + swings the synthetic onto regions 7+14 — the single-donor-dependence signal LOO + exists to surface.""" + golden, df = _load_golden() + if "leave_one_out" not in golden: + pytest.skip("LOO golden missing — regenerate via the R script.") + loo_g = golden["leave_one_out"] + dropped = int(loo_g["dropped_regionno"]) + custom_v = np.asarray(golden["solution_v"], dtype=float) + res = SyntheticControl(v_method="custom", custom_v=custom_v).fit( + df, "gdpcap", "treated", "regionno", "year", **_basque_kwargs(golden) + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + loo = res.leave_one_out() + row = loo[(loo["status"] == "loo") & (loo["dropped_unit"] == dropped)] + assert len(row) == 1 + assert float(row["att"].iloc[0]) == pytest.approx(float(loo_g["att"]), abs=1e-2) + # Full reduced-pool gap trajectory (1955-1997) matches R's drop-donor synth. + gaps = res.get_leave_one_out_gaps() + gap_py = gaps[gaps["dropped_unit"] == dropped].sort_values("period")["gap"].to_numpy() + np.testing.assert_allclose(gap_py, np.asarray(loo_g["gap"], dtype=float), atol=2e-2) + + # --------------------------------------------------------------------------- # In-space placebo permutation inference (Abadie-Diamond-Hainmueller 2010 §2.4) # --------------------------------------------------------------------------- @@ -1328,4 +1362,845 @@ def test_rmspe_ratio_is_root_scale(): assert _rmspe_ratio(pre, post, scale=10.0) == pytest.approx(1.5) # Zero post-effect -> ratio 0; perfect pre-fit -> finite (floored), not inf. assert _rmspe_ratio(pre, np.zeros(2), scale=10.0) == pytest.approx(0.0) + # Perfect pre-fit (zero pre-gaps) -> floored denominator -> finite, not inf. assert np.isfinite(_rmspe_ratio(np.zeros(2), post, scale=10.0)) + + +# --------------------------------------------------------------------------- +# Leave-one-out donor robustness (ADH 2015 §4) +# --------------------------------------------------------------------------- + + +def _equal_mix_panel(n_donors=5, T=8, T0=6, effect=3.0, seed=1): + """Near-identical donors -> equal-ish weights -> dropping any one barely moves + the synthetic (the LOO-stable regime).""" + rng = np.random.default_rng(seed) + years = list(range(2000, 2000 + T)) + base = rng.normal(10, 0.4, n_donors) + common = np.cumsum(rng.normal(0, 0.2, T)) # shared trend + donors = {j: base[j] + common + rng.normal(0, 0.08, T) for j in range(n_donors)} + treated = np.mean([donors[j] for j in range(n_donors)], axis=0) + rng.normal(0, 0.04, T) + treated = treated.copy() + treated[T0:] += effect + rows = [] + for j in range(n_donors): + for t in range(T): + rows.append({"unit": f"d{j}", "year": years[t], "y": donors[j][t], "treated": 0}) + for t in range(T): + rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)}) + return pd.DataFrame(rows) + + +def _single_donor_panel(n_donors=4, T=8, T0=6, effect=3.0, seed=2): + """One donor (d0) tracks the treated unit; the rest are far away -> weight + concentrates on d0 -> dropping d0 swings the result (the LOO-fragile regime).""" + rng = np.random.default_rng(seed) + years = list(range(2000, 2000 + T)) + d0_path = 10 + np.cumsum(rng.normal(0, 0.3, T)) + donors = {0: d0_path + rng.normal(0, 0.03, T)} + for j in range(1, n_donors): + donors[j] = (25.0 + 6.0 * j) + np.cumsum(rng.normal(0, 0.3, T)) # far from treated + treated = d0_path + rng.normal(0, 0.03, T) + treated = treated.copy() + treated[T0:] += effect + rows = [] + for j in range(n_donors): + for t in range(T): + rows.append({"unit": f"d{j}", "year": years[t], "y": donors[j][t], "treated": 0}) + for t in range(T): + rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)}) + return pd.DataFrame(rows) + + +def _fit_cheap(df): + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + return synthetic_control(df, "y", "treated", "unit", "year", seed=0, **_FAST) + + +_LOO_COLS = ["dropped_unit", "att", "pre_rmspe", "post_rmspe", "rmspe_ratio", "delta_att", "status"] + + +def test_leave_one_out_baseline_row_and_structure(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + loo = res.leave_one_out() + assert list(loo.columns) == _LOO_COLS + # Exactly one baseline row, first, reading directly from the full fit. + base = loo.iloc[0] + # dropped_unit is "not applicable" for the baseline row (pandas renders the + # None as NA in the donor-id column). + assert base["status"] == "baseline" and pd.isna(base["dropped_unit"]) + assert base["att"] == pytest.approx(res.att) and base["delta_att"] == 0.0 + assert base["pre_rmspe"] == pytest.approx(res.pre_rmspe) + assert base["rmspe_ratio"] == pytest.approx(res.rmspe_ratio) + # One LOO row per positively-weighted donor (no failures on this clean panel). + pos = [d for d in res._fit_snapshot.donor_ids if d in res.donor_weights] + loo_rows = loo[loo["status"] == "loo"] + assert set(loo_rows["dropped_unit"]) == set(pos) + assert res._loo_n_failed == 0 and res._loo_status == "ran" + # delta_att == att - full att, exactly. + for _, r in loo_rows.iterrows(): + assert r["delta_att"] == pytest.approx(r["att"] - res.att) + # Sorted by |delta_att| descending. + deltas = loo_rows["delta_att"].abs().to_numpy() + assert np.all(np.diff(deltas) <= 1e-12) + # att_range spans the LOO refits. + lo, hi = res._loo_att_range + assert lo <= hi and lo == pytest.approx(loo_rows["att"].min()) + assert hi == pytest.approx(loo_rows["att"].max()) + + +def test_leave_one_out_stable_when_no_donor_dominates(): + res = _fit_cheap(_equal_mix_panel(n_donors=5, effect=3.0)) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + loo = res.leave_one_out() + loo_rows = loo[loo["status"] == "loo"] + # Near-identical donors -> dropping any one barely moves the ATT (well under the + # 3.0 effect). att_range is correspondingly tight. + assert loo_rows["delta_att"].abs().max() < 1.0 + lo, hi = res._loo_att_range + assert (hi - lo) < 1.0 + + +def test_leave_one_out_swings_when_one_donor_dominates(): + res = _fit_cheap(_single_donor_panel(n_donors=4, effect=3.0)) + # Weight concentrates on d0. + assert res.donor_weights.get("d0", 0.0) > 0.5 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + loo = res.leave_one_out() + loo_rows = loo[loo["status"] == "loo"] + # Dropping the dominant donor is the most influential drop (top finite row) and + # moves the ATT by a non-trivial amount. + top = loo_rows.iloc[0] + assert top["dropped_unit"] == "d0" + assert abs(top["delta_att"]) > 0.2 + + +def test_leave_one_out_deterministic(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + loo1 = res.leave_one_out() + loo2 = res.leave_one_out() + pd.testing.assert_frame_equal(loo1, loo2) + + +def test_leave_one_out_requires_two_donors(): + res = _fit_for_placebo(n_donors=1) + with pytest.warns(UserWarning, match="at least 2 donors"): + loo = res.leave_one_out() + assert len(loo) == 1 and loo.iloc[0]["status"] == "baseline" + assert res._loo_status == "too_few_donors" and res._loo_att_range is None + + +def test_leave_one_out_fails_closed_on_nonconverged_treated_fit(): + df, _, _ = _make_panel(n_donors=4, effect=3.0) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, "y", "treated", "unit", "year", seed=0, inner_max_iter=1, **_FAST_CHURN + ) + assert res._fit_converged is False + with pytest.warns(UserWarning, match="did not converge at fit time"): + loo = res.leave_one_out() + assert len(loo) == 1 and loo.iloc[0]["status"] == "baseline" + assert res._loo_status == "treated_fit_nonconverged" + + +def test_leave_one_out_refit_failure_tallied(monkeypatch): + import importlib + + sc = importlib.import_module("diff_diff.synthetic_control") + res = _fit_for_placebo(n_donors=4) + real_fit_unit = sc._placebo_fit_unit + calls = {"n": 0} + + def flaky_fit_unit(snap, unit, donor_pool, n_starts): + calls["n"] += 1 + if calls["n"] == 1: # first leave-one-out refit "fails" + return None + return real_fit_unit(snap, unit, donor_pool, n_starts) + + monkeypatch.setattr(sc, "_placebo_fit_unit", flaky_fit_unit) + with pytest.warns(UserWarning, match="failed to converge"): + loo = res.leave_one_out() + assert res._loo_n_failed == 1 + failed = loo[loo["status"] == "failed"] + assert len(failed) == 1 + assert failed[["att", "pre_rmspe", "rmspe_ratio", "delta_att"]].isna().all().all() + # Failed rows sort last (after the baseline + the converged LOO rows). + assert loo.iloc[-1]["status"] == "failed" + + +def test_leave_one_out_pickle_drops_gaps_keeps_table(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.leave_one_out() + restored = pickle.loads(pickle.dumps(res)) + # The summary table + scalars survive; panel-derived gap paths do not. + pd.testing.assert_frame_equal(restored.get_leave_one_out_df(), res.get_leave_one_out_df()) + assert restored._loo_gaps is None + assert restored._loo_att_range == res._loo_att_range + with pytest.raises(ValueError, match="not retained after pickling"): + restored.get_leave_one_out_gaps() + + +def test_leave_one_out_gaps_long_form(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.leave_one_out() + gaps = res.get_leave_one_out_gaps() + assert list(gaps.columns) == ["dropped_unit", "period", "gap", "phase"] + pos = [d for d in res._fit_snapshot.donor_ids if d in res.donor_weights] + assert set(gaps["dropped_unit"]) == set(pos) + # Every dropped donor has a full pre+post trajectory. + n_periods = len(res.pre_periods) + len(res.post_periods) + assert (gaps.groupby("dropped_unit").size() == n_periods).all() + assert set(gaps["phase"]) == {"pre", "post"} + + +def test_leave_one_out_accessor_before_run_raises(): + res = _fit_for_placebo(n_donors=4) + with pytest.raises(ValueError, match="call leave_one_out"): + res.get_leave_one_out_df() + with pytest.raises(ValueError, match="call leave_one_out"): + res.get_leave_one_out_gaps() + + +def test_leave_one_out_does_not_touch_analytical_inference(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.leave_one_out() + assert_nan_inference( + {"se": res.se, "t_stat": res.t_stat, "p_value": res.p_value, "conf_int": res.conf_int} + ) + assert res.is_significant is False + + +def test_leave_one_out_requires_snapshot(): + res = _fit_for_placebo(n_donors=4) + restored = pickle.loads(pickle.dumps(res)) + with pytest.raises(ValueError, match="requires the fit snapshot"): + restored.leave_one_out() + + +# --------------------------------------------------------------------------- +# In-time placebo: snapshot-truncation helper (ADH 2015 §4) +# --------------------------------------------------------------------------- + + +def _snap_for_in_time(**kw): + return _fit_for_placebo(n_donors=4, **kw)._fit_snapshot + + +def test_truncate_snapshot_positional_split(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + snap = _snap_for_in_time() + assert list(snap.pre_periods) == [2000, 2001, 2002, 2003, 2004, 2005] + mod, _ = _truncate_snapshot_in_time(snap, 2003) + assert mod is not None + assert mod.pre_periods == [2000, 2001, 2002] # pre-fake = strictly before t_f + assert mod.post_periods == [2003, 2004, 2005] # post-fake = held-out pre, t_f first + # all_periods EXCLUDES the true post periods (2006, 2007) -> airtight no-peeking. + assert mod.all_periods == [2000, 2001, 2002, 2003, 2004, 2005] + assert 2006 not in mod.all_periods and 2007 not in mod.all_periods + + +def test_truncate_snapshot_drops_specs_in_held_out_window(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + snap = _snap_for_in_time() # default pre_period_outcomes="all": one lag per pre period + mod, dropped = _truncate_snapshot_in_time(snap, 2003) + for spec in mod.specs: # surviving specs reference only pre-fake periods + assert all(p < 2003 for p in spec.periods) + assert len(dropped) == 3 # lags at 2003/2004/2005 dropped + assert len(mod.specs) == len(snap.specs) - 3 + + +def test_truncate_snapshot_custom_v_lockstep(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=np.arange(1.0, 7.0), # distinct entries to verify the subset + inner_min_decrease=1e-3, + ) + snap = res._fit_snapshot + mod, _ = _truncate_snapshot_in_time(snap, 2003) + # custom_v subset IN LOCKSTEP with the surviving specs (the default lag specs are + # ordered by ascending pre period, so the first three entries survive). + assert mod.custom_v is not None and len(mod.custom_v) == len(mod.specs) + np.testing.assert_array_equal(mod.custom_v, np.array([1.0, 2.0, 3.0])) + + +def test_truncate_snapshot_straddling_window_partial_keep(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + special_predictors=[("y", [2002, 2003, 2004], "mean")], + pre_period_outcomes=[2000, 2001], + inner_min_decrease=1e-3, + ) + snap = res._fit_snapshot + mod, _ = _truncate_snapshot_in_time(snap, 2003) + # The special predictor straddles t_f -> truncated to its pre-fake part [2002]. + special = [s for s in mod.specs if s.kind == "special"] + assert len(special) == 1 and special[0].periods == [2002] + + +def test_truncate_snapshot_infeasible_too_few_pre_fake(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + snap = _snap_for_in_time() + # Fewer than 2 pre-fake periods -> infeasible (the deliberate >=2 rule; an + # auto-swept single-pre-fake placebo is a non-credible pre-fit — documented Note). + assert _truncate_snapshot_in_time(snap, 2000)[0] is None # 0 pre-fake + assert _truncate_snapshot_in_time(snap, 2001)[0] is None # 1 pre-fake + + +def test_truncate_snapshot_infeasible_all_specs_dropped(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + special_predictors=[("y", [2004, 2005], "mean")], + pre_period_outcomes=[2004, 2005], + inner_min_decrease=1e-3, + ) + snap = res._fit_snapshot + # t_f=2003 leaves >=2 pre-fake periods, but every spec lives in [2004, 2005] + # -> all dropped -> infeasible (cannot fit with zero predictors). + mod, dropped = _truncate_snapshot_in_time(snap, 2003) + assert mod is None and len(dropped) == len(snap.specs) + + +def test_truncate_snapshot_does_not_mutate_original(): + from diff_diff.synthetic_control import _truncate_snapshot_in_time + + snap = _snap_for_in_time() + before = [list(s.periods) for s in snap.specs] + _truncate_snapshot_in_time(snap, 2003) + after = [list(s.periods) for s in snap.specs] + assert before == after # shared spec objects are never mutated in place + + +# --------------------------------------------------------------------------- +# In-time placebo: end-to-end (ADH 2015 §4) +# --------------------------------------------------------------------------- + +_IN_TIME_COLS = [ + "placebo_period", + "placebo_att", + "pre_fit_rmspe", + "rmspe_ratio", + "n_pre_fake", + "n_post_fake", + "n_dropped_specs", + "status", +] + + +def test_in_time_placebo_near_zero_when_effect_post_only(): + # The effect is only in the TRUE post window (>=2006); every backdated placebo + # falls in the clean pre window, so the placebo "effect" should be ~0. + res = _fit_for_placebo(n_donors=4, effect=3.0) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + itp = res.in_time_placebo() + assert list(itp.columns) == _IN_TIME_COLS + ran = itp[itp["status"] == "ran"] + assert len(ran) > 0 + assert ran["placebo_att"].abs().max() < 1.0 # well below the 3.0 true effect + + +def test_in_time_placebo_sweep_feasibility(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + itp = res.in_time_placebo() + # pre = [2000..2005] -> feasible dates = pre[2:] = [2002, 2003, 2004, 2005] + # (>=2 pre-fake periods — the deliberate Note-documented restriction). + assert list(itp["placebo_period"]) == [2002, 2003, 2004, 2005] + assert (itp["status"] == "ran").all() + # n_pre_fake + n_post_fake == n_pre for every row, with >=2 pre-fake + >=1 post-fake. + assert ((itp["n_pre_fake"] + itp["n_post_fake"]) == len(res.pre_periods)).all() + assert (itp["n_pre_fake"] >= 2).all() and (itp["n_post_fake"] >= 1).all() + + +def test_in_time_placebo_explicit_post_date_raises(): + res = _fit_for_placebo(n_donors=4) + with pytest.raises(ValueError, match="true post-treatment period"): + res.in_time_placebo([2006]) + + +def test_in_time_placebo_date_not_in_pre_raises(): + res = _fit_for_placebo(n_donors=4) + with pytest.raises(ValueError, match="not a pre-treatment period"): + res.in_time_placebo([1999]) + + +def test_in_time_placebo_empty_explicit_input_raises(): + # An explicit but EMPTY container is malformed (NOT "every date infeasible") -> raise + # (codex R6 P1). None still means "sweep all feasible dates". + res = _fit_for_placebo(n_donors=4) + for empty in ([], (), pd.Index([]), np.array([])): + with pytest.raises(ValueError, match="placebo_periods is empty"): + res.in_time_placebo(empty) + # The malformed call must not leave any in-time state behind. + assert res._in_time_df is None and res._in_time_status is None + + +def test_in_time_placebo_dedups_and_canonicalizes_explicit_dates(): + # Duplicate / unordered explicit dates -> de-duplicated + pre-period-ordered, so no + # duplicate refits and n_dates is not inflated (codex R7 P3). + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + itp = res.in_time_placebo([2004, 2002, 2004]) # duplicate 2004, unordered + assert list(itp["placebo_period"]) == [2002, 2004] # unique, canonical pre-period order + + +def test_in_time_placebo_ran_block_reports_partial_coverage(): + # CI codex P2: a sweep where SOME dates ran and SOME were infeasible must surface + # n_ran / n_infeasible on the status="ran" block so coverage is not overstated. + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_time_placebo([2001, 2003]) # 2001 infeasible (1 pre-fake), 2003 runs + assert res._in_time_status == "ran" # at least one date ran + block = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]["in_time_placebo"] + assert block["status"] == "ran" + assert block["n_dates"] == 2 and block["n_ran"] == 1 + assert block["n_infeasible"] == 1 and block["n_failed"] == 0 + + +def test_leave_one_out_immune_to_donor_weights_mutation(): + # Codex R8 P1: the LOO drop-set is FROZEN at fit time (snap.weighted_donor_ids = + # the >1e-6 reportable support), NOT read from the mutable presentation-level + # donor_weights dict. So mutating donor_weights after the fit must NOT change which + # donors are dropped — the robustness result depends only on the fit. + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + before = set(res.leave_one_out()[lambda d: d["status"] != "baseline"]["dropped_unit"]) + assert before == set(res._fit_snapshot.weighted_donor_ids) # drops the frozen support + # Mutate the public dict: drop a real donor, inject a bogus one. + victim = next(iter(res.donor_weights)) + res.donor_weights = {k: v for k, v in res.donor_weights.items() if k != victim} + res.donor_weights["bogus_donor"] = 0.99 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + after = set(res.leave_one_out()[lambda d: d["status"] != "baseline"]["dropped_unit"]) + assert after == before # unchanged by the mutation + assert "bogus_donor" not in after # a donor not in the fit is never dropped + assert victim in after # still dropped despite removal from donor_weights + + +def test_in_time_placebo_early_date_infeasible_no_raise(): + res = _fit_for_placebo(n_donors=4) + # A valid pre-date with too few (<2) pre-fake periods -> NaN infeasible row + + # warning, NOT a raise. + with pytest.warns(UserWarning, match="infeasible"): + itp = res.in_time_placebo([2001]) # 1 pre-fake period + assert len(itp) == 1 and itp.iloc[0]["status"] == "infeasible" + assert np.isnan(itp.iloc[0]["placebo_att"]) + + +def test_in_time_placebo_custom_v_zero_mass_is_infeasible_not_failed(): + # A custom_v whose mass lies entirely on specs that TRUNCATE drops leaves a + # zero-mass surviving V -> the date is INFEASIBLE under the supplied custom_v, + # NOT a convergence failure (codex R2 P1b: v/v.sum() would be 0/0). + df, _, _ = _make_panel(n_donors=4) # default: 6 lag specs (2000..2005) + v = np.array([0.0, 0.0, 0.0, 1.0, 1.0, 1.0]) # all mass on the 2003/2004/2005 lags + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v, + inner_min_decrease=1e-3, + ) + itp = res.in_time_placebo([2003]) # keeps lags 2000/2001/2002 -> all zero weight + row = itp[itp["placebo_period"] == 2003] + assert len(row) == 1 and row.iloc[0]["status"] == "infeasible" # NOT "failed" + assert res._in_time_status == "all_dates_infeasible" + + +def test_leave_one_out_uniform_shift_surfaced_by_delta_not_range(monkeypatch): + # Codex R3 P1b: when every donor-drop shifts the ATT the SAME way, the raw + # att_range has ~zero width (looks stable) but the donor dependence is large. + # The headline metric must be baseline-relative (max |delta_att|), not the range. + import importlib + + sc = importlib.import_module("diff_diff.synthetic_control") + res = _fit_for_placebo(n_donors=4) + baseline = float(res.att) + snap = res._fit_snapshot + shift = 5.0 # same large shift for EVERY drop -> uniform + + def uniform_shift(snap_arg, unit, pool, n_starts): + gp = {p: 0.0 for p in snap.pre_periods} + gp.update({p: baseline + shift for p in snap.post_periods}) + return gp, 1.0 + + monkeypatch.setattr(sc, "_placebo_fit_unit", uniform_shift) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.leave_one_out() + lo, hi = res._loo_att_range + assert (hi - lo) == pytest.approx(0.0, abs=1e-9) # raw range would hide the shift + assert res._loo_max_abs_delta_att == pytest.approx(shift, abs=1e-9) # delta reveals it + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + assert native["leave_one_out"]["max_abs_delta_att"] == pytest.approx(shift, abs=1e-9) + + +def test_in_time_placebo_windowed_covariate_dropped_and_warns(): + # A special predictor measured over [2004, 2005] falls entirely in the held-out + # window for t_f=2003 -> dropped (TRUNCATE) + warning + n_dropped_specs reflects it. + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + special_predictors=[("y", [2004, 2005], "mean")], + pre_period_outcomes=[2000, 2001, 2002, 2003], + inner_min_decrease=1e-3, + ) + with pytest.warns(UserWarning, match="dropped"): + itp = res.in_time_placebo([2003]) + row = itp.iloc[0] + # The special predictor (and the lag at 2003) lie in [2003, 2005] -> dropped. + assert row["n_dropped_specs"] >= 1 and row["status"] == "ran" + + +def test_in_time_placebo_all_specs_dropped_infeasible(): + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + special_predictors=[("y", [2004, 2005], "mean")], + pre_period_outcomes=[2004, 2005], + inner_min_decrease=1e-3, + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + itp = res.in_time_placebo([2003]) # every predictor is at 2004/2005 + assert itp.iloc[0]["status"] == "infeasible" + + +def test_in_time_placebo_custom_v_runs_without_shape_error(): + # End-to-end guard for the custom_v lockstep subset: without it the custom path + # would raise a shape mismatch once specs are dropped. + df, _, _ = _make_panel(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=np.ones(6), + inner_min_decrease=1e-3, + ) + itp = res.in_time_placebo() + assert (itp["status"] == "ran").any() + + +def test_in_time_placebo_accepts_2d_custom_v(): + # fit() accepts an array-like custom_v (e.g. a (1, k) row vector, raveled during + # validation); the in-time TRUNCATE subset must ravel before indexing or a 2D + # custom_v raises IndexError (codex R5 P1). Must match the 1D result exactly. + df, _, _ = _make_panel(n_donors=4) + v1d = np.arange(1.0, 7.0) + v2d = v1d.reshape(1, 6) # row-vector form accepted at fit time + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res1 = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v1d, + inner_min_decrease=1e-3, + ) + res2 = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v2d, + inner_min_decrease=1e-3, + ) + itp1 = res1.in_time_placebo([2003]) + itp2 = res2.in_time_placebo([2003]) # would IndexError before the ravel fix + pd.testing.assert_frame_equal(itp1, itp2) + + +def test_in_time_placebo_deterministic(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + itp1 = res.in_time_placebo() + itp2 = res.in_time_placebo() + pd.testing.assert_frame_equal(itp1, itp2) + + +def test_in_time_placebo_fails_closed_on_nonconverged_treated_fit(): + df, _, _ = _make_panel(n_donors=4, effect=3.0) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, "y", "treated", "unit", "year", seed=0, inner_max_iter=1, **_FAST_CHURN + ) + assert res._fit_converged is False + with pytest.warns(UserWarning, match="did not converge"): + itp = res.in_time_placebo() + assert len(itp) == 0 and res._in_time_status == "treated_fit_nonconverged" + + +def test_in_time_placebo_pickle_drops_gaps_keeps_table(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_time_placebo() + restored = pickle.loads(pickle.dumps(res)) + pd.testing.assert_frame_equal(restored.get_in_time_placebo_df(), res.get_in_time_placebo_df()) + assert restored._in_time_gaps is None + with pytest.raises(ValueError, match="not retained after pickling"): + restored.get_in_time_placebo_gaps() + with pytest.raises(ValueError, match="requires the fit snapshot"): + restored.in_time_placebo() + + +def test_in_time_placebo_gaps_long_form(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_time_placebo([2003]) + gaps = res.get_in_time_placebo_gaps() + assert list(gaps.columns) == ["placebo_period", "period", "gap", "phase"] + assert set(gaps["phase"]) == {"pre_fake", "post_fake"} + # Periods before t_f=2003 are pre_fake; 2003+ are post_fake. + assert set(gaps.loc[gaps["phase"] == "pre_fake", "period"]) == {2000, 2001, 2002} + assert set(gaps.loc[gaps["phase"] == "post_fake", "period"]) == {2003, 2004, 2005} + + +def test_in_time_placebo_accessor_before_run_raises(): + res = _fit_for_placebo(n_donors=4) + with pytest.raises(ValueError, match="call in_time_placebo"): + res.get_in_time_placebo_df() + with pytest.raises(ValueError, match="call in_time_placebo"): + res.get_in_time_placebo_gaps() + + +def test_in_time_placebo_does_not_touch_analytical_inference(): + res = _fit_for_placebo(n_donors=4) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res.in_time_placebo() + assert_nan_inference( + {"se": res.se, "t_stat": res.t_stat, "p_value": res.p_value, "conf_int": res.conf_int} + ) + assert res.is_significant is False + + +# --------------------------------------------------------------------------- +# Self-consistency parity: the ADH-2015 diagnostics are EXACT re-runs of the +# validated solver on the equivalent sub-problem. +# +# R `Synth` has NO in-time-placebo or leave-one-out function (verified against its +# full CRAN function index), so there is no canonical R *output* to match for these +# diagnostics specifically. Instead we prove (deterministically, via a fixed custom +# V) that leave_one_out() equals a from-scratch fit on the reduced donor pool, and +# in_time_placebo() equals a from-scratch fit on the backdated/truncated panel. +# Because the custom-V solver is itself R-anchored on Basque +# (test_basque_tier1_custom_v_parity), this transitively anchors the diagnostics to +# R while directly validating that the re-run mechanism is exact (not approximate). +# --------------------------------------------------------------------------- + + +def test_leave_one_out_matches_fresh_reduced_pool_fit(): + df, _, _ = _make_panel(n_donors=4) + v = np.arange(1.0, 7.0) # k = 6 default lag predictors; fixed V -> deterministic + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v, + inner_min_decrease=1e-3, + ) + loo = res.leave_one_out() + donor_ids = list(res._fit_snapshot.donor_ids) + d = [x for x in donor_ids if x in res.donor_weights][0] # a positively-weighted donor + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + fresh = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v, + inner_min_decrease=1e-3, + donor_pool=[x for x in donor_ids if x != d], + ) + loo_att = loo.loc[loo["dropped_unit"] == d, "att"].iloc[0] + assert loo_att == pytest.approx(fresh.att, abs=1e-7) + + +def test_in_time_placebo_matches_fresh_backdated_fit(): + df, _, _ = _make_panel(n_donors=4) # years 2000-2007, T0=6 -> pre = 2000..2005 + v = np.arange(1.0, 7.0) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + res = synthetic_control( + df, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v, + inner_min_decrease=1e-3, + ) + itp = res.in_time_placebo([2003]) + placebo_att = itp.loc[itp["placebo_period"] == 2003, "placebo_att"].iloc[0] + # Fresh backdated fit: drop the true post periods, treat 2003 as the intervention, + # feed the pre-fake-subset V (lags at 2000/2001/2002 -> v[:3]). + back = df[df["year"] <= 2005].copy() + back["treated"] = ((back["unit"] == "treated") & (back["year"] >= 2003)).astype(int) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + fresh = synthetic_control( + back, + "y", + "treated", + "unit", + "year", + v_method="custom", + custom_v=v[:3], + inner_min_decrease=1e-3, + ) + assert placebo_att == pytest.approx(fresh.att, abs=1e-7) + + +# --------------------------------------------------------------------------- +# All-refits-failed branches (codex R1 P1): when EVERY refit fails to converge, +# the status must NOT be reported as "ran" / mislabeled as dimensional infeasibility. +# --------------------------------------------------------------------------- + + +def test_leave_one_out_all_refits_failed_status(monkeypatch): + import importlib + + sc = importlib.import_module("diff_diff.synthetic_control") + res = _fit_for_placebo(n_donors=4) + monkeypatch.setattr(sc, "_placebo_fit_unit", lambda *a, **k: None) # every drop fails + with pytest.warns(UserWarning, match="failed to converge"): + loo = res.leave_one_out() + # Distinct status (NOT "ran"); att_range is None; baseline + only failed rows. + assert res._loo_status == "all_refits_failed" + assert res._loo_att_range is None + assert (loo["status"] != "loo").all() # no successful drop + assert (loo.iloc[1:]["status"] == "failed").all() + # DiagnosticReport must surface it as NOT "ran", with the convergence reason. + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + assert native["leave_one_out"]["status"] != "ran" + # Machine-readable code distinguishes numerical failure from structural infeasibility. + assert native["leave_one_out"]["reason_code"] == "all_refits_failed" + assert "failed to converge" in native["leave_one_out"]["reason"] + + +def test_in_time_placebo_all_dates_failed_status(monkeypatch): + import importlib + + sc = importlib.import_module("diff_diff.synthetic_control") + res = _fit_for_placebo(n_donors=4) + monkeypatch.setattr(sc, "_placebo_fit_unit", lambda *a, **k: None) # every refit fails + with pytest.warns(UserWarning, match="failed to converge"): + itp = res.in_time_placebo() + # Convergence failure must NOT be mislabeled as dimensional infeasibility. + assert res._in_time_status == "all_dates_failed" + assert (itp["status"] == "failed").all() and len(itp) > 0 + native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"] + assert native["in_time_placebo"]["status"] != "ran" + assert native["in_time_placebo"]["reason_code"] == "all_dates_failed" + assert "failed to converge" in native["in_time_placebo"]["reason"] + + +def test_in_time_placebo_mixed_failed_and_infeasible_status(monkeypatch): + # Codex R8 P2: a no-success run with BOTH a dimensionally-infeasible date AND a + # convergence-failed date must report the mixed "all_dates_unusable" status with + # both counts — NOT be mislabeled as exclusively failed (which would falsely claim + # "none was dimensionally infeasible"). + import importlib + + sc = importlib.import_module("diff_diff.synthetic_control") + res = _fit_for_placebo(n_donors=4) + # Feasible dates "fail" to converge; 2001 (1 pre-fake) is dimensionally infeasible. + monkeypatch.setattr(sc, "_placebo_fit_unit", lambda *a, **k: None) + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + itp = res.in_time_placebo([2001, 2003]) # 2001 infeasible, 2003 fails + assert res._in_time_status == "all_dates_unusable" + assert res._in_time_n_failed == 1 and res._in_time_n_infeasible == 1 + assert set(itp["status"]) == {"infeasible", "failed"} + block = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]["in_time_placebo"] + assert block["reason_code"] == "all_dates_unusable" + assert block["n_failed"] == 1 and block["n_infeasible"] == 1 diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py index 4153d840..7cb7d734 100644 --- a/tests/test_practitioner.py +++ b/tests/test_practitioner.py @@ -356,6 +356,9 @@ def test_synthetic_control_results(self, mock_scm_results): all_labels = " ".join(s.get("label", "") for s in output["next_steps"]).lower() assert "in_space_placebo" in all_code assert "placebo" in all_labels + # The ADH-2015 robustness steps also surface (opt-in diagnostics, non-STEPS + # tags so a caller's completed_steps can never suppress them). + assert "leave_one_out" in all_code and "in_time_placebo" in all_code # SCM is not a staggered DiD: no control-group / anticipation knobs. handler_steps = [s for s in output["next_steps"] if s["baker_step"] > 2] handler_code = " ".join(s.get("code", "") for s in handler_steps)