diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7bd213d8..789da6f4 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- **PowerAnalysis methodology-review-tracker promotion: In Progress → Complete, with a panel-variance correction (behavior change).** Closes the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits on the tracker (PR-A #506 added both paper reviews + under-review Notes; this PR validates the source against the code and reconciles the discrepancies). **Behavior change:** the analytical *panel* DiD variance was the Moulton design-effect factor `(1+(T−1)·rho)/T`, wrong two ways versus the source — wrong period-scaling (~4× too small at `rho=0`, `m=r=5` versus the iid DiD benchmark) and the **opposite `rho`-sign** (it *raised* the MDE as within-unit correlation grew). It is replaced by the within-unit equicorrelated special case of Burlig et al. Eq. 2, `Var(ATT) = sigma² · (1/n_T + 1/n_C) · (1/n_pre + 1/n_post) · (1 − rho)`, in which within-unit (serial) correlation *lowers* the MDE because the difference-in-differences cancels the shared within-unit component. So `PowerAnalysis.mde` / `power` / `sample_size` (and the `compute_*` wrappers) now return a **smaller** MDE / required N as `rho` rises for **all** designs; the 2×2 path matches Bloom's `2σ²` at the default `rho = 0` and is continuous with the panel form at `n_pre = n_post = 1`. New input validation, enforced for **all** designs *before* the 2×2-vs-panel router: `n_pre >= 1`, `n_post >= 1`, `rho ∈ [−1/(T−1), 1)` (`T = n_pre + n_post`), finite `sigma >= 0`, positive group counts, and `treat_frac ∈ (0, 1)` now raise `ValueError` (previously invalid two-period shapes and out-of-range `rho` fell through to `basic_did` silently). The `(1 − rho)` factor applies at `T = 2` too — the 2×2 path is Burlig's `m = r = 1` special case (footnote 11), so a nonzero `rho` is no longer silently ignored there, while `rho = 0` still recovers Bloom's `2σ²`. The MDE multiplier stays the **normal (z)** Bloom multiplier (a deliberate large-sample approximation to Burlig's t, documented as `**Deviation from R:**`) — unchanged. New `tests/test_methodology_power.py` (Bloom Table 1 multipliers; 2×2 + panel closed forms; a literal-equicorrelated Monte-Carlo validation of the panel variance; `sample_size`↔`mde` round-trip; input-guard + `rho`-at-`T=2` + `compute_*` wrapper validation; base-R `qnorm` parity at `benchmarks/data/r_power_golden.json`, generator `benchmarks/R/generate_power_golden.R`); the two `tests/test_power.py` ICC-direction tests were inverted to Burlig's sign. REGISTRY `## PowerAnalysis` equation block rewritten (z not t; corrected 2×2 / panel SE + sample-size; removed the cluster-`m` and inverted-`R²` terms that matched neither code nor source); `docs/references.rst` adds Frison & Pocock (1992) + McKenzie (2012) as the equicorrelated lineage; tutorial `06_power_analysis.ipynb` corrected. `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-05-31`); priority queue pruned; the PR-A under-review Notes removed across REGISTRY / `power.py` / `references.rst`.
- **`WooldridgeDiD` outcome-fit hint:** `WooldridgeDiD(method="ols")` now emits a `UserWarning` when the outcome is binary (`{0, 1}`) or a non-negative integer count, noting that a matching nonlinear model (`method="logit"` / `method="poisson"`) is often the **more appropriate specification** for such outcomes. Following Wooldridge (2023): the nonlinear paths impose parallel trends on the link/index scale rather than in levels (level-PT is only valid for continuous/unbounded outcomes), and the paper's Section 5 simulations show the linear model both biased and less precise where the nonlinear mean holds. It is a **different identifying assumption** than linear OLS — which one fits depends on which parallel-trends restriction holds — so the warning frames it as a recommended comparison, not an automatic switch or free efficiency upgrade. OLS remains a valid QMLE for *any* response (Table 1). Always-on (suppress via `warnings.filterwarnings`); detection is high-signal (binary requires exactly `{0, 1}`; the count branch suggests Poisson — the natural unbounded-count model — for *any* non-negative integers with >2 distinct values, so bounded binomial / known-upper-bound integer outcomes are not separately distinguished from unbounded counts; fractional / continuous outcomes are not flagged).
+- **`SyntheticControl` leave-one-out + in-time placebo robustness diagnostics (ADH 2015 §4).** Two opt-in `SyntheticControlResults` methods, each a thin re-run of the validated solver (analytical `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay bound to the NaN analytical `p_value`). **`leave_one_out()`** drops each reportably-weighted donor (weight above the 1e-6 floor — the donors in `donor_weights`) in turn and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`; non-converged refits → `status="failed"` with NaN metrics); a large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal). **`in_time_placebo()`** reassigns the intervention to an earlier pre-date `t_f`, re-fits using only pre-`t_f` information, and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 when there is no real pre-period effect (ADH 2015 Fig. 4). It sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a dimensionally-infeasible valid date yields a `status="infeasible"` row. **Windowing = TRUNCATE** (documented `**Note:**` in REGISTRY): predictor specs are re-cut to the pre-`t_f` window (pre-period-outcome predictors become the pre-`t_f` outcomes; covariate/special windows are intersected), a window lying entirely in the held-out region is **dropped** (surfaced in `n_dropped_specs` + an aggregated warning) and `custom_v` is subset in lockstep with the surviving specs; the true post-treatment periods are excluded from the placebo fit entirely (no peeking). Both fail closed on a non-converged treated fit (and `leave_one_out` on `<2` donors). New accessors `get_leave_one_out_df()` / `get_in_time_placebo_df()` (survive pickling) and long-form `get_leave_one_out_gaps()` / `get_in_time_placebo_gaps()` for the overlay/backdating plots (panel-derived, dropped on pickle). **Validation:** R `Synth` has no in-time/LOO function (verified against its full CRAN function index), so — beyond the solver's existing Basque R parity — the diagnostics are anchored by deterministic self-consistency tests proving each equals a from-scratch `synthetic_control()` fit on the equivalent sub-problem (reduced donor pool / backdated panel) to 1e-7. **Reporting-stack integration:** `_scm_native` surfaces opt-in `leave_one_out` + `in_time_placebo` blocks (`status="not_run"` stub until run), `BusinessReport` lifts them into the SCM native robustness block, and `practitioner_next_steps` emits both as steps (non-`STEPS` tags so a caller's `completed_steps` cannot suppress them). The remaining ADH-2015 items (CV `V`-selection, `W^reg` extrapolation diagnostic, sparse-SC) are tracked in `TODO.md`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl, `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
- **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide.
- **`SyntheticControl` in-space placebo permutation inference + reporting-stack integration (ADH 2010 §2.4).** New `SyntheticControlResults.in_space_placebo()` provides the significance test classic SCM lacks an analytical SE for: it reassigns treatment to each donor, refits a synthetic control for that pseudo-treated donor against the **other `J−1` donors** (the real treated unit is excluded from every placebo pool — its post-period is treatment-contaminated; matches `SCtools::generate.placebos`), and ranks the treated unit's post/pre **RMSPE ratio** among the `J+1` units. New fields `placebo_p_value` (`= rank/(n_placebos+1)`, an upper-tail rank test on the unsigned RMSPE ratio — direction-agnostic, so it detects an effect of *either* sign rather than a signed/one-directional hypothesis; ties counted via `≥`), `rmspe_ratio` (the treated statistic, set at fit), and `n_placebos`/`n_failed` (effective reference-set sizes; non-converged placebos are excluded from BOTH numerator and denominator, never penalized into the rank). `placebo_p_value` is a **separate field** from the (always-NaN) `p_value` — it is a permutation p-value with no SE/t-stat and does not flow through `safe_inference`; `is_significant` stays bound to `p_value`. Edge cases fail closed: scale-aware RMSPE-ratio floor (a perfect pre-fit gives a finite ratio, not `inf`), `J<2` → NaN+warn, `J==2` → degenerate+coarse warn, deterministic given `seed`. New `get_placebo_df()` returns the per-unit RMSPE-ratio summary table (incl. the treated row and any failed donors) used for the rank. The design keeps the placebo *compute* opt-in — the per-donor refit loop runs only on the explicit `in_space_placebo()` call. To support that opt-in call, every fit retains a `_SyntheticControlFitSnapshot` of the pivoted panel (memory O(units × periods × predictor-vars), like `SyntheticDiD`'s snapshot for `in_time_placebo`; excluded from pickling). A compact/lazy snapshot representation is tracked as a follow-up in `TODO.md`. **Reporting-stack integration:** `SyntheticControlResults` is now routed through `DiagnosticReport` (fit-based `scm_fit` parallel-trends analogue → verdict `design_enforced_pt` reading `pre_rmspe`; `_scm_native` surfaces `pre_rmspe` + donor-weight concentration + the placebo p-value when already computed — never triggering the refit loop implicitly), `practitioner_next_steps` (`_handle_synthetic_control` with the placebo as the headline significance step), and `BusinessReport` (fit-based assumption block, ADH 2010 attribution, robustness via `estimator_native_diagnostics`; HonestDiD passthrough rejected like SDiD/TROP). Also fixes a latent BR bug where the headline `is_significant` was a non-JSON-serializable numpy `bool_` when `p_value` is a numpy `NaN`. Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (new `**Note:**` labels for the donor-pool construction, failure handling, RMSPE-ratio floor, and the non-analytical-p-value split), `docs/methodology/REPORTING.md`, `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
- **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN; significance comes from in-space placebo permutation inference via `in_space_placebo()` (see the dedicated entry below). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
diff --git a/README.md b/README.md
index 1e738591..8330f441 100644
--- a/README.md
+++ b/README.md
@@ -108,7 +108,7 @@ Full guide: `diff_diff.get_llm_guide("practitioner")`.
- [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html) - Gardner (2022) two-stage estimator with GMM sandwich variance
- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html) - Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover on near-control units; handles non-staggered and staggered timing; supports survey-design variance under `survey_design=` for HC1 / CR1 (Wave E.1 Binder TSL) and Conley (Wave E.2 panel-aware stratified-Conley sandwich on per-period PSU totals; extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC — `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=
`); `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores (Wave E.3, R `svyrecvar(subset())` form)
- [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html) - Synthetic DiD combining standard DiD and synthetic control for few treated units
-- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html) - Abadie, Diamond & Hainmueller (2010) classic synthetic control for a single treated unit (donor-weight counterfactual, nested/custom V; in-space placebo permutation inference via `in_space_placebo()`)
+- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html) - Abadie, Diamond & Hainmueller (2010) classic synthetic control for a single treated unit (donor-weight counterfactual, nested/custom V; in-space placebo permutation inference via `in_space_placebo()`, plus ADH-2015 `leave_one_out()` + `in_time_placebo()` robustness)
- [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html) - triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
- [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html) - Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves
- [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html) - de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲ = 0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲ > 0`, continuous-near-d̲ or mass-point), with a multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release - repeated cross-sections rejected by the validator. Alias `HAD`.
diff --git a/TODO.md b/TODO.md
index 17fed4c2..a1b6f545 100644
--- a/TODO.md
+++ b/TODO.md
@@ -84,7 +84,7 @@ Deferred items from PR reviews that were not addressed before merge.
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) |
| Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium |
| Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Low |
-| SyntheticControl: in-time placebo + leave-one-out donor-robustness diagnostics are not implemented (ADH 2015, not the ADH 2010 scope of the current estimator), so `_scm_native` surfaces only pre-fit + in-space placebo. The practitioner / DiagnosticReport / BusinessReport routing and the in-space placebo permutation layer landed in PR-2; this remaining row covers adding the two ADH-2015 diagnostics (and surfacing them under `estimator_native_diagnostics`) in a later 2015-sourced PR. | `synthetic_control.py`, `diagnostic_report.py` | ADH-2015 follow-up | Low |
+| SyntheticControl: the remaining ADH-2015 §4 items are out of scope of the leave-one-out + in-time-placebo PR — out-of-sample cross-validation `V`-selection (training/validation split), the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`), and sparse-SC subset search (`l < J`, holding `V` fixed). Leave-one-out (`leave_one_out()`) and the in-time placebo (`in_time_placebo()`) landed and are surfaced under `estimator_native_diagnostics`; these three are the deferred tail. | `synthetic_control.py`, `synthetic_control_results.py` | ADH-2015 follow-up | Low |
| ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items. | `diff_diff/continuous_did.py` | — | Low |
| Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low |
| TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low |
diff --git a/benchmarks/R/generate_synth_basque_golden.R b/benchmarks/R/generate_synth_basque_golden.R
index 3f524ff1..8bd8473b 100644
--- a/benchmarks/R/generate_synth_basque_golden.R
+++ b/benchmarks/R/generate_synth_basque_golden.R
@@ -79,6 +79,37 @@ synthetic_path <- as.numeric(dp$Y0plot %*% so$solution.w)
treated_path <- as.numeric(dp$Y1plot)
years <- as.integer(rownames(dp$Y1plot))
+# --- Leave-one-out golden (ADH 2015 §4 donor robustness) ---------------------
+# Drop the highest-weight donor (region 10, Cataluna) and re-fit with the
+# ORIGINAL solution.v held fixed (custom.v), so the reduced-pool W-solve is
+# deterministic and directly comparable to SyntheticControlResults.leave_one_out()
+# on a v_method="custom" fit (which likewise reuses the original custom_v on the
+# donor pool minus the dropped unit — specs/V are unchanged, only the donors shrink).
+loo_drop <- 10L
+controls_loo <- controls[controls != loo_drop]
+invisible(capture.output({
+ dp_loo <- dataprep(
+ foo = basque,
+ predictors = predictors,
+ predictors.op = "mean",
+ time.predictors.prior = 1964:1969,
+ special.predictors = special,
+ dependent = "gdpcap",
+ unit.variable = "regionno",
+ unit.names.variable = "regionname",
+ time.variable = "year",
+ treatment.identifier = 17,
+ controls.identifier = controls_loo,
+ time.optimize.ssr = 1960:1969,
+ time.plot = 1955:1997
+ )
+ so_loo <- synth(dp_loo, custom.v = as.numeric(so$solution.v))
+}))
+w_loo <- as.numeric(so_loo$solution.w)
+synthetic_path_loo <- as.numeric(dp_loo$Y0plot %*% so_loo$solution.w)
+gap_loo <- as.numeric(dp_loo$Y1plot) - synthetic_path_loo
+att_loo <- mean(gap_loo[years >= 1970]) # mean post-period gap (treatment year 1970)
+
golden <- list(
config = list(
treated_regionno = 17,
@@ -104,7 +135,13 @@ golden <- list(
years = years,
treated_path = treated_path,
synthetic_path = synthetic_path,
- gap = treated_path - synthetic_path
+ gap = treated_path - synthetic_path,
+ leave_one_out = list(
+ dropped_regionno = loo_drop,
+ solution_w = as.list(setNames(w_loo, colnames(dp_loo$X0))),
+ att = att_loo,
+ gap = gap_loo
+ )
)
dir.create("tests/data", showWarnings = FALSE, recursive = TRUE)
diff --git a/diff_diff/business_report.py b/diff_diff/business_report.py
index f6edb275..b3160a6e 100644
--- a/diff_diff/business_report.py
+++ b/diff_diff/business_report.py
@@ -1019,6 +1019,9 @@ def _lift_robustness(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]:
native_block["pre_rmspe"] = native.get("pre_rmspe")
native_block["weight_concentration"] = native.get("weight_concentration")
native_block["in_space_placebo"] = native.get("in_space_placebo")
+ # ADH-2015 robustness diagnostics (opt-in; "not_run" stub until run).
+ native_block["leave_one_out"] = native.get("leave_one_out")
+ native_block["in_time_placebo"] = native.get("in_time_placebo")
return {
"bacon": {
"status": bacon.get("status"),
diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py
index f474e91b..fd65ea61 100644
--- a/diff_diff/diagnostic_report.py
+++ b/diff_diff/diagnostic_report.py
@@ -2152,8 +2152,10 @@ def _check_estimator_native(self) -> Dict[str, Any]:
selected ``lambda_*``).
SyntheticControl: pre-treatment fit (``pre_rmspe``), donor-weight
- concentration, and — when already computed — the in-space placebo
- permutation p-value (``in_space_placebo``).
+ concentration, and — each surfaced only when already computed — the
+ in-space placebo permutation p-value (``in_space_placebo``), the ADH-2015
+ leave-one-out donor robustness (``leave_one_out``), and the in-time
+ backdating placebo (``in_time_placebo``).
"""
r = self._results
name = type(r).__name__
@@ -2251,8 +2253,11 @@ def _scm_native(self, r: Any) -> Dict[str, Any]:
DR never triggers it implicitly, because it refits one synthetic control
per donor (potentially many nested V searches) and the placebo layer is
opt-in by design. (This differs from SDiD's cheaper in-time-placebo sweep,
- which ``_sdid_native`` runs inline.) Only the in-space placebo is exposed;
- in-time placebo and leave-one-out are ADH 2015 (not implemented).
+ which ``_sdid_native`` runs inline.) The ADH-2015 §4 diagnostics —
+ leave-one-out donor robustness (``leave_one_out()``) and the in-time
+ (backdating) placebo (``in_time_placebo()``) — are surfaced the same way:
+ opt-in, reported only once the user has run them (each refits the synthetic
+ control one or more times), else a ``status="not_run"`` stub.
"""
out: Dict[str, Any] = {"status": "ran", "estimator": "SyntheticControl"}
out["pre_rmspe"] = _to_python_float(getattr(r, "pre_rmspe", None))
@@ -2327,6 +2332,132 @@ def _scm_native(self, r: Any) -> Dict[str, Any]:
"per donor)."
),
}
+
+ # Leave-one-out donor robustness (ADH 2015 §4): opt-in, surfaced once run.
+ if getattr(r, "_loo_df", None) is not None:
+ loo_status = getattr(r, "_loo_status", None)
+ if loo_status == "ran":
+ att_range = getattr(r, "_loo_att_range", None)
+ out["leave_one_out"] = {
+ "status": "ran",
+ # Headline single-donor-dependence metric: the largest baseline-
+ # relative swing (max |delta_att|). Preferred over att_range, which
+ # can look narrow even when every drop shifts the ATT far from the
+ # full-fit baseline in the same direction.
+ "max_abs_delta_att": _to_python_float(
+ getattr(r, "_loo_max_abs_delta_att", None)
+ ),
+ "att_range": (
+ [_to_python_float(att_range[0]), _to_python_float(att_range[1])]
+ if att_range is not None
+ else None
+ ),
+ "n_failed": _to_python_scalar(getattr(r, "_loo_n_failed", None)),
+ }
+ else:
+ _loo_reasons = {
+ "treated_fit_nonconverged": (
+ "leave_one_out() was run but the treated unit's own SCM fit "
+ "did not converge at fit time, so the baseline ATT is not a "
+ "valid reference for the leave-one-out deltas."
+ ),
+ "too_few_donors": (
+ "leave_one_out() was run but fewer than 2 donors are available "
+ "(dropping one must leave a non-empty pool)."
+ ),
+ "all_refits_failed": (
+ "leave_one_out() was run but every donor-drop refit failed to "
+ "converge, so no valid leave-one-out estimate was produced "
+ "(see the status='failed' rows); raise n_starts or loosen the "
+ "optimizer tolerances."
+ ),
+ }
+ out["leave_one_out"] = {
+ "status": "infeasible",
+ # Machine-readable code so consumers can distinguish a numerical
+ # convergence failure ("all_refits_failed") from structural
+ # infeasibility ("too_few_donors") without parsing `reason`.
+ "reason_code": loo_status,
+ "reason": _loo_reasons.get(
+ loo_status, "leave_one_out() produced no valid refits."
+ ),
+ }
+ else:
+ out["leave_one_out"] = {
+ "status": "not_run",
+ "reason": (
+ "Call results.leave_one_out() to run leave-one-out donor "
+ "robustness (opt-in; refits once per reportably-weighted donor)."
+ ),
+ }
+
+ # In-time (backdating) placebo (ADH 2015 §4): opt-in, surfaced once run.
+ if getattr(r, "_in_time_df", None) is not None:
+ in_time_status = getattr(r, "_in_time_status", None)
+ if in_time_status == "ran":
+ itp = r._in_time_df
+ ran = itp[itp["status"] == "ran"] if "status" in itp else itp
+ max_abs_att = float(ran["placebo_att"].abs().max()) if len(ran) else None
+ out["in_time_placebo"] = {
+ "status": "ran",
+ # Full coverage breakdown so a partially-usable sweep is not
+ # overstated: n_dates is the requested grid; n_ran are the usable
+ # placebos; n_failed / n_infeasible are the dropped remainder.
+ "n_dates": _to_python_scalar(int(len(itp))),
+ "n_ran": _to_python_scalar(int(len(ran))),
+ "n_failed": _to_python_scalar(getattr(r, "_in_time_n_failed", None)),
+ "n_infeasible": _to_python_scalar(getattr(r, "_in_time_n_infeasible", None)),
+ "max_abs_placebo_att": _to_python_float(max_abs_att),
+ }
+ else:
+ _it_reasons = {
+ "treated_fit_nonconverged": (
+ "in_time_placebo() was run but the treated unit's own SCM fit "
+ "did not converge at fit time."
+ ),
+ "too_few_pre_periods": (
+ "in_time_placebo() was run but there are too few pre-treatment "
+ "periods for any feasible placebo date (need >=3)."
+ ),
+ "all_dates_infeasible": (
+ "in_time_placebo() was run but every placebo date was "
+ "infeasible (no pre-fake period, all predictors dropped, or "
+ "the supplied custom_v had zero mass on the surviving "
+ "predictors after truncation)."
+ ),
+ "all_dates_failed": (
+ "in_time_placebo() was run but every placebo refit failed to "
+ "converge (none was dimensionally infeasible); raise n_starts "
+ "or loosen the optimizer tolerances."
+ ),
+ "all_dates_unusable": (
+ "in_time_placebo() was run but no placebo date produced a usable "
+ "result: some refits failed to converge AND some dates were "
+ "dimensionally infeasible (see n_failed / n_infeasible)."
+ ),
+ }
+ out["in_time_placebo"] = {
+ "status": "infeasible",
+ # Machine-readable code distinguishing a numerical convergence
+ # failure ("all_dates_failed") from structural infeasibility
+ # ("all_dates_infeasible" / "too_few_pre_periods") or a mix
+ # ("all_dates_unusable"), without parsing `reason`. The n_failed /
+ # n_infeasible counts give the exact breakdown.
+ "reason_code": in_time_status,
+ "n_failed": _to_python_scalar(getattr(r, "_in_time_n_failed", None)),
+ "n_infeasible": _to_python_scalar(getattr(r, "_in_time_n_infeasible", None)),
+ "reason": _it_reasons.get(
+ in_time_status, "in_time_placebo() produced no valid refits."
+ ),
+ }
+ else:
+ out["in_time_placebo"] = {
+ "status": "not_run",
+ "reason": (
+ "Call results.in_time_placebo() to run the in-time (backdating) "
+ "placebo (opt-in; refits per backdated date)."
+ ),
+ }
return out
# -- Heterogeneity helpers --------------------------------------------
diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
index c0e2caf9..ff1b1ad2 100644
--- a/diff_diff/guides/llms-full.txt
+++ b/diff_diff/guides/llms-full.txt
@@ -616,7 +616,7 @@ scm.fit(
) -> SyntheticControlResults
```
-**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation).
+**Inference:** NONE analytical — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. `att` is the mean post-period gap. Significance via in-space placebo permutation inference: `results.in_space_placebo()` reassigns treatment to each donor, refits against the other J-1 donors (the real treated unit is excluded from every placebo pool), and sets `placebo_p_value = rank/(n_placebos+1)` from the post/pre RMSPE-ratio. The permutation `placebo_p_value` is a SEPARATE field from the (NaN) `p_value`; `is_significant` stays bound to `p_value`. **ADH-2015 §4 robustness (opt-in, analytical inference unchanged):** `results.leave_one_out()` drops each reportably-weighted donor (weight > 1e-6) and re-fits (per-drop ATT/`delta_att` — large `delta_att` ⇒ single-donor dependence); `results.in_time_placebo()` backdates the intervention and checks for a spurious pre-period gap (TRUNCATE windowing — predictor windows in the held-out region are dropped). Predictor periods must lie within the pre window; `post_periods` must be a contiguous suffix cross-checked against `D` (no anticipation).
**Usage:**
@@ -1298,7 +1298,7 @@ Returned by `SyntheticControl.fit()`.
| `pre_periods`, `post_periods` | `list` | Calendar-sorted periods |
| `v_method`, `standardize` | `str` | Echoed configuration |
-**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()`
+**Methods:** `in_space_placebo()` (opt-in permutation inference; refits one synthetic control per donor), `get_placebo_df()` (per-unit RMSPE-ratio table incl. the treated row), `leave_one_out()` (ADH-2015 §4 donor robustness; drops each reportably-weighted donor (weight > 1e-6) → per-drop ATT/`delta_att` table) + `get_leave_one_out_df()`/`get_leave_one_out_gaps()`, `in_time_placebo()` (ADH-2015 §4 backdating placebo; reassigns the intervention earlier, TRUNCATE windowing, placebo ATT ~0 if no real pre-effect) + `get_in_time_placebo_df()`/`get_in_time_placebo_gaps()`, `summary()`, `print_summary()`, `to_dict()`, `to_dataframe()`, `get_gap_df()`, `get_weights_df()`
### TripleDifferenceResults
diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt
index 5261b046..7fb2922b 100644
--- a/diff_diff/guides/llms.txt
+++ b/diff_diff/guides/llms.txt
@@ -60,7 +60,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")`
- [TwoStageDiD](https://diff-diff.readthedocs.io/en/stable/api/two_stage.html): Gardner (2022) two-stage estimator with GMM sandwich variance
- [SpilloverDiD](https://diff-diff.readthedocs.io/en/stable/api/spillover.html): Butts (2021) ring-indicator spillover-aware DiD identifying direct effect on treated + per-ring spillover-on-control; reuses `conley_coords` for ring construction; handles non-staggered and staggered timing; supports `SurveyDesign(weights, strata, psu, fpc)` under `vcov_type="hc1"` with optional `cluster=` for CR1 via Gerber (2026) Binder TSL (Wave E.1) and under `vcov_type="conley"` via a panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2 cross-sectional `conley_lag_cutoff=0`) extended in Wave E.2 follow-up to `conley_lag_cutoff > 0` via panel-block composition with within-PSU serial Bartlett HAC (Newey-West 1987 separable form; `lag>0` requires an effective PSU via explicit `survey_design.psu` or injected `cluster=`), both composed with the Wave D Gardner GMM correction; `SurveyDesign.subpopulation()` preserves full-design `n_psu` / `df_survey` via zero-padded scores at the meat-helper boundary (Wave E.3, R `svyrecvar(subset())` form) (replicate weights queued as follow-up)
- [SyntheticDiD](https://diff-diff.readthedocs.io/en/stable/api/estimators.html): Synthetic DiD combining standard DiD and synthetic control methods for few treated units
-- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, nested or custom predictor-importance V, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1))
+- [SyntheticControl](https://diff-diff.readthedocs.io/en/stable/api/synthetic_control.html): Abadie, Diamond & Hainmueller (2010) classic synthetic control for ONE treated unit — donor-weight counterfactual, nested or custom predictor-importance V, gap path + pre-RMSPE; no analytical SE (inference fields NaN), significance via in-space placebo permutation inference (`in_space_placebo()`, post/pre RMSPE-ratio, p = rank/(n_placebos+1)); ADH-2015 §4 robustness: `leave_one_out()` donor-robustness + `in_time_placebo()` backdating placebo
- [TripleDifference](https://diff-diff.readthedocs.io/en/stable/api/triple_diff.html): Triple difference (DDD) estimator for designs requiring two criteria for treatment eligibility
- [ContinuousDiD](https://diff-diff.readthedocs.io/en/stable/api/continuous_did.html): Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD with dose-response curves
- [HeterogeneousAdoptionDiD](https://diff-diff.readthedocs.io/en/stable/api/had.html): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for designs where **no unit remains untreated**; local-linear estimator at the dose support boundary returning Weighted Average Slope (WAS) on Design 1' (`d̲=0` / QUG) or `WAS_{d̲}` on Design 1 (`d̲>0`, continuous-near-d̲ or mass-point), with multi-period event-study extension (last-treatment cohort, pointwise CIs). **Panel-only** in this release (repeated cross-sections rejected by the validator). Alias `HAD`.
diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py
index b0d6811f..3a636e3c 100644
--- a/diff_diff/practitioner.py
+++ b/diff_diff/practitioner.py
@@ -718,6 +718,42 @@ def _handle_synthetic_control(results: Any):
priority="medium",
step_name="estimator_selection",
),
+ _step(
+ baker_step=6,
+ label="Leave-one-out donor robustness (ADH 2015)",
+ why=(
+ "Re-fit dropping each reportably-weighted donor (weight above the 1e-6 "
+ "floor) in turn to confirm the "
+ "estimate is not driven by a single donor (Abadie-Diamond-Hainmueller "
+ "2015, Section 4); a large delta_att when one donor is removed flags "
+ "single-donor dependence."
+ ),
+ code=(
+ "loo_df = results.leave_one_out()\n"
+ "print(loo_df) # baseline + per-dropped-donor ATT and delta_att"
+ ),
+ priority="medium",
+ # Not a standard STEPS tag, so a caller's completed_steps (validated
+ # against STEPS) can never auto-suppress this opt-in recommendation.
+ step_name="loo_jackknife",
+ ),
+ _step(
+ baker_step=6,
+ label="In-time (backdating) placebo (ADH 2015)",
+ why=(
+ "Reassign the intervention to an earlier pre-period and confirm no "
+ "spurious gap appears before the true treatment date (Abadie-Diamond-"
+ "Hainmueller 2015, Section 4, Figure 4)."
+ ),
+ code=(
+ "itp_df = results.in_time_placebo()\n"
+ "print(itp_df) # per-backdated-date placebo ATT (should be ~0)"
+ ),
+ priority="medium",
+ # Non-standard tag (not in STEPS) -> never auto-suppressed; deliberately
+ # NOT "sensitivity" (a caller could mark that done and drop this step).
+ step_name="in_time_placebo",
+ ),
_robustness_compare_step("SyntheticDiD or CS"),
]
warnings = _check_nan_att(results)
diff --git a/diff_diff/synthetic_control.py b/diff_diff/synthetic_control.py
index 164568a2..4004446c 100644
--- a/diff_diff/synthetic_control.py
+++ b/diff_diff/synthetic_control.py
@@ -30,7 +30,7 @@
"""
import warnings
-from dataclasses import dataclass
+from dataclasses import dataclass, replace
from typing import Any, Dict, List, Optional, Tuple, cast
import numpy as np
@@ -523,6 +523,9 @@ def fit(
pre_periods=list(pre_periods),
post_periods=list(post_periods),
donor_ids=list(donor_ids),
+ # Freeze the reportably-weighted support (donor_weights keys, in donor_ids
+ # order) so leave_one_out() is immune to post-fit mutation of donor_weights.
+ weighted_donor_ids=[d for d in donor_ids if d in donor_weights],
treated_id=treated_id,
standardize=self.standardize,
v_method=self.v_method,
@@ -1310,3 +1313,88 @@ def _placebo_fit_unit(
post_gaps = np.array([gap_path[p] for p in snap.post_periods], dtype=float)
scale = float(np.max(np.abs(Z1))) if Z1.size else 0.0
return gap_path, _rmspe_ratio(pre_gaps, post_gaps, scale)
+
+
+# =============================================================================
+# in-time (backdating) placebo (ADH 2015 §4) — used by
+# SyntheticControlResults.in_time_placebo() via function-level import
+# =============================================================================
+
+
+def _truncate_snapshot_in_time(
+ snap: _SyntheticControlFitSnapshot,
+ t_f: Any,
+) -> Tuple[Optional[_SyntheticControlFitSnapshot], List[str]]:
+ """Build a snapshot for an in-time placebo reassigning the intervention to ``t_f``.
+
+ Re-cuts the panel so the **pre-fake** window is the pre-periods strictly before
+ ``t_f`` and the **post-fake** window is the held-out pre-periods from ``t_f``
+ onward (``t_f`` is the first post-fake period). The true post-treatment periods
+ are EXCLUDED entirely — ``all_periods`` is set to ``pre_fake + post_fake`` so the
+ placebo refit (which fits V/W on ``pre_periods`` and measures the gap over
+ ``all_periods``) never sees treatment-contaminated data.
+
+ Predictor specs are TRUNCATED to the pre-fake window (ADH 2015 §4 "lag the
+ predictors accordingly"; for our absolute-period specs this means intersecting
+ each spec's ``periods`` with the pre-fake window). A spec whose window lies
+ ENTIRELY in the held-out region is DROPPED (its label is collected for an
+ aggregated warning), and ``custom_v`` is subset in lockstep with the surviving
+ specs so its length stays ``== len(kept_specs)``.
+
+ Returns ``(modified_snapshot, dropped_spec_labels)``, or ``(None, dropped)`` when
+ the date is infeasible: fewer than 2 pre-fake periods (the weight / V solve needs
+ at least 2), no post-fake period, or every predictor dropped. ``t_f`` must be an
+ element of ``snap.pre_periods`` (the caller validates membership).
+ """
+ pre = snap.pre_periods
+ idx = pre.index(t_f) # positional split (period labels may be non-numeric)
+ new_pre = list(pre[:idx])
+ new_post = list(pre[idx:]) # t_f is the first post-fake period
+ pre_set = set(new_pre)
+
+ kept_specs: List[_PredictorSpec] = []
+ keep_mask: List[int] = []
+ dropped: List[str] = []
+ for i, spec in enumerate(snap.specs):
+ # Intersect with the pre-fake window (preserves the spec's canonical order).
+ kept_periods = [p for p in spec.periods if p in pre_set]
+ if kept_periods:
+ # NEW spec object — never mutate the shared snapshot specs in place.
+ kept_specs.append(replace(spec, periods=kept_periods))
+ keep_mask.append(i)
+ else:
+ dropped.append(spec.label)
+
+ # Feasibility: >=2 pre-fake periods, >=1 post-fake period (to measure the placebo
+ # gap), and at least one surviving predictor. The >=2 pre-fake rule is DELIBERATELY
+ # stricter than the base estimator's T0>=1 allowance (which warns that single-pre
+ # nested-V selection is unreliable): an auto-swept placebo date with a single
+ # pre-fake period is a trivially-matchable, non-credible pre-fit, so it is dropped
+ # as infeasible rather than surfaced as a "ran" placebo. Documented as a Note in
+ # docs/methodology/REGISTRY.md §SyntheticControl.
+ if len(new_pre) < 2 or len(new_post) < 1 or not kept_specs:
+ return None, dropped
+
+ # Subset custom_v IN LOCKSTEP with the surviving specs (custom_v length must stay
+ # == k = len(specs); a desync silently misaligns V on the custom path). RAVEL first:
+ # fit() accepts array-like custom_v (e.g. a (1, k) row vector) and the snapshot keeps
+ # its original shape, so `[keep_mask]` would otherwise index the wrong axis of a 2D
+ # custom_v and raise (matches _placebo_fit_unit, which also ravels at use).
+ new_custom_v = (
+ None if snap.custom_v is None else np.asarray(snap.custom_v, dtype=float).ravel()[keep_mask]
+ )
+ # A custom V whose surviving entries sum to ~0 (all mass was on dropped specs)
+ # cannot fit — _placebo_fit_unit's `v / v.sum()` would be 0/0. This is the date
+ # being INFEASIBLE UNDER THE SUPPLIED custom_v, not a solver-convergence failure,
+ # so return None (→ status="infeasible") rather than letting the refit "fail".
+ if new_custom_v is not None and float(np.sum(new_custom_v)) <= 0.0:
+ return None, dropped
+ snap_mod = replace(
+ snap,
+ specs=kept_specs,
+ all_periods=new_pre + new_post,
+ pre_periods=new_pre,
+ post_periods=new_post,
+ custom_v=new_custom_v,
+ )
+ return snap_mod, dropped
diff --git a/diff_diff/synthetic_control_results.py b/diff_diff/synthetic_control_results.py
index 8e412057..6dfa10b5 100644
--- a/diff_diff/synthetic_control_results.py
+++ b/diff_diff/synthetic_control_results.py
@@ -48,6 +48,12 @@ class _SyntheticControlFitSnapshot:
pre_periods: List[Any]
post_periods: List[Any]
donor_ids: List[Any]
+ # The treated unit's reportably-weighted donor support (donor ids with weight above
+ # the 1e-6 interpretability floor), FROZEN at fit time and ordered by donor_ids.
+ # leave_one_out() iterates this immutable list — NOT the mutable, presentation-level
+ # results.donor_weights dict — so post-fit mutation cannot change which donors are
+ # dropped, and the robustness result depends only on the fit.
+ weighted_donor_ids: List[Any]
treated_id: Any
standardize: str
v_method: str
@@ -195,6 +201,39 @@ def __post_init__(self) -> None:
# "all_placebos_failed". A small string, so it survives pickling.
self._placebo_status: Optional[str] = None
+ # --- ADH 2015 §4 robustness diagnostics (opt-in, populated by ---
+ # --- leave_one_out() / in_time_placebo()). Same panel-vs-scalar split as ---
+ # --- the in-space placebo: the small per-row tables (_loo_df / _in_time_df), ---
+ # --- scalar summaries and status strings survive pickling; the per-refit ---
+ # --- gap-path dicts (_loo_gaps / _in_time_gaps) are panel-derived and nulled ---
+ # --- by __getstate__. analytical se/t/p/ci stay NaN throughout.
+ self._loo_df: Optional[pd.DataFrame] = None
+ self._loo_gaps: Optional[Dict[Any, Dict[Any, float]]] = None
+ # Reason a leave-one-out run was infeasible/absent. Values: None (not run),
+ # "ran", "treated_fit_nonconverged", "too_few_donors", "all_refits_failed".
+ self._loo_status: Optional[str] = None
+ # (min, max) ATT across the successful leave-one-out refits (the absolute
+ # spread of counterfactual ATTs); None until run.
+ self._loo_att_range: Optional[Tuple[float, float]] = None
+ # The headline single-donor-dependence number: max |att_loo - baseline_att|
+ # over the successful drops. Baseline-RELATIVE, so a uniform shift of every
+ # drop away from the baseline is NOT masked the way a narrow raw att_range
+ # would be. None until run.
+ self._loo_max_abs_delta_att: Optional[float] = None
+ self._loo_n_failed: int = 0
+ self._in_time_df: Optional[pd.DataFrame] = None
+ self._in_time_gaps: Optional[Dict[Any, Dict[Any, float]]] = None
+ # Reason an in-time placebo run was infeasible/absent. Values: None (not run),
+ # "ran", "treated_fit_nonconverged", "too_few_pre_periods",
+ # "all_dates_infeasible", "all_dates_failed", "all_dates_unusable" (a mix of
+ # failed + infeasible dates with none usable).
+ self._in_time_status: Optional[str] = None
+ self._in_time_n_failed: int = 0
+ # Number of placebo dates that were dimensionally infeasible (too few pre-fake
+ # periods, all predictors dropped, or a zero-mass surviving custom_v). Surfaced
+ # alongside _in_time_n_failed so a mixed no-success run reports an accurate mix.
+ self._in_time_n_infeasible: int = 0
+
def __getstate__(self) -> Dict[str, Any]:
"""Exclude panel-derived internal state from pickling.
@@ -209,6 +248,12 @@ def __getstate__(self) -> Dict[str, Any]:
state = self.__dict__.copy()
state["_fit_snapshot"] = None
state["_placebo_gaps"] = None
+ # ADH-2015 diagnostic gap paths are panel-derived (same hazard as
+ # _placebo_gaps); the small _loo_df / _in_time_df tables + scalar summaries
+ # survive so a round-tripped result still reports the diagnostic, but the
+ # overlay gap accessors raise (re-fit to recompute).
+ state["_loo_gaps"] = None
+ state["_in_time_gaps"] = None
return state
def __repr__(self) -> str:
@@ -727,3 +772,583 @@ def in_space_placebo(
self._placebo_status = "ran" if n_placebos > 0 else "all_placebos_failed"
self._placebo_df = pd.DataFrame(rows, columns=self._PLACEBO_COLS)
return self._placebo_df.copy()
+
+ _LOO_COLS = [
+ "dropped_unit",
+ "att",
+ "pre_rmspe",
+ "post_rmspe",
+ "rmspe_ratio",
+ "delta_att",
+ "status",
+ ]
+
+ def leave_one_out(self, n_starts: Optional[int] = None) -> pd.DataFrame:
+ """
+ Leave-one-out donor robustness (Abadie-Diamond-Hainmueller 2015, Section 4).
+
+ Drops each **reportably-weighted** donor, one at a time, and re-fits the
+ treated unit's synthetic control against the remaining donor pool. The
+ per-drop ATTs reveal whether the estimated effect is driven by any single
+ donor (ADH 2015 overlay the leave-one-out counterfactual trajectories for
+ this purpose; :meth:`get_leave_one_out_gaps` returns those paths). This is a
+ thin re-run of the validated SCM solver — it has **no analytical standard
+ error**; ``se``/``t_stat``/``p_value``/``conf_int`` and ``is_significant``
+ are unaffected (still bound to the NaN analytical ``p_value``).
+
+ The drop set is exactly the donors in ``donor_weights`` — those above the
+ ``1e-6`` interpretability floor (``synthetic_control._MIN_REPORT_WEIGHT``).
+ A donor with negligible weight ``0 < w ≤ 1e-6`` is excluded (its removal
+ moves the ATT by ~the weight, so its ``delta_att`` would be ~0 — an
+ uninformative row), keeping the LOO table aligned with the reported support;
+ a zero-weight donor's removal leaves the synthetic unchanged. (This `1e-6`
+ approximation of "positive weight" is documented in REGISTRY §SyntheticControl.)
+ A donor that carries ALL the weight is still dropped (the others absorb its
+ mass on re-fit); its large ``delta_att`` is exactly the single-donor-dependence
+ signal this diagnostic exists to surface, NOT a failure.
+
+ Parameters
+ ----------
+ n_starts : int, optional
+ Override the multistart count for each leave-one-out refit's nested V
+ search. Default None inherits the original fit's ``n_starts``.
+
+ Returns
+ -------
+ pandas.DataFrame
+ One ``status="baseline"`` row (the full fit, ``delta_att=0``) followed by
+ one row per dropped donor (``status="loo"``, or ``"failed"`` with NaN
+ metrics when its refit did not converge), sorted by ``|delta_att|``
+ descending (failed rows last). Columns: ``dropped_unit``, ``att``,
+ ``pre_rmspe``, ``post_rmspe``, ``rmspe_ratio``, ``delta_att``
+ (``att_loo - full_att``), ``status``.
+
+ Raises
+ ------
+ ValueError
+ If the fit snapshot is unavailable (e.g. this result was unpickled).
+ """
+ if self._fit_snapshot is None:
+ raise ValueError(
+ "leave_one_out() requires the fit snapshot on the results object. "
+ "This result appears to have been loaded from serialization (which "
+ "excludes the snapshot) or produced by an older estimator version. "
+ "Re-fit to enable leave-one-out donor robustness."
+ )
+ from diff_diff.synthetic_control import _mspe, _placebo_fit_unit
+
+ snap = self._fit_snapshot
+ if n_starts is None:
+ n_starts_eff = snap.n_starts
+ else:
+ # Mirror the estimator constructor's validation so a bad override fails
+ # fast instead of silently coercing into a degenerate refit (cf.
+ # in_space_placebo()).
+ if not isinstance(n_starts, (int, np.integer)) or n_starts < 1:
+ raise ValueError(f"n_starts override must be a positive integer, got {n_starts!r}")
+ n_starts_eff = int(n_starts)
+
+ # Baseline row: read DIRECTLY from the full fit (do NOT re-fit), so the
+ # reference ATT — and therefore delta_att=0.0 — is exact.
+ baseline_row = {
+ "dropped_unit": None,
+ "att": float(self.att),
+ "pre_rmspe": float(self.pre_rmspe),
+ "post_rmspe": float(np.sqrt(_mspe(self.gap_path, snap.post_periods))),
+ "rmspe_ratio": float(self.rmspe_ratio),
+ "delta_att": 0.0,
+ "status": "baseline",
+ }
+
+ # Fail closed when the treated unit's own fit did not converge: a truncated /
+ # under-optimized baseline ATT makes every leave-one-out delta meaningless.
+ if not self._fit_converged:
+ warnings.warn(
+ "Leave-one-out skipped: the treated unit's own SCM fit did not "
+ "converge at fit time (inner Frank-Wolfe weight solve and/or outer V "
+ "search), so the baseline ATT is not a valid optimum to compare "
+ "leave-one-out refits against. Re-fit with a larger inner_max_iter / "
+ "looser inner_min_decrease (inner) and/or a larger "
+ "optimizer_options['maxiter'] / more n_starts (outer V search).",
+ UserWarning,
+ stacklevel=2,
+ )
+ self._loo_status = "treated_fit_nonconverged"
+ self._loo_att_range = None
+ self._loo_n_failed = 0
+ self._loo_gaps = {}
+ self._loo_df = pd.DataFrame([baseline_row], columns=self._LOO_COLS)
+ return self._loo_df.copy()
+
+ # Dropping any donor requires at least one donor left in the pool.
+ if len(snap.donor_ids) < 2:
+ warnings.warn(
+ "Leave-one-out donor robustness requires at least 2 donors (dropping "
+ f"one must leave a non-empty pool); only {len(snap.donor_ids)} "
+ "available. Returning the baseline fit only.",
+ UserWarning,
+ stacklevel=2,
+ )
+ self._loo_status = "too_few_donors"
+ self._loo_att_range = None
+ self._loo_n_failed = 0
+ self._loo_gaps = {}
+ self._loo_df = pd.DataFrame([baseline_row], columns=self._LOO_COLS)
+ return self._loo_df.copy()
+
+ # Drop the FROZEN reportably-weighted support captured at fit time (donor ids
+ # with weight above the 1e-6 floor, in donor_ids order). Reading the snapshot —
+ # NOT the mutable presentation-level self.donor_weights — makes the result
+ # depend only on the fit and immune to post-fit mutation of donor_weights.
+ pos_donors = list(snap.weighted_donor_ids)
+ loo_gaps: Dict[Any, Dict[Any, float]] = {}
+ loo_rows: List[Dict[str, Any]] = []
+ atts: List[float] = []
+ n_failed = 0
+
+ for d in pos_donors:
+ pool = [x for x in snap.donor_ids if x != d]
+ fitted = _placebo_fit_unit(snap, snap.treated_id, pool, n_starts_eff)
+ if fitted is None:
+ n_failed += 1
+ loo_rows.append(
+ {
+ "dropped_unit": d,
+ "att": np.nan,
+ "pre_rmspe": np.nan,
+ "post_rmspe": np.nan,
+ "rmspe_ratio": np.nan,
+ "delta_att": np.nan,
+ "status": "failed",
+ }
+ )
+ continue
+ gap_path_d, ratio_d = fitted
+ loo_gaps[d] = gap_path_d
+ att_d = float(np.mean([gap_path_d[p] for p in snap.post_periods]))
+ atts.append(att_d)
+ loo_rows.append(
+ {
+ "dropped_unit": d,
+ "att": att_d,
+ "pre_rmspe": float(np.sqrt(_mspe(gap_path_d, snap.pre_periods))),
+ "post_rmspe": float(np.sqrt(_mspe(gap_path_d, snap.post_periods))),
+ "rmspe_ratio": ratio_d,
+ "delta_att": att_d - float(self.att),
+ "status": "loo",
+ }
+ )
+
+ # Sort successful drops by |delta_att| desc (most influential donor first);
+ # non-converged drops sort last.
+ finite_rows = sorted(
+ (r for r in loo_rows if r["status"] == "loo"),
+ key=lambda r: abs(r["delta_att"]),
+ reverse=True,
+ )
+ failed_rows = [r for r in loo_rows if r["status"] == "failed"]
+ ordered = [baseline_row] + finite_rows + failed_rows
+
+ if n_failed > 0:
+ warnings.warn(
+ f"{n_failed} of {len(pos_donors)} leave-one-out refits failed to "
+ "converge and are reported with NaN metrics (status='failed'); the "
+ "ATT range uses the remaining refits.",
+ UserWarning,
+ stacklevel=2,
+ )
+
+ self._loo_gaps = loo_gaps
+ self._loo_n_failed = int(n_failed)
+ self._loo_att_range = (min(atts), max(atts)) if atts else None
+ # Baseline-relative headline: the largest swing of any single donor-drop from
+ # the full-fit ATT (max |delta_att|). Robust to a uniform shift that a raw
+ # att_range would understate.
+ self._loo_max_abs_delta_att = max(abs(a - float(self.att)) for a in atts) if atts else None
+ # Distinguish a real run from "every donor-drop refit failed to converge"
+ # (no valid leave-one-out estimate produced) so DR/BR do not report an empty
+ # diagnostic as completed. (pos_donors empty — a converged fit always has >=1
+ # positive weight — falls through to "ran": baseline-only, benign.)
+ self._loo_status = "all_refits_failed" if (pos_donors and not atts) else "ran"
+ self._loo_df = pd.DataFrame(ordered, columns=self._LOO_COLS)
+ return self._loo_df.copy()
+
+ def get_leave_one_out_df(self) -> pd.DataFrame:
+ """
+ Get the leave-one-out donor-robustness table (see :meth:`leave_one_out`).
+
+ Survives pickling. Raises if :meth:`leave_one_out` has not been run.
+
+ Returns
+ -------
+ pandas.DataFrame
+ """
+ if self._loo_df is None:
+ raise ValueError("No leave-one-out results yet; call leave_one_out() first.")
+ return self._loo_df.copy()
+
+ def get_leave_one_out_gaps(self) -> pd.DataFrame:
+ """
+ Long-form leave-one-out gap paths, for the overlay ("spaghetti") plot.
+
+ One row per (dropped donor, period) for every converged leave-one-out refit.
+ Columns: ``dropped_unit``, ``period``, ``gap``, ``phase`` (``"pre"``/
+ ``"post"``) — mirroring :meth:`get_gap_df`. These per-period paths are
+ panel-derived and are NOT retained after pickling.
+
+ Returns
+ -------
+ pandas.DataFrame
+
+ Raises
+ ------
+ ValueError
+ If :meth:`leave_one_out` has not been run, or if the gap paths were
+ dropped on pickling (re-fit and re-run to recompute them).
+ """
+ if self._loo_df is None:
+ raise ValueError("No leave-one-out results yet; call leave_one_out() first.")
+ if self._loo_gaps is None:
+ raise ValueError(
+ "Leave-one-out gap paths are not retained after pickling "
+ "(panel-derived); re-run leave_one_out() on a freshly fitted result "
+ "to recompute them."
+ )
+ rows: List[Dict[str, Any]] = []
+ for unit, gap_path in self._loo_gaps.items():
+ for period in list(self.pre_periods) + list(self.post_periods):
+ if period in gap_path:
+ phase = "post" if period in self.post_periods else "pre"
+ rows.append(
+ {
+ "dropped_unit": unit,
+ "period": period,
+ "gap": gap_path[period],
+ "phase": phase,
+ }
+ )
+ return pd.DataFrame(rows, columns=["dropped_unit", "period", "gap", "phase"])
+
+ _IN_TIME_COLS = [
+ "placebo_period",
+ "placebo_att",
+ "pre_fit_rmspe",
+ "rmspe_ratio",
+ "n_pre_fake",
+ "n_post_fake",
+ "n_dropped_specs",
+ "status",
+ ]
+
+ def in_time_placebo(
+ self,
+ placebo_periods: Optional[Any] = None,
+ n_starts: Optional[int] = None,
+ ) -> pd.DataFrame:
+ """
+ In-time (backdating) placebo (Abadie-Diamond-Hainmueller 2015, Section 4).
+
+ Reassigns the intervention to an earlier pre-treatment date ``t_f`` and re-fits
+ the synthetic control using ONLY pre-``t_f`` information, then measures the
+ "effect" over the held-out window ``[t_f, T0)``. A credible synthetic control
+ should show **no spurious gap** there (ADH 2015 Figure 4, German reunification
+ backdated to 1975). This is a thin re-run of the validated SCM solver — it has
+ **no analytical standard error**; ``se``/``t_stat``/``p_value``/``conf_int`` and
+ ``is_significant`` are unaffected.
+
+ **Windowing convention (TRUNCATE).** The placebo fit uses only periods strictly
+ before ``t_f``: pre-period-outcome predictors become the pre-``t_f`` outcomes,
+ and covariate / special predictor windows are intersected with the pre-``t_f``
+ window. A predictor window lying ENTIRELY in the held-out region ``[t_f, T0)``
+ is dropped (surfaced in ``n_dropped_specs`` + an aggregated warning). For
+ outcome-predictor fits this equals the literal "lag the predictors" re-run of a
+ manual ``Synth::synth`` (R has no in-time-placebo function); see
+ ``docs/methodology/REGISTRY.md`` for the recognized deviation note.
+
+ Parameters
+ ----------
+ placebo_periods : period value or list of period values, optional
+ The pseudo-intervention date(s), each a member of ``pre_periods``. Default
+ None sweeps every feasible interior pre-date (at least 2 pre-fake periods to
+ fit + at least 1 post-fake period to measure the gap). A date that is a true
+ post-treatment period, or not a pre-period at all, raises ``ValueError``; a
+ valid pre-date that is dimensionally infeasible (too few pre-fake periods, or
+ all predictors dropped) yields a ``status="infeasible"`` row (no raise).
+ n_starts : int, optional
+ Override the multistart count for each placebo refit's nested V search.
+ Default None inherits the original fit's ``n_starts``.
+
+ Returns
+ -------
+ pandas.DataFrame
+ One row per placebo date. Columns: ``placebo_period``, ``placebo_att`` (mean
+ gap over the held-out window — should be ~0 if no real pre-period effect),
+ ``pre_fit_rmspe``, ``rmspe_ratio`` (post-fake/pre-fake), ``n_pre_fake``,
+ ``n_post_fake``, ``n_dropped_specs``, ``status`` (``"ran"`` / ``"infeasible"``
+ / ``"failed"``).
+
+ Raises
+ ------
+ ValueError
+ If the fit snapshot is unavailable (e.g. this result was unpickled), or an
+ explicit ``placebo_periods`` entry is a post-treatment period / not a
+ pre-period.
+ """
+ if self._fit_snapshot is None:
+ raise ValueError(
+ "in_time_placebo() requires the fit snapshot on the results object. "
+ "This result appears to have been loaded from serialization (which "
+ "excludes the snapshot) or produced by an older estimator version. "
+ "Re-fit to enable the in-time placebo."
+ )
+ from diff_diff.synthetic_control import (
+ _mspe,
+ _placebo_fit_unit,
+ _truncate_snapshot_in_time,
+ )
+
+ snap = self._fit_snapshot
+ if n_starts is None:
+ n_starts_eff = snap.n_starts
+ else:
+ if not isinstance(n_starts, (int, np.integer)) or n_starts < 1:
+ raise ValueError(f"n_starts override must be a positive integer, got {n_starts!r}")
+ n_starts_eff = int(n_starts)
+
+ pre = list(snap.pre_periods)
+ empty = pd.DataFrame([], columns=self._IN_TIME_COLS)
+
+ # Fail closed when the treated unit's own fit did not converge: a truncated /
+ # under-optimized baseline makes the placebo comparison meaningless.
+ if not self._fit_converged:
+ warnings.warn(
+ "In-time placebo skipped: the treated unit's own SCM fit did not "
+ "converge at fit time (inner Frank-Wolfe weight solve and/or outer V "
+ "search). Re-fit with a larger inner_max_iter / looser "
+ "inner_min_decrease (inner) and/or a larger optimizer_options['maxiter'] "
+ "/ more n_starts (outer V search).",
+ UserWarning,
+ stacklevel=2,
+ )
+ self._in_time_status = "treated_fit_nonconverged"
+ self._in_time_n_failed = 0
+ self._in_time_gaps = {}
+ self._in_time_df = empty
+ return empty.copy()
+
+ # A feasible date needs >=2 pre-fake + >=1 post-fake period -> >=3 pre periods.
+ # The >=2 pre-fake rule is a deliberate Note-documented restriction (an auto-
+ # swept single-pre-fake placebo is a non-credible pre-fit; see REGISTRY).
+ if len(pre) < 3:
+ warnings.warn(
+ "In-time placebo requires at least 3 pre-treatment periods (a feasible "
+ "placebo date needs >=2 pre-fake periods to fit and >=1 post-fake period "
+ f"to measure the gap); only {len(pre)} available.",
+ UserWarning,
+ stacklevel=2,
+ )
+ self._in_time_status = "too_few_pre_periods"
+ self._in_time_n_failed = 0
+ self._in_time_gaps = {}
+ self._in_time_df = empty
+ return empty.copy()
+
+ if placebo_periods is None:
+ # Sweep every feasible pre-date (positional: idx>=2 gives >=2 pre-fake +
+ # >=1 post-fake; idx<2 would leave fewer than 2 pre-fake periods).
+ dates: List[Any] = [pre[i] for i in range(2, len(pre))]
+ else:
+ if isinstance(placebo_periods, (list, tuple, set, np.ndarray, pd.Index, pd.Series)):
+ dates = list(placebo_periods)
+ else:
+ dates = [placebo_periods]
+ # An explicit but EMPTY container is a malformed request (NOT "every date
+ # was infeasible") — fail fast, consistent with the post-date / non-pre
+ # date raises below. Pass None to sweep all feasible pre-dates.
+ if not dates:
+ raise ValueError(
+ "placebo_periods is empty; pass None to sweep all feasible "
+ "pre-dates, or a non-empty list of pre-period date(s)."
+ )
+ pre_set = set(pre)
+ post_set = set(snap.post_periods)
+ for d in dates:
+ if d in post_set:
+ raise ValueError(
+ f"placebo_period {d!r} is a true post-treatment period; an "
+ "in-time placebo date must lie in the pre-treatment window."
+ )
+ if d not in pre_set:
+ raise ValueError(
+ f"placebo_period {d!r} is not a pre-treatment period "
+ f"(pre_periods = {pre})."
+ )
+ # De-duplicate + canonicalize to pre-period order (mirrors _resolve_periods):
+ # duplicate / unordered explicit dates must not trigger duplicate refits or
+ # inflate n_dates.
+ _requested = set(dates)
+ dates = [p for p in pre if p in _requested]
+
+ in_time_gaps: Dict[Any, Dict[Any, float]] = {}
+ rows: List[Dict[str, Any]] = []
+ dropped_all: set = set()
+ n_failed = 0
+ n_infeasible = 0
+ n_ran = 0
+
+ for t_f in dates:
+ idx = pre.index(t_f)
+ n_pre_fake = idx
+ n_post_fake = len(pre) - idx
+ snap_mod, dropped = _truncate_snapshot_in_time(snap, t_f)
+ dropped_all.update(dropped)
+ if snap_mod is None:
+ n_infeasible += 1
+ rows.append(
+ {
+ "placebo_period": t_f,
+ "placebo_att": np.nan,
+ "pre_fit_rmspe": np.nan,
+ "rmspe_ratio": np.nan,
+ "n_pre_fake": n_pre_fake,
+ "n_post_fake": n_post_fake,
+ "n_dropped_specs": len(dropped),
+ "status": "infeasible",
+ }
+ )
+ continue
+ fitted = _placebo_fit_unit(snap_mod, snap.treated_id, snap.donor_ids, n_starts_eff)
+ if fitted is None:
+ n_failed += 1
+ rows.append(
+ {
+ "placebo_period": t_f,
+ "placebo_att": np.nan,
+ "pre_fit_rmspe": np.nan,
+ "rmspe_ratio": np.nan,
+ "n_pre_fake": n_pre_fake,
+ "n_post_fake": n_post_fake,
+ "n_dropped_specs": len(dropped),
+ "status": "failed",
+ }
+ )
+ continue
+ gap_path, ratio = fitted
+ in_time_gaps[t_f] = gap_path
+ placebo_att = float(np.mean([gap_path[p] for p in snap_mod.post_periods]))
+ rows.append(
+ {
+ "placebo_period": t_f,
+ "placebo_att": placebo_att,
+ "pre_fit_rmspe": float(np.sqrt(_mspe(gap_path, snap_mod.pre_periods))),
+ "rmspe_ratio": ratio,
+ "n_pre_fake": n_pre_fake,
+ "n_post_fake": n_post_fake,
+ "n_dropped_specs": len(dropped),
+ "status": "ran",
+ }
+ )
+ n_ran += 1
+
+ if dropped_all:
+ warnings.warn(
+ "In-time placebo (TRUNCATE convention): predictor(s) "
+ f"{sorted(map(str, dropped_all))} fell entirely in the held-out "
+ "post-fake window for some placebo date(s) and were dropped from those "
+ "refits (see the n_dropped_specs column).",
+ UserWarning,
+ stacklevel=2,
+ )
+ if n_infeasible > 0:
+ warnings.warn(
+ f"{n_infeasible} in-time placebo date(s) were dimensionally infeasible "
+ "(too few pre-fake periods or all predictors dropped) and are reported "
+ "with status='infeasible' (NaN metrics).",
+ UserWarning,
+ stacklevel=2,
+ )
+ if n_failed > 0:
+ warnings.warn(
+ f"{n_failed} in-time placebo refit(s) failed to converge and are "
+ "reported with status='failed' (NaN metrics).",
+ UserWarning,
+ stacklevel=2,
+ )
+
+ self._in_time_gaps = in_time_gaps
+ self._in_time_n_failed = int(n_failed)
+ self._in_time_n_infeasible = int(n_infeasible)
+ # When no date ran, classify the cause precisely so the downstream reason text
+ # is never false: a pure convergence failure ("all_dates_failed", actionable —
+ # raise n_starts / loosen tolerances) and pure dimensional infeasibility
+ # ("all_dates_infeasible", structural) are distinct; a MIX of both gets its own
+ # "all_dates_unusable" code (both counters are surfaced) rather than being
+ # mislabeled as exclusively one or the other.
+ if n_ran > 0:
+ self._in_time_status = "ran"
+ elif n_failed > 0 and n_infeasible > 0:
+ self._in_time_status = "all_dates_unusable"
+ elif n_failed > 0:
+ self._in_time_status = "all_dates_failed"
+ else:
+ self._in_time_status = "all_dates_infeasible"
+ self._in_time_df = pd.DataFrame(rows, columns=self._IN_TIME_COLS)
+ return self._in_time_df.copy()
+
+ def get_in_time_placebo_df(self) -> pd.DataFrame:
+ """
+ Get the in-time placebo table (see :meth:`in_time_placebo`).
+
+ Survives pickling. Raises if :meth:`in_time_placebo` has not been run.
+
+ Returns
+ -------
+ pandas.DataFrame
+ """
+ if self._in_time_df is None:
+ raise ValueError("No in-time placebo results yet; call in_time_placebo() first.")
+ return self._in_time_df.copy()
+
+ def get_in_time_placebo_gaps(self) -> pd.DataFrame:
+ """
+ Long-form in-time placebo gap paths, for the backdating overlay plot.
+
+ One row per (placebo date, period) for every converged in-time refit. Columns:
+ ``placebo_period``, ``period``, ``gap``, ``phase`` (``"pre_fake"`` for periods
+ before the placebo date, ``"post_fake"`` for the held-out window from it on).
+ These per-period paths are panel-derived and are NOT retained after pickling.
+
+ Returns
+ -------
+ pandas.DataFrame
+
+ Raises
+ ------
+ ValueError
+ If :meth:`in_time_placebo` has not been run, or if the gap paths were
+ dropped on pickling (re-fit and re-run to recompute them).
+ """
+ if self._in_time_df is None:
+ raise ValueError("No in-time placebo results yet; call in_time_placebo() first.")
+ if self._in_time_gaps is None:
+ raise ValueError(
+ "In-time placebo gap paths are not retained after pickling "
+ "(panel-derived); re-run in_time_placebo() on a freshly fitted result "
+ "to recompute them."
+ )
+ pre = list(self.pre_periods)
+ rows: List[Dict[str, Any]] = []
+ for t_f, gap_path in self._in_time_gaps.items():
+ split = pre.index(t_f)
+ for period in pre:
+ if period in gap_path:
+ phase = "post_fake" if pre.index(period) >= split else "pre_fake"
+ rows.append(
+ {
+ "placebo_period": t_f,
+ "period": period,
+ "gap": gap_path[period],
+ "phase": phase,
+ }
+ )
+ return pd.DataFrame(rows, columns=["placebo_period", "period", "gap", "phase"])
diff --git a/docs/api/synthetic_control.rst b/docs/api/synthetic_control.rst
index 506c7422..b727315d 100644
--- a/docs/api/synthetic_control.rst
+++ b/docs/api/synthetic_control.rst
@@ -25,6 +25,15 @@ reported estimate. Significance comes from **in-space placebo permutation infere
``placebo_p_value = rank/(n_placebos+1)``). This permutation p-value is a separate field
from the (NaN) ``p_value``; ``is_significant`` stays bound to ``p_value``.
+**Robustness diagnostics (ADH 2015 §4, opt-in):**
+:meth:`~diff_diff.SyntheticControlResults.leave_one_out` drops each reportably-weighted (weight > 1e-6)
+donor and re-fits (per-drop ATT / ``delta_att`` table — a large ``delta_att`` flags
+single-donor dependence), and
+:meth:`~diff_diff.SyntheticControlResults.in_time_placebo` reassigns the intervention to an
+earlier pre-date and checks for a spurious gap before the true treatment date (the
+backdating placebo; ``placebo_att`` should be ~0). Both re-run the validated solver and
+leave the analytical inference fields NaN.
+
**Distinct from** :class:`~diff_diff.SyntheticDiD` (Arkhangelsky et al. 2021), which adds
time weights and ridge regularization; classic SCM uses **donor weights only** plus the
outer ``V`` search.
@@ -71,6 +80,12 @@ Results container for synthetic control estimation.
~SyntheticControlResults.in_space_placebo
~SyntheticControlResults.get_placebo_df
+ ~SyntheticControlResults.leave_one_out
+ ~SyntheticControlResults.get_leave_one_out_df
+ ~SyntheticControlResults.get_leave_one_out_gaps
+ ~SyntheticControlResults.in_time_placebo
+ ~SyntheticControlResults.get_in_time_placebo_df
+ ~SyntheticControlResults.get_in_time_placebo_gaps
~SyntheticControlResults.summary
~SyntheticControlResults.print_summary
~SyntheticControlResults.to_dict
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 38c8022f..f33f263b 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1988,6 +1988,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
**Inference:** **No analytical standard error** (Section 2.4) — `se`/`t_stat`/`p_value`/`conf_int` are always NaN. Significance comes from **in-space placebo permutation inference** via `SyntheticControlResults.in_space_placebo()`: reassign treatment to each donor, refit a synthetic control for it, and rank the treated unit's post/pre RMSPE ratio (`rmspe_ratio` = `RMSPE_post / RMSPE_pre` = `sqrt(MSPE_post / MSPE_pre)`) among all units; `placebo_p_value = rank / (n_placebos + 1)`, where `rank = 1 + #{placebos with ratio ≥ treated ratio}` — an **upper-tail rank test on the (unsigned) RMSPE-ratio statistic**, ties counted conservatively via `≥`. Because the ratio squares the gaps it is direction-agnostic: a large ratio signals an effect of *either* sign, so this is NOT a signed/one-directional ("one-sided") hypothesis test on the treatment-effect direction. ADH 2010 §3.4 reports the *MSPE* ratio (the square of `rmspe_ratio`); the two are monotone-equivalent, so the rank and p-value are identical — only the reported statistic's scale differs. `rmspe_ratio` (the treated statistic) is computed at fit time; `placebo_p_value` / `n_placebos` / `n_failed` are populated by the opt-in `in_space_placebo()` call. `get_placebo_df()` returns the per-unit ratio table used for the rank (the per-period placebo gap paths for a "spaghetti" plot are retained internally, not in this summary).
+**ADH-2015 §4 robustness diagnostics (opt-in):** Two further diagnostics from Abadie-Diamond-Hainmueller (2015, §4), each a thin re-run of the validated solver — populated only when called and surfaced under `estimator_native_diagnostics` (the analytical inference contract is unchanged; `se`/`t_stat`/`p_value`/`conf_int`/`is_significant` stay bound to the NaN analytical `p_value`):
+- **Leave-one-out donor robustness** (`leave_one_out()`): drops each **reportably-weighted** donor and re-fits the treated unit against the reduced pool, returning a per-drop ATT / `delta_att` table (a `status="baseline"` row first, then one row per dropped donor sorted by `|delta_att|`). A large `delta_att` flags single-donor dependence (a single *dominant* donor is still dropped — the others absorb its mass — and its large `delta_att` is the intended signal, not a failure). The reporting stack's headline donor-sensitivity number is `max_abs_delta_att` = `max |delta_att|` over the drops (baseline-relative, so a uniform shift of every drop away from the full-fit ATT is not masked the way a raw ATT range would be). `get_leave_one_out_gaps()` returns the per-drop trajectories for the overlay plot. Fails closed on a non-converged treated fit or `< 2` donors.
+- **In-time (backdating) placebo** (`in_time_placebo()`): reassigns the intervention to an earlier pre-date `t_f`, re-fits using ONLY pre-`t_f` information (TRUNCATE convention — see Note), and reports the placebo "effect" over the held-out window `[t_f, T0)` — ~0 if there is no real pre-period effect (ADH 2015 Fig. 4, German reunification backdated to 1975). Sweeps every feasible interior pre-date by default (≥2 pre-fake + ≥1 post-fake); an explicit post-period / non-pre date raises, a valid-but-dimensionally-infeasible date yields a `status="infeasible"` row (no raise).
+
**Notes / deviations:**
- **Note:** The standardization divisor `divisor = sqrt(apply(cbind(X0,X1), 1, var))` (per-predictor SD over donors+treated, ddof=1) and the inner/outer optimizer are **not specified in ADH 2010** (which defers these numerics to Abadie & Gardeazabal 2003 App. B / the `Synth` software). The divisor is pinned from the R `Synth::synth` source; `solution.v` lives in this scaled predictor space, so the deterministic R-parity test feeds `custom_v` in the same scaled space.
- **Note:** The outer objective minimizes the pre-period outcome MSPE over **all** pre periods, whereas R `Synth` uses a `time.optimize.ssr` window (1960–1969 in the Basque example). The nested `V` therefore differs from R by an efficiency-only choice (the paper notes inferential validity holds for *any* `V`), so end-to-end nested parity is a tolerance band, not equality.
@@ -2000,6 +2004,10 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
- **Note (placebo failure handling):** a placebo is **excluded from both the numerator and the denominator** of the rank (never penalized into it) and tallied in `n_failed` when its fit is not a valid optimum — EITHER its **inner Frank-Wolfe weight solve** did not converge (a truncated `W` is unusable) OR its **outer `V` search** did not converge (an under-optimized `V` fits the pre-period worse, shrinking the RMSPE ratio and biasing the p-value anti-conservatively, so it must not silently enter the rank). The reported p-value uses the **effective** count `rank / (n_placebos + 1)`, where `n_placebos` is the number of placebos that entered the reference set. Failed donors still appear in `get_placebo_df()` (`status="failed"`, NaN metrics), so once a reference set is produced the table is the full treated + every-donor unit set (`n_donors + 1` rows). In the fail-closed cases the placebo loop does not run and only the treated row is returned: `J < 2` → `placebo_p_value` is NaN with a warning (no placebo distribution; `J == 2` warns the distribution is coarse), and a treated fit whose own **inner OR outer** search did not converge also fails closed (ranking a truncated / under-optimized treated statistic would not be a valid permutation). **Caveat:** each placebo refit inherits the original fit's `optimizer_options` / `n_starts`, so valid inference requires settings adequate for the outer `V` search to converge to a comparable-quality synthetic (production defaults do; cheap settings under-optimize placebo `V` and those placebos are dropped as failed — raise `n_starts` on `in_space_placebo()` or re-fit with a larger `optimizer_options['maxiter']`).
- **Note (RMSPE-ratio floor):** the reported `rmspe_ratio = sqrt(MSPE_post / MSPE_pre)` floors the pre-period MSPE denominator at a scale-aware `1e-8 · max(|pre-outcomes|, 1)²` (before the square root) so a (near-)perfect pre-fit (`pre-MSPE → 0`) yields a large-but-FINITE ratio rather than `inf`/`nan` (which would corrupt the rank). Ties (`ratio_j ≥ treated_ratio`) are counted, making the p-value conservative. Mirrors the `_fit_tol` poor-fit guard.
- **Note (placebo p-value is non-analytical):** `placebo_p_value` is deliberately a SEPARATE field from `p_value` (which stays NaN) — it is a permutation p-value with no SE / t-stat, so it does not flow through `safe_inference`. `is_significant` likewise stays bound to the (NaN) `p_value`, NOT `placebo_p_value`; a tool gating on `is_significant` will see `False` even when `placebo_p_value` is small. The reporting stack surfaces the placebo p-value through `estimator_native_diagnostics`, never the analytical headline.
+- **Note (in-time placebo windowing — TRUNCATE):** ADH 2015 §4 says to re-estimate the in-time placebo "with the same predictors lagged accordingly." Because `diff_diff`'s predictor specs reference **absolute** periods, the in-time placebo re-cuts them by TRUNCATION: pre-period-outcome predictors become the pre-`t_f` outcomes, and covariate / special-predictor windows are intersected with the pre-`t_f` window; a window lying ENTIRELY in the held-out region `[t_f, T0)` is **dropped** (surfaced in the `n_dropped_specs` column + an aggregated warning), and `custom_v` is subset in lockstep with the surviving specs. For an outcome-predictor fit (the R-anchorable case) TRUNCATE is identical to ADH's "lag" — both equal a manual `Synth::synth` re-run with `time.optimize.ssr` cut at `t_f`. The held-out window never enters the fit (the placebo's `all_periods` is the pre-fake + post-fake span; the true post-treatment periods are excluded entirely), so there is no "peeking." This concrete convention is NOT spelled out in ADH 2015 (which gives only the qualitative "lag accordingly").
+- **Note (in-time placebo requires ≥2 pre-fake periods):** the in-time placebo treats a date with fewer than 2 pre-fake periods as `status="infeasible"` (the default sweep starts at the 3rd pre-period). This is DELIBERATELY stricter than the base estimator's `T0 ≥ 1` allowance (which permits a single-pre-period fit but warns that nested-`V` selection is unreliable): an auto-swept placebo date with a single pre-fake period is a trivially-matchable, non-credible pre-fit, so it is dropped rather than surfaced as a `ran` placebo (mirrors `SyntheticDiD.in_time_placebo`'s `i ≥ 2` rule). A date whose surviving `custom_v` has zero mass after truncation is likewise infeasible (not a convergence failure).
+- **Note (leave-one-out weight floor):** ADH 2015 §4 leave-one-out omits "each donor that received positive weight." This implementation drops each donor with **reportable** weight — above the `1e-6` interpretability floor (`synthetic_control._MIN_REPORT_WEIGHT`), i.e. exactly the donors in `donor_weights` — rather than every strictly-positive weight. A donor with `0 < w ≤ 1e-6` is numerical dust whose removal moves the ATT by ~its weight (its `delta_att` would be ~0, an uninformative row), and the floor keeps the LOO table aligned with the reported donor support. The drop-set is **frozen at fit time** on the fit snapshot (`weighted_donor_ids`), so `leave_one_out()` is immune to post-fit mutation of the presentation-level `donor_weights` dict.
+- **Note (ADH-2015 diagnostics validation):** R `Synth` has **no** in-time-placebo or leave-one-out function (verified against its full CRAN function index; `SCtools` adds only the *in-space* placebo battery, `scpi` only prediction-interval uncertainty), so there is no canonical R *output* to match for these diagnostics — in R they are hand-rolled by re-running `dataprep()`+`synth()`. They are validated instead by (a) the solver's existing Basque R parity (above), and (b) deterministic **self-consistency** tests proving each diagnostic equals a from-scratch `synthetic_control()` fit on the equivalent sub-problem — `leave_one_out()` drop-`d` == a fit on the donor pool minus `d`; `in_time_placebo([t_f])` == a fit on the backdated/truncated panel — both via a fixed `custom_v` (match to 1e-7). The deferred ADH-2015 items (out-of-sample CV `V`-selection, regression-weight `W^reg` extrapolation diagnostic, sparse-SC subset search) are tracked in `TODO.md`.
**Reference implementation:** authors' `Synth` package for R/MATLAB/Stata (`Synth::synth`); in-space placebo construction follows `SCtools::generate.placebos`. **R-parity anchor:** the Basque Country study (Abadie-Gardeazabal 2003, `data("basque")`) — published synthetic = region 10 (Cataluña) 0.851 + region 14 (Madrid) 0.149, `loss.v` 0.0089. Two-tier test (`tests/test_methodology_synthetic_control.py`): Tier-1 feeds R's `solution.v` via `custom_v` → donor weights match to atol 1e-3 (deterministic); Tier-2 checks the nested fit in a band.
@@ -2010,6 +2018,9 @@ Classic synthetic control (donor/unit weights only) for a single treated unit, d
- [x] Outer nested `V` (pre-period outcome MSPE) + user-supplied `custom_v`.
- [x] Gap path + pre-period RMSPE + predictor-balance table.
- [x] No analytical SE (NaN inference); in-space placebo permutation inference (`in_space_placebo()`, `rank/(n_placebos+1)`) with the real treated unit excluded from every placebo pool, effective-count denominator, and a scale-aware RMSPE-ratio floor.
+- [x] Leave-one-out donor robustness (`leave_one_out()`, ADH 2015 §4): per-drop ATT / `delta_att` table + overlay gaps; fail-closed.
+- [x] In-time (backdating) placebo (`in_time_placebo()`, ADH 2015 §4): TRUNCATE windowing (drop held-out-window predictors + lockstep `custom_v` subset), feasible-date sweep, fail-closed.
+- [ ] *Deferred (ADH 2015):* out-of-sample CV `V`-selection, regression-weight `W^reg` extrapolation diagnostic, sparse-SC subset search (see `TODO.md`).
- [x] Predictor-leakage, absorbing-suffix/no-anticipation, empty-window, duplicate-label, and inner-non-convergence validation gates.
---
diff --git a/docs/methodology/REPORTING.md b/docs/methodology/REPORTING.md
index 58058312..17af4cd9 100644
--- a/docs/methodology/REPORTING.md
+++ b/docs/methodology/REPORTING.md
@@ -266,11 +266,13 @@ a library setting.
under `estimator_native_diagnostics`. `SyntheticControlResults`
routes parallel-trends to the `scm_fit` analogue (`pre_rmspe`,
verdict `design_enforced_pt`) and surfaces `pre_rmspe`, donor-weight
- concentration, and the in-space placebo permutation p-value under
- `estimator_native_diagnostics` — the placebo block is populated only
- when the caller has already run `in_space_placebo()` (opt-in; DR never
- triggers the per-donor refit loop implicitly), and it omits
- HonestDiD-style `sensitivity` (significance IS the placebo).
+ concentration, the in-space placebo permutation p-value, and the
+ ADH-2015 leave-one-out (`leave_one_out`) and in-time placebo
+ (`in_time_placebo`) blocks under `estimator_native_diagnostics` — each
+ is populated only when the caller has already run the corresponding
+ opt-in method (DR never triggers a refit loop implicitly; otherwise a
+ `status="not_run"` stub), and it omits HonestDiD-style `sensitivity`
+ (significance IS the placebo).
`EfficientDiDResults` PT runs through `EfficientDiD.hausman_pretest`
(the estimator's native PT-All vs PT-Post check).
diff --git a/tests/data/synth_basque_golden.json b/tests/data/synth_basque_golden.json
index c9694aa1..ef52a8b4 100644
--- a/tests/data/synth_basque_golden.json
+++ b/tests/data/synth_basque_golden.json
@@ -332,5 +332,27 @@
"years": [1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997],
"treated_path": [3.853184630005, 3.945658296151, 4.033561734873, 4.023421896897, 4.013781968405, 4.285918396223, 4.574336095797, 4.898957353563, 5.197014981629, 5.338902978753, 5.465153005252, 5.545915627064, 5.614895726639, 5.852184933072, 6.08140541737, 6.17009424135, 6.283633404546, 6.555555398653, 6.810768561103, 7.105184302811, 7.377891682176, 7.232933621923, 7.089831372119, 6.786703607145, 6.639817386857, 6.56283917137, 6.500785454993, 6.545058607, 6.595329801139, 6.761496750091, 6.937160671728, 7.332191151301, 7.742788123594, 8.120536640759, 8.509711162324, 8.776777889074, 9.025278666196, 8.873892824706, 8.718223539089, 9.018137849286, 9.440873861653, 9.686518137675, 10.170665872809],
"synthetic_path": [3.702941626236, 3.85396941359, 3.996388245904, 4.029401685261, 4.059672772757, 4.378924693221, 4.733052176452, 4.987634927316, 5.222027456975, 5.298503665995, 5.362138883475, 5.448530450656, 5.523111080498, 5.760711870916, 5.993100448641, 6.137916329668, 6.294338685153, 6.620780377198, 6.933000322789, 7.087051801242, 7.228035388368, 7.220668043084, 7.211129382309, 7.07464338848, 7.057308667347, 7.129300695567, 7.234425280207, 7.325333530431, 7.421847797791, 7.516347610321, 7.610133692242, 8.117951258459, 8.623595409787, 9.086804616765, 9.545547512504, 9.788247373834, 10.037692891258, 9.838212341386, 9.639052163101, 9.987880532374, 10.303885350264, 10.538434998253, 10.998744701142],
- "gap": [0.1502430037689, 0.09168888256074, 0.03717348896852, -0.005979788364376, -0.04589080435155, -0.09300629699853, -0.1587160806545, -0.08867757375288, -0.02501247534587, 0.0403993127577, 0.1030141217766, 0.09738517640774, 0.0917846461413, 0.09147306215556, 0.08830496872896, 0.03217791168141, -0.01070528060644, -0.06522497854548, -0.1222317616855, 0.01813250156919, 0.1498562938071, 0.01226557883842, -0.1212980101894, -0.2879397813353, -0.4174912804894, -0.5664615241969, -0.7336398252139, -0.7802749234314, -0.8265179966512, -0.7548508602291, -0.6729730205141, -0.7857601071586, -0.8808072861931, -0.9662679760057, -1.035836350179, -1.01146948476, -1.012414225062, -0.96431951668, -0.9208286240117, -0.969742683088, -0.863011488611, -0.851916860578, -0.8280788283333]
+ "gap": [0.1502430037689, 0.09168888256074, 0.03717348896852, -0.005979788364376, -0.04589080435155, -0.09300629699853, -0.1587160806545, -0.08867757375288, -0.02501247534587, 0.0403993127577, 0.1030141217766, 0.09738517640774, 0.0917846461413, 0.09147306215556, 0.08830496872896, 0.03217791168141, -0.01070528060644, -0.06522497854548, -0.1222317616855, 0.01813250156919, 0.1498562938071, 0.01226557883842, -0.1212980101894, -0.2879397813353, -0.4174912804894, -0.5664615241969, -0.7336398252139, -0.7802749234314, -0.8265179966512, -0.7548508602291, -0.6729730205141, -0.7857601071586, -0.8808072861931, -0.9662679760057, -1.035836350179, -1.01146948476, -1.012414225062, -0.96431951668, -0.9208286240117, -0.969742683088, -0.863011488611, -0.851916860578, -0.8280788283333],
+ "leave_one_out": {
+ "dropped_regionno": 10,
+ "solution_w": {
+ "2": 9.808425765417e-09,
+ "3": 2.549786470331e-08,
+ "4": 1.480664733111e-06,
+ "5": 1.139816545885e-07,
+ "6": 1.578269127035e-08,
+ "7": 0.7023493162417,
+ "8": 1.197763079811e-08,
+ "9": 1.711227888294e-08,
+ "11": 4.911054589331e-07,
+ "12": 6.004338842587e-09,
+ "13": 1.948545433164e-08,
+ "14": 0.2976484098422,
+ "15": 1.699497289433e-08,
+ "16": 3.294816157091e-08,
+ "18": 3.255185607848e-08
+ },
+ "att": 0.6138519220436,
+ "gap": [0.6880403421591, 0.628878774828, 0.5753339248836, 0.54101400783, 0.5037249840412, 0.546426742406, 0.5606408216439, 0.6633437692491, 0.7549698385142, 0.8268369239666, 0.8905574520592, 0.8951691657033, 0.8929960283248, 0.9175321635515, 0.9274417393066, 0.8572191113726, 0.8064665804174, 0.8252573922895, 0.8361572960057, 0.9576844174138, 1.065360164047, 0.8680617841655, 0.6817614133823, 0.431542742956, 0.3121824139876, 0.2118891193835, 0.1045310836943, 0.1442119442113, 0.172403844142, 0.2354616807978, 0.2925811556038, 0.3381223905461, 0.3936699900311, 0.4330380719963, 0.4842571460249, 0.6248622785575, 0.7242896438824, 0.7039633742619, 0.6787159280904, 0.7734776694026, 0.9728259048368, 1.059949514994, 1.197909760725]
+ }
}
diff --git a/tests/test_business_report.py b/tests/test_business_report.py
index d5668e95..80b04fdf 100644
--- a/tests/test_business_report.py
+++ b/tests/test_business_report.py
@@ -825,6 +825,19 @@ def test_scm_robustness_block_surfaces_native_fields(self, scm_fit):
assert "weight_concentration" in native
assert native["in_space_placebo"]["n_placebos"] == res.n_placebos
+ def test_scm_robustness_block_surfaces_adh2015_diagnostics(self, scm_fit):
+ # After running the opt-in ADH-2015 diagnostics, the robustness block must
+ # carry their native sub-blocks (lifted from estimator_native_diagnostics).
+ res, _ = scm_fit
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.leave_one_out()
+ res.in_time_placebo()
+ rob = BusinessReport(res, auto_diagnostics=True).to_dict()["robustness"]
+ native = rob["estimator_native"]
+ assert native["leave_one_out"]["status"] == "ran"
+ assert native["in_time_placebo"]["status"] == "ran"
+
def test_staggered_triple_diff_assumption_uses_ddd_not_generic_pt(self):
class StaggeredTripleDiffResults:
pass
diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py
index 66e594a5..a0a78fcd 100644
--- a/tests/test_diagnostic_report.py
+++ b/tests/test_diagnostic_report.py
@@ -2077,8 +2077,10 @@ def test_scm_native_section_populated(self, scm_fit):
assert "herfindahl" in native["weight_concentration"]
# Placebo is opt-in: NOT auto-run inside the report.
assert native["in_space_placebo"]["status"] == "not_run"
- # In-time placebo / leave-one-out are ADH 2015 (not implemented here).
- assert "in_time_placebo" not in native and "leave_one_out" not in native
+ # The ADH-2015 diagnostics are also opt-in: surfaced as "not_run" stubs until
+ # the user calls leave_one_out() / in_time_placebo().
+ assert native["leave_one_out"]["status"] == "not_run"
+ assert native["in_time_placebo"]["status"] == "not_run"
def test_scm_native_surfaces_placebo_after_optin_run(self, scm_fit):
res, _ = scm_fit
@@ -2090,6 +2092,25 @@ def test_scm_native_surfaces_placebo_after_optin_run(self, scm_fit):
assert block["n_placebos"] == res.n_placebos
assert block["placebo_p_value"] == pytest.approx(res.placebo_p_value)
+ def test_scm_native_surfaces_leave_one_out_after_optin_run(self, scm_fit):
+ res, _ = scm_fit
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.leave_one_out()
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ block = native["leave_one_out"]
+ assert block["status"] == "ran"
+ assert block["att_range"] is not None and len(block["att_range"]) == 2
+
+ def test_scm_native_surfaces_in_time_placebo_after_optin_run(self, scm_fit):
+ res, _ = scm_fit
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.in_time_placebo()
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ block = native["in_time_placebo"]
+ assert block["status"] == "ran" and block["n_dates"] >= 1
+
def test_scm_does_not_call_honest_did(self, scm_fit):
"""HonestDiD sensitivity should NOT run on SCM (fit-based / native path)."""
res, _ = scm_fit
@@ -2108,18 +2129,22 @@ def test_scm_significance_not_marked_done_until_placebo_run(self, scm_fit):
res, _ = scm_fit # the fixture does NOT run the placebo
schema = DiagnosticReport(res).to_dict()
labels = " ".join(s.get("label", "") for s in schema.get("next_steps", [])).lower()
- assert "placebo" in labels # the significance recommendation still surfaces
+ # Target the IN-SPACE step specifically (the in-time-placebo step's label also
+ # contains "placebo" but is a different, always-on ADH-2015 recommendation).
+ assert "in-space placebo" in labels # the significance recommendation surfaces
def test_scm_placebo_step_completes_after_run(self, scm_fit):
- """Once the opt-in placebo has been run, DR stops recommending it."""
+ """Once the opt-in in-space placebo has been run, DR stops recommending it."""
res, _ = scm_fit
before = DiagnosticReport(res).to_dict()["next_steps"]
- assert any("placebo" in s.get("label", "").lower() for s in before)
+ assert any("in-space placebo" in s.get("label", "").lower() for s in before)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res.in_space_placebo() # opt-in significance procedure now done
after = DiagnosticReport(res).to_dict()["next_steps"]
- assert not any("placebo" in s.get("label", "").lower() for s in after)
+ # The in-space step is suppressed; the (differently-tagged) in-time-placebo
+ # and leave-one-out recommendations are unaffected.
+ assert not any("in-space placebo" in s.get("label", "").lower() for s in after)
def test_scm_rejects_precomputed_parallel_trends_and_sensitivity(self, scm_fit):
# Like SDiD/TROP, SCM computes its PT verdict internally (the scm_fit
diff --git a/tests/test_methodology_synthetic_control.py b/tests/test_methodology_synthetic_control.py
index ae563d01..ee85ea78 100644
--- a/tests/test_methodology_synthetic_control.py
+++ b/tests/test_methodology_synthetic_control.py
@@ -32,7 +32,12 @@
import pandas as pd
import pytest
-from diff_diff import SyntheticControl, SyntheticControlResults, synthetic_control
+from diff_diff import (
+ DiagnosticReport,
+ SyntheticControl,
+ SyntheticControlResults,
+ synthetic_control,
+)
from tests.conftest import assert_nan_inference
DATA_DIR = Path(__file__).parent / "data"
@@ -845,6 +850,35 @@ def test_basque_tier2_nested_band():
assert res.donor_weights.get(10, 0) + res.donor_weights.get(14, 0) > 0.7
+def test_basque_tier1_leave_one_out_parity():
+ """Tier-1 LOO (deterministic): dropping the dominant donor (region 10) with R's
+ ``solution.v`` held fixed, the reduced-pool refit's ATT and gap path match R's
+ drop-donor ``synth`` exactly (a direct R anchor on the reduced-pool W-solve;
+ ``leave_one_out()`` on a custom-V fit reuses that fixed V on the donor pool minus
+ the dropped unit). Region 10 carries ~85% of the full-pool weight, so dropping it
+ swings the synthetic onto regions 7+14 — the single-donor-dependence signal LOO
+ exists to surface."""
+ golden, df = _load_golden()
+ if "leave_one_out" not in golden:
+ pytest.skip("LOO golden missing — regenerate via the R script.")
+ loo_g = golden["leave_one_out"]
+ dropped = int(loo_g["dropped_regionno"])
+ custom_v = np.asarray(golden["solution_v"], dtype=float)
+ res = SyntheticControl(v_method="custom", custom_v=custom_v).fit(
+ df, "gdpcap", "treated", "regionno", "year", **_basque_kwargs(golden)
+ )
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ loo = res.leave_one_out()
+ row = loo[(loo["status"] == "loo") & (loo["dropped_unit"] == dropped)]
+ assert len(row) == 1
+ assert float(row["att"].iloc[0]) == pytest.approx(float(loo_g["att"]), abs=1e-2)
+ # Full reduced-pool gap trajectory (1955-1997) matches R's drop-donor synth.
+ gaps = res.get_leave_one_out_gaps()
+ gap_py = gaps[gaps["dropped_unit"] == dropped].sort_values("period")["gap"].to_numpy()
+ np.testing.assert_allclose(gap_py, np.asarray(loo_g["gap"], dtype=float), atol=2e-2)
+
+
# ---------------------------------------------------------------------------
# In-space placebo permutation inference (Abadie-Diamond-Hainmueller 2010 §2.4)
# ---------------------------------------------------------------------------
@@ -1328,4 +1362,845 @@ def test_rmspe_ratio_is_root_scale():
assert _rmspe_ratio(pre, post, scale=10.0) == pytest.approx(1.5)
# Zero post-effect -> ratio 0; perfect pre-fit -> finite (floored), not inf.
assert _rmspe_ratio(pre, np.zeros(2), scale=10.0) == pytest.approx(0.0)
+ # Perfect pre-fit (zero pre-gaps) -> floored denominator -> finite, not inf.
assert np.isfinite(_rmspe_ratio(np.zeros(2), post, scale=10.0))
+
+
+# ---------------------------------------------------------------------------
+# Leave-one-out donor robustness (ADH 2015 §4)
+# ---------------------------------------------------------------------------
+
+
+def _equal_mix_panel(n_donors=5, T=8, T0=6, effect=3.0, seed=1):
+ """Near-identical donors -> equal-ish weights -> dropping any one barely moves
+ the synthetic (the LOO-stable regime)."""
+ rng = np.random.default_rng(seed)
+ years = list(range(2000, 2000 + T))
+ base = rng.normal(10, 0.4, n_donors)
+ common = np.cumsum(rng.normal(0, 0.2, T)) # shared trend
+ donors = {j: base[j] + common + rng.normal(0, 0.08, T) for j in range(n_donors)}
+ treated = np.mean([donors[j] for j in range(n_donors)], axis=0) + rng.normal(0, 0.04, T)
+ treated = treated.copy()
+ treated[T0:] += effect
+ rows = []
+ for j in range(n_donors):
+ for t in range(T):
+ rows.append({"unit": f"d{j}", "year": years[t], "y": donors[j][t], "treated": 0})
+ for t in range(T):
+ rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)})
+ return pd.DataFrame(rows)
+
+
+def _single_donor_panel(n_donors=4, T=8, T0=6, effect=3.0, seed=2):
+ """One donor (d0) tracks the treated unit; the rest are far away -> weight
+ concentrates on d0 -> dropping d0 swings the result (the LOO-fragile regime)."""
+ rng = np.random.default_rng(seed)
+ years = list(range(2000, 2000 + T))
+ d0_path = 10 + np.cumsum(rng.normal(0, 0.3, T))
+ donors = {0: d0_path + rng.normal(0, 0.03, T)}
+ for j in range(1, n_donors):
+ donors[j] = (25.0 + 6.0 * j) + np.cumsum(rng.normal(0, 0.3, T)) # far from treated
+ treated = d0_path + rng.normal(0, 0.03, T)
+ treated = treated.copy()
+ treated[T0:] += effect
+ rows = []
+ for j in range(n_donors):
+ for t in range(T):
+ rows.append({"unit": f"d{j}", "year": years[t], "y": donors[j][t], "treated": 0})
+ for t in range(T):
+ rows.append({"unit": "treated", "year": years[t], "y": treated[t], "treated": int(t >= T0)})
+ return pd.DataFrame(rows)
+
+
+def _fit_cheap(df):
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ return synthetic_control(df, "y", "treated", "unit", "year", seed=0, **_FAST)
+
+
+_LOO_COLS = ["dropped_unit", "att", "pre_rmspe", "post_rmspe", "rmspe_ratio", "delta_att", "status"]
+
+
+def test_leave_one_out_baseline_row_and_structure():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ loo = res.leave_one_out()
+ assert list(loo.columns) == _LOO_COLS
+ # Exactly one baseline row, first, reading directly from the full fit.
+ base = loo.iloc[0]
+ # dropped_unit is "not applicable" for the baseline row (pandas renders the
+ # None as NA in the donor-id column).
+ assert base["status"] == "baseline" and pd.isna(base["dropped_unit"])
+ assert base["att"] == pytest.approx(res.att) and base["delta_att"] == 0.0
+ assert base["pre_rmspe"] == pytest.approx(res.pre_rmspe)
+ assert base["rmspe_ratio"] == pytest.approx(res.rmspe_ratio)
+ # One LOO row per positively-weighted donor (no failures on this clean panel).
+ pos = [d for d in res._fit_snapshot.donor_ids if d in res.donor_weights]
+ loo_rows = loo[loo["status"] == "loo"]
+ assert set(loo_rows["dropped_unit"]) == set(pos)
+ assert res._loo_n_failed == 0 and res._loo_status == "ran"
+ # delta_att == att - full att, exactly.
+ for _, r in loo_rows.iterrows():
+ assert r["delta_att"] == pytest.approx(r["att"] - res.att)
+ # Sorted by |delta_att| descending.
+ deltas = loo_rows["delta_att"].abs().to_numpy()
+ assert np.all(np.diff(deltas) <= 1e-12)
+ # att_range spans the LOO refits.
+ lo, hi = res._loo_att_range
+ assert lo <= hi and lo == pytest.approx(loo_rows["att"].min())
+ assert hi == pytest.approx(loo_rows["att"].max())
+
+
+def test_leave_one_out_stable_when_no_donor_dominates():
+ res = _fit_cheap(_equal_mix_panel(n_donors=5, effect=3.0))
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ loo = res.leave_one_out()
+ loo_rows = loo[loo["status"] == "loo"]
+ # Near-identical donors -> dropping any one barely moves the ATT (well under the
+ # 3.0 effect). att_range is correspondingly tight.
+ assert loo_rows["delta_att"].abs().max() < 1.0
+ lo, hi = res._loo_att_range
+ assert (hi - lo) < 1.0
+
+
+def test_leave_one_out_swings_when_one_donor_dominates():
+ res = _fit_cheap(_single_donor_panel(n_donors=4, effect=3.0))
+ # Weight concentrates on d0.
+ assert res.donor_weights.get("d0", 0.0) > 0.5
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ loo = res.leave_one_out()
+ loo_rows = loo[loo["status"] == "loo"]
+ # Dropping the dominant donor is the most influential drop (top finite row) and
+ # moves the ATT by a non-trivial amount.
+ top = loo_rows.iloc[0]
+ assert top["dropped_unit"] == "d0"
+ assert abs(top["delta_att"]) > 0.2
+
+
+def test_leave_one_out_deterministic():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ loo1 = res.leave_one_out()
+ loo2 = res.leave_one_out()
+ pd.testing.assert_frame_equal(loo1, loo2)
+
+
+def test_leave_one_out_requires_two_donors():
+ res = _fit_for_placebo(n_donors=1)
+ with pytest.warns(UserWarning, match="at least 2 donors"):
+ loo = res.leave_one_out()
+ assert len(loo) == 1 and loo.iloc[0]["status"] == "baseline"
+ assert res._loo_status == "too_few_donors" and res._loo_att_range is None
+
+
+def test_leave_one_out_fails_closed_on_nonconverged_treated_fit():
+ df, _, _ = _make_panel(n_donors=4, effect=3.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df, "y", "treated", "unit", "year", seed=0, inner_max_iter=1, **_FAST_CHURN
+ )
+ assert res._fit_converged is False
+ with pytest.warns(UserWarning, match="did not converge at fit time"):
+ loo = res.leave_one_out()
+ assert len(loo) == 1 and loo.iloc[0]["status"] == "baseline"
+ assert res._loo_status == "treated_fit_nonconverged"
+
+
+def test_leave_one_out_refit_failure_tallied(monkeypatch):
+ import importlib
+
+ sc = importlib.import_module("diff_diff.synthetic_control")
+ res = _fit_for_placebo(n_donors=4)
+ real_fit_unit = sc._placebo_fit_unit
+ calls = {"n": 0}
+
+ def flaky_fit_unit(snap, unit, donor_pool, n_starts):
+ calls["n"] += 1
+ if calls["n"] == 1: # first leave-one-out refit "fails"
+ return None
+ return real_fit_unit(snap, unit, donor_pool, n_starts)
+
+ monkeypatch.setattr(sc, "_placebo_fit_unit", flaky_fit_unit)
+ with pytest.warns(UserWarning, match="failed to converge"):
+ loo = res.leave_one_out()
+ assert res._loo_n_failed == 1
+ failed = loo[loo["status"] == "failed"]
+ assert len(failed) == 1
+ assert failed[["att", "pre_rmspe", "rmspe_ratio", "delta_att"]].isna().all().all()
+ # Failed rows sort last (after the baseline + the converged LOO rows).
+ assert loo.iloc[-1]["status"] == "failed"
+
+
+def test_leave_one_out_pickle_drops_gaps_keeps_table():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.leave_one_out()
+ restored = pickle.loads(pickle.dumps(res))
+ # The summary table + scalars survive; panel-derived gap paths do not.
+ pd.testing.assert_frame_equal(restored.get_leave_one_out_df(), res.get_leave_one_out_df())
+ assert restored._loo_gaps is None
+ assert restored._loo_att_range == res._loo_att_range
+ with pytest.raises(ValueError, match="not retained after pickling"):
+ restored.get_leave_one_out_gaps()
+
+
+def test_leave_one_out_gaps_long_form():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.leave_one_out()
+ gaps = res.get_leave_one_out_gaps()
+ assert list(gaps.columns) == ["dropped_unit", "period", "gap", "phase"]
+ pos = [d for d in res._fit_snapshot.donor_ids if d in res.donor_weights]
+ assert set(gaps["dropped_unit"]) == set(pos)
+ # Every dropped donor has a full pre+post trajectory.
+ n_periods = len(res.pre_periods) + len(res.post_periods)
+ assert (gaps.groupby("dropped_unit").size() == n_periods).all()
+ assert set(gaps["phase"]) == {"pre", "post"}
+
+
+def test_leave_one_out_accessor_before_run_raises():
+ res = _fit_for_placebo(n_donors=4)
+ with pytest.raises(ValueError, match="call leave_one_out"):
+ res.get_leave_one_out_df()
+ with pytest.raises(ValueError, match="call leave_one_out"):
+ res.get_leave_one_out_gaps()
+
+
+def test_leave_one_out_does_not_touch_analytical_inference():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.leave_one_out()
+ assert_nan_inference(
+ {"se": res.se, "t_stat": res.t_stat, "p_value": res.p_value, "conf_int": res.conf_int}
+ )
+ assert res.is_significant is False
+
+
+def test_leave_one_out_requires_snapshot():
+ res = _fit_for_placebo(n_donors=4)
+ restored = pickle.loads(pickle.dumps(res))
+ with pytest.raises(ValueError, match="requires the fit snapshot"):
+ restored.leave_one_out()
+
+
+# ---------------------------------------------------------------------------
+# In-time placebo: snapshot-truncation helper (ADH 2015 §4)
+# ---------------------------------------------------------------------------
+
+
+def _snap_for_in_time(**kw):
+ return _fit_for_placebo(n_donors=4, **kw)._fit_snapshot
+
+
+def test_truncate_snapshot_positional_split():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ snap = _snap_for_in_time()
+ assert list(snap.pre_periods) == [2000, 2001, 2002, 2003, 2004, 2005]
+ mod, _ = _truncate_snapshot_in_time(snap, 2003)
+ assert mod is not None
+ assert mod.pre_periods == [2000, 2001, 2002] # pre-fake = strictly before t_f
+ assert mod.post_periods == [2003, 2004, 2005] # post-fake = held-out pre, t_f first
+ # all_periods EXCLUDES the true post periods (2006, 2007) -> airtight no-peeking.
+ assert mod.all_periods == [2000, 2001, 2002, 2003, 2004, 2005]
+ assert 2006 not in mod.all_periods and 2007 not in mod.all_periods
+
+
+def test_truncate_snapshot_drops_specs_in_held_out_window():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ snap = _snap_for_in_time() # default pre_period_outcomes="all": one lag per pre period
+ mod, dropped = _truncate_snapshot_in_time(snap, 2003)
+ for spec in mod.specs: # surviving specs reference only pre-fake periods
+ assert all(p < 2003 for p in spec.periods)
+ assert len(dropped) == 3 # lags at 2003/2004/2005 dropped
+ assert len(mod.specs) == len(snap.specs) - 3
+
+
+def test_truncate_snapshot_custom_v_lockstep():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ df, _, _ = _make_panel(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=np.arange(1.0, 7.0), # distinct entries to verify the subset
+ inner_min_decrease=1e-3,
+ )
+ snap = res._fit_snapshot
+ mod, _ = _truncate_snapshot_in_time(snap, 2003)
+ # custom_v subset IN LOCKSTEP with the surviving specs (the default lag specs are
+ # ordered by ascending pre period, so the first three entries survive).
+ assert mod.custom_v is not None and len(mod.custom_v) == len(mod.specs)
+ np.testing.assert_array_equal(mod.custom_v, np.array([1.0, 2.0, 3.0]))
+
+
+def test_truncate_snapshot_straddling_window_partial_keep():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ df, _, _ = _make_panel(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ special_predictors=[("y", [2002, 2003, 2004], "mean")],
+ pre_period_outcomes=[2000, 2001],
+ inner_min_decrease=1e-3,
+ )
+ snap = res._fit_snapshot
+ mod, _ = _truncate_snapshot_in_time(snap, 2003)
+ # The special predictor straddles t_f -> truncated to its pre-fake part [2002].
+ special = [s for s in mod.specs if s.kind == "special"]
+ assert len(special) == 1 and special[0].periods == [2002]
+
+
+def test_truncate_snapshot_infeasible_too_few_pre_fake():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ snap = _snap_for_in_time()
+ # Fewer than 2 pre-fake periods -> infeasible (the deliberate >=2 rule; an
+ # auto-swept single-pre-fake placebo is a non-credible pre-fit — documented Note).
+ assert _truncate_snapshot_in_time(snap, 2000)[0] is None # 0 pre-fake
+ assert _truncate_snapshot_in_time(snap, 2001)[0] is None # 1 pre-fake
+
+
+def test_truncate_snapshot_infeasible_all_specs_dropped():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ df, _, _ = _make_panel(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ special_predictors=[("y", [2004, 2005], "mean")],
+ pre_period_outcomes=[2004, 2005],
+ inner_min_decrease=1e-3,
+ )
+ snap = res._fit_snapshot
+ # t_f=2003 leaves >=2 pre-fake periods, but every spec lives in [2004, 2005]
+ # -> all dropped -> infeasible (cannot fit with zero predictors).
+ mod, dropped = _truncate_snapshot_in_time(snap, 2003)
+ assert mod is None and len(dropped) == len(snap.specs)
+
+
+def test_truncate_snapshot_does_not_mutate_original():
+ from diff_diff.synthetic_control import _truncate_snapshot_in_time
+
+ snap = _snap_for_in_time()
+ before = [list(s.periods) for s in snap.specs]
+ _truncate_snapshot_in_time(snap, 2003)
+ after = [list(s.periods) for s in snap.specs]
+ assert before == after # shared spec objects are never mutated in place
+
+
+# ---------------------------------------------------------------------------
+# In-time placebo: end-to-end (ADH 2015 §4)
+# ---------------------------------------------------------------------------
+
+_IN_TIME_COLS = [
+ "placebo_period",
+ "placebo_att",
+ "pre_fit_rmspe",
+ "rmspe_ratio",
+ "n_pre_fake",
+ "n_post_fake",
+ "n_dropped_specs",
+ "status",
+]
+
+
+def test_in_time_placebo_near_zero_when_effect_post_only():
+ # The effect is only in the TRUE post window (>=2006); every backdated placebo
+ # falls in the clean pre window, so the placebo "effect" should be ~0.
+ res = _fit_for_placebo(n_donors=4, effect=3.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ itp = res.in_time_placebo()
+ assert list(itp.columns) == _IN_TIME_COLS
+ ran = itp[itp["status"] == "ran"]
+ assert len(ran) > 0
+ assert ran["placebo_att"].abs().max() < 1.0 # well below the 3.0 true effect
+
+
+def test_in_time_placebo_sweep_feasibility():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ itp = res.in_time_placebo()
+ # pre = [2000..2005] -> feasible dates = pre[2:] = [2002, 2003, 2004, 2005]
+ # (>=2 pre-fake periods — the deliberate Note-documented restriction).
+ assert list(itp["placebo_period"]) == [2002, 2003, 2004, 2005]
+ assert (itp["status"] == "ran").all()
+ # n_pre_fake + n_post_fake == n_pre for every row, with >=2 pre-fake + >=1 post-fake.
+ assert ((itp["n_pre_fake"] + itp["n_post_fake"]) == len(res.pre_periods)).all()
+ assert (itp["n_pre_fake"] >= 2).all() and (itp["n_post_fake"] >= 1).all()
+
+
+def test_in_time_placebo_explicit_post_date_raises():
+ res = _fit_for_placebo(n_donors=4)
+ with pytest.raises(ValueError, match="true post-treatment period"):
+ res.in_time_placebo([2006])
+
+
+def test_in_time_placebo_date_not_in_pre_raises():
+ res = _fit_for_placebo(n_donors=4)
+ with pytest.raises(ValueError, match="not a pre-treatment period"):
+ res.in_time_placebo([1999])
+
+
+def test_in_time_placebo_empty_explicit_input_raises():
+ # An explicit but EMPTY container is malformed (NOT "every date infeasible") -> raise
+ # (codex R6 P1). None still means "sweep all feasible dates".
+ res = _fit_for_placebo(n_donors=4)
+ for empty in ([], (), pd.Index([]), np.array([])):
+ with pytest.raises(ValueError, match="placebo_periods is empty"):
+ res.in_time_placebo(empty)
+ # The malformed call must not leave any in-time state behind.
+ assert res._in_time_df is None and res._in_time_status is None
+
+
+def test_in_time_placebo_dedups_and_canonicalizes_explicit_dates():
+ # Duplicate / unordered explicit dates -> de-duplicated + pre-period-ordered, so no
+ # duplicate refits and n_dates is not inflated (codex R7 P3).
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ itp = res.in_time_placebo([2004, 2002, 2004]) # duplicate 2004, unordered
+ assert list(itp["placebo_period"]) == [2002, 2004] # unique, canonical pre-period order
+
+
+def test_in_time_placebo_ran_block_reports_partial_coverage():
+ # CI codex P2: a sweep where SOME dates ran and SOME were infeasible must surface
+ # n_ran / n_infeasible on the status="ran" block so coverage is not overstated.
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.in_time_placebo([2001, 2003]) # 2001 infeasible (1 pre-fake), 2003 runs
+ assert res._in_time_status == "ran" # at least one date ran
+ block = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]["in_time_placebo"]
+ assert block["status"] == "ran"
+ assert block["n_dates"] == 2 and block["n_ran"] == 1
+ assert block["n_infeasible"] == 1 and block["n_failed"] == 0
+
+
+def test_leave_one_out_immune_to_donor_weights_mutation():
+ # Codex R8 P1: the LOO drop-set is FROZEN at fit time (snap.weighted_donor_ids =
+ # the >1e-6 reportable support), NOT read from the mutable presentation-level
+ # donor_weights dict. So mutating donor_weights after the fit must NOT change which
+ # donors are dropped — the robustness result depends only on the fit.
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ before = set(res.leave_one_out()[lambda d: d["status"] != "baseline"]["dropped_unit"])
+ assert before == set(res._fit_snapshot.weighted_donor_ids) # drops the frozen support
+ # Mutate the public dict: drop a real donor, inject a bogus one.
+ victim = next(iter(res.donor_weights))
+ res.donor_weights = {k: v for k, v in res.donor_weights.items() if k != victim}
+ res.donor_weights["bogus_donor"] = 0.99
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ after = set(res.leave_one_out()[lambda d: d["status"] != "baseline"]["dropped_unit"])
+ assert after == before # unchanged by the mutation
+ assert "bogus_donor" not in after # a donor not in the fit is never dropped
+ assert victim in after # still dropped despite removal from donor_weights
+
+
+def test_in_time_placebo_early_date_infeasible_no_raise():
+ res = _fit_for_placebo(n_donors=4)
+ # A valid pre-date with too few (<2) pre-fake periods -> NaN infeasible row +
+ # warning, NOT a raise.
+ with pytest.warns(UserWarning, match="infeasible"):
+ itp = res.in_time_placebo([2001]) # 1 pre-fake period
+ assert len(itp) == 1 and itp.iloc[0]["status"] == "infeasible"
+ assert np.isnan(itp.iloc[0]["placebo_att"])
+
+
+def test_in_time_placebo_custom_v_zero_mass_is_infeasible_not_failed():
+ # A custom_v whose mass lies entirely on specs that TRUNCATE drops leaves a
+ # zero-mass surviving V -> the date is INFEASIBLE under the supplied custom_v,
+ # NOT a convergence failure (codex R2 P1b: v/v.sum() would be 0/0).
+ df, _, _ = _make_panel(n_donors=4) # default: 6 lag specs (2000..2005)
+ v = np.array([0.0, 0.0, 0.0, 1.0, 1.0, 1.0]) # all mass on the 2003/2004/2005 lags
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v,
+ inner_min_decrease=1e-3,
+ )
+ itp = res.in_time_placebo([2003]) # keeps lags 2000/2001/2002 -> all zero weight
+ row = itp[itp["placebo_period"] == 2003]
+ assert len(row) == 1 and row.iloc[0]["status"] == "infeasible" # NOT "failed"
+ assert res._in_time_status == "all_dates_infeasible"
+
+
+def test_leave_one_out_uniform_shift_surfaced_by_delta_not_range(monkeypatch):
+ # Codex R3 P1b: when every donor-drop shifts the ATT the SAME way, the raw
+ # att_range has ~zero width (looks stable) but the donor dependence is large.
+ # The headline metric must be baseline-relative (max |delta_att|), not the range.
+ import importlib
+
+ sc = importlib.import_module("diff_diff.synthetic_control")
+ res = _fit_for_placebo(n_donors=4)
+ baseline = float(res.att)
+ snap = res._fit_snapshot
+ shift = 5.0 # same large shift for EVERY drop -> uniform
+
+ def uniform_shift(snap_arg, unit, pool, n_starts):
+ gp = {p: 0.0 for p in snap.pre_periods}
+ gp.update({p: baseline + shift for p in snap.post_periods})
+ return gp, 1.0
+
+ monkeypatch.setattr(sc, "_placebo_fit_unit", uniform_shift)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.leave_one_out()
+ lo, hi = res._loo_att_range
+ assert (hi - lo) == pytest.approx(0.0, abs=1e-9) # raw range would hide the shift
+ assert res._loo_max_abs_delta_att == pytest.approx(shift, abs=1e-9) # delta reveals it
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ assert native["leave_one_out"]["max_abs_delta_att"] == pytest.approx(shift, abs=1e-9)
+
+
+def test_in_time_placebo_windowed_covariate_dropped_and_warns():
+ # A special predictor measured over [2004, 2005] falls entirely in the held-out
+ # window for t_f=2003 -> dropped (TRUNCATE) + warning + n_dropped_specs reflects it.
+ df, _, _ = _make_panel(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ special_predictors=[("y", [2004, 2005], "mean")],
+ pre_period_outcomes=[2000, 2001, 2002, 2003],
+ inner_min_decrease=1e-3,
+ )
+ with pytest.warns(UserWarning, match="dropped"):
+ itp = res.in_time_placebo([2003])
+ row = itp.iloc[0]
+ # The special predictor (and the lag at 2003) lie in [2003, 2005] -> dropped.
+ assert row["n_dropped_specs"] >= 1 and row["status"] == "ran"
+
+
+def test_in_time_placebo_all_specs_dropped_infeasible():
+ df, _, _ = _make_panel(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ special_predictors=[("y", [2004, 2005], "mean")],
+ pre_period_outcomes=[2004, 2005],
+ inner_min_decrease=1e-3,
+ )
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ itp = res.in_time_placebo([2003]) # every predictor is at 2004/2005
+ assert itp.iloc[0]["status"] == "infeasible"
+
+
+def test_in_time_placebo_custom_v_runs_without_shape_error():
+ # End-to-end guard for the custom_v lockstep subset: without it the custom path
+ # would raise a shape mismatch once specs are dropped.
+ df, _, _ = _make_panel(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=np.ones(6),
+ inner_min_decrease=1e-3,
+ )
+ itp = res.in_time_placebo()
+ assert (itp["status"] == "ran").any()
+
+
+def test_in_time_placebo_accepts_2d_custom_v():
+ # fit() accepts an array-like custom_v (e.g. a (1, k) row vector, raveled during
+ # validation); the in-time TRUNCATE subset must ravel before indexing or a 2D
+ # custom_v raises IndexError (codex R5 P1). Must match the 1D result exactly.
+ df, _, _ = _make_panel(n_donors=4)
+ v1d = np.arange(1.0, 7.0)
+ v2d = v1d.reshape(1, 6) # row-vector form accepted at fit time
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res1 = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v1d,
+ inner_min_decrease=1e-3,
+ )
+ res2 = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v2d,
+ inner_min_decrease=1e-3,
+ )
+ itp1 = res1.in_time_placebo([2003])
+ itp2 = res2.in_time_placebo([2003]) # would IndexError before the ravel fix
+ pd.testing.assert_frame_equal(itp1, itp2)
+
+
+def test_in_time_placebo_deterministic():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ itp1 = res.in_time_placebo()
+ itp2 = res.in_time_placebo()
+ pd.testing.assert_frame_equal(itp1, itp2)
+
+
+def test_in_time_placebo_fails_closed_on_nonconverged_treated_fit():
+ df, _, _ = _make_panel(n_donors=4, effect=3.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df, "y", "treated", "unit", "year", seed=0, inner_max_iter=1, **_FAST_CHURN
+ )
+ assert res._fit_converged is False
+ with pytest.warns(UserWarning, match="did not converge"):
+ itp = res.in_time_placebo()
+ assert len(itp) == 0 and res._in_time_status == "treated_fit_nonconverged"
+
+
+def test_in_time_placebo_pickle_drops_gaps_keeps_table():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.in_time_placebo()
+ restored = pickle.loads(pickle.dumps(res))
+ pd.testing.assert_frame_equal(restored.get_in_time_placebo_df(), res.get_in_time_placebo_df())
+ assert restored._in_time_gaps is None
+ with pytest.raises(ValueError, match="not retained after pickling"):
+ restored.get_in_time_placebo_gaps()
+ with pytest.raises(ValueError, match="requires the fit snapshot"):
+ restored.in_time_placebo()
+
+
+def test_in_time_placebo_gaps_long_form():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.in_time_placebo([2003])
+ gaps = res.get_in_time_placebo_gaps()
+ assert list(gaps.columns) == ["placebo_period", "period", "gap", "phase"]
+ assert set(gaps["phase"]) == {"pre_fake", "post_fake"}
+ # Periods before t_f=2003 are pre_fake; 2003+ are post_fake.
+ assert set(gaps.loc[gaps["phase"] == "pre_fake", "period"]) == {2000, 2001, 2002}
+ assert set(gaps.loc[gaps["phase"] == "post_fake", "period"]) == {2003, 2004, 2005}
+
+
+def test_in_time_placebo_accessor_before_run_raises():
+ res = _fit_for_placebo(n_donors=4)
+ with pytest.raises(ValueError, match="call in_time_placebo"):
+ res.get_in_time_placebo_df()
+ with pytest.raises(ValueError, match="call in_time_placebo"):
+ res.get_in_time_placebo_gaps()
+
+
+def test_in_time_placebo_does_not_touch_analytical_inference():
+ res = _fit_for_placebo(n_donors=4)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res.in_time_placebo()
+ assert_nan_inference(
+ {"se": res.se, "t_stat": res.t_stat, "p_value": res.p_value, "conf_int": res.conf_int}
+ )
+ assert res.is_significant is False
+
+
+# ---------------------------------------------------------------------------
+# Self-consistency parity: the ADH-2015 diagnostics are EXACT re-runs of the
+# validated solver on the equivalent sub-problem.
+#
+# R `Synth` has NO in-time-placebo or leave-one-out function (verified against its
+# full CRAN function index), so there is no canonical R *output* to match for these
+# diagnostics specifically. Instead we prove (deterministically, via a fixed custom
+# V) that leave_one_out() equals a from-scratch fit on the reduced donor pool, and
+# in_time_placebo() equals a from-scratch fit on the backdated/truncated panel.
+# Because the custom-V solver is itself R-anchored on Basque
+# (test_basque_tier1_custom_v_parity), this transitively anchors the diagnostics to
+# R while directly validating that the re-run mechanism is exact (not approximate).
+# ---------------------------------------------------------------------------
+
+
+def test_leave_one_out_matches_fresh_reduced_pool_fit():
+ df, _, _ = _make_panel(n_donors=4)
+ v = np.arange(1.0, 7.0) # k = 6 default lag predictors; fixed V -> deterministic
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v,
+ inner_min_decrease=1e-3,
+ )
+ loo = res.leave_one_out()
+ donor_ids = list(res._fit_snapshot.donor_ids)
+ d = [x for x in donor_ids if x in res.donor_weights][0] # a positively-weighted donor
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ fresh = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v,
+ inner_min_decrease=1e-3,
+ donor_pool=[x for x in donor_ids if x != d],
+ )
+ loo_att = loo.loc[loo["dropped_unit"] == d, "att"].iloc[0]
+ assert loo_att == pytest.approx(fresh.att, abs=1e-7)
+
+
+def test_in_time_placebo_matches_fresh_backdated_fit():
+ df, _, _ = _make_panel(n_donors=4) # years 2000-2007, T0=6 -> pre = 2000..2005
+ v = np.arange(1.0, 7.0)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ res = synthetic_control(
+ df,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v,
+ inner_min_decrease=1e-3,
+ )
+ itp = res.in_time_placebo([2003])
+ placebo_att = itp.loc[itp["placebo_period"] == 2003, "placebo_att"].iloc[0]
+ # Fresh backdated fit: drop the true post periods, treat 2003 as the intervention,
+ # feed the pre-fake-subset V (lags at 2000/2001/2002 -> v[:3]).
+ back = df[df["year"] <= 2005].copy()
+ back["treated"] = ((back["unit"] == "treated") & (back["year"] >= 2003)).astype(int)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ fresh = synthetic_control(
+ back,
+ "y",
+ "treated",
+ "unit",
+ "year",
+ v_method="custom",
+ custom_v=v[:3],
+ inner_min_decrease=1e-3,
+ )
+ assert placebo_att == pytest.approx(fresh.att, abs=1e-7)
+
+
+# ---------------------------------------------------------------------------
+# All-refits-failed branches (codex R1 P1): when EVERY refit fails to converge,
+# the status must NOT be reported as "ran" / mislabeled as dimensional infeasibility.
+# ---------------------------------------------------------------------------
+
+
+def test_leave_one_out_all_refits_failed_status(monkeypatch):
+ import importlib
+
+ sc = importlib.import_module("diff_diff.synthetic_control")
+ res = _fit_for_placebo(n_donors=4)
+ monkeypatch.setattr(sc, "_placebo_fit_unit", lambda *a, **k: None) # every drop fails
+ with pytest.warns(UserWarning, match="failed to converge"):
+ loo = res.leave_one_out()
+ # Distinct status (NOT "ran"); att_range is None; baseline + only failed rows.
+ assert res._loo_status == "all_refits_failed"
+ assert res._loo_att_range is None
+ assert (loo["status"] != "loo").all() # no successful drop
+ assert (loo.iloc[1:]["status"] == "failed").all()
+ # DiagnosticReport must surface it as NOT "ran", with the convergence reason.
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ assert native["leave_one_out"]["status"] != "ran"
+ # Machine-readable code distinguishes numerical failure from structural infeasibility.
+ assert native["leave_one_out"]["reason_code"] == "all_refits_failed"
+ assert "failed to converge" in native["leave_one_out"]["reason"]
+
+
+def test_in_time_placebo_all_dates_failed_status(monkeypatch):
+ import importlib
+
+ sc = importlib.import_module("diff_diff.synthetic_control")
+ res = _fit_for_placebo(n_donors=4)
+ monkeypatch.setattr(sc, "_placebo_fit_unit", lambda *a, **k: None) # every refit fails
+ with pytest.warns(UserWarning, match="failed to converge"):
+ itp = res.in_time_placebo()
+ # Convergence failure must NOT be mislabeled as dimensional infeasibility.
+ assert res._in_time_status == "all_dates_failed"
+ assert (itp["status"] == "failed").all() and len(itp) > 0
+ native = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]
+ assert native["in_time_placebo"]["status"] != "ran"
+ assert native["in_time_placebo"]["reason_code"] == "all_dates_failed"
+ assert "failed to converge" in native["in_time_placebo"]["reason"]
+
+
+def test_in_time_placebo_mixed_failed_and_infeasible_status(monkeypatch):
+ # Codex R8 P2: a no-success run with BOTH a dimensionally-infeasible date AND a
+ # convergence-failed date must report the mixed "all_dates_unusable" status with
+ # both counts — NOT be mislabeled as exclusively failed (which would falsely claim
+ # "none was dimensionally infeasible").
+ import importlib
+
+ sc = importlib.import_module("diff_diff.synthetic_control")
+ res = _fit_for_placebo(n_donors=4)
+ # Feasible dates "fail" to converge; 2001 (1 pre-fake) is dimensionally infeasible.
+ monkeypatch.setattr(sc, "_placebo_fit_unit", lambda *a, **k: None)
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ itp = res.in_time_placebo([2001, 2003]) # 2001 infeasible, 2003 fails
+ assert res._in_time_status == "all_dates_unusable"
+ assert res._in_time_n_failed == 1 and res._in_time_n_infeasible == 1
+ assert set(itp["status"]) == {"infeasible", "failed"}
+ block = DiagnosticReport(res).to_dict()["estimator_native_diagnostics"]["in_time_placebo"]
+ assert block["reason_code"] == "all_dates_unusable"
+ assert block["n_failed"] == 1 and block["n_infeasible"] == 1
diff --git a/tests/test_practitioner.py b/tests/test_practitioner.py
index 4153d840..7cb7d734 100644
--- a/tests/test_practitioner.py
+++ b/tests/test_practitioner.py
@@ -356,6 +356,9 @@ def test_synthetic_control_results(self, mock_scm_results):
all_labels = " ".join(s.get("label", "") for s in output["next_steps"]).lower()
assert "in_space_placebo" in all_code
assert "placebo" in all_labels
+ # The ADH-2015 robustness steps also surface (opt-in diagnostics, non-STEPS
+ # tags so a caller's completed_steps can never suppress them).
+ assert "leave_one_out" in all_code and "in_time_placebo" in all_code
# SCM is not a staggered DiD: no control-group / anticipation knobs.
handler_steps = [s for s in output["next_steps"] if s["baker_step"] > 2]
handler_code = " ".join(s.get("code", "") for s in handler_steps)