diff --git a/CHANGELOG.md b/CHANGELOG.md index 7763c178..6c909c83 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`CallawaySantAnna.cluster=` silent no-op (Phase 1b interstitial).** `CallawaySantAnna(cluster="state").fit(...)` previously accepted the argument, stored it, returned it from `get_params()`, but never consumed it anywhere in the fit / aggregator / bootstrap pipeline (`staggered.py:154-156` docstring claimed "Defaults to unit-level clustering" — but for bare `cluster=X`, the aggregator at `staggered_aggregation.py:193-213` computed per-unit IF variance regardless, and the bootstrap at `staggered_bootstrap.py:323-347` drew per-unit multiplier weights regardless). Users who explicitly set `cluster="state"` got per-unit inference with no warning — typically SE too small under intra-cluster correlation. **Survey-PSU clustering via `survey_design=SurveyDesign(psu="state")` was NOT affected** and continued to cluster correctly via `_compute_stratified_psu_meat`. The fix synthesizes a minimal `SurveyDesign(psu=self.cluster, weight_type="pweight")` when bare `cluster=` is set without an explicit survey design, threading the synthesized PSU through the existing survey-PSU machinery (aggregator + bootstrap). A new dedicated `df_inference` field on `CallawaySantAnnaResults` carries the cluster-level df for the bare-cluster-synthesize path ONLY (where `survey_metadata` is intentionally `None` to preserve the `DiagnosticReport.survey_metadata is not None` skip at `diagnostic_report.py:848-856` + `:1150-1158` for "Original fit used a survey design" reasoning, and the `summary()` survey block render at `staggered_results.py:235-238`). `HonestDiD` at `honest_did.py` prefers `survey_metadata.df_survey` first (the actual CS-internal df, which may be tightened post-resolve for replicate designs) and falls back to `df_inference` for bare-cluster fits — so downstream consumers always see the cluster df without overriding the post-recompute survey df. When `survey_design=SurveyDesign(weights=Y)` without PSU is provided AND `cluster=X` is also set, `_inject_cluster_as_psu` injects the bare cluster as the effective PSU AND an `effective_survey_design = replace(survey_design, psu=self.cluster)` is constructed so the downstream `_validate_unit_constant_survey` catches movers (units crossing clusters across periods) on panel data via the now-PSU-bearing design; `survey_metadata` is recomputed to reflect the injected PSU. When both `cluster=X` AND `survey_design.psu=Y` are set, the explicit PSU wins via `_resolve_effective_cluster` (emits `UserWarning` if partitions differ). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`**: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely (`survey.py:104-109` enforces replicate_weights are mutually exclusive with strata/psu/fpc); honoring bare `cluster=` would silently have no effect while populating `cluster_name`/`n_clusters` on Results dishonestly. Assertive regression tests pin the fix on both panel and repeated-cross-section paths plus the survey/non-survey contract boundaries: `test_cluster_robust_ses_differ_from_unit_level`, `test_bare_cluster_works_with_panel_false_rcs`, `test_bare_cluster_synthesizes_survey_design`, `test_inject_branch_panel_mover_raises`, `test_replicate_weight_plus_cluster_rejected`, `test_bare_cluster_populates_df_inference` (asserts the dedicated cluster-df carrier is set), `test_bare_cluster_does_not_set_survey_metadata` (asserts the survey/non-survey contract is preserved — DiagnosticReport / summary() must not treat a bare-cluster fit as survey-backed), `test_explicit_survey_design_does_populate_survey_metadata` (asserts the inject-branch path still populates survey_metadata for legitimate user-provided SurveyDesign), and `test_bare_cluster_honest_did_uses_df_inference` (end-to-end: HonestDiD threads df_inference into HonestDiDResults.df_survey, preventing silent normal-theory regression on a future refactor). When `cluster=None` (default), behavior is bit-equal to pre-PR (wiring guarded by `if self.cluster is not None:`). Audit verified the no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare `cluster=` correctly. ### Added -- **TROP methodology-review-tracker promotion: In Progress → Complete.** Closes the Athey, Imbens, Qu & Viviano (2025) *Triply Robust Panel Estimators* (arXiv:2508.21536) primary-source review on the methodology-review tracker. PR-A (the paper review on file at `docs/methodology/papers/athey-2025-review.md`) was previously merged as #443; this PR is the F.L.I.P. consolidation — new `tests/test_methodology_trop.py` with paper-equation-numbered Verified Components walk-through (10 classes, 36 tests covering Eq. 2 soft-threshold SVD prox / plain prox-gradient monotonicity on a toy setup / weighted-prox solver (the shipped accelerated FISTA outer loop is NOT directly tested for per-step monotonicity because Nesterov momentum does not guarantee it), Eq. 3 unit + time weights, Eqs. 4-5 + Algorithm 1 LOOCV with two-stage cycling, Corollary 1 three-condition unbiasedness, Theorem 5.1 MC-ranking realisation of the triply-robust bias bound, Section 2.2 DID + MC reductions, Eq. 13 + Algorithm 2 per-(i, t) estimation, Algorithm 3 stratified pairs bootstrap, Section 3 / Eq. 6 factor-DGP recovery, plus a `TestTROPDeviations` class locking 11 documented library deviations). Migrated from `tests/test_trop.py`: `TestMethodologyVerification` (5 tests → `TestTROPEquation6FactorDGPRecovery`), four paper-conformance tests + one weighted-solver convergence test from `TestPaperConformanceFixes` (→ `TestTROPEquation3Weights` / `TestTROPAlgorithm1LOOCV` / `TestTROPNuclearNormProx` / `TestTROPAlgorithm3Bootstrap`), three prox / plain prox-gradient monotonicity / weighted-objective tests from `TestTROPNuclearNormSolver` (→ `TestTROPNuclearNormProx`), plus a cycling-convergence test from `TestCyclingSearch` and the factor-DGP smoke from `TestTROPvsSDID`; the `TestPaperConformanceFixes` and `TestTROPvsSDID` shells are deleted. `TestTROPNuclearNormSolver` retains its single defensive `test_zero_weights_no_division_error`. `METHODOLOGY_REVIEW.md` TROP row promoted with merge date 2026-05-24, full Verified Components / Test Coverage / Deviations / Outstanding Concerns / R Parity structure mirroring HAD (PR #473) / ContinuousDiD (PR #476) / DCDH (PR #481) / WooldridgeDiD (PR #486) precedents. **Documented deviations:** Gap #5 (unnormalised weights match Eq. 2, not Section 5 sum-to-one), Gap #9 (unbalanced panels supported beyond paper's balanced-panel assumption), rank selection is implicit via nuclear-norm soft-thresholding with no discrete `rank_selection` constructor parameter (matches paper Section 5.3 + Appendix; corrects an earlier REGISTRY overclaim that listed cv / ic / elbow methods), `λ_nn = ∞` → 1e10 internal sentinel with original-value storage on results. **Outstanding Concerns (deferred):** Equation 14 covariate extension (`TROP.fit()` does not accept a `covariates` kwarg; non-support locked by `TestTROPDeviations::test_covariates_not_supported` via `inspect.signature` to guard against future `**kwargs`) and Theorem 8.1 (covariate triple robustness) deferred until use cases motivate; SC / SDID reductions paper-claimed under "specific (omega, theta) weight choices" not provided in the paper text — cross-language anchor deferred. **R parity:** deferred until paper-author reference implementation is released ("forthcoming" per the paper). REGISTRY.md `## TROP` section gains a "Verified Components" expansion: 10 ticked requirements + four `**Note:**` / `**Note (library-side choice):**` / `**Note (deferral):**` annotations consolidating the deviation surface (Eq. 10 balancing-decomposition pointer, Gap #5 weight-normalisation library-side choice, Eq. 14 covariate deferral). No source-code changes to `diff_diff/trop*.py`. Methodology sign-off scope: paper-aligned identification ingredients (Eq. 2 prox, Eq. 3 weights, Eqs. 4-5 LOOCV, Algorithms 1-3, Corollary 1 unbiasedness, Eq. 6 simulation recovery, DID reduction, documented deviations) are directly locked by the new tests. Theorem 5.1 is verified as a simulation sanity check (TROP RMSE < DID RMSE under LOOCV-tuned weights), NOT as a direct fixed-weight conditional-bias-bound lock; the Matrix Completion reduction is verified as code-path activation (effective_rank > 0 + beats DID baseline), NOT as equivalence against an independent MC reference. The Eq. 14 covariate extension is documented as deferred (TROP.fit() has no `covariates` kwarg). +- **New tutorial: SpilloverDiD on synthetic TVA-style spillover panel (Butts 2021 §4 analogue) with spillover-bandwidth sensitivity and Conley spatial-HAC inference.** `docs/tutorials/23_spillover_tva.ipynb` walks a practitioner through the SpilloverDiD workflow on a 4-period 200-unit panel laid out as treated cluster + near-control band + far-control band. The DGP is tuned at locked seed 23 (`n_treated=25, n_near=120, n_far=55, tau_total=-7.4, delta_1=-4.5, d_bar=100 km`) so naive multi-period TWFE on the full sample understates the direct effect by ~42% — matching the Butts (2021) §4 Table 1 Panel A bias-correction direction documented at `docs/methodology/papers/butts-2021-review.md:257`. SpilloverDiD with `rings=[0.0, 100.0]` cleanly recovers both `tau_total = -7.34` and `delta_1 = -4.53`. The tutorial covers (1) problem framing with TVA / Kline-Moretti (2014) citation, (2) panel construction with the DGP equation inline, (3) naive headline and the bias mechanism, (4) `rings` sensitivity grid at outer edges 50/100/150/200 km (estimates stabilize once `d_bar` covers the true spillover horizon), (5) the headline `SpilloverDiD` fit, (6) Conley spatial-HAC variance via `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff∈{0,1}` — the cutoff = `d_bar` choice follows Butts §3.1, while the `conley_lag_cutoff` serial term is the library's documented Wave E.2 follow-up synthesis with Newey-West-style serial Bartlett HAC (per REGISTRY "Variance (Wave E.2 follow-up)") — showcasing the Spillover-Conley work shipped through PRs #468 / #474 / #477 / #482 / #485 / #489. Drift detection in `tests/test_t23_spillover_tva_drift.py` (21 function-level tests, T19 pattern) pins panel composition, geographic bands, true parameters, naive coefficient endpoint (round-to-2 — well-conditioned MultiPeriodDiD fit, BLAS-stable to better than 0.005), recovery endpoints (round-to-2 on `tau_total = -7.34` and `delta_1 = -4.53`), seed-specific geometry numbers (max distance from origin, band diameters, cross-band max pair, within-band median + within-100km pair fractions), sensitivity grid `rings=[0, 50]` endpoint at round-to-1 (per reviewer guidance for BLAS safety on the borderline-rank-deficient point), notebook §2 constant-sync + AST-body sync against the test fixture, exact `tau_total` AND `delta_1` identity across the d_bar ∈ {100, 150, 200} grid (on THIS synthetic DGP only, because no units lie in the 80-200 km band — once `d_bar` covers the true 100 km spillover horizon, the additional ring bins are empty and contribute zero observations; this is NOT a generic Butts §4 result and the registry frames `d_bar` as a real bias/variance tradeoff in the general case), Conley SE divergence from HC1 (direction-pinned: Conley < HC1 on this DGP), and the platform-agnostic post-filter warning surface (T19 pattern: mirrors the notebook's narrow `.*encountered in matmul` `RuntimeWarning` filter inside the capture block and asserts no warnings remain; on Apple Silicon M4 + numpy<2.3 the three known Accelerate BLAS matmul warnings — documented at `TODO.md` "RuntimeWarnings in Linear Algebra Operations" — fire and are filtered, on M3 / Intel / Linux or numpy>=2.3 the filter is a no-op, EITHER WAY any UserWarning / FutureWarning / non-matmul RuntimeWarning surfaces immediately). Per `feedback_t19_drift_guards_test_file_only`, ZERO in-notebook asserts — all numerical guards live in the test file. Per `feedback_notebook_workflow`, the DGP was developed and locked in a temporary `_scratch/` script (gitignored) before being pasted into the notebook §2 cell and duplicated into the drift-test `panel` fixture. `docs/index.rst` toctree updated; `docs/references.rst` gains the Kline-Moretti (2014) entry; `docs/methodology/papers/butts-2021-review.md:257` cross-reference updated from "T22 tutorial" to "T23 tutorial" (slot 22 was occupied by `22_had_survey_design.ipynb`). The `SpilloverDiD T22 TVA tutorial` row in `TODO.md` (renumbered to T23 at delivery) is dropped. - **TripleDifference `vcov_type` input contract (Phase 1b interstitial #2, permanently narrow).** `TripleDifference(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages mirroring the CS interstitial. The rejection is **library-architectural, not paper-prescribed**: TripleDifference uses influence-function-based variance per Ortiz-Villavicencio & Sant'Anna (2025) arXiv:2505.09942 — the 3-pairwise-DiD decomposition `inf = w3·IF_3 + w2·IF_2 - w1·IF_1` has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to the remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (`std(inf)/sqrt(n)`); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the combined IF (`(G/(G-1)) · Σ_c (Σ_{i∈c} ψ_i)² / n²`, plain CR1 — no Stata-style `(n-1)/(n-p)` finite-sample factor because the IF has no design-matrix `p` in the OLS sense); `hc1` with `survey_design=` ≡ TSL on the combined IF (analytical or replicate). All three paths are unchanged at machine precision (default behavior bit-equal across all 3 estimation methods `{dr, reg, ipw}`). `vcov_type` and `cluster_name` fields added to `TripleDifferenceResults`, threaded through `to_dict()`. `summary()` routes the variance-family label through the shared `_format_vcov_label` (`results.py:49-89`): bare fits render `"HC1 heteroskedasticity-robust"`, clustered fits render `"CR1 cluster-robust at , G="` (since the actual algebra is Liang-Zeger CR1 on the combined IF), and survey-backed fits suppress the variance-estimator line entirely (the Survey Design block already names design + n_psu + df, and the analytical SE is TSL on the combined IF — a raw "hc1" label would misstate the inference path). **`cluster= + SurveyDesign(replicate_weights=[...])` raises `NotImplementedError`** at `fit()`: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely; honoring bare `cluster=` would silently have no effect on the variance estimate while populating `cluster_name`/`n_clusters` on Results dishonestly. Mirrors the `CallawaySantAnna` guard from PR #487. Under `survey_design.psu` (non-replicate path) `cluster_name`/`n_clusters` on Results are suppressed (set to None) so they can't misreport the raw cluster argument when the resolver picks the survey PSU instead. `set_params(vcov_type=...)` mirrors CS pattern (mutate-then-validate-at-use, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR #2** (after CS PR #487) rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by TripleDifference's IF-based variance, not a deferral. New `TestTripleDifferenceVcovType` class in `tests/test_triple_diff.py` covers the 5-surface contract (default/cluster/survey bit-equal, `__init__` rejection per family, `fit()`-time revalidation) plus 8 introspection / convenience-function tests. REGISTRY.md "IF-based variance estimators vs analytical-sandwich estimators" cross-reference section updated to list `TripleDifference` alongside `CallawaySantAnna` in the "Enforced today" tier. Phase 1b PR 4/8 (full `{classical, hc1, hc2, hc2_bm}` threading) resumes on a different estimator (TwoStageDiD) post-merge; the two remaining IF-based estimators (`ImputationDiD`, `EfficientDiD`) follow the same narrow-contract template. - **CallawaySantAnna `vcov_type` input contract (Phase 1b interstitial, permanently narrow).** `CallawaySantAnna(vcov_type=...)` now accepts `{"hc1"}` only (default). The analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` spatial-HAC are REJECTED at `__init__` with methodology-rooted messages. The rejection is **library-architectural, not paper-prescribed**: CS uses influence-function-based variance per Callaway & Sant'Anna (2021) — per-(g,t) doubly-robust / IPW / outcome-regression structure — and has no single design matrix to compute hat-matrix leverage `1/(1-h_ii)` or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent and applies to other IF-based estimators (ImputationDiD, EfficientDiD) when their `vcov_type` threading PRs land. `hc1` with `cluster=None` ≡ per-unit IF variance (Williams 2000 form); `hc1` with `cluster=X` ≡ CR1 Liang-Zeger on the IF activated via the cluster= wiring fix above. Documentation in `docs/methodology/REGISTRY.md` "IF-based variance estimators vs analytical-sandwich estimators" subsection. `vcov_type`, `cluster_name`, `n_clusters`, `df_inference` added to `CallawaySantAnnaResults` (the canonical PSU column wins for `cluster_name` reporting — `survey_design.psu` when explicit PSU is provided, `self.cluster` when bare cluster synthesizes/injects). `set_params(vcov_type=...)` mirrors SA pattern (mutate-then-refresh `_vcov_type_explicit`, no atomic validation); `fit()` re-validates `vcov_type` at use time so a `set_params(vcov_type="hc4")` mutation surfaces a clear error at fit-time rather than silently propagating to Results metadata. **Interstitial PR** rather than full Phase 1b PR 4/8 vcov_type threading — the narrow surface is methodologically dictated by CS's IF-based variance, not a deferral. Phase 1b PR 4/8 (full {classical, hc1, hc2, hc2_bm} threading) resumes on a different estimator post-merge. - **TripleDifference cluster-changes-SE defensive regression test.** Added `tests/test_triple_diff.py::TestTripleDifferenceClusterDefensive::test_cluster_changes_ses` asserting that `TripleDifference(cluster="state")` produces SE differing from `cluster=None` SE by `>1e-6` on a fixed-seed panel with state-level random effects. Defensive coverage closes a test gap identified during the Phase 1b cluster-wiring audit; TripleDifference's bare-cluster code path (`triple_diff.py:1245-1259`) was already correct but lacked a positive regression test. Mirrors `tests/test_two_stage.py::test_cluster_changes_ses`. diff --git a/TODO.md b/TODO.md index 169968a5..54cd3d69 100644 --- a/TODO.md +++ b/TODO.md @@ -159,7 +159,6 @@ Deferred items from PR reviews that were not addressed before merge. | `SpilloverDiD(vcov_type="conley", conley_lag_cutoff > 0, survey_design=...)` no-effective-PSU serial Bartlett HAC. Wave E.2 follow-up ships the panel-block composition when an effective PSU exists (explicit `survey_design.psu` OR injected via `cluster=` per `_inject_cluster_as_psu`). Weights-only / strata-only survey designs WITHOUT a cluster fallback raise `NotImplementedError` at `SpilloverDiD.fit` post-resolution because under the pseudo-PSU = obs-index fallback each pseudo-PSU appears in exactly one period — the per-PSU serial cross-period loop would silently contribute zero. Fix would either derive a unit-level serial fallback for no-PSU designs (mixes IF allocators with the pseudo-PSU spatial term — needs methodology work) or route the serial loop through `conley_unit` with explicit documentation of the IF-allocator asymmetry. Regression goldens vs the effective-PSU shipped path. | `spillover.py::SpilloverDiD.fit`, `two_stage.py::_compute_stratified_serial_bartlett_meat` | follow-up (Wave E.2 follow-up tail) | Low | | `SpilloverDiD(ring_method="count")` extension. Currently only the nearest-treated-ring specification is exposed. Count-of-treated-in-ring (paper Section 3.2 end) is methodologically supported by Butts but re-introduces functional-form dependence; expose with an explicit kwarg gate and documentation warning. | `spillover.py::SpilloverDiD.fit` | follow-up | Low | | `SpilloverDiD` data-driven `d_bar` selection (Butts 2021b / Butts 2023 JUE Insight cross-validation). | `spillover.py::SpilloverDiD` | follow-up | Low | -| `SpilloverDiD` T22 TVA tutorial (`docs/tutorials/22_spillover_did.ipynb`): synthetic TVA-style DGP reproducing Butts (2021) Section 4 Table 1 Panel A bias-correction direction (~40% understatement). Split from the methodology PR per user-confirmed scope split (2026-05-15). | `docs/tutorials/`, `tests/test_t22_*_drift.py` | follow-up (Wave B) | Medium | | Extend `TwoStageDiD` with Conley vcov as a first-class feature (mirrors Wave A's TWFE/MPD/DiD extension). Currently `TwoStageDiD.__init__` lacks `vcov_type` / `conley_*` kwargs; `SpilloverDiD` works around this by threading Conley directly via `solve_ols` at stage 2. Promoting Conley to TwoStageDiD's API removes the workaround and lets non-spillover users access Conley + Gardner two-stage. | `diff_diff/two_stage.py` | follow-up | Medium | | `SpilloverDiD` sparse cKDTree path for the staggered nearest-treated-distance helper (mirrors the static helper's sparse branch). Currently `_compute_nearest_treated_distance_staggered` always builds dense `(n_units, n_treated_by_onset)` pairwise distance matrices per cohort; on large staggered panels with many cohorts this is avoidable memory/runtime. Add a sparse k-d-tree branch analogous to `_compute_nearest_treated_distance_sparse`, gated on `n > _CONLEY_SPARSE_N_THRESHOLD`. | `spillover.py::_compute_nearest_treated_distance_staggered` | follow-up (Wave B) | Low | | `SpilloverDiDResults` in `DiagnosticReport` dispatch tables. Wave C event-study emits a TwoStageDiD-compatible `event_study_effects: Dict[int, Dict]` alias that `plot_event_study` consumes via the new `reference_period` attribute fallback in `_extract_plot_data`, but `SpilloverDiDResults` is NOT registered in `DiagnosticReport`'s `_APPLICABILITY` / `_PT_METHOD` tables — so `DiagnosticReport(spillover_result)` doesn't currently route to event-study diagnostics. Registering requires (a) deciding which diagnostics apply (parallel trends, pre-trends power, heterogeneity, design-effect) AND (b) adding an end-to-end test. | `diff_diff/diagnostic_report.py::_APPLICABILITY`, `_PT_METHOD` | follow-up (Wave C) | Low | diff --git a/docs/index.rst b/docs/index.rst index 5c500530..f4994db4 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -83,6 +83,7 @@ Quick Links tutorials/20_had_brand_campaign tutorials/21_had_pretest_workflow tutorials/22_had_survey_design + tutorials/23_spillover_tva .. toctree:: :maxdepth: 1 diff --git a/docs/methodology/papers/butts-2021-review.md b/docs/methodology/papers/butts-2021-review.md index 2850fff9..290531fd 100644 --- a/docs/methodology/papers/butts-2021-review.md +++ b/docs/methodology/papers/butts-2021-review.md @@ -254,7 +254,7 @@ where `D^k_{it} := D_i * 1{K_it = k}` and `K_it` is years since treatment turned - Agriculture employment: standard DiD = -5.1% per decade. With spillovers controlled, total effect = -7.4% per decade. **Spillover bias = +2.3 pp; canonical DiD UNDERESTIMATED the agricultural decline by ~40%.** Spillover-on-control coefficients are negative (-3.7%, -1.6%, -3.0%, -1.6% per decade across the 4 distance bins) — consistent with farm-worker out-migration to higher-paying TVA manufacturing jobs. - Manufacturing employment: standard DiD = +5.6% per decade. With spillovers controlled, total effect = +3.5% per decade. **Spillover bias = +2.1 pp; canonical DiD OVERESTIMATED the manufacturing gain by ~40%.** Spillover coefficients are negative (-2.0%, -2.5%, -3.3%, -3.0%), consistent with "urban shadow" effects whereby firms relocate INTO the TVA from neighboring areas. - Interpretation (page 21): "the long-run spillovers cause the original estimates to be about 40 percent too small for agriculture employment and 40 percent too large for manufacturing employment." -- Useful design choice for diff-diff T22 tutorial DGP: a TVA-style bias-correction percentage of ~40% is large enough to be visible without being implausible. +- Useful design choice for diff-diff T23 tutorial DGP: a TVA-style bias-correction percentage of ~40% is large enough to be visible without being implausible. (Tutorial slot 22 went to `22_had_survey_design.ipynb`; SpilloverDiD landed in slot 23 — `docs/tutorials/23_spillover_tva.ipynb`.) ### Empirical illustration — Opportunity Zones (Appendix B) - Application: revisits Chen, Glaeser, and Wessel (2021) on the 2017 federal Opportunity Zone program's effect on home prices. diff --git a/docs/references.rst b/docs/references.rst index 423587b8..fa54071d 100644 --- a/docs/references.rst +++ b/docs/references.rst @@ -188,7 +188,11 @@ Multi-Period and Staggered Adoption - **Butts, K. (2023).** "Difference-in-Differences with Spatial Spillovers." *arXiv:2105.03737v3* (originally posted 2021). https://arxiv.org/abs/2105.03737 - Identifies the ring-indicator estimator implemented in our ``SpilloverDiD`` class. Section 2-3 covers non-staggered timing (Equations 5/6/8); Section 5 covers staggered timing via two-stage Gardner (Table 2). Section 3.1 (page 13) recommends Conley spatial-HAC for inference with cutoff = ``d_bar``. + Identifies the ring-indicator estimator implemented in our ``SpilloverDiD`` class. Section 2-3 covers non-staggered timing (Equations 5/6/8); Section 5 covers staggered timing via two-stage Gardner (Table 2). Section 3.1 (page 13) recommends Conley spatial-HAC for inference with cutoff = ``d_bar``. Section 4 Table 1 Panel A (TVA agriculture) is the empirical anchor for tutorial ``docs/tutorials/23_spillover_tva.ipynb`` (~40% understatement direction). + +- **Kline, P., & Moretti, E. (2014).** "Local Economic Development, Agglomeration Economies, and the Big Push: 100 Years of Evidence from the Tennessee Valley Authority." *The Quarterly Journal of Economics*, 129(1), 275-331. https://doi.org/10.1093/qje/qjt034 + + Original Tennessee Valley Authority application that Butts (2021) §4 revisits to illustrate spillover bias in the canonical place-based-intervention DiD. Cited as the motivating real-world example in ``docs/tutorials/23_spillover_tva.ipynb``. - **Conley, T. G. (1999).** "GMM Estimation with Cross Sectional Dependence." *Journal of Econometrics*, 92(1), 1-45. https://doi.org/10.1016/S0304-4076(98)00084-0 diff --git a/docs/tutorials/23_spillover_tva.ipynb b/docs/tutorials/23_spillover_tva.ipynb new file mode 100644 index 00000000..bda2cbb5 --- /dev/null +++ b/docs/tutorials/23_spillover_tva.ipynb @@ -0,0 +1,583 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a5f3d36b", + "metadata": {}, + "source": [ + "# Spillover-aware DiD with `SpilloverDiD` \u2014 a TVA-style worked example\n", + "\n", + "**Use this notebook when:** your treatment plausibly affects nearby\n", + "*untreated* units through geographic externalities, and your standard\n", + "DiD specification treats those neighbors as part of the control group.\n", + "\n", + "Place-based interventions (industrial policy, public-works investment,\n", + "vaccination campaigns, retail rollouts, technology subsidies) routinely\n", + "fit this pattern. The canonical reference is Kline & Moretti (2014)'s\n", + "study of the Tennessee Valley Authority (TVA): treated counties received\n", + "federal investment, but adjacent untreated counties absorbed positive\n", + "spillovers via labor flows, supply chains, and population movement.\n", + "Naive DiD that lumps adjacent counties into the control group then\n", + "*underestimates* the direct effect on treated counties, because the\n", + "controls are contaminated by spillover.\n", + "\n", + "Butts (2021) formalizes this with a single-stage regression that\n", + "explicitly models per-distance-band spillover effects on near-controls\n", + "while keeping far-controls as the clean comparison group:\n", + "\n", + "$$ y_{it} = \\mu_i + \\lambda_t + \\tau\\,D_{it} + \\sum_j \\delta_j\\,(1 - D_{it})\\,\\text{Ring}_{it,j} + \\varepsilon_{it} $$\n", + "\n", + "The diff-diff library's `SpilloverDiD` estimator (Butts 2021 + Gardner\n", + "2022 two-stage) implements this on panels with one or more spatial\n", + "rings, recovers the direct effect $\\tau$ and the spillover coefficients\n", + "$\\delta_j$, and supports Conley (1999) spatial-HAC standard errors out\n", + "of the box.\n", + "\n", + "This tutorial walks through the practitioner workflow:\n", + "\n", + "1. Build a synthetic TVA-style panel with known $\\tau$ and $\\delta_1$.\n", + "2. Fit a naive multi-period TWFE and observe the bias.\n", + "3. Choose a spillover bandwidth via theory + a sensitivity grid.\n", + "4. Fit `SpilloverDiD`; verify recovery of $\\tau$ and $\\delta_1$.\n", + "5. Add Conley spatial-HAC variance for robust inference.\n", + "\n", + "**References:** Butts (2021) \u2014 *arXiv:2105.03737*; Kline & Moretti\n", + "(2014) \u2014 *QJE* 129(1); Conley (1999) \u2014 *Journal of Econometrics* 92(1)." + ] + }, + { + "cell_type": "markdown", + "id": "a16e76a4", + "metadata": {}, + "source": [ + "## 2. The synthetic panel\n", + "\n", + "We build a 4-period, 200-unit panel laid out as three geographic bands:\n", + "\n", + "| Band | Count | Location | Treatment |\n", + "|--------------|-------|-----------------------------------------------------|-----------|\n", + "| Treated | 25 | clustered around (0,0); max ~12 km from origin, cluster diameter ~22 km at seed 23 | Yes, from period 3 |\n", + "| Near-control | 120 | ~12-82 km north (0.1\u00b0-0.7\u00b0 latitude) | No (but absorbs spillover) |\n", + "| Far-control | 55 | ~224-331 km north (2.0\u00b0-3.0\u00b0 latitude) | No (clean control) |\n", + "\n", + "True parameters: $\\tau_{\\text{total}} = -7.4$ (the direct effect on\n", + "treated, matching the Butts \u00a74 agriculture-employment magnitude),\n", + "$\\delta_1 = -4.5$ (the per-period spillover on near-controls), and a\n", + "locked random seed of `23`. The data-generating equation is\n", + "\n", + "$$ y_{it} = \\alpha_i + \\lambda_t + \\tau_{\\text{total}}\\,D_{it} + \\delta_1\\,(1 - D_{it})\\,\\text{Ring}_{it,1} + \\mathcal{N}(0, 0.5) $$\n", + "\n", + "We tune the band sizes and $\\delta_1$ so the naive TWFE coefficient\n", + "lands at ~58% of $\\tau_{\\text{total}}$ \u2014 i.e., the ~40% understatement\n", + "direction documented in Butts (2021) \u00a74 Table 1 Panel A." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1712f2bb", + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from diff_diff import MultiPeriodDiD, SpilloverDiD\n", + "\n", + "MAIN_SEED = 23\n", + "N_TREATED = 25\n", + "N_NEAR = 120\n", + "N_FAR = 55\n", + "T_PERIODS = 4\n", + "FIRST_TREAT = 3\n", + "TAU_TOTAL = -7.4\n", + "DELTA_1 = -4.5\n", + "D_BAR_KM = 100.0\n", + "NOISE_SD = 0.5\n", + "\n", + "\n", + "def build_t23_panel(seed: int = MAIN_SEED) -> pd.DataFrame:\n", + " \"\"\"4-period TVA-style synthetic panel with known direct + spillover.\n", + "\n", + " Layout: treated cluster + near-control band (within D_BAR_KM) + far-control\n", + " band (clean). Outcomes: y_it = alpha_i + lambda_t + TAU_TOTAL * D_it +\n", + " DELTA_1 * Ring1_it * (1 - D_it) + N(0, NOISE_SD).\n", + " \"\"\"\n", + " rng = np.random.default_rng(seed)\n", + " n_units = N_TREATED + N_NEAR + N_FAR\n", + " units = [f\"u{i:04d}\" for i in range(n_units)]\n", + " alpha = rng.normal(0.0, 1.0, size=n_units)\n", + " lambda_t = np.array([0.0, 0.5, 1.0, 1.5])[:T_PERIODS]\n", + "\n", + " coords = np.empty((n_units, 2))\n", + " is_treated_unit = np.zeros(n_units, dtype=bool)\n", + " is_near_unit = np.zeros(n_units, dtype=bool)\n", + " for i in range(N_TREATED):\n", + " coords[i] = (rng.normal(0, 0.05), rng.normal(0, 0.05))\n", + " is_treated_unit[i] = True\n", + " for i in range(N_TREATED, N_TREATED + N_NEAR):\n", + " coords[i] = (rng.uniform(0.1, 0.7), rng.uniform(-0.3, 0.3))\n", + " is_near_unit[i] = True\n", + " for i in range(N_TREATED + N_NEAR, n_units):\n", + " coords[i] = (rng.uniform(2.0, 3.0), rng.uniform(-0.5, 0.5))\n", + "\n", + " rows = []\n", + " for i, u in enumerate(units):\n", + " for t in range(1, T_PERIODS + 1):\n", + " D_it = int(is_treated_unit[i] and t >= FIRST_TREAT)\n", + " Ring1_it = int(is_near_unit[i] and t >= FIRST_TREAT)\n", + " y = (\n", + " alpha[i]\n", + " + lambda_t[t - 1]\n", + " + TAU_TOTAL * D_it\n", + " + DELTA_1 * Ring1_it * (1 - D_it)\n", + " + rng.normal(0, NOISE_SD)\n", + " )\n", + " rows.append(\n", + " {\n", + " \"unit\": u,\n", + " \"time\": t,\n", + " \"lat\": coords[i, 0],\n", + " \"lon\": coords[i, 1],\n", + " \"ever_treated\": int(is_treated_unit[i]),\n", + " \"D\": D_it,\n", + " \"y\": y,\n", + " }\n", + " )\n", + " return pd.DataFrame(rows)\n", + "\n", + "\n", + "df = build_t23_panel()\n", + "print(f\"Panel: {len(df)} rows, {df['unit'].nunique()} units, T={df['time'].nunique()}\")\n", + "df.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21aa12c0", + "metadata": {}, + "outputs": [], + "source": [ + "# Quick geographic check: treated cluster, near-control band, far-control band\n", + "band_counts = (\n", + " df.drop_duplicates(\"unit\")\n", + " .assign(\n", + " band=lambda d: np.select(\n", + " [d[\"ever_treated\"] == 1, d[\"lat\"] <= 1.0],\n", + " [\"treated\", \"near\"],\n", + " default=\"far\",\n", + " )\n", + " )\n", + " .groupby(\"band\")\n", + " .size()\n", + " .reindex([\"treated\", \"near\", \"far\"])\n", + ")\n", + "print(band_counts)\n" + ] + }, + { + "cell_type": "markdown", + "id": "8be4e7c9", + "metadata": {}, + "source": [ + "## 3. The naive headline \u2014 multi-period TWFE on the full sample\n", + "\n", + "A practitioner reaching for a default DiD specification fits a\n", + "multi-period TWFE on the full panel, treating ALL non-treated-cluster\n", + "units as controls \u2014 both the near-controls (which absorb spillover)\n", + "and the far-controls (which don't).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c563f94", + "metadata": {}, + "outputs": [], + "source": [ + "naive = MultiPeriodDiD()\n", + "with warnings.catch_warnings():\n", + " # absorb=['unit'] makes the unit-invariant 'ever_treated' indicator\n", + " # perfectly collinear with unit FE; MultiPeriodDiD drops it (with a\n", + " # UserWarning) and identifies the ATT through the ever_treated \u00d7 post\n", + " # interaction columns. This is the expected TWFE specification.\n", + " warnings.filterwarnings(\"ignore\", category=UserWarning, message=\"Rank-deficient design matrix\")\n", + " naive_res = naive.fit(\n", + " df,\n", + " outcome=\"y\",\n", + " treatment=\"ever_treated\",\n", + " time=\"time\",\n", + " post_periods=[3, 4],\n", + " unit=\"unit\",\n", + " absorb=[\"unit\"],\n", + " reference_period=2, # explicit pre-period; matches the current MPD default\n", + " )\n", + "\n", + "print(f\"Naive TWFE ATT: {naive_res.att:.4f} (true tau_total = {TAU_TOTAL})\")\n", + "print(f\" SE: {naive_res.se:.4f}\")\n", + "print(f\" Ratio to true: {naive_res.att / TAU_TOTAL:.3f}\")\n", + "print(f\" Understatement: {(1 - naive_res.att / TAU_TOTAL) * 100:.1f}%\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "7253fe4c", + "metadata": {}, + "source": [ + "The naive estimate is roughly **-4.29**, about 58% of the true\n", + "$\\tau_{\\text{total}} = -7.4$ \u2014 a **~42% understatement** that matches\n", + "the Butts (2021) \u00a74 agriculture direction. The mechanism is\n", + "straightforward: with $D_{i,t} = 1$ only for the treated cluster, the\n", + "near-control band contributes negative outcomes in the post-period (via\n", + "$\\delta_1 = -4.5$) into the *control* arm. The TWFE within-comparison\n", + "then sees a smaller pre/post gap on the treatment side relative to the\n", + "contaminated control side, and reports a deflated direct effect.\n", + "\n", + "This is exactly the bias `SpilloverDiD` is designed to remove." + ] + }, + { + "cell_type": "markdown", + "id": "cb77d87a", + "metadata": {}, + "source": [ + "## 4. Choosing the spillover bandwidth\n", + "\n", + "`SpilloverDiD` is parameterized by `rings=[r_0, r_1, ..., r_K]` where\n", + "the $j$-th ring covers distance band $[r_{j-1}, r_j]$ km. The\n", + "outermost ring edge `max(rings)` defines the spillover horizon\n", + "(`d_bar`); units outside that horizon are treated as the clean control\n", + "group. For a single near-control band, the simplest spec is\n", + "`rings=[0.0, d_bar]` \u2014 one ring, one $\\delta_1$ coefficient.\n", + "\n", + "Choosing `d_bar` is a design decision driven by (i) theory about the\n", + "plausible spillover range and (ii) a sensitivity check that you have\n", + "not under- or over-stated it. Below we vary the outer edge across 50,\n", + "100, 150, 200 km and report the recovered $\\tau$ and $\\delta_1$ at\n", + "each.\n", + "\n", + "> **API note:** `SpilloverDiD` accepts an optional explicit `d_bar`\n", + "> kwarg, but if set it must equal `max(rings)`. We omit it and let the\n", + "> default (`d_bar = max(rings)`) apply." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4804334d", + "metadata": {}, + "outputs": [], + "source": [ + "sensitivity = []\n", + "for outer in (50.0, 100.0, 150.0, 200.0):\n", + " est = SpilloverDiD(rings=[0.0, outer], conley_coords=(\"lat\", \"lon\"))\n", + " with warnings.catch_warnings():\n", + " # Narrow filter for the Apple Silicon M4 + numpy<2.3 Accelerate\n", + " # BLAS RuntimeWarnings (\"divide by zero\" / \"overflow\" / \"invalid\"\n", + " # in matmul) documented at TODO.md \"RuntimeWarnings in Linear\n", + " # Algebra Operations\". These do NOT fire on M3 / Intel / Linux\n", + " # or numpy>=2.3, so this filter is a no-op there. Values are\n", + " # recovered correctly on every platform.\n", + " warnings.filterwarnings(\n", + " \"ignore\", category=RuntimeWarning, message=\".*encountered in matmul\"\n", + " )\n", + " res = est.fit(df, outcome=\"y\", unit=\"unit\", time=\"time\", treatment=\"D\")\n", + " delta1 = float(res.spillover_effects.iloc[0][\"coef\"])\n", + " sensitivity.append({\"d_bar_km\": outer, \"tau_total\": res.att, \"delta_1\": delta1})\n", + "\n", + "sens_df = pd.DataFrame(sensitivity)\n", + "print(sens_df.to_string(index=False, float_format=lambda x: f\"{x:>8.4f}\"))\n" + ] + }, + { + "cell_type": "markdown", + "id": "787428d1", + "metadata": {}, + "source": [ + "At `d_bar = 50` km the ring is too narrow: near-controls in the\n", + "50-78 km band ARE exposed in the DGP (they're within the true 100 km\n", + "spillover horizon and carry $\\delta_1 = -4.5$), but the ring spec\n", + "misclassifies them as far-away clean controls. Both estimates suffer\n", + "from the misspecification \u2014 $\\tau$ deflates to ~-5.4 because the\n", + "\"clean control\" arm now contains genuinely-affected units, and the\n", + "spillover coefficient $\\delta_1$ attenuates to ~-2.6 because the\n", + "$S = 1$ ring averages 50-78 km exposure into the cleaner\n", + "$S = 0$ comparison. This is the registry-documented failure mode for\n", + "undershooting `d_bar`. (The exact values are quoted to one decimal\n", + "here rather than two because the `d_bar = 50` design is borderline-\n", + "rank-deficient \u2014 the two-decimal value can shift slightly across BLAS\n", + "paths even at the same locked seed.)\n", + "\n", + "At `d_bar = 100` km the ring covers the entire DGP near-control band\n", + "(0.1\u00b0-0.7\u00b0 latitude \u2248 11-78 km). $\\tau$ recovers to -7.34 and\n", + "$\\delta_1$ to -4.53 \u2014 both within 1% of the truth.\n", + "\n", + "At `d_bar = 150` km and `d_bar = 200` km, the estimates are *identical*\n", + "to `d_bar = 100`. This is by DGP design: there are no units in the\n", + "80-200 km band, so widening the ring adds zero observations and\n", + "zero new information. Once `d_bar` covers the true spillover horizon,\n", + "further widening is benign on this panel.\n", + "\n", + "**Verdict:** `d_bar = 100` km is the right choice here, balancing\n", + "\"capture all spillover\" against \"don't sweep clean controls into the\n", + "spillover bin\". On real data, a similar sensitivity grid plus paper /\n", + "theory guidance is the practitioner workflow Butts (2021) \u00a73.1\n", + "recommends." + ] + }, + { + "cell_type": "markdown", + "id": "ce1cb17e", + "metadata": {}, + "source": [ + "## 5. Fit `SpilloverDiD` and interpret\n", + "\n", + "With `d_bar = 100`, the headline fit recovers both the direct effect\n", + "and the spillover. We compare side-by-side with the naive TWFE from \u00a73.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4558bcfe", + "metadata": {}, + "outputs": [], + "source": [ + "est = SpilloverDiD(rings=[0.0, D_BAR_KM], conley_coords=(\"lat\", \"lon\"))\n", + "with warnings.catch_warnings():\n", + " # See \u00a74 for the rationale on this narrow matmul-RuntimeWarning filter.\n", + " warnings.filterwarnings(\n", + " \"ignore\", category=RuntimeWarning, message=\".*encountered in matmul\"\n", + " )\n", + " spill = est.fit(df, outcome=\"y\", unit=\"unit\", time=\"time\", treatment=\"D\")\n", + "\n", + "delta_1_row = spill.spillover_effects.iloc[0]\n", + "\n", + "comparison = pd.DataFrame(\n", + " {\n", + " \"estimate\": [naive_res.att, spill.att, delta_1_row[\"coef\"]],\n", + " \"se\": [naive_res.se, spill.se, delta_1_row[\"se\"]],\n", + " \"true value\": [TAU_TOTAL, TAU_TOTAL, DELTA_1],\n", + " },\n", + " index=[\"naive TWFE (direct, biased)\", \"SpilloverDiD tau_total\", \"SpilloverDiD delta_1\"],\n", + ")\n", + "print(comparison.to_string(float_format=lambda x: f\"{x:>8.4f}\"))\n" + ] + }, + { + "cell_type": "markdown", + "id": "e28652ea", + "metadata": {}, + "source": [ + "With the spillover term in the regression, `SpilloverDiD` cleanly\n", + "separates the direct effect (-7.34, very close to true -7.4) from the\n", + "spillover on near-controls (-4.53, very close to true -4.5). The\n", + "spillover-aware estimator removes the bias the naive TWFE suffered.\n", + "\n", + "The interpretation is straightforward: every period in the post window,\n", + "treated units lose 7.4 units of $y$ on average from the policy, AND\n", + "every near-control unit (within 100 km of any treated unit) loses 4.5\n", + "units of $y$ from the spillover. A standard DiD that ignored the\n", + "spillover would report only the contaminated -4.3 and miss the\n", + "spillover-on-controls entirely." + ] + }, + { + "cell_type": "markdown", + "id": "2e005043", + "metadata": {}, + "source": [ + "## 6. Robust inference with Conley spatial-HAC\n", + "\n", + "The HC1 standard error in \u00a75 treats each observation's score as\n", + "independent of every other. On spatially-correlated panels \u2014 and this\n", + "is the canonical case for place-based interventions \u2014 that\n", + "independence assumption is violated. Conley (1999) provides a\n", + "spatial-HAC variance estimator that downweights pairs by their\n", + "geographic distance through a Bartlett kernel out to a `conley_cutoff_km`\n", + "bandwidth. Newey-West (1987) provides the analogous serial term via\n", + "`conley_lag_cutoff`.\n", + "\n", + "Per Butts (2021) \u00a73.1, the Conley bandwidth is typically chosen to\n", + "match the spillover horizon `d_bar`. Below we compare the \u00a75 HC1 SE\n", + "against Conley with `conley_cutoff_km = d_bar = 100`, both with\n", + "`conley_lag_cutoff = 0` (spatial only) and `conley_lag_cutoff = 1` (add\n", + "within-unit serial correlation across consecutive periods).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25d20f8c", + "metadata": {}, + "outputs": [], + "source": [ + "def fit_with(vcov_type, lag=None):\n", + " kwargs = dict(rings=[0.0, D_BAR_KM], conley_coords=(\"lat\", \"lon\"), vcov_type=vcov_type)\n", + " if vcov_type == \"conley\":\n", + " kwargs[\"conley_cutoff_km\"] = D_BAR_KM\n", + " kwargs[\"conley_lag_cutoff\"] = lag\n", + " est = SpilloverDiD(**kwargs)\n", + " with warnings.catch_warnings():\n", + " # See \u00a74 for the rationale on this narrow matmul-RuntimeWarning filter.\n", + " warnings.filterwarnings(\n", + " \"ignore\", category=RuntimeWarning, message=\".*encountered in matmul\"\n", + " )\n", + " return est.fit(df, outcome=\"y\", unit=\"unit\", time=\"time\", treatment=\"D\")\n", + "\n", + "\n", + "res_hc1 = fit_with(\"hc1\")\n", + "res_c0 = fit_with(\"conley\", lag=0)\n", + "res_c1 = fit_with(\"conley\", lag=1)\n", + "\n", + "se_table = pd.DataFrame(\n", + " {\n", + " \"tau_total\": [res_hc1.att, res_c0.att, res_c1.att],\n", + " \"SE\": [res_hc1.se, res_c0.se, res_c1.se],\n", + " },\n", + " index=[\"HC1\", \"Conley (lag=0, spatial only)\", \"Conley (lag=1, spatial + serial)\"],\n", + ")\n", + "print(se_table.to_string(float_format=lambda x: f\"{x:>8.4f}\"))\n" + ] + }, + { + "cell_type": "markdown", + "id": "e6d5ee1b", + "metadata": {}, + "source": [ + "Point estimates are identical across all three rows \u2014 the variance\n", + "choice doesn't move the coefficients. But the standard errors differ:\n", + "on this DGP, the Conley spatial-HAC SE comes in *lower* than HC1.\n", + "\n", + "The mechanism is the pairwise covariance structure of the\n", + "influence-function score matrix. With `conley_cutoff_km = 100`, the\n", + "spatial term in the no-survey Conley meat sums Bartlett-weighted\n", + "score cross-products over every within-period observation pair whose\n", + "units lie within 100 km **of each other** (support is pairwise, not\n", + "relative to the treated cluster \u2014 no survey-design PSU aggregation\n", + "here, the no-survey path operates on the observation-level Wave D\n", + "Gardner GMM-corrected influence rows $\\psi_i$). On this DGP that\n", + "means non-zero contribution from:\n", + "\n", + "- treated \u00d7 treated pairs (cluster diameter ~22 km at seed 23),\n", + "- treated \u00d7 near-control pairs (max pairwise separation ~90 km),\n", + "- near \u00d7 near pairs (100% of within-band pairs are within 100 km), and\n", + "- far \u00d7 far pairs **within** the far-control band (max within-band\n", + " pairwise distance is ~131 km, but ~95% of within-band pair distances\n", + " are within 100 km of each other \u2014 median far/far pairwise distance\n", + " is ~56 km).\n", + "\n", + "Cross-band pairs at long distance \u2014 treated \u00d7 far and near \u00d7 far \u2014\n", + "sit outside the 100 km support and contribute zero kernel weight.\n", + "The Conley meat is the sum of the observation-level diagonal\n", + "self-product terms $\\psi_i \\psi_i'$ PLUS the Bartlett-weighted\n", + "off-diagonal score cross-products $K(d_{ij}/h)\\,\\psi_i \\psi_j'$\n", + "from all the within-support pair classes enumerated above; the ATT\n", + "variance is then the corresponding sandwich entry. (Note: the\n", + "no-survey Conley path does NOT apply an HC1 `n/(n-p)` finite-sample\n", + "multiplier \u2014 that lives on the HC1 path only. Whether the resulting\n", + "sandwich ends up above or below HC1 depends on both the missing\n", + "multiplier AND the net sign and magnitude of the weighted\n", + "off-diagonal cross-products under the specific DGP and design.) On\n", + "this seed they net out to a smaller ATT variance, so Conley < HC1,\n", + "but the direction is a feature of the data and the sign mechanism\n", + "is not easily attributable to any one pair class. Adding the serial\n", + "term (`lag = 1`) further reshapes the SE through a separate\n", + "within-unit cross-period score covariance sum.\n", + "\n", + "The takeaway: Conley vcov is a *correction*. It can move the SE in\n", + "either direction relative to HC1 depending on the residual covariance\n", + "structure within the kernel's support, but it's the methodologically\n", + "correct choice when scores are spatially or serially correlated. On\n", + "real applications the correction often inflates the SE (and the\n", + "t-stat shrinks), but the direction is a feature of the data, not a\n", + "guarantee. See the `ConleySpatialHAC` and `SpilloverDiD` sections of\n", + "`docs/methodology/REGISTRY.md` for the full variance formula." + ] + }, + { + "cell_type": "markdown", + "id": "5cc89608", + "metadata": {}, + "source": [ + "## 7. Practitioner takeaways and where to go next\n", + "\n", + "**When to reach for `SpilloverDiD`.** Whenever your treatment has a\n", + "plausible geographic externality \u2014 labor-market spillovers,\n", + "agglomeration effects, disease transmission, supply-chain ripples,\n", + "peer effects in retail or technology adoption \u2014 a naive DiD that\n", + "treats nearby units as controls will give you a biased estimate of the\n", + "direct effect. The magnitude of the bias is roughly\n", + "$(\\text{near share}) \\times \\delta_1$ in absolute value (here:\n", + "$(120/175) \\times 4.5 \\approx 3.1$, matching the observed\n", + "-4.3 vs -7.4 gap).\n", + "\n", + "**How to choose `d_bar`.** Start from theory: what's the plausible\n", + "spatial range of the spillover mechanism? Triangulate with a\n", + "sensitivity grid (\u00a74 above). The estimate stabilizes once `d_bar`\n", + "covers the true horizon; further widening is benign unless it sweeps\n", + "clean controls into the spillover bin.\n", + "\n", + "**When Conley matters.** Whenever your residuals have spatial or\n", + "serial correlation. On clustered-treatment designs (treatment radius\n", + "$\\ll$ panel extent) the spatial-HAC adjustment can move the SE in\n", + "either direction relative to HC1; the direction is a feature of the\n", + "DGP. Per Butts (2021) \u00a73.1, set `conley_cutoff_km = d_bar` as the\n", + "default and run a sensitivity grid on the cutoff if the SE matters\n", + "for your inference.\n", + "\n", + "**Extensions not covered here.**\n", + "- *Multiple rings* \u2014 pass `rings=[0, 50, 100]` for separate near-vs-mid\n", + " spillover coefficients $\\delta_1, \\delta_2$ when theory or\n", + " diagnostics suggest distance-graded externalities.\n", + "- *Staggered treatment* \u2014 `SpilloverDiD` automatically handles\n", + " unit-specific treatment onsets via the `first_treat` column.\n", + "- *Event study* \u2014 pass `event_study=True` and `horizon_max` for\n", + " per-event-time direct and spillover effects (Wave C; see\n", + " `tutorials/12_two_stage_did.ipynb` for the TwoStageDiD event-study\n", + " pattern this mirrors).\n", + "- *Survey weights* \u2014 `survey_design=SurveyDesign(...)` supports\n", + " H\u00e1jek-weighted point estimates with Binder TSL variance. Combined\n", + " with `vcov_type=\"conley\"` and `conley_lag_cutoff=0` (cross-\n", + " sectional spatial only) this is the Wave E.2 stratified-Conley\n", + " sandwich on PSU totals. The panel-block extension\n", + " `conley_lag_cutoff > 0` (spatial + within-PSU serial Bartlett HAC)\n", + " is the Wave E.2 follow-up library synthesis, and it requires an\n", + " effective PSU on the survey design (either explicit\n", + " `survey_design.psu=` or `cluster=` injected per Wave E.1's\n", + " warn-and-use-PSU pattern). A dedicated `SpilloverDiD`\n", + " survey-design tutorial is on the roadmap.\n", + "\n", + "**References:**\n", + "- Butts, K. (2021). *Difference-in-Differences with Spatial\n", + " Spillovers*. arXiv:2105.03737.\n", + "- Kline, P., & Moretti, E. (2014). *Local Economic Development,\n", + " Agglomeration Economies, and the Big Push: 100 Years of Evidence\n", + " from the Tennessee Valley Authority*. Quarterly Journal of\n", + " Economics 129(1), 275-331.\n", + "- Conley, T. G. (1999). *GMM Estimation with Cross-Sectional\n", + " Dependence*. Journal of Econometrics 92(1), 1-45.\n", + "- Newey, W. K., & West, K. D. (1987). *A Simple, Positive\n", + " Semi-definite, Heteroskedasticity and Autocorrelation Consistent\n", + " Covariance Matrix*. Econometrica 55(3), 703-708.\n", + "- Gerber, I. (2026). *Design-Based Standard Errors for Two-Stage\n", + " Difference-in-Differences*. arXiv:2605.04124." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/tutorials/README.md b/docs/tutorials/README.md index 73f965c9..5c900b7a 100644 --- a/docs/tutorials/README.md +++ b/docs/tutorials/README.md @@ -119,6 +119,14 @@ End-to-end HAD walkthrough on a BRFSS-shape stratified survey design (5 strata x - `did_had_pretest_workflow` on both overall and event-study paths under `survey_design=`, walking the Phase 4.5 C0 QUG-deferred verdict suffix and the stratified-clustered Stute multiplier bootstrap - Companion drift-test file (`tests/test_t22_had_survey_design_drift.py`) +### 23. SpilloverDiD on a TVA-style Spillover Panel (`23_spillover_tva.ipynb`) +Practitioner workflow for `SpilloverDiD` (Butts 2021 ring-indicator estimator + Gardner 2022 two-stage residualize-then-fit) on a synthetic TVA-style panel (4 periods, 200 units = 25 treated + 120 near-control + 55 far-control) reproducing the Butts §4 Table 1 Panel A ~40% understatement direction: +- When place-based interventions cause geographic spillovers, naive multi-period TWFE on the full sample understates the direct effect because near-controls absorb the spillover (here: ATT recovers as -4.29 vs true `tau_total = -7.4`, a 42% understatement) +- `SpilloverDiD(rings=[0.0, 100.0], conley_coords=("lat", "lon"))` cleanly recovers both `tau_total` (-7.34) and the near-band spillover coefficient `delta_1` (-4.53) +- Choosing the spillover bandwidth via a `rings` sensitivity grid at outer edges 50 / 100 / 150 / 200 km, with the documented "undershooting `d_bar`" failure mode at 50 km +- Conley spatial-HAC variance under `vcov_type="conley", conley_cutoff_km=100, conley_lag_cutoff in {0, 1}` — the cutoff = `d_bar` choice follows Butts §3.1, while the `conley_lag_cutoff` serial extension is the library's documented Wave E.2 follow-up synthesis with Newey-West-style serial Bartlett HAC (per REGISTRY "Variance (Wave E.2 follow-up)") +- Companion drift-test file (`tests/test_t23_spillover_tva_drift.py`) + ## Running the Notebooks 1. Install diff-diff with dependencies: diff --git a/tests/test_t23_spillover_tva_drift.py b/tests/test_t23_spillover_tva_drift.py new file mode 100644 index 00000000..153df6ba --- /dev/null +++ b/tests/test_t23_spillover_tva_drift.py @@ -0,0 +1,684 @@ +"""Drift detection for Tutorial 23 (``docs/tutorials/23_spillover_tva.ipynb``). + +The tutorial narrative quotes seed-specific numbers (naive vs. +SpilloverDiD comparison, sensitivity-grid endpoints, HC1 vs. Conley +SE). If library numerics drift (estimator changes, RNG path changes, +BLAS path changes), the prose can go stale silently while +``pytest --nbmake`` still passes — it only checks that the cells +execute without error. + +These asserts re-derive the same numbers using the locked T23 DGP +duplicated below (verbatim from the notebook §2 code cell), then check +them against the values quoted in the tutorial markdown. If a future +change moves any number outside its tolerance band, this test fails +and a maintainer is forced to either update the prose or investigate +the methodology shift before merge. + +T23 is the first SpilloverDiD tutorial. It demonstrates the +``SpilloverDiD`` estimator on a TVA-style synthetic panel reproducing +the Butts (2021) §4 Table 1 Panel A ~40% understatement direction +(naive multi-period TWFE significantly understates the direct effect +when near-controls absorb spillover). The DGP-builder constants below +MUST stay in sync with the corresponding constants in the notebook §2 +code cell; the ``test_dgp_true_parameters_match_quoted`` test catches +silent drift on those values. +""" + +from __future__ import annotations + +import warnings + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import MultiPeriodDiD, SpilloverDiD + +# Locked DGP parameters — must stay in sync with the notebook §2 cell. +MAIN_SEED = 23 +N_TREATED = 25 +N_NEAR = 120 +N_FAR = 55 +T_PERIODS = 4 +FIRST_TREAT = 3 +TAU_TOTAL = -7.4 +DELTA_1 = -4.5 +D_BAR_KM = 100.0 +NOISE_SD = 0.5 + + +def _build_t23_panel(seed: int = MAIN_SEED) -> pd.DataFrame: + """Duplicated verbatim from the notebook §2 code cell. Keep in sync.""" + rng = np.random.default_rng(seed) + n_units = N_TREATED + N_NEAR + N_FAR + units = [f"u{i:04d}" for i in range(n_units)] + alpha = rng.normal(0.0, 1.0, size=n_units) + lambda_t = np.array([0.0, 0.5, 1.0, 1.5])[:T_PERIODS] + + coords = np.empty((n_units, 2)) + is_treated_unit = np.zeros(n_units, dtype=bool) + is_near_unit = np.zeros(n_units, dtype=bool) + for i in range(N_TREATED): + coords[i] = (rng.normal(0, 0.05), rng.normal(0, 0.05)) + is_treated_unit[i] = True + for i in range(N_TREATED, N_TREATED + N_NEAR): + coords[i] = (rng.uniform(0.1, 0.7), rng.uniform(-0.3, 0.3)) + is_near_unit[i] = True + for i in range(N_TREATED + N_NEAR, n_units): + coords[i] = (rng.uniform(2.0, 3.0), rng.uniform(-0.5, 0.5)) + + rows = [] + for i, u in enumerate(units): + for t in range(1, T_PERIODS + 1): + D_it = int(is_treated_unit[i] and t >= FIRST_TREAT) + Ring1_it = int(is_near_unit[i] and t >= FIRST_TREAT) + y = ( + alpha[i] + + lambda_t[t - 1] + + TAU_TOTAL * D_it + + DELTA_1 * Ring1_it * (1 - D_it) + + rng.normal(0, NOISE_SD) + ) + rows.append( + { + "unit": u, + "time": t, + "lat": coords[i, 0], + "lon": coords[i, 1], + "ever_treated": int(is_treated_unit[i]), + "D": D_it, + "y": y, + } + ) + return pd.DataFrame(rows) + + +@pytest.fixture(scope="module") +def panel() -> pd.DataFrame: + return _build_t23_panel() + + +@pytest.fixture(scope="module") +def naive_fit(panel): + est = MultiPeriodDiD() + with warnings.catch_warnings(): + # absorb=['unit'] makes the unit-invariant 'ever_treated' indicator + # perfectly collinear with the unit FE; MultiPeriodDiD drops it + # (with a UserWarning) and identifies the ATT through the + # ever_treated x post interaction columns. This is the expected + # TWFE specification; the rank-deficient drop is benign. + warnings.filterwarnings( + "ignore", category=UserWarning, message="Rank-deficient design matrix" + ) + return est.fit( + panel, + outcome="y", + treatment="ever_treated", + time="time", + post_periods=[3, 4], + unit="unit", + absorb=["unit"], + reference_period=2, # explicit pre-period; matches the current MPD default + ) + + +def _silence_spillover_matmul_warnings(): + """Apply the notebook's narrow ``.*encountered in matmul`` + ``RuntimeWarning`` filter. The three matmul warnings ("divide by + zero" / "overflow" / "invalid value") are an Apple Silicon M4 + + macOS Sequoia + numpy<2.3 Accelerate BLAS artifact documented at + ``TODO.md`` under "RuntimeWarnings in Linear Algebra Operations" + (root cause: Apple BLAS SME kernels corrupt the FP status register; + tracked as numpy#28687, fixed in numpy>=2.3). They DO NOT fire on + M3 / Intel / Linux or numpy>=2.3 — so this filter is a no-op there, + and any platform-specific noise it does silence does not affect + result correctness. + + The post-filter warning surface (zero remaining warnings on the + T23 DGP) is pinned by ``test_spillover_fit_warning_policy_post_filter_clean`` + and ``test_spillover_conley_fit_warning_policy_post_filter_clean``. + A new RuntimeWarning with a different message, or any UserWarning / + FutureWarning, fails those tests.""" + warnings.filterwarnings("ignore", category=RuntimeWarning, message=r".*encountered in matmul") + + +@pytest.fixture(scope="module") +def spillover_fit(panel): + est = SpilloverDiD(rings=[0.0, D_BAR_KM], conley_coords=("lat", "lon")) + with warnings.catch_warnings(): + _silence_spillover_matmul_warnings() + return est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + + +@pytest.fixture(scope="module") +def spillover_conley_lag0_fit(panel): + est = SpilloverDiD( + rings=[0.0, D_BAR_KM], + conley_coords=("lat", "lon"), + vcov_type="conley", + conley_cutoff_km=D_BAR_KM, + conley_lag_cutoff=0, + ) + with warnings.catch_warnings(): + _silence_spillover_matmul_warnings() + return est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + + +@pytest.fixture(scope="module") +def spillover_conley_lag1_fit(panel): + est = SpilloverDiD( + rings=[0.0, D_BAR_KM], + conley_coords=("lat", "lon"), + vcov_type="conley", + conley_cutoff_km=D_BAR_KM, + conley_lag_cutoff=1, + ) + with warnings.catch_warnings(): + _silence_spillover_matmul_warnings() + return est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + + +# ---------------------------------------------------------------------------- +# Tests +# ---------------------------------------------------------------------------- + + +def test_panel_composition(panel): + """800 rows = 200 units × 4 periods; 25 treated, 120 near, 55 far.""" + assert len(panel) == (N_TREATED + N_NEAR + N_FAR) * T_PERIODS == 800 + assert panel["unit"].nunique() == N_TREATED + N_NEAR + N_FAR == 200 + assert panel["time"].nunique() == T_PERIODS == 4 + treated_units = panel.loc[panel["ever_treated"] == 1, "unit"].nunique() + assert treated_units == N_TREATED == 25 + + +def test_panel_geographic_bands(panel): + """Treated cluster within ~10 km of origin (5 sigma * 2 lat/lon), + near-controls in [0.1, 0.7] lat degrees, far-controls in [2.0, 3.0].""" + units = panel.drop_duplicates("unit") + treated = units[units["ever_treated"] == 1] + untreated = units[units["ever_treated"] == 0] + near = untreated[untreated["lat"] <= 1.0] + far = untreated[untreated["lat"] > 1.0] + + assert len(near) == N_NEAR == 120 + assert len(far) == N_FAR == 55 + + # Treated cluster geometry + assert treated["lat"].abs().max() < 0.3 + assert treated["lon"].abs().max() < 0.3 + # Near-control band geometry + assert near["lat"].min() >= 0.1 + assert near["lat"].max() <= 0.7 + # Far-control band geometry + assert far["lat"].min() >= 2.0 + assert far["lat"].max() <= 3.0 + + +def test_dgp_true_parameters_match_quoted(): + """True parameters quoted in the tutorial narrative (§2).""" + assert TAU_TOTAL == -7.4 + assert DELTA_1 == -4.5 + assert D_BAR_KM == 100.0 + assert NOISE_SD == 0.5 + assert MAIN_SEED == 23 + + +def test_estimator_construction_matches_quoted(): + """The §5 fit instantiation parameters must match the docstring narrative.""" + est = SpilloverDiD(rings=[0.0, D_BAR_KM], conley_coords=("lat", "lon")) + params = est.get_params() + assert params["rings"] == [0.0, 100.0] + assert params["d_bar"] is None # auto-default to max(rings) + assert params["conley_coords"] == ("lat", "lon") + assert params["vcov_type"] == "hc1" + + +def test_naive_twfe_understates_tau_total(naive_fit): + """§3 quoted: naive ATT ≈ -4.29, ~58% of true tau_total (~42% understatement).""" + ratio = naive_fit.att / TAU_TOTAL + assert 0.55 <= ratio <= 0.62, f"naive ratio={ratio:.3f} outside [0.55, 0.62] band" + + +def test_naive_att_endpoint_matches_quoted(naive_fit): + """§3 quoted endpoint: 2-decimal pin matching the published `-4.29` + in the notebook, README, and CHANGELOG. The well-conditioned naive + MultiPeriodDiD fit is stable across BLAS paths to better than 0.005, + so 2-decimal pinning is safe (in contrast to the borderline-rank- + deficient `rings=[0,50]` sensitivity point, which we keep at + round-to-1).""" + assert round(naive_fit.att, 2) == -4.29 + + +def test_spillover_did_recovers_tau_total(spillover_fit): + """§5 quoted: SpilloverDiD tau_total = -7.34, recovers true -7.4 + (within 0.5 tolerance bound). Endpoint pinned to 2 decimals + matching the published `-7.34` in the notebook, README, and + CHANGELOG — well-conditioned fit, BLAS-stable at 2 decimals.""" + assert abs(spillover_fit.att - TAU_TOTAL) < 0.5 + assert round(spillover_fit.att, 2) == -7.34 + + +def test_spillover_did_recovers_delta_1(spillover_fit): + """§5 quoted: SpilloverDiD delta_1 = -4.53, recovers true -4.5 + (within 0.5 tolerance bound). Endpoint pinned to 2 decimals + matching the published `-4.53` in the notebook, README, and + CHANGELOG — well-conditioned fit, BLAS-stable at 2 decimals.""" + delta_1 = float(spillover_fit.spillover_effects.iloc[0]["coef"]) + assert abs(delta_1 - DELTA_1) < 0.5 + assert round(delta_1, 2) == -4.53 + + +def test_rings_sensitivity_grid_endpoints(panel): + """§4 quoted: d_bar=50 → tau=-5.4, others (100/150/200) → tau=-7.3. + + Per the plan and reviewer guidance, round-to-1 tolerance is safer + than round-to-2 against BLAS divergence on the borderline-rank-deficient + smallest grid point. + """ + expected_tau = {50.0: -5.4, 100.0: -7.3, 150.0: -7.3, 200.0: -7.3} + expected_delta = {50.0: -2.6, 100.0: -4.5, 150.0: -4.5, 200.0: -4.5} + for outer in (50.0, 100.0, 150.0, 200.0): + est = SpilloverDiD(rings=[0.0, outer], conley_coords=("lat", "lon")) + with warnings.catch_warnings(): + _silence_spillover_matmul_warnings() + res = est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + assert res.spillover_effects is not None + delta_1 = float(res.spillover_effects.iloc[0]["coef"]) + assert round(res.att, 1) == expected_tau[outer], ( + f"d_bar={outer}: tau={res.att:.4f} (rounded {round(res.att, 1)}) " + f"vs expected {expected_tau[outer]}" + ) + assert round(delta_1, 1) == expected_delta[outer], ( + f"d_bar={outer}: delta_1={delta_1:.4f} (rounded {round(delta_1, 1)}) " + f"vs expected {expected_delta[outer]}" + ) + + +def test_rings_grid_d_bar_100_to_200_identical_delta_1(panel): + """§4 narrative claim covers BOTH coefficients: `d_bar in {100, 150, + 200}` produces identical tau_total AND delta_1 (the test of + tau_total identity lives in `test_rings_grid_d_bar_100_to_200_identical`; + this companion test pins the delta_1 identity so a future drift + that affects only the spillover coefficient can't leave the + 'identical' notebook claim stale).""" + deltas = [] + for outer in (100.0, 150.0, 200.0): + est = SpilloverDiD(rings=[0.0, outer], conley_coords=("lat", "lon")) + with warnings.catch_warnings(): + _silence_spillover_matmul_warnings() + res = est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + assert res.spillover_effects is not None + deltas.append(float(res.spillover_effects.iloc[0]["coef"])) + np.testing.assert_allclose(deltas, deltas[0] * np.ones(3), atol=1e-10) + + +def test_rings_grid_d_bar_100_to_200_identical(panel): + """§4 narrative claim: once d_bar covers the true spillover horizon + (which here ends at ~78 km), widening past 100 km adds zero + observations to the ring and the estimates are identical.""" + results = [] + for outer in (100.0, 150.0, 200.0): + est = SpilloverDiD(rings=[0.0, outer], conley_coords=("lat", "lon")) + with warnings.catch_warnings(): + _silence_spillover_matmul_warnings() + res = est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + results.append(res.att) + np.testing.assert_allclose(results, results[0] * np.ones(3), atol=1e-10) + + +def test_conley_se_differs_from_hc1(spillover_fit, spillover_conley_lag0_fit): + """§6 sanity: Conley vcov produces a different SE than HC1 by more than + floating-point noise. Pairs with `test_conley_se_less_than_hc1` which + pins the direction of the difference for this specific DGP.""" + assert abs(spillover_conley_lag0_fit.se - spillover_fit.se) > 1e-6 + + +def test_conley_se_less_than_hc1(spillover_fit, spillover_conley_lag0_fit): + """§6 prose claim: 'on this DGP, the Conley spatial-HAC SE comes in + *lower* than HC1'. Pin the direction so the narrative doesn't go + stale if a future library change flips the sign of the per-pair + score covariance and reverses the inequality.""" + assert spillover_conley_lag0_fit.se < spillover_fit.se + + +def test_conley_se_point_estimates_invariant( + spillover_fit, spillover_conley_lag0_fit, spillover_conley_lag1_fit +): + """§6 narrative claim: variance-type choice doesn't move the point + estimates. tau_total is bit-identical across HC1 / Conley lag=0 / + Conley lag=1 (all paths use the same OLS solve; only the meat + differs).""" + np.testing.assert_allclose( + [spillover_conley_lag0_fit.att, spillover_conley_lag1_fit.att], + spillover_fit.att, + atol=1e-10, + ) + + +def test_conley_lag_cutoff_changes_se_vs_lag_zero( + spillover_conley_lag0_fit, spillover_conley_lag1_fit +): + """§6 sanity: adding the serial term (lag=1) changes the SE relative + to spatial-only (lag=0). Direction-agnostic — on this DGP it + shrinks, on others it can grow.""" + assert abs(spillover_conley_lag1_fit.se - spillover_conley_lag0_fit.se) > 1e-6 + + +def test_summary_renders_without_warning(spillover_fit): + """§5 smoke: the summary() call runs clean on the headline fit.""" + with warnings.catch_warnings(): + warnings.simplefilter("error") + out = spillover_fit.summary() + assert isinstance(out, str) + assert len(out) > 0 + + +def test_notebook_dgp_constants_match_test_module_constants(): + """P2 sync guard: codex R6 caught that `test_notebook_dgp_ast_matches_test_fixture` + only compares the function body, and `test_dgp_true_parameters_match_quoted` + only reasserts the test module's own constants. A notebook-only edit + to `MAIN_SEED`, `N_TREATED`, `N_NEAR`, `N_FAR`, `T_PERIODS`, + `FIRST_TREAT`, `TAU_TOTAL`, `DELTA_1`, `D_BAR_KM`, or `NOISE_SD` + would change tutorial behavior without failing either of those + tests. + + Parses the notebook §2 cell, walks the top-level + ``Assign`` nodes, and asserts the value of each expected constant + matches the test module's value. Any notebook-only constant edit + now fails this guard. + + **CI isolation note:** the project's pure-Python and Rust CI jobs + copy ``tests/`` to an isolated location WITHOUT the ``docs/`` + tree (to verify the installed package doesn't depend on the + source tree). When the notebook source isn't reachable, this + test skips gracefully — the sync-guard contract is meaningful + in local dev (where edits happen) and the nbmake job separately + verifies the notebook executes end-to-end.""" + import ast + import json + from pathlib import Path + + nb_path = Path(__file__).resolve().parents[1] / "docs" / "tutorials" / "23_spillover_tva.ipynb" + if not nb_path.exists(): + pytest.skip( + f"Notebook source not found at {nb_path}; this drift guard " + f"requires source-tree access and is meaningful only in local " + f"dev. The nbmake CI job separately verifies the notebook " + f"executes end-to-end." + ) + with nb_path.open() as f: + nb = json.load(f) + + matches = [ + c + for c in nb["cells"] + if c["cell_type"] == "code" and any("def build_t23_panel" in s for s in c["source"]) + ] + assert len(matches) == 1, ( + f"Expected exactly one notebook code cell defining `build_t23_panel`; " + f"found {len(matches)}." + ) + nb_cell_src = "".join(matches[0]["source"]) + + # Walk top-level Assigns in the cell and collect a constant -> value dict. + nb_consts: dict = {} + tree = ast.parse(nb_cell_src) + for node in tree.body: + if isinstance(node, ast.Assign) and len(node.targets) == 1: + target = node.targets[0] + if isinstance(target, ast.Name): + try: + nb_consts[target.id] = ast.literal_eval(node.value) + except (ValueError, SyntaxError): + pass # non-literal RHS; skip + + expected = { + "MAIN_SEED": MAIN_SEED, + "N_TREATED": N_TREATED, + "N_NEAR": N_NEAR, + "N_FAR": N_FAR, + "T_PERIODS": T_PERIODS, + "FIRST_TREAT": FIRST_TREAT, + "TAU_TOTAL": TAU_TOTAL, + "DELTA_1": DELTA_1, + "D_BAR_KM": D_BAR_KM, + "NOISE_SD": NOISE_SD, + } + missing = [k for k in expected if k not in nb_consts] + assert not missing, ( + f"Notebook §2 cell missing expected constant assignments: {missing}. " + f"If a constant was renamed or moved, update both notebook and test." + ) + mismatched = {k: (nb_consts[k], expected[k]) for k in expected if nb_consts[k] != expected[k]} + assert not mismatched, ( + f"Notebook §2 constants drifted from test module constants: {mismatched}. " + f"Each entry is (notebook_value, test_value). Update both to match." + ) + + +def test_notebook_dgp_ast_matches_test_fixture(): + """P2 sync guard: enforces the "verbatim" duplication claim by + parsing the notebook's §2 ``build_t23_panel`` definition and + asserting that, modulo function name (notebook: ``build_t23_panel``; + test: ``_build_t23_panel``) and docstring, its AST matches the test + fixture's. Catches silent drift in non-constant DGP logic (coordinate + ranges, lambda_t, row construction) that the numerical-value pins + don't see — codex R4 P2 flagged this gap. + + Uses ``ast.dump`` for a whitespace-/comment-agnostic comparison: + semantically identical code matches, cosmetic edits don't trigger + spurious failures. + + **CI isolation note:** like + ``test_notebook_dgp_constants_match_test_module_constants``, this + test skips gracefully when CI's isolated-tests-copy step strips + the ``docs/`` tree. The sync-guard contract is meaningful in + local dev where edits happen.""" + import ast + import inspect + import json + from pathlib import Path + + nb_path = Path(__file__).resolve().parents[1] / "docs" / "tutorials" / "23_spillover_tva.ipynb" + if not nb_path.exists(): + pytest.skip( + f"Notebook source not found at {nb_path}; this drift guard " + f"requires source-tree access and is meaningful only in local " + f"dev. The nbmake CI job separately verifies the notebook " + f"executes end-to-end." + ) + with nb_path.open() as f: + nb = json.load(f) + + matches = [ + c + for c in nb["cells"] + if c["cell_type"] == "code" and any("def build_t23_panel" in s for s in c["source"]) + ] + assert len(matches) == 1, ( + f"Expected exactly one notebook code cell defining `build_t23_panel`; " + f"found {len(matches)}. If you renamed or split the §2 DGP cell, " + f"update this test's cell-locator." + ) + nb_cell_src = "".join(matches[0]["source"]) + + def _extract_normalized_fn(src: str, fn_name: str) -> str: + """Parse `src`, find FunctionDef `fn_name`, strip its docstring, + rename it to the canonical `build_t23_panel`, and return the + normalized AST dump.""" + tree = ast.parse(src) + fn = next( + (n for n in ast.walk(tree) if isinstance(n, ast.FunctionDef) and n.name == fn_name), + None, + ) + assert fn is not None, f"Could not find FunctionDef `{fn_name}` in source" + if ( + fn.body + and isinstance(fn.body[0], ast.Expr) + and isinstance(fn.body[0].value, ast.Constant) + and isinstance(fn.body[0].value.value, str) + ): + fn.body = fn.body[1:] + fn.name = "build_t23_panel" + return ast.dump(fn, annotate_fields=True, include_attributes=False) + + nb_norm = _extract_normalized_fn(nb_cell_src, "build_t23_panel") + test_norm = _extract_normalized_fn(inspect.getsource(_build_t23_panel), "_build_t23_panel") + + assert nb_norm == test_norm, ( + f"Notebook §2 DGP cell drifted from test fixture `_build_t23_panel`.\n" + f"--- notebook AST ---\n{nb_norm[:400]}...\n" + f"--- test AST ---\n{test_norm[:400]}...\n" + f"Update one or both so the function bodies match modulo name + docstring." + ) + + +def test_seed_specific_geometry_pins_match_quoted(panel): + """P3 sync guard: the §2 panel-layout table and §6 within-cutoff + enumeration quote seed-specific geometry numbers (max distance from + origin, cluster diameter, band extents, far×far / near×near + pair-within-100km percentages). The drift test pins all the values + quoted in the notebook so prose can't go stale even if the headline + estimates remain within tolerance — codex R4 P3 flagged this gap.""" + treated = panel[panel["ever_treated"] == 1].drop_duplicates("unit") + near = panel[(panel["ever_treated"] == 0) & (panel["lat"] <= 1.0)].drop_duplicates("unit") + far = panel[(panel["ever_treated"] == 0) & (panel["lat"] > 1.0)].drop_duplicates("unit") + deg_to_km = 111.0 + + def _max_dist_from_origin_km(d): + return float(np.sqrt(d["lat"] ** 2 + d["lon"] ** 2).max() * deg_to_km) + + def _min_dist_from_origin_km(d): + return float(np.sqrt(d["lat"] ** 2 + d["lon"] ** 2).min() * deg_to_km) + + def _band_diameter_km(d): + lats = d["lat"].values + lons = d["lon"].values + diffs = np.sqrt((lats[:, None] - lats[None, :]) ** 2 + (lons[:, None] - lons[None, :]) ** 2) + return float(diffs.max() * deg_to_km) + + def _pct_pairs_within_100km(d): + lats = d["lat"].values + lons = d["lon"].values + n = len(lats) + dist = ( + np.sqrt((lats[:, None] - lats[None, :]) ** 2 + (lons[:, None] - lons[None, :]) ** 2) + * deg_to_km + ) + triu = np.triu(np.ones((n, n), dtype=bool), k=1) + pair_d = dist[triu] + return float((pair_d <= 100.0).sum() / len(pair_d) * 100.0) + + def _cross_band_max_pair_km(d1, d2): + lats1, lons1 = d1["lat"].values, d1["lon"].values + lats2, lons2 = d2["lat"].values, d2["lon"].values + cross = ( + np.sqrt((lats1[:, None] - lats2[None, :]) ** 2 + (lons1[:, None] - lons2[None, :]) ** 2) + * deg_to_km + ) + return float(cross.max()) + + def _within_band_median_pair_km(d): + lats = d["lat"].values + lons = d["lon"].values + n = len(lats) + dist = ( + np.sqrt((lats[:, None] - lats[None, :]) ** 2 + (lons[:, None] - lons[None, :]) ** 2) + * deg_to_km + ) + triu = np.triu(np.ones((n, n), dtype=bool), k=1) + return float(np.median(dist[triu])) + + # §2 quoted: "clustered around (0,0); max ~12 km from origin, cluster diameter ~22 km at seed 23" + assert round(_max_dist_from_origin_km(treated)) == 12, _max_dist_from_origin_km(treated) + assert round(_band_diameter_km(treated)) == 22, _band_diameter_km(treated) + # §2 quoted: "~12-82 km north" + assert round(_min_dist_from_origin_km(near)) == 12, _min_dist_from_origin_km(near) + assert round(_max_dist_from_origin_km(near)) == 82, _max_dist_from_origin_km(near) + # §2 quoted: "~224-331 km north" + assert round(_min_dist_from_origin_km(far)) == 224, _min_dist_from_origin_km(far) + assert round(_max_dist_from_origin_km(far)) == 331, _max_dist_from_origin_km(far) + # §6 quoted: "max within-band pairwise distance is ~131 km" for far band + # (NOT "lat extent" — geometrically the lat extent is ~111 km; 131 km is + # the pairwise band diameter accounting for lon dispersion too) + assert round(_band_diameter_km(far)) == 131, _band_diameter_km(far) + # §6 quoted: "100% of within-band pairs are within 100 km" for near band + assert round(_pct_pairs_within_100km(near)) == 100, _pct_pairs_within_100km(near) + # §6 quoted: "~95% of within-band pair distances are within 100 km" for far band + assert round(_pct_pairs_within_100km(far)) == 95, _pct_pairs_within_100km(far) + # §6 quoted: "treated × near-control pairs (max pairwise separation ~90 km)" + assert round(_cross_band_max_pair_km(treated, near)) == 90, _cross_band_max_pair_km( + treated, near + ) + # §6 quoted: "median far/far pairwise distance is ~56 km" + assert round(_within_band_median_pair_km(far)) == 56, _within_band_median_pair_km(far) + + +def _assert_post_filter_warning_surface_is_clean(captured) -> None: + """Shared T19-style platform-agnostic warning-policy assertion. + + The notebook's narrow ``.*encountered in matmul`` filter (see + `_silence_spillover_matmul_warnings`) silences three Apple Silicon + M4 + numpy<2.3 Accelerate BLAS warnings that are emitted on the + affected platform but DO NOT fire on M3 / Intel / Linux or + numpy>=2.3 (per ``TODO.md`` "RuntimeWarnings in Linear Algebra + Operations"). The drift contract this assertion locks is + platform-agnostic: + + - on platforms where the matmul warnings fire, they get filtered + and never reach the captured list; + - on platforms where they don't fire, the filter is a no-op; + + EITHER WAY the post-filter captured list must be empty. Any + UserWarning, FutureWarning, DeprecationWarning, or RuntimeWarning + with a non-matmul message will fail this assertion and force the + maintainer to either update the notebook narrative or fix the + underlying cause.""" + if not captured: + return + details = [(msg.category.__name__, str(msg.message)) for msg in captured] + assert False, ( + f"Unexpected post-filter warnings on the T23 DGP: {details}. " + f"If a new warning is genuinely expected, broaden " + f"`_silence_spillover_matmul_warnings()` and update the §5/§6 " + f"notebook narrative accordingly." + ) + + +def test_spillover_fit_warning_policy_post_filter_clean(panel): + """§5 warning-policy guard (T19-pattern, platform-agnostic). + + Mirrors the notebook's narrow ``.*encountered in matmul`` filter + inside the capture block, then asserts the post-filter warning + surface is empty on the T23 DGP. On Apple Silicon M4 + numpy<2.3 + the three known BLAS matmul warnings fire and are filtered; on + M3 / Intel / Linux or numpy>=2.3 the filter is a no-op. EITHER + WAY a fresh ``UserWarning`` / ``FutureWarning`` or any non-matmul + ``RuntimeWarning`` will fail this guard.""" + est = SpilloverDiD(rings=[0.0, D_BAR_KM], conley_coords=("lat", "lon")) + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + _silence_spillover_matmul_warnings() # mirror notebook §5 filter + est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + _assert_post_filter_warning_surface_is_clean(w) + + +def test_spillover_conley_fit_warning_policy_post_filter_clean(panel): + """§6 warning-policy guard, parallel to §5 but on the Conley path + (vcov_type="conley", conley_cutoff_km=d_bar, conley_lag_cutoff in {0, 1}). + Same T19-style platform-agnostic contract: mirror the notebook + filter inside the capture, assert no remaining warning escaped.""" + for lag in (0, 1): + est = SpilloverDiD( + rings=[0.0, D_BAR_KM], + conley_coords=("lat", "lon"), + vcov_type="conley", + conley_cutoff_km=D_BAR_KM, + conley_lag_cutoff=lag, + ) + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + _silence_spillover_matmul_warnings() # mirror notebook §6 filter + est.fit(panel, outcome="y", unit="unit", time="time", treatment="D") + _assert_post_filter_warning_surface_is_clean(w)