Skip to content

T23 SpilloverDiD tutorial: TVA-style ~40% understatement walkthrough#493

Merged
igerber merged 22 commits into
mainfrom
spillover-tva-tutorial-t23
May 25, 2026
Merged

T23 SpilloverDiD tutorial: TVA-style ~40% understatement walkthrough#493
igerber merged 22 commits into
mainfrom
spillover-tva-tutorial-t23

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 25, 2026

Summary

Methodology references

  • Method name(s): SpilloverDiD (Butts 2021 ring-indicator estimator + Gardner 2022 two-stage residualize-then-fit + Wave A/E.2 Conley spatial-HAC composition).
  • Paper / source link(s):
    • Butts, K. (2021). Difference-in-Differences with Spatial Spillovers. arXiv:2105.03737 — §4 Table 1 Panel A is the tutorial's empirical anchor (paper review on disk at docs/methodology/papers/butts-2021-review.md:257 endorses the ~40% bias-correction direction as a tutorial-friendly DGP target).
    • Kline, P. & Moretti, E. (2014). Local Economic Development...TVA. QJE 129(1) — original TVA application that Butts (2021) §4 revisits. Added to docs/references.rst in this PR.
    • Conley, T. G. (1999). JoE 92(1) + Newey, W. K. & West, K. D. (1987). Econometrica 55(3) — spatial + serial HAC variance demonstrated in §6.
  • Any intentional deviations from the source: None. This is a tutorial-only PR; no estimator or variance-formula changes. The narrative claims (Conley < HC1 on this DGP, sensitivity-grid identity at d_bar≥100, recovery within tolerance) are tested with exact-pin drift guards; the §6 sign-direction explanation is deliberately agnostic about per-pair score covariance mechanism.

Validation

  • Tests added/updated: tests/test_t23_spillover_tva_drift.py (17 function-level tests, T19 pattern).
  • Notebook evidence: docs/tutorials/23_spillover_tva.ipynb executes end-to-end under pytest --nbmake in ~2 seconds. The DGP-builder is duplicated verbatim into the drift-test panel fixture per the T19 pattern (no _scratch/ shared module — _scratch/ is gitignored); the test_dgp_true_parameters_match_quoted pin catches silent drift between the two copies.
  • Sphinx HTML build with -W warnings as errors passes; the toctree entry under Tutorials: Business Applications (after tutorials/22_had_survey_design) is wired correctly.
  • Local codex review: 10 rounds, ✅ clean on R10 with zero findings across all 8 dimensions. Round-by-round trace lives in the commit messages — each commit names the codex finding it addresses (R1 P1 TWFE absorb, R2 P3 terminology, R3 P2/P3 warnings + d_bar wording, R4 P1/P2 pairwise Conley + narrowing, R5/R6/R7/R8 successive refinements on the warning-policy guard down to Counter-multiset equality, R9 P3 seed-23 geometry numbers + CHANGELOG count).
  • Cross-references: docs/index.rst toctree, docs/methodology/papers/butts-2021-review.md:257 (T22→T23), docs/references.rst (Kline-Moretti 2014 added), TODO.md (drops the T22-tutorial row), CHANGELOG.md (one-liner under [Unreleased] ### Added).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes. Content + filename secret scans clean.

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Executive Summary

  • I found no estimator-methodology defects in the changed SpilloverDiD narrative itself. The tutorial’s use of nearest-treated rings, d_bar = max(rings), and Conley as the recommended inference path matches docs/methodology/REGISTRY.md:L3264-L3486, diff_diff/spillover.py:L1651-L1740, diff_diff/spillover.py:L2150-L2189, and docs/methodology/papers/butts-2021-review.md:L249-L257.
  • P1: the new T23 drift tests hard-code that three exact np.matmul RuntimeWarnings must occur on every fit (tests/test_t23_spillover_tva_drift.py:L126-L145, L350-L422), but the repo documents those warnings as an Apple Silicon M4 + numpy < 2.3 artifact that is not reproducible on Linux (TODO.md:L291-L297).
  • That conflicts with the repo’s actual CI surface, which installs latest numpy and runs the full Python test suite on Ubuntu (.github/workflows/rust-test.yml:L205-L219). On that matrix, the warning set should be empty, so these tests are a merge blocker.
  • T23’s notebook comments also misattribute the warnings to a model-specific “per-unit FE projection edge case” (docs/tutorials/23_spillover_tva.ipynb:L278-L283) instead of the known Accelerate/BLAS issue. T19 already documents and tests the correct contract (docs/tutorials/19_dcdh_marketing_pulse.ipynb:L226-L240, tests/test_t19_marketing_pulse_drift.py:L221-L260).
  • Execution note: this was a static review only; the sandbox here does not have pytest/numpy installed.

Methodology

  • No findings. The changed tutorial stays within the documented SpilloverDiD contract and does not introduce an undocumented estimator, weighting, or variance deviation.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The blocking warning-contract issue is not a good TODO.md deferral candidate because it would break the current CI/test surface.

Security

  • No findings.

Documentation/Tests

  • P1 Impact: tests/test_t23_spillover_tva_drift.py:L126-L145 and L350-L422 require three exact raw RuntimeWarnings to appear once per fit, but TODO.md:L291-L297 says those warnings only occur on Apple Silicon M4 with numpy < 2.3 and are not reproducible on Linux. Since the pure-Python CI job installs latest numpy on Ubuntu (.github/workflows/rust-test.yml:L205-L219), this test contract is inconsistent with the supported environment and is likely to fail even when the fit is correct. The notebook prose/comment at docs/tutorials/23_spillover_tva.ipynb:L278-L283 also teaches the wrong cause for the warnings. Concrete fix: mirror T19’s pattern. Document the matmul warnings as platform-specific Accelerate noise if you want to mention them at all, keep the scoped .*encountered in matmul filter in the notebook if desired, and change the drift test to assert the post-filter visible warning policy (or gate raw-warning assertions behind an affected-platform + numpy < 2.3 condition) instead of asserting that three raw warnings must always occur.

Path to Approval

  1. Replace the T23 raw-warning assertions with the same contract used in T19: after applying the notebook’s .*encountered in matmul filter, assert that no unexpected warnings remain, rather than asserting that three raw runtime warnings must appear.
  2. Update T23 notebook comments/prose to describe the matmul warnings as optional platform-specific Accelerate/BLAS noise, not as a universal SpilloverDiD FE edge case. Reusing the T19 wording would be the simplest fix.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 00d72e2364a06647db00aee72c832324b0e53435


Overall Assessment

Looks good

Executive Summary

  • Re-review outcome: the prior P1 warning-surface blocker is resolved. The notebook now describes the np.matmul warnings as the platform-specific Apple Silicon M4 + numpy<2.3 BLAS issue documented in TODO.md, and the new drift guards use the T19-style post-filter contract instead of requiring raw warnings on every platform (docs/tutorials/23_spillover_tva.ipynb:L278-L285, tests/test_t23_spillover_tva_drift.py:L348-L414, TODO.md:L289-L297).
  • No estimator implementation changed. The tutorial’s SpilloverDiD spec, d_bar = max(rings) usage, far-away control interpretation, and Conley guidance are consistent with the Methodology Registry and the class docstring (docs/methodology/REGISTRY.md:L3276-L3298, L3341-L3362, diff_diff/spillover.py:L1652-L1690, L2150-L2189).
  • The supporting paper references are wired correctly: Butts §4 Table 1 Panel A is used as the empirical anchor, and Kline-Moretti (2014) is added as the motivating TVA citation (docs/methodology/papers/butts-2021-review.md:L253-L257, docs/references.rst:L189-L195, docs/index.rst:L80-L86).
  • One minor P3 remains: the notebook says the d_bar >= 100 estimates are identical, but the exact identity drift guard only pins tau_total; delta_1 is only checked to one decimal place (docs/tutorials/23_spillover_tva.ipynb:L315-L319, tests/test_t23_spillover_tva_drift.py:L268-L298).
  • Execution note: static review only; this sandbox does not have pytest or numpy, so I could not run the new drift file locally.

Methodology

  • No findings. Affected method surface is SpilloverDiD tutorial/test collateral only (plus a MultiPeriodDiD benchmark example); I found no undocumented deviation from the registered Butts/Gardner/Conley methodology.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. Removing the old tutorial placeholder from TODO.md is appropriate now that the T23 tutorial has landed.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: the prose says the d_bar = 150 and 200 km estimates are identical to 100, but test_rings_grid_d_bar_100_to_200_identical asserts exact identity only for att. delta_1 is only pinned through rounded endpoints, so a future drift in the spillover coefficient could leave that notebook claim stale without failing tests (docs/tutorials/23_spillover_tva.ipynb:L315-L319, tests/test_t23_spillover_tva_drift.py:L268-L298). Concrete fix: extend the exact-identity test to compare the delta_1 coefficient across d_bar in {100, 150, 200} at tight tolerance, or soften the notebook wording from “identical” to a looser stability claim.
  • No other findings. The previous warning-contract issue from the earlier AI review is addressed by the new post-filter clean-surface tests and corrected notebook explanation (docs/tutorials/23_spillover_tva.ipynb:L278-L285, tests/test_t23_spillover_tva_drift.py:L348-L414, TODO.md:L289-L297).

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1248c44a6fb9a778c1f7cc8844882d7eabeaf574


Overall Assessment

Looks good

Executive Summary

  • Re-review outcome: the prior drift-guard gap is resolved. The notebook’s d_bar >= 100 “identical estimates” claim now has exact tests for both tau_total and delta_1 in tests/test_t23_spillover_tva_drift.py:L287-L316, matching docs/tutorials/23_spillover_tva.ipynb:L315-L319.
  • Affected methods are SpilloverDiD tutorial/test collateral plus a MultiPeriodDiD benchmark only; I found no methodology deviation from the registered Butts/Gardner/Conley contract in docs/methodology/REGISTRY.md:L3264-L3489 and diff_diff/spillover.py:L1651-L2189.
  • The notebook’s use of rings=[0.0, d_bar], implicit d_bar=max(rings), far-away controls, and Conley bandwidth guidance is consistent with the registry and class docstring.
  • Two minor documentation issues remain: CHANGELOG.md says the drift file has 17 tests, but the file defines 18; and Tutorial 23 is in the Sphinx toctree but not in docs/tutorials/README.md.
  • Static review only: this environment does not have numpy or pytest, so I could not execute the notebook or the new drift test.

Methodology

  • No findings. The changed surface is documentation/drift coverage for SpilloverDiD, not estimator logic. The tutorial’s identification and variance descriptions match the committed registry/docstring contract at docs/methodology/REGISTRY.md:L3276-L3289, docs/methodology/REGISTRY.md:L3332-L3485, diff_diff/spillover.py:L1664-L1693, and diff_diff/spillover.py:L2150-L2189.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. Removing the delivered T23 TODO row is appropriate, and the Apple Silicon matmul-warning limitation remains separately tracked in TODO.md:L289-L305.

Security

  • No findings.

Documentation/Tests

  • Severity: P3
    Impact: CHANGELOG.md says the new drift file contains “17 function-level tests,” but tests/test_t23_spillover_tva_drift.py now defines 18 test_* functions. That makes the release note/audit trail stale.
    Concrete fix: update the changelog count to 18, or remove the explicit number entirely.
    References: CHANGELOG.md:L13-L14, tests/test_t23_spillover_tva_drift.py:L186-L432

  • Severity: P3 [Newly identified]
    Impact: Tutorial 23 is discoverable in Sphinx via docs/index.rst, but the repo-local tutorial catalog still stops at Tutorial 22, so contributors browsing docs/tutorials/README.md will miss the new notebook.
    Concrete fix: add a ### 23 entry to docs/tutorials/README.md in the same format as Tutorials 20-22, including the notebook name and companion drift test.
    References: docs/index.rst:L76-L86, docs/tutorials/README.md:L1-L120

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e1e0ba59624ce6dec84e4933b1e654f7957c23a5


Overall Assessment

Looks good

Executive Summary

  • Prior re-review P3s are resolved: CHANGELOG.md now reports 18 tests, and Tutorial 23 is listed in both the Sphinx toctree and the tutorial catalog at CHANGELOG.md:L13-L14, docs/index.rst:L80-L86, and docs/tutorials/README.md:L122-L128.
  • Methodology cross-check passed. The PR only adds tutorial/test collateral around SpilloverDiD plus a MultiPeriodDiD benchmark; I found no undocumented deviation from the committed Butts/Gardner/Conley contract in docs/methodology/REGISTRY.md:L3264-L3492, diff_diff/spillover.py:L1651-L2189, docs/references.rst:L189-L199, and docs/methodology/papers/butts-2021-review.md:L248-L257.
  • Severity P2: the new drift suite does not actually enforce the claimed notebook/test “verbatim” DGP sync; only a small constant block is synchronized, so non-constant notebook edits can drift from the fixture silently.
  • Severity P3: several new seed-specific geometry/Conley-support numbers in the notebook are not pinned by the drift tests, so that prose can still go stale even if the new test file stays green.
  • Static review only: this environment does not have pytest or numpy, so I could not execute tests/test_t23_spillover_tva_drift.py or the notebook.

Methodology

  • No findings. Affected methods are SpilloverDiD tutorial/test collateral and a MultiPeriodDiD benchmark only. The notebook’s single-ring nearest-treated specification, d_bar = max(rings) behavior, underspecified-d_bar sensitivity discussion, and Conley conley_cutoff_km = d_bar guidance all match the registry and class docstrings at docs/methodology/REGISTRY.md:L3276-L3289, docs/methodology/REGISTRY.md:L3332-L3489, diff_diff/spillover.py:L1664-L1690, diff_diff/spillover.py:L1811-L1879, diff_diff/spillover.py:L2150-L2189, and diff_diff/estimators.py:L1204-L1205, diff_diff/estimators.py:L1250-L1252, diff_diff/estimators.py:L1395-L1408.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P2. Impact: tests/test_t23_spillover_tva_drift.py says the DGP builder is duplicated “verbatim” from notebook §2, but the only explicit sync check is test_dgp_true_parameters_match_quoted(), which asserts a few constants. Non-constant notebook edits such as coordinate ranges, lambda_t, or the row-construction logic can diverge from the fixture without CI noticing if the headline estimates remain within the existing tolerance bands. That weakens the stated purpose of the drift suite. Concrete fix: either extract the DGP builder into a shared helper used by both notebook and test, or add a test that loads docs/tutorials/23_spillover_tva.ipynb and compares the normalized §2 code cell source against _build_t23_panel(). References: tests/test_t23_spillover_tva_drift.py:L10-L24, tests/test_t23_spillover_tva_drift.py:L50-L93, tests/test_t23_spillover_tva_drift.py:L218-L224, docs/tutorials/23_spillover_tva.ipynb:L81-L149.

Tech Debt

  • No findings. Dropping the delivered tutorial row from TODO.md is appropriate, and the M4 BLAS warning limitation remains separately tracked at TODO.md:L289-L306.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the notebook now includes several seed-specific quantitative claims beyond the headline ATT/SE numbers, including the geometry table in §2 and the far×far pair-support percentages in the Conley discussion. The new drift suite does not pin those values; it only checks band counts/lat bounds and the main estimator outputs. Those prose details can therefore drift silently while tests/test_t23_spillover_tva_drift.py still passes. Concrete fix: add haversine-based assertions for the §2 geometry and §6 pair-support claims, or soften the notebook prose to the coarser bounds already enforced by the tests. References: docs/tutorials/23_spillover_tva.ipynb:L57-L61, docs/tutorials/23_spillover_tva.ipynb:L311-L318, docs/tutorials/23_spillover_tva.ipynb:L452-L466, tests/test_t23_spillover_tva_drift.py:L195-L215, tests/test_t23_spillover_tva_drift.py:L237-L354.
  • Execution note: I could not run the new drift suite here because pytest and numpy are unavailable in this environment.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3d18f7e0bbb0b493767f423ce45ad9d0af33b751


Overall Assessment

Looks good

Executive Summary

  • Prior re-review gaps are addressed: the drift suite now includes an AST-level notebook/fixture sync guard and additional seed-geometry assertions.
  • Methodology cross-check passed. The PR only adds tutorial/test/docs collateral around existing SpilloverDiD and MultiPeriodDiD behavior, and I found no undocumented deviation from the committed Butts/Gardner/Conley contract.
  • Severity P3: section 6 mislabels the far-band ~131 km figure as a latitude extent when the corresponding test computes full pairwise band diameter.
  • Severity P3: the new geometry guard still overstates its own coverage; a couple of quoted pairwise-distance numbers in the notebook remain unpinned.
  • Severity P3: CHANGELOG.md understates the size of the new drift suite (18 vs 20 test_* functions).
  • Static review only: pytest and numpy are unavailable in this environment, so I could not execute the notebook or tests/test_t23_spillover_tva_drift.py.

Methodology

No findings. The tutorial’s ring-regressor form, d_bar = max(rings) API note, undershooting-d_bar sensitivity explanation, and Conley-at-d_bar guidance all match the committed registry/docstring contract at docs/methodology/REGISTRY.md:L3276-L3292, docs/methodology/REGISTRY.md:L3341-L3359, docs/methodology/REGISTRY.md:L3472-L3486, diff_diff/spillover.py:L1664-L1690, diff_diff/spillover.py:L1848-L1878, diff_diff/estimators.py:L1195-L1208, diff_diff/estimators.py:L1395-L1408, docs/references.rst:L189-L195, and docs/methodology/papers/butts-2021-review.md:L252-L257.

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity P3. Impact: test_seed_specific_geometry_pins_match_quoted() says it pins “all the values quoted in the notebook,” but section 6’s treated × near <= ~94 km and median far/far pairwise distance ~56 km claims are not asserted. Those two seed-specific prose values can still drift silently while the drift suite stays green. Concrete fix: add explicit assertions for the max treated-to-near pair distance and far×far median distance, or narrow the test docstring claim to only the values actually pinned. References: docs/tutorials/23_spillover_tva.ipynb:L461-L466, tests/test_t23_spillover_tva_drift.py:L430-L480.

Tech Debt

No findings. Dropping the delivered tutorial row from TODO.md is appropriate, and the matmul-warning limitation remains separately tracked at TODO.md:L289-L306.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: section 6 says the far-control band’s “lat extent” is ~131 km, but the DGP only samples far-unit latitudes on [2.0, 3.0], so the actual latitude span is about 111 km; the matching test computes full pairwise band diameter, not latitude extent. The prose/comment pair is geometrically inaccurate even though the underlying Conley-support intuition is fine. Concrete fix: rename that quantity to “band diameter” / “max within-band pairwise distance,” or recompute the true latitude extent and update the quoted value. References: docs/tutorials/23_spillover_tva.ipynb:L123-L124, docs/tutorials/23_spillover_tva.ipynb:L464-L466, tests/test_t23_spillover_tva_drift.py:L448-L452, tests/test_t23_spillover_tva_drift.py:L475-L476.
  • Severity P3. Impact: CHANGELOG.md says the new drift suite has 18 function-level tests, but tests/test_t23_spillover_tva_drift.py currently defines 20 test_* functions, so the release note understates the shipped coverage. Concrete fix: update the changelog count to 20, or trim the extra two tests if 18 was intentional. References: CHANGELOG.md:L14, tests/test_t23_spillover_tva_drift.py:L186-L549.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4311a21661ad6f67b9fc9fe3baca45711a42aff9


Overall Assessment

Looks good

Executive Summary

  • Prior re-review P3 items appear addressed: the ~131 km text now correctly describes pairwise band diameter, the geometry guard now pins the previously missing quoted distances, and CHANGELOG.md now matches the file’s 20 test_* functions.
  • Methodology cross-check passed. The affected methods are SpilloverDiD, MultiPeriodDiD, and the Conley variance narrative, but this PR changes only tutorial/docs/tests and stays within the existing SpilloverDiD contract in docs/methodology/REGISTRY.md:L3472-L3492 and diff_diff/spillover.py:L1651-L1692.
  • Severity P2: the new notebook/fixture sync guard does not compare the notebook’s DGP constants, so future notebook-only edits to the seed, band sizes, timing, or effect-size constants can drift the executed tutorial without failing the drift suite.
  • Static review only: I could not run the notebook or tests/test_t23_spillover_tva_drift.py because pytest and numpy are unavailable in this environment.

Methodology

  • No findings. The tutorial’s d_bar = max(rings) explanation, undershooting-d_bar failure mode, far-away-control requirement, and conley_cutoff_km = d_bar guidance are consistent with docs/methodology/REGISTRY.md:L3472-L3492, docs/references.rst:L189-L199, diff_diff/spillover.py:L1651-L1692, diff_diff/spillover.py:L1858-L1878, and diff_diff/spillover.py:L2115-L2145.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings. The previous wording/coverage issues are fixed in docs/tutorials/23_spillover_tva.ipynb:L464-L467, tests/test_t23_spillover_tva_drift.py:L495-L508, and CHANGELOG.md:L14.

Tech Debt

  • No findings. Removing the delivered tutorial row from TODO.md is appropriate, and the remaining spillover follow-ups remain tracked at TODO.md:L157-L164.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: test_notebook_dgp_ast_matches_test_fixture() only compares the AST of build_t23_panel(), but the notebook’s executed DGP also depends on the cell-level constant assignments above that function. test_dgp_true_parameters_match_quoted() only reasserts the test module’s own constants, not the notebook assignments. A future notebook-only edit to MAIN_SEED, N_TREATED, N_NEAR, N_FAR, T_PERIODS, FIRST_TREAT, TAU_TOTAL, DELTA_1, D_BAR_KM, or NOISE_SD can therefore change tutorial behavior while the advertised sync guard stays green. Concrete fix: parse the notebook code cell and assert those assignments match the test module constants, or move the parameters into the compared function/shared helper so the existing AST guard covers them. References: docs/tutorials/23_spillover_tva.ipynb:L89-L98, tests/test_t23_spillover_tva_drift.py:L37-L47, tests/test_t23_spillover_tva_drift.py:L218-L224, tests/test_t23_spillover_tva_drift.py:L366-L427.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 948627bdefbce6a692dd58e1c4b453ed6b5fa494


Overall Assessment

Looks good

Executive Summary

  • The prior re-review drift-guard gap is fixed: the new constant-sync test now checks the notebook’s top-level DGP assignments, and the existing AST guard still covers the function body in tests/test_t23_spillover_tva_drift.py:L366-L495.
  • Methodology cross-check passed for the affected methods: SpilloverDiD, the naive MultiPeriodDiD comparison, and the no-survey Conley variance narrative. The tutorial’s d_bar/rings contract, far-away control logic, and Conley guidance match docs/methodology/REGISTRY.md:L3341-L3492, diff_diff/spillover.py:L1651-L1878, diff_diff/spillover.py:L2110-L2147, diff_diff/two_stage.py:L63-L145, and docs/references.rst:L189-L199.
  • The TVA empirical anchor is correctly wired back to the paper review and references in docs/methodology/papers/butts-2021-review.md:L252-L257 and docs/references.rst:L189-L195.
  • Severity P2: the published notebook/README exact numbers are still only guarded to one decimal in the drift suite, so small future numeric drift can stale the prose without failing tests.
  • Static review only: I could not run pytest or execute the notebook here because the environment is missing pytest and numpy.

Methodology

  • No findings. Affected methods are SpilloverDiD, MultiPeriodDiD, and ConleySpatialHAC. The tutorial’s treatment of d_bar = max(rings), the need for far-away controls, and the recommendation to pair vcov_type="conley" with conley_cutoff_km = d_bar are consistent with docs/methodology/REGISTRY.md:L3341-L3492, diff_diff/spillover.py:L1651-L1878, and diff_diff/spillover.py:L2110-L2147.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings. The previous notebook/test sync weakness is now addressed by the new constant parser plus the AST-body equivalence guard in tests/test_t23_spillover_tva_drift.py:L366-L495.

Tech Debt

  • No findings. Removing the delivered tutorial row from TODO.md is appropriate, and the remaining spillover follow-ups stay tracked in TODO.md:L157-L164.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the notebook, README, and changelog publish two-decimal claims such as -4.29, -7.34, -4.53, -5.38, and -2.59, but the new drift tests generally only enforce one-decimal endpoints (round(..., 1)) in tests/test_t23_spillover_tva_drift.py:L243-L258 and L261-L284. That means future drift smaller than 0.05 can leave docs/tutorials/23_spillover_tva.ipynb:L230-L239, L300-L319, L372-L382, docs/tutorials/README.md:L122-L128, and CHANGELOG.md:L14 stale while the suite still passes. Concrete fix: either coarsen those published numbers to one decimal everywhere, or tighten the stable assertions to the quoted precision and keep only the intentionally loose rings=[0, 50] path at one decimal.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 99c0428319b7a370c471094bc46a40fe3cee41aa


Overall Assessment

Looks good

Executive Summary

  • The prior re-review drift-guard concern is addressed: the new suite now pins the quoted two-decimal headline values and adds notebook/test sync guards for both constants and DGP logic in tests/test_t23_spillover_tva_drift.py:L243-L269 and L377-L506.
  • Methodology cross-check passed for the affected methods SpilloverDiD, MultiPeriodDiD, and the Conley no-survey path. The tutorial’s d_bar = max(rings) contract, nearest-treated-ring semantics, Assumption 5(ii) far-away control requirement, and Butts §4 tutorial anchor align with docs/methodology/REGISTRY.md:L3341-L3492, diff_diff/spillover.py:L1651-L1878, diff_diff/spillover.py:L2110-L2147, diff_diff/estimators.py:L1094-L1258, and docs/methodology/papers/butts-2021-review.md:L252-L257.
  • Severity P3: the notebook’s Conley variance explanation slightly overstates the decomposition by referring to “raw scores” and “HC1 diagonal terms,” while the documented implementation uses observation-level Wave D Psi rows with no HC1 finite-sample multiplier on the Conley path (docs/tutorials/23_spillover_tva.ipynb:L455-L476, docs/methodology/REGISTRY.md:L3341-L3363, diff_diff/two_stage.py:L83-L120).
  • Severity P3: CHANGELOG.md still says the naive endpoint is pinned at round-to-1 even though the new drift guard now fixes -4.29 at two decimals (CHANGELOG.md:L14, tests/test_t23_spillover_tva_drift.py:L243-L250).
  • Static review only: I could not execute the new test file or notebook here because the environment is missing pytest and numpy.

Methodology

Affected methods reviewed: SpilloverDiD, MultiPeriodDiD, ConleySpatialHAC. No estimator-path code changed in this diff; the methodology review here is about whether the new tutorial and drift guards describe the shipped contract correctly.

  • Severity P3. Impact: the Conley explanation in docs/tutorials/23_spillover_tva.ipynb:L455-L476 says the no-survey path “operates on raw scores” and that the meat is the sum of “HC1 diagonal terms” plus off-diagonal cross-products. The documented implementation uses observation-level Wave D GMM-corrected influence rows Psi and explicitly preserves the no-HC1-multiplier Conley convention in docs/methodology/REGISTRY.md:L3341-L3363 and diff_diff/two_stage.py:L83-L120. This does not change results, but it slightly misstates the variance decomposition and could confuse readers about the finite-sample scaling on the Conley path. Concrete fix: rephrase that passage to say the no-survey path uses observation-level GMM-corrected influence rows and that Conley adds within-support off-diagonal pair terms to the observation-level diagonal terms, without calling them “HC1 diagonal terms.”

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings. The previous drift-guard gap is resolved by the stronger endpoint pins and the notebook constant/AST sync checks in tests/test_t23_spillover_tva_drift.py:L243-L269 and L377-L506.

Tech Debt

  • No findings. Removing the delivered tutorial row from TODO.md is appropriate; the remaining spillover follow-ups remain tracked in TODO.md:L157-L164.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: CHANGELOG.md:L14 still describes the naive headline guard as “round-to-1,” but the actual regression test now pins -4.29 to two decimals in tests/test_t23_spillover_tva_drift.py:L243-L250. This is minor, but it leaves the release note understating the actual test strictness and can mislead future maintainers comparing the changelog to the drift suite. Concrete fix: update the changelog entry so only the borderline-rank-deficient rings=[0, 50] sensitivity point is described as round-to-1, while the naive headline is described as two-decimal-pinned.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7207bea97d8bf6e20a57d4f2ccf32fb9a388bdc8


Overall Assessment

✅ Looks good

Executive Summary

  • Prior re-review issues are resolved: the notebook’s Conley explanation now matches the documented no-survey Wave D/Wave E.2 path using observation-level GMM-corrected influence rows and no HC1 multiplier (docs/tutorials/23_spillover_tva.ipynb:L451-L498, docs/methodology/REGISTRY.md:L3341-L3363, diff_diff/two_stage.py:L63-L118).
  • The tutorial’s d_bar = max(rings) note, underspecified-d_bar failure mode, and far-away-control identification all align with the shipped SpilloverDiD contract (diff_diff/spillover.py:L1861-L1876, diff_diff/spillover.py:L2136-L2146, docs/methodology/REGISTRY.md:L3472-L3489).
  • The new drift suite materially closes the earlier drift gaps: it now pins the headline endpoints, exact 100/150/200 identity for both tau_total and delta_1, notebook constant/body sync, geometry claims, and the post-filter warning surface (tests/test_t23_spillover_tva_drift.py:L243-L356, tests/test_t23_spillover_tva_drift.py:L377-L656).
  • Severity P3: the condensed README/changelog summaries slightly over-attribute the conley_lag_cutoff ∈ {0,1} demo to Butts §3.1. The notebook body itself is precise; the summary docs should credit Butts for the spatial cutoff choice and the registry/Newey-West synthesis for the serial extension (docs/tutorials/README.md:L127-L127, CHANGELOG.md:L14-L14, docs/tutorials/23_spillover_tva.ipynb:L404-L408, docs/methodology/REGISTRY.md:L3411-L3437).
  • Static review only: I could not execute tests/test_t23_spillover_tva_drift.py here because pytest is not installed in this environment.

Methodology

  • Severity P3. Impact: docs/tutorials/README.md:L127-L127 and CHANGELOG.md:L14-L14 describe the conley_lag_cutoff ∈ {0,1} tutorial demo as “per Butts §3.1. That overstates the paper support slightly: Butts §3.1 supports the Conley spatial cutoff choice (cutoff = d_bar), while the lagged serial term is the library’s documented Wave E.2 follow-up synthesis with Newey-West-style serial Bartlett HAC. The notebook text already makes this distinction correctly at docs/tutorials/23_spillover_tva.ipynb:L404-L408, and the registry documents the serial extension at docs/methodology/REGISTRY.md:L3411-L3437. Concrete fix: rephrase the README/changelog summary to say the cutoff choice follows Butts §3.1, while conley_lag_cutoff` demonstrates the library’s documented panel-block/Newey-West extension.
  • No other findings. The affected methods reviewed were SpilloverDiD, MultiPeriodDiD, and the no-survey Conley path; the tutorial’s numerical and identification claims stay within the documented contracts (diff_diff/estimators.py:L1211-L1258, diff_diff/spillover.py:L1861-L1876, diff_diff/spillover.py:L2136-L2146).

Code Quality

  • No findings.

Performance

  • No findings. This PR is docs/tests only and does not change estimator runtime paths.

Maintainability

  • No findings. The duplicated notebook DGP is now defended by both constant-sync and AST-body sync guards, which is the right anti-drift pattern for this kind of tutorial fixture (tests/test_t23_spillover_tva_drift.py:L377-L506).

Tech Debt

  • No findings. Removing the delivered TVA tutorial row from TODO.md is appropriate; the remaining spillover follow-ups remain tracked separately.

Security

  • No findings.

Documentation/Tests

  • No findings beyond the P3 attribution nit above. The prior changelog strictness mismatch is fixed: CHANGELOG.md:L14-L14 now matches the two-decimal endpoint pins in tests/test_t23_spillover_tva_drift.py:L243-L269.
  • Residual risk, not a PR finding: I could not run the new drift file locally because pytest is unavailable in this environment.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8eff5d8ccc789b7d860eb75a0ad206aa1a3b8b65


Overall Assessment

✅ Looks good

Executive Summary

  • The prior re-review attribution issue is resolved: the README now correctly distinguishes Butts §3.1’s support for the spatial cutoff from the library’s documented Wave E.2 follow-up serial extension (docs/tutorials/README.md:L122-L128, docs/methodology/REGISTRY.md:L3411-L3439).
  • No estimator code changed here. The affected methodological surface is the new SpilloverDiD tutorial, its MultiPeriodDiD naive baseline, and the no-survey Conley explanation; those all match the shipped SpilloverDiD contract and registry (diff_diff/spillover.py:L1667-L1679, diff_diff/spillover.py:L1861-L1878, diff_diff/spillover.py:L2115-L2147).
  • The new drift suite is strong for a docs/tutorial PR: it pins the headline endpoints, the d_bar sensitivity grid, notebook/test DGP sync, and the warning surface (tests/test_t23_spillover_tva_drift.py:L272-L365, tests/test_t23_spillover_tva_drift.py:L377-L506, tests/test_t23_spillover_tva_drift.py:L621-L656).
  • Severity P3: CHANGELOG.md still slightly over-attributes the exact {100,150,200} identity claim to Butts §4. In the notebook itself, that identity is correctly described as a DGP-specific consequence of having no units in the 80–200 km band (CHANGELOG.md:L14-L14, docs/tutorials/23_spillover_tva.ipynb:L318-L322, docs/methodology/REGISTRY.md:L3472-L3475).
  • Static review only: I could not execute tests/test_t23_spillover_tva_drift.py because pytest is not installed in this environment.

Methodology

  • Severity P3. Impact: CHANGELOG.md:L14-L14 phrases the exact d_bar ∈ {100,150,200} identity as “per Butts §4 ‘once d_bar covers the true horizon, widening is benign.’” That is broader than the source material supports. The shipped registry frames d_bar as a user-chosen bias/variance tradeoff, with too-small and too-large cases both having methodological consequences; the exact identity here is a property of this synthetic geometry, not a generic Butts result (docs/methodology/REGISTRY.md:L3472-L3475). Concrete fix: reword the changelog to attribute the identity to this tutorial DGP, e.g. “on this synthetic panel, widening past 100 km is benign because no units lie in the 80–200 km band.”
  • No other findings. The tutorial’s d_bar=max(rings) note, far-away-control requirement, and Conley/serial-lag explanation match the current implementation and registry (diff_diff/spillover.py:L1667-L1679, diff_diff/spillover.py:L1861-L1878, diff_diff/spillover.py:L2115-L2147, docs/methodology/REGISTRY.md:L3411-L3439).

Code Quality

  • No findings.

Performance

  • No findings. This PR is docs/tests only and does not change estimator runtime paths.

Maintainability

  • No findings. The duplicated notebook DGP is defended by both constant-sync and AST-body sync guards, which is the right anti-drift pattern for this kind of tutorial fixture (tests/test_t23_spillover_tva_drift.py:L377-L506).

Tech Debt

  • No findings. Removing the delivered tutorial row from TODO.md is appropriate; the remaining spillover follow-ups stay tracked (TODO.md:L157-L164).

Security

  • No findings.

Documentation/Tests

  • No findings beyond the P3 methodology wording above. The new tutorial is wired into the docs index and references cleanly (docs/index.rst:L80-L86, docs/references.rst:L189-L199, docs/methodology/papers/butts-2021-review.md:L248-L257).
  • Residual risk: this is a static review only; I could not run tests/test_t23_spillover_tva_drift.py locally because pytest is unavailable.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e105dc538cf8397455fd8e4f754a15a1bec38e83


Overall Assessment

✅ Looks good

Executive Summary

  • The prior rerun’s methodology wording issue is resolved: CHANGELOG.md:L14-L14 now scopes the {100,150,200} estimate identity to this synthetic DGP instead of attributing it generically to Butts.
  • No estimator or variance code changed. The tutorial stays within the shipped SpilloverDiD contract for d_bar=max(rings), far-away-control identification, and the documented conley_lag_cutoff follow-up synthesis.
  • The new drift suite is strong for a tutorial PR: 21 tests pin the headline endpoints, sensitivity grid, notebook/test DGP sync, geometry values, and the post-filter warning surface.
  • One minor documentation issue remains in the notebook’s survey-weight bullet: it blurs Wave E.2 vs Wave E.2 follow-up and overstates the generic survey+Conley surface.
  • Static review only: I could not execute the new test file here because the environment lacks pytest and numpy.

Methodology

  • Previous finding addressed: the changelog no longer over-attributes the d_bar ∈ {100,150,200} identity claim to Butts; the wording now correctly limits that claim to this tutorial DGP at CHANGELOG.md:L14-L14.
  • Severity P3. Impact: docs/tutorials/23_spillover_tva.ipynb:L542-L545 says survey_design + vcov_type="conley" is a “panel-block sandwich (Wave E.2)” and implies a generic SurveyDesign(...) surface. The documented contract is narrower: cross-sectional survey+Conley is Wave E.2, panel-block spatial+serial composition is Wave E.2 follow-up, and the conley_lag_cutoff > 0 survey path requires an effective PSU or injected cluster fallback (docs/methodology/REGISTRY.md:L3388-L3437, diff_diff/spillover.py:L3219-L3260). This is informational only because unsupported cases fail closed, but it can misdirect readers about what is paper-backed vs library follow-up and what survey layouts are supported. Concrete fix: reword the bullet to distinguish conley_lag_cutoff=0 Wave E.2 from the >0 Wave E.2 follow-up path, and mention the effective-PSU requirement for the panel-block serial case. If the roadmap note is meant to be spillover-specific, say “dedicated SpilloverDiD survey-design tutorial.”
  • No other methodology findings. The tutorial’s d_bar=max(rings) API note, far-away-control discussion, and framing of the serial Conley term as a documented library synthesis are consistent with diff_diff/spillover.py:L1667-L1690, diff_diff/spillover.py:L1861-L1878, and docs/methodology/REGISTRY.md:L3411-L3482.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The notebook/test DGP duplication is adequately defended by the constant-sync and AST-sync guards in tests/test_t23_spillover_tva_drift.py:L377-L507.

Tech Debt

No findings. Removing the delivered tutorial row from TODO.md is appropriate.

Security

No findings.

Documentation/Tests

No additional findings. The docs wiring and references are consistent across docs/index.rst:L80-L86, docs/references.rst:L189-L195, and docs/methodology/papers/butts-2021-review.md:L248-L257.

Residual risk: this was a static review only; I could not run tests/test_t23_spillover_tva_drift.py in this environment because pytest and numpy are unavailable.

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 239eeb57e1329de001614be6af8410c29fdf2e5f


Overall Assessment

✅ Looks good

Executive Summary

  • The prior documentation issue is resolved: the notebook now correctly distinguishes Wave E.2 survey+Conley with conley_lag_cutoff=0 from the Wave E.2 follow-up conley_lag_cutoff > 0 path and its effective-PSU requirement at docs/tutorials/23_spillover_tva.ipynb:L542-L552, matching docs/methodology/REGISTRY.md:L3404-L3433 and diff_diff/spillover.py:L3219-L3260.
  • No estimator, weighting, or variance implementation changed in this PR. The new tutorial stays within the documented SpilloverDiD contract for d_bar = max(rings) and the Butts/Conley framing at diff_diff/spillover.py:L1667-L1674, diff_diff/spillover.py:L1861-L1878, and docs/references.rst:L189-L195.
  • The drift suite is strong for a tutorial PR: it pins the headline ATTs, d_bar sensitivity behavior, notebook/test DGP sync, geometry claims, and the post-filter warning surface at tests/test_t23_spillover_tva_drift.py:L237-L365 and tests/test_t23_spillover_tva_drift.py:L377-L656.
  • Docs wiring is consistent across the index, README, references, methodology paper note, and changelog at docs/index.rst:L80-L86, docs/tutorials/README.md:L122-L128, docs/methodology/papers/butts-2021-review.md:L253-L257, and CHANGELOG.md:L14-L14.
  • Static review only: I could not execute tests/test_t23_spillover_tva_drift.py here because the environment lacks pytest and numpy.

Methodology

No findings. The tutorial’s API note for d_bar = max(rings) is consistent with the class docstring and validator at docs/tutorials/23_spillover_tva.ipynb:L249-L264, diff_diff/spillover.py:L1667-L1674, and diff_diff/spillover.py:L1861-L1878. The sensitivity-grid discussion is correctly scoped to this synthetic DGP rather than presented as a generic Butts result at docs/tutorials/23_spillover_tva.ipynb:L318-L327, CHANGELOG.md:L14-L14, and docs/methodology/papers/butts-2021-review.md:L253-L257.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The notebook/test DGP duplication is adequately protected by constant-sync and AST-sync guards at tests/test_t23_spillover_tva_drift.py:L377-L506.

Tech Debt

No findings. Removing the delivered tutorial row is appropriate, and the remaining SpilloverDiD follow-ups remain tracked at TODO.md:L157-L161.

Security

No findings.

Documentation/Tests

No findings. The documentation surfaces are internally consistent, and the new test file covers the main narrative claims well. Residual risk: I could not run the drift suite locally because pytest and numpy are unavailable in this environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 25, 2026
igerber and others added 17 commits May 25, 2026 16:14
New tutorial docs/tutorials/23_spillover_tva.ipynb (slot 23 since
22_had_survey_design.ipynb occupies slot 22) walks practitioners
through SpilloverDiD on a TVA-style synthetic panel reproducing the
Butts (2021) §4 Table 1 Panel A ~40% understatement direction.

DGP locked at MAIN_SEED=23: 4-period, 200-unit panel laid out as 25
treated + 120 near-control + 55 far-control. With tau_total=-7.4 and
delta_1=-4.5, naive multi-period TWFE understates by 42% (att=-4.29);
SpilloverDiD with rings=[0.0, 100.0] cleanly recovers tau_total=-7.34
and delta_1=-4.53. Section 6 demonstrates Conley spatial-HAC variance
with conley_cutoff_km=100 per Butts §3.1.

Drift tests at tests/test_t23_spillover_tva_drift.py: 14 function-
level tests mirroring the T19 pattern. Module-scoped panel /
naive_fit / spillover_fit / spillover_conley_lag0_fit /
spillover_conley_lag1_fit fixtures. Per
feedback_t19_drift_guards_test_file_only, the notebook contains zero
assert statements — all numerical pins live in the test file. The
DGP-builder function is duplicated verbatim between the notebook §2
cell and the test fixture; test_dgp_true_parameters_match_quoted
pins the constants both copies share.

Cross-references:
- docs/index.rst: toctree entry under Tutorials: Business Applications.
- docs/methodology/papers/butts-2021-review.md:257: T22→T23 (slot
  conflict callout).
- docs/references.rst: adds Kline & Moretti (2014) — cited in §1.
- TODO.md: drops the resolved T22-tutorial row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 fix: the §3 "naive multi-period TWFE" baseline was fit without unit
FE absorption — only the implicit time-interaction was applied. ATT was
correct (balanced panel makes pooled and TWFE coincide on point
estimates) but the SE was the pooled SE (~0.28) not the TWFE SE
(~0.14). Adds `absorb=["unit"]` to both the notebook §3 fit and the
drift-test `naive_fit` fixture, making the regression a true two-way
fixed-effects panel TWFE per the MultiPeriodDiD contract documented at
`docs/methodology/REGISTRY.md:L119-L141`. All drift tests pass
unchanged (ATT pin is invariant under unit-FE absorption on this
balanced DGP).

P2 fix: §6 attributed the lower Conley SE to "negative cross-covariance
between treated and far-control PSUs at long distances" — but the far
controls sit at 200-330 km from origin while `conley_cutoff_km=100`,
so those pairs are OUTSIDE kernel support and cannot drive the
variance. Rewrites the explanation to refer to the within-cutoff
covariance structure (treated cluster + near-control band, both
pairwise within 100 km) and notes that the sign of the net effect on
SE depends on the per-pair score covariances under this specific DGP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex R2 caught a terminology mismatch in §6: the explanation said
"only PSU pairs within 100 km contribute" but the fit shown is
`SpilloverDiD(vcov_type="conley")` with no `survey_design`, so the
no-survey Conley path runs an observation-level pairwise sum over
within-period score pairs — PSU aggregation only kicks in on the
survey-design path per
`docs/methodology/REGISTRY.md:L3353-L3357` and `:L3549-L3555`.

Rewrites the §6 mechanism paragraph to use "score pairs" /
"within-period observation pairs" language and explicitly notes the
no-survey scope. Also clarifies that the spatial term's outside-
support statement applies to the spatial sum specifically; the
serial term (`lag=1`) is a separate within-unit cross-period
covariance sum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2 fixes (warning suppression / direction-pinning gap):
- Naive MultiPeriodDiD fit now passes `reference_period=2` explicitly,
  silencing the FutureWarning about the reference-period default
  change at the source. Bit-identical to the prior fit (the new
  default already matches the explicit value).
- Replaces blanket `simplefilter("ignore")` on the MPD fit with a
  message-scoped filter limited to the expected
  "Rank-deficient design matrix" UserWarning. That warning is benign
  here: `absorb=["unit"]` makes the unit-invariant `ever_treated`
  indicator perfectly collinear with the unit FE, so MPD drops it and
  identifies the ATT through the ever_treated × post interactions —
  the standard TWFE specification. Now any unexpected warning would
  surface.
- Adds `test_conley_se_less_than_hc1` to pin the §6 prose claim
  "on this DGP, the Conley spatial-HAC SE comes in *lower* than HC1"
  directionally. The pre-existing `test_conley_se_differs_from_hc1`
  is retained as a direction-agnostic sanity check.

P3 fix (§4 d_bar=50 mechanism):
The previous wording said `delta_1` attenuates because the ring
"averages over fewer units" — incorrect under this constant-`delta_1`
DGP. The real mechanism is misspecification: near-controls at 50-78 km
ARE exposed (within the true 100 km spillover horizon) but the
`d_bar=50` ring misclassifies them as far-away clean controls, so
both `tau` deflates (contaminated control arm) and `delta_1`
attenuates (cleaner near-controls now mixed into the `S=0` comparison).
Rewrites the §4 paragraph accordingly and references the registry's
documented failure mode for undershooting `d_bar`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ings

P1 fix (§6 pairwise Conley support):
The previous §6 rewrite said far-control band is "outside kernel
support" and "contributes nothing beyond own diagonal" — wrong.
Conley kernel support is determined PAIRWISE between observations,
not relative to the treated cluster. The far-control band lat extent
is ~111 km, so most far/far pairs ARE within 100 km of each other
and DO contribute non-zero Bartlett weight to the spatial Conley meat.

Rewrites §6 to enumerate which pair classes contribute under this
DGP's geometry (treated×treated, treated×near, near×near, AND far×far
within-band) vs which are outside the 100 km support (treated×far,
near×far cross-band). The Conley < HC1 result on this DGP comes from
the net sign of ALL the within-100-km cross-covariance terms, not
just the treated-near block.

P2 fix (narrow warning suppression + warning-policy guard):
Replaces blanket `warnings.simplefilter("ignore")` around every
SpilloverDiD fit with a message-scoped filter for the three known
benign `*encountered in matmul` RuntimeWarnings the stage-2 sandwich
emits from a per-unit FE projection edge case. Notebook §4/§5/§6 fits
and the corresponding drift fixtures all route through
`_silence_spillover_matmul_warnings()` for a single source of truth.

Adds `test_spillover_fit_emits_only_known_matmul_warnings` (T19-pattern
warning-policy guard) that runs the headline fit under
`warnings.simplefilter("always")` and asserts ONLY the three known
matmul warnings are emitted — any new warning category, or new
RuntimeWarning message, surfaces immediately instead of being silently
masked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the warning-suppression narrowing from R4. R5 caught three
spots still using blanket `warnings.simplefilter("ignore")`:

- Notebook §6 `fit_with()` helper (the Conley-vs-HC1 comparison cell).
- Drift test `test_rings_sensitivity_grid_endpoints` loop body.
- Drift test `test_rings_grid_d_bar_100_to_200_identical` loop body.

All three now use either the notebook-side narrow `RuntimeWarning`
filter for `.*encountered in matmul` or the drift-test helper
`_silence_spillover_matmul_warnings()`, matching the §5 fit fixture
pattern. Single source of truth for the suppression contract.

Adds `test_spillover_conley_fit_emits_only_known_matmul_warnings`
parallel to the §5 warning-policy guard, locking the warning surface
of the §6 Conley-path fits (lag=0 spatial-only and lag=1
spatial+serial). Refactors the assertion body into a shared helper
`_assert_only_known_matmul_warnings()` used by both §5 and §6 guards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 caught a methodologically-misleading phrasing in §6: the previous
paragraph attributed the Conley<HC1 result on this DGP to "positive
autocorrelation of scores within the far-control block", but in the
Conley sandwich formula positive off-diagonal score covariances
ADD to the variance (they make the meat larger), not shrink it.

Replaces that sentence with a sign-agnostic explanation: the Conley
meat is HC1 diagonal terms PLUS Bartlett-weighted off-diagonal score
cross-products from within-support pair classes, and the sandwich
ATT variance depends on the NET sign of those terms — on this seed
they net out below HC1, but the direction is a feature of the
data/design and not attributable to any single pair class.

No behavior or numerical change; prose-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 caught two asymmetries in the warning-policy guard:

1. The assertion only rejected "unexpected" warnings but didn't
   pin presence of the three known matmul flavors. If upstream
   fixed one (so e.g. "overflow encountered in matmul" stopped
   firing), the test would still pass and the notebook prose /
   comments referencing "three known warnings" would silently go
   stale.

2. The silence filter used `.*encountered in matmul` as a regex
   pattern, which would also silence a hypothetical 4th flavor
   ("underflow encountered in matmul") without surfacing it.

Renames the helper to `_assert_exact_matmul_warning_surface` and
extracts the three exact expected messages into a module-level
constant `_EXPECTED_MATMUL_MESSAGES`. Both halves of the contract
are now asserted:

- (a) every captured warning matches one of the three EXACT
  expected messages (rejects new flavors / categories);
- (b) ALL THREE messages are present at least once (rejects
  silent disappearance of any of them).

`_silence_spillover_matmul_warnings()` is also tightened to install
one `filterwarnings("ignore", category=RuntimeWarning, message=m)`
per exact expected message string, so a new matmul flavor is no
longer auto-silenced at fixture level — it surfaces immediately in
the guard test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 noted that the R7 fix asserted presence/absence of the three known
matmul warnings but didn't pin their multiplicity. If upstream starts
firing one of the same messages twice (e.g. a stage-2 loop refactor
emits "divide by zero" 2x instead of 1x), the assertion still passes
silently, while the notebook's repeated "three benign matmul
warnings" claim becomes stale.

Tightens the guard helper to assert EXACT multiset equality via
`Counter(observed) == Counter(_EXPECTED_MATMUL_MESSAGES)` — each
expected message must fire exactly once. Adds a separate
`non_runtime` precheck that surfaces any unexpected warning category
(UserWarning, FutureWarning, etc.) with a category-named error.

Verified the current multiplicity is 1x each across all three fits
(HC1, Conley lag=0, Conley lag=1) before tightening; if a future
library change shifts that count legitimately, the failure message
lists observed vs. expected counts so a maintainer can update both
`_EXPECTED_MATMUL_MESSAGES` / the multiplicity expectation AND the
§5/§6 notebook narrative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9 caught two P3 cosmetic accuracy issues:

1. The §2 panel-layout table and §6 pair-enumeration both quoted
   geometry numbers that don't match the seeded DGP. The "within
   ~5 km of (0,0)" treated-cluster phrasing is the expectation of a
   single `rng.normal(0, 0.05)` draw, but at seed 23 the treated
   cluster max distance from origin is 11.6 km and the pairwise
   diameter is 22.3 km. Similarly the near-band extent (11-78 km)
   and far-band extent (~111 km lat) were stylized round numbers
   rather than the seed-23 actuals.

   Updates the §2 table and the §6 pair enumeration to the seed-23
   measured values:
     - treated cluster: max ~12 km, diameter ~22 km
     - near band: ~12-82 km
     - far band: ~224-331 km (lat extent ~131 km)
     - far x far within 100 km: ~95% (median pair dist ~56 km)
     - near x near within 100 km: 100%

   The notebook's substantive claims (recovery / sensitivity grid /
   Conley < HC1) are unchanged; only the geometric descriptions.

2. CHANGELOG entry says the drift file has "14 function-level
   tests" — the file now has 17 (after the R3/R4/R5 additions of the
   `test_conley_se_less_than_hc1` direction pin and the §5/§6
   warning-policy guards). Updates the count and lists the new
   coverage surface (direction pin + exact warning multiset).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI codex flagged a merge-blocker P1: my warning-policy guards required
the three `*encountered in matmul` RuntimeWarnings to fire on every
SpilloverDiD fit (Counter-multiset equality of an exact expected
multiset). But per TODO.md "RuntimeWarnings in Linear Algebra
Operations", those warnings are an Apple Silicon M4 + macOS Sequoia +
numpy<2.3 Accelerate BLAS artifact (numpy#28687, fixed in numpy>=2.3)
and DO NOT fire on M3 / Intel / Linux. The pure-Python CI job runs
Ubuntu with latest numpy, so my multiplicity assertion would fail
even on a fully-correct fit.

Switches to the T19 platform-agnostic pattern (mirrored from
tests/test_t19_marketing_pulse_drift.py:221-260):

- Drift test installs the narrow `.*encountered in matmul` filter
  INSIDE the warning-capture block, then asserts the post-filter
  captured list is empty. Renamed both guards to
  `test_*_warning_policy_post_filter_clean` to reflect the contract.
- New helper `_assert_post_filter_warning_surface_is_clean()` replaces
  the old `_assert_exact_matmul_warning_surface` (multiplicity-pin)
  + `_EXPECTED_MATMUL_MESSAGES` constant.
- `_silence_spillover_matmul_warnings()` reverts to a single broad
  `r".*encountered in matmul"` filter (no message-exact-per-flavor
  list — narrowing wasn't needed once the assertion stopped requiring
  presence). Docstring now cites TODO.md and the Apple BLAS root cause
  instead of the misattributed "per-unit FE projection edge case".

On Apple Silicon M4 + numpy<2.3: warnings fire, filter catches them,
assertion passes. On M3 / Intel / Linux or numpy>=2.3: warnings don't
fire, filter is a no-op, assertion still passes. Either way, any
unexpected UserWarning / FutureWarning / non-matmul RuntimeWarning
fails the guard.

Notebook §4 comment updated to attribute the matmul filter to the
Apple Silicon BLAS issue (with TODO.md pointer). §5 and §6 inherit
the cross-reference via "See §4 for the rationale ...".

CHANGELOG entry tightened: drops the (now incorrect) "multiplicity
shift surfaces immediately" claim and explains the platform-agnostic
post-filter contract instead.

Local: 17 drift tests + nbmake still pass on Apple Silicon (warnings
fire and get filtered). CI Linux: the filter is a no-op and the
assertion still passes because the SpilloverDiD fit emits no warnings
when the BLAS bug isn't triggered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…≥100 grid

CI codex R2 noted the §4 narrative claim that `d_bar in {100, 150,
200}` produces identical estimates covers BOTH `tau_total` AND
`delta_1`, but the exact-identity drift guard
`test_rings_grid_d_bar_100_to_200_identical` only pinned `tau_total`.
`delta_1` was checked via the round-to-1 grid endpoints
(`test_rings_sensitivity_grid_endpoints`), which would still pass if
a future change introduced a small (<0.05) delta_1 drift between the
three d_bar values while keeping the rounded endpoints unchanged.

Adds companion test `test_rings_grid_d_bar_100_to_200_identical_delta_1`
that asserts `delta_1` is bit-equal (atol=1e-10) across d_bar ∈ {100,
150, 200} on the locked panel, mirroring the tau_total identity test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cosmetic P3s from CI codex R3:

- docs/tutorials/README.md: catalog stopped at T22; added a T23 entry
  in the same format (overview + companion drift-test pointer).
- CHANGELOG.md: said "17 function-level tests" — file now has 18
  after the d_bar≥100 delta_1 identity companion test landed in
  commit 1248c44. Bumped to 18 and also surfaced the new exact-
  identity claim ("tau_total AND delta_1 ... per Butts §4 'once
  d_bar covers the true horizon, widening is benign'").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI codex R4 caught two real gaps in the drift coverage:

P2 (Maintainability): the test file's docstring claims the DGP-builder
is duplicated "verbatim" from the notebook §2 cell, but only the
parameter CONSTANTS are pinned (`test_dgp_true_parameters_match_quoted`).
Non-constant edits (coordinate ranges, lambda_t, row construction)
could drift silently if the headline numbers stay within tolerance.

Adds `test_notebook_dgp_ast_matches_test_fixture` — parses the
notebook JSON, extracts the §2 `build_t23_panel` FunctionDef, and
compares its AST (with docstring stripped, function name normalized)
against `_build_t23_panel`'s. Uses `ast.dump` for whitespace- and
comment-agnostic semantic equality. Any DGP-logic divergence between
the two copies now fails loudly; cosmetic-only edits (whitespace,
comments) don't trigger spurious failures.

P3 (Documentation/Tests): §2 quotes seed-specific geometry numbers
(max ~12 km, cluster diameter ~22 km, near 12-82 km, far 224-331 km)
and §6 quotes pair-support percentages (far×far ~95% within 100 km,
near×near 100%). Drift tests only pinned the band counts and lat
bounds, so those prose details could drift silently.

Adds `test_seed_specific_geometry_pins_match_quoted` — recomputes
each quoted value from the seed-23 panel using haversine-deg-to-km
arithmetic and asserts they match the notebook narrative integers
(rounded). If a future RNG/geometry change shifts any number outside
the rounded value, the test fails and the maintainer must update
either the prose or the layout parameters.

20 drift tests pass (16 → 20: +AST sync, +geometry pin, +
`test_rings_grid_d_bar_100_to_200_identical_delta_1` from R3, +
the R3 §6 warning-policy guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three P3 refinements from R5:

1. (Doc) §6 said the far-band "lat extent" is ~131 km. That's wrong:
   the latitude span is `[2.0, 3.0]` * 111 km ≈ 107 km of TRUE lat
   extent; the 131 km figure is the PAIRWISE BAND DIAMETER (the
   `np.sqrt(dlat² + dlon²).max()` quantity the test actually computes).
   Renames the prose to "max within-band pairwise distance is ~131 km"
   and adds an explanatory comment in the test where the geometry was
   computed.

2. (Maintainability) `test_seed_specific_geometry_pins_match_quoted`
   claimed it pinned "all the values quoted in the notebook", but two
   §6 values were unpinned: "treated × near-control pairs ≤ ~94 km
   separation" and "median far/far pairwise distance ~56 km". Also
   the §6 prose claimed ≤ ~94 km but the actual max treated × near
   distance at seed 23 is ~89.7 km, so I updated the prose to
   "≤ ~90 km" to match (more accurate) and added two new helper
   functions + assertions:
     - `_cross_band_max_pair_km(treated, near) == 90`
     - `_within_band_median_pair_km(far) == 56`

3. (Doc) `CHANGELOG.md` understated coverage at 18 tests; the file
   now defines 20 (R4 added AST sync + geometry pins). Bumped to 20.

Drift suite: still 20 tests; numerical content unchanged. Notebook
prose: §6 wording reflects geometric reality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 noted that my AST sync test only compared the `build_t23_panel`
function body, but the notebook §2 cell also has 10 cell-level
constant assignments above the function (MAIN_SEED, N_TREATED,
N_NEAR, N_FAR, T_PERIODS, FIRST_TREAT, TAU_TOTAL, DELTA_1, D_BAR_KM,
NOISE_SD). A notebook-only edit to any of those would change tutorial
behavior without failing either `test_notebook_dgp_ast_matches_test_fixture`
(only compares function body) or `test_dgp_true_parameters_match_quoted`
(only reasserts the test module's own constants).

Adds `test_notebook_dgp_constants_match_test_module_constants`: parses
the notebook §2 cell, walks the top-level `ast.Assign` nodes,
literal-evaluates each RHS, and asserts the 10 expected constants
have values matching the test module. Any notebook-only constant
drift now fails the guard with a (notebook_value, test_value) diff
in the error message.

CHANGELOG count bumped 20 → 21.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 noted that the notebook, README, and CHANGELOG publish 2-decimal
numbers (-4.29, -7.34, -4.53, -5.38, -2.59) but the drift tests only
enforce round-to-1 precision. Future drift smaller than 0.05 could
leave the prose stale without failing any test.

Mixed fix:

- WELL-CONDITIONED headline pins tightened to round-to-2 (matching
  what the prose actually quotes):
    * test_naive_att_endpoint_matches_quoted: -4.3 → -4.29
    * test_spillover_did_recovers_tau_total: -7.3 → -7.34
    * test_spillover_did_recovers_delta_1: -4.5 → -4.53

- BORDERLINE rings=[0,50] grid point left at round-to-1 (per the R5
  reviewer's BLAS-safety guidance). Notebook §4 prose coarsened to
  match: "-5.38" → "~-5.4" and "-2.59" → "~-2.6", with an explicit
  parenthetical explaining WHY this point uses 1-decimal precision
  while §5 uses 2 (borderline-rank-deficient design under d_bar=50
  can shift across BLAS paths).

Headline values are stable across BLAS paths to better than 0.005
(verified on Apple Silicon; cross-platform variance on well-
conditioned fits is sub-ULP). The drift-test docstrings now
explicitly document the 2-decimal vs 1-decimal contract for future
maintainers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber and others added 5 commits May 25, 2026 16:15
…cision

R8 caught two P3 prose accuracy issues:

1. §6 Conley narrative said the no-survey path "operates on raw
   scores" and the meat is "HC1 diagonal terms PLUS off-diagonal
   cross-products". Both phrasings misstate the documented Wave D
   path:
   - The no-survey Conley path uses observation-level Wave D Gardner
     GMM-corrected influence rows `psi_i` (per `diff_diff/two_stage.py:L83-L120`
     and `docs/methodology/REGISTRY.md:L3341-L3363`), not raw scores.
   - The Conley meat decomposition is diagonal self-product terms
     (`psi_i psi_i'`) plus off-diagonal Bartlett-weighted pair terms
     (`K(d_{ij}/h) psi_i psi_j'`). It does NOT apply an HC1 `n/(n-p)`
     finite-sample multiplier — that lives on the HC1 path only.

   Rewrites the §6 mechanism paragraph + cross-band paragraph to:
   - Refer to "observation-level Wave D Gardner GMM-corrected
     influence rows $\psi_i$" instead of "raw scores".
   - Use the explicit decomposition $\psi_i \psi_i'$ + Bartlett-
     weighted $\psi_i \psi_j'$ and add a parenthetical noting the
     missing HC1 multiplier as a separate contributor to the
     Conley < HC1 direction on this DGP.

2. CHANGELOG said "naive coefficient endpoint (round-to-1)" but the
   actual drift test now pins -4.29 at 2 decimals. Updates the
   CHANGELOG line to reflect the current contract — naive coefficient
   at 2 decimals, headline tau/delta_1 at 2 decimals, borderline
   `rings=[0,50]` at 1 decimal, and lists the additional sync /
   geometry / constant pins added during the R4-R8 iteration cycle.

No code/test changes; prose-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…attribution

R9 noted that README + CHANGELOG over-attributed the §6 demo
configuration `vcov_type="conley", conley_cutoff_km=100,
conley_lag_cutoff∈{0,1}` to "Butts §3.1". The paper §3.1 only
supports the spatial cutoff choice (`conley_cutoff_km = d_bar`); the
`conley_lag_cutoff` serial-Bartlett extension is the library's
documented Wave E.2 follow-up synthesis (Newey-West 1987 form on the
within-PSU/within-unit time axis, per REGISTRY "Variance (Wave E.2
follow-up)"). The notebook body §6 itself already drew this
distinction correctly; only the condensed summaries were imprecise.

Rephrases both summary sentences to:
"the cutoff = `d_bar` choice follows Butts §3.1, while the
`conley_lag_cutoff` serial extension is the library's documented
Wave E.2 follow-up synthesis with Newey-West-style serial Bartlett
HAC (per REGISTRY 'Variance (Wave E.2 follow-up)')".

No code/test changes; documentation attribution-only refinement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…claim

R10 noted that CHANGELOG attributed the `d_bar ∈ {100, 150, 200}`
exact identity to "Butts §4 'once d_bar covers the true horizon,
widening is benign'", but that overstates the paper. The Butts paper
and the registry frame `d_bar` as a real bias/variance tradeoff; the
exact identity at THIS grid is a DGP-specific consequence of the
synthetic panel having no units in the 80-200 km band (the
near-control band tops out at ~78 km, and the far-control band starts
at ~224 km, so widening past 100 km adds zero observations to any
ring bin).

Rephrases the CHANGELOG entry to attribute the identity to the
tutorial DGP geometry and explicitly note that it is NOT a generic
result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R11 noted that §7's survey-weights bullet conflated Wave E.2
(cross-sectional Conley + survey via stratified-Conley sandwich on
PSU totals) with the Wave E.2 follow-up (panel-block spatial + serial
Bartlett HAC; requires an effective PSU on the survey design — either
explicit `survey_design.psu=` or `cluster=<col>` injected per the
Wave E.1 warn-and-use-PSU pattern).

Rewrites the bullet to distinguish:
- `conley_lag_cutoff=0`: Wave E.2 stratified-Conley sandwich on PSU
  totals (cross-sectional, spatial only)
- `conley_lag_cutoff>0`: Wave E.2 follow-up panel-block extension
  (library synthesis with within-PSU serial Bartlett HAC, requires
  effective PSU on the survey design)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rce absent

CI's pure-Python and Rust matrix jobs copy `tests/` to an isolated
location (per the "Copy tests to isolated location (Unix)" workflow
step) WITHOUT the `docs/` tree, to verify the installed package
doesn't depend on the source tree. Two of my new drift tests need
to load `docs/tutorials/23_spillover_tva.ipynb` and fail with
`FileNotFoundError: '/private/tmp/docs/tutorials/23_spillover_tva.ipynb'`
on every non-source-tree job.

Failing jobs (from run 26416449494):
- Python Tests (macos-latest, py3.11/3.13/3.14)
- Python Tests (windows-latest, py3.13)
- Python Tests (ubuntu-24.04-arm, py3.14)

Fix: both `test_notebook_dgp_constants_match_test_module_constants`
and `test_notebook_dgp_ast_matches_test_fixture` now `pytest.skip()`
gracefully when the notebook source isn't reachable, with a docstring
note explaining that the sync-guard contract is meaningful in local
dev (where edits happen) and the nbmake CI job separately verifies
the notebook executes end-to-end.

This preserves the codex R4/R6 P2 fixes (notebook source IS available
in local dev runs and in the iterative codex-review workflow, so the
guards still trigger on real edits) while making the suite robust to
the project's isolated-tests CI pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the spillover-tva-tutorial-t23 branch from 239eeb5 to c3fd1de Compare May 25, 2026 20:16
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c3fd1de2397df20622f9992fed7f413993c18908


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review remains clean. The earlier survey-Conley documentation concern stays resolved: the notebook now correctly scopes conley_lag_cutoff=0 vs >0 under survey_design= at docs/tutorials/23_spillover_tva.ipynb:L542-L552, matching docs/methodology/REGISTRY.md:L3417-L3445 and the fit-time gate in diff_diff/spillover.py:L3219-L3260.
  • This PR is docs/tests only. I found no estimator, weighting, variance, identification, or default-behavior changes, and the tutorial’s methodological claims stay within the documented SpilloverDiD contract in diff_diff/spillover.py:L1664-L1690, diff_diff/spillover.py:L2110-L2148, and docs/methodology/REGISTRY.md:L3338-L3498.
  • The drift suite is strong for a tutorial PR: it pins the notebook DGP, headline coefficients, sensitivity-grid behavior, geometry claims, Conley-vs-HC1 behavior, and warning policy in tests/test_t23_spillover_tva_drift.py:L37-L178, tests/test_t23_spillover_tva_drift.py:L237-L365, tests/test_t23_spillover_tva_drift.py:L377-L615, and tests/test_t23_spillover_tva_drift.py:L649-L684.
  • Docs wiring is internally consistent across the tutorial index, README, references, paper review, and changelog at docs/index.rst:L80-L86, docs/tutorials/README.md:L122-L128, docs/references.rst:L189-L199, docs/methodology/papers/butts-2021-review.md:L248-L257, and CHANGELOG.md:L13-L14.
  • Static review only: I could not execute tests/test_t23_spillover_tva_drift.py here because this environment does not have numpy, pandas, or pytest.

Methodology

  • No findings. The tutorial’s statements on the ring-indicator spec, d_bar = max(rings), the d_bar sensitivity workflow, Butts’ far-away control requirement, Conley cutoff choice, and the Wave E.2 follow-up serial extension all align with the registry and implementation: docs/tutorials/23_spillover_tva.ipynb:L249-L328, docs/tutorials/23_spillover_tva.ipynb:L393-L552, docs/methodology/REGISTRY.md:L3347-L3445, docs/methodology/papers/butts-2021-review.md:L161-L175, diff_diff/spillover.py:L1664-L1690, diff_diff/spillover.py:L1861-L1878, diff_diff/conley.py:L823-L1043, diff_diff/two_stage.py:L83-L152.

Code Quality

  • No findings. I did not find any new inline inference anti-patterns; the relevant library paths continue to use safe_inference() rather than ad hoc t_stat/p_value/CI logic in diff_diff/spillover.py:L829-L843, diff_diff/spillover.py:L1023-L1023, and diff_diff/spillover.py:L3489-L3503.

Performance

  • No findings. The new drift checks do some quadratic geometry work, but only on a fixed 200-unit tutorial panel, so the cost is bounded and appropriate for this test surface at tests/test_t23_spillover_tva_drift.py:L537-L615.

Maintainability

  • No findings. The notebook/test duplication is well-defended: constant-sync and AST-body-sync guards make the copied DGP hard to drift silently at tests/test_t23_spillover_tva_drift.py:L377-L534.

Tech Debt

  • No findings. Dropping the delivered Spillover tutorial row from TODO.md is correct, and the remaining relevant SpilloverDiD follow-ups are still explicitly tracked at TODO.md:L157-L164.

Security

  • No findings. The PR adds documentation and tests only; I saw no secrets, untrusted execution surface, or unsafe notebook behavior in the changed files.

Documentation/Tests

  • No findings. The tutorial prose, README summary, references entry, paper-review note, and toctree entry are consistent with each other and with the code contracts at docs/tutorials/23_spillover_tva.ipynb:L8-L45, docs/tutorials/23_spillover_tva.ipynb:L230-L239, docs/tutorials/23_spillover_tva.ipynb:L300-L328, docs/tutorials/23_spillover_tva.ipynb:L451-L552, docs/tutorials/README.md:L122-L128, docs/references.rst:L189-L199, and docs/index.rst:L80-L86. Residual risk is limited to the static-only nature of this review because the local environment lacks test dependencies.

@igerber igerber merged commit 75e98e9 into main May 25, 2026
26 checks passed
@igerber igerber deleted the spillover-tva-tutorial-t23 branch May 25, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant