Add ContinuousDiD estimator (Callaway, Goodman-Bacon & Sant'Anna 2024) by igerber · Pull Request #177 · igerber/diff-diff

igerber · 2026-02-21T21:18:04Z

Summary

Add ContinuousDiD estimator implementing Callaway, Goodman-Bacon & Sant'Anna (2024) "Difference-in-Differences with a Continuous Treatment" (NBER WP 32117)
B-spline dose-response curves for ATT(d) and ACRT(d) derivative estimation
Full staggered adoption support with (g,t) cell iteration, dose aggregation, and event-study aggregation
Multiplier bootstrap inference with analytical SE fallback
Extract shared bootstrap utilities from staggered_bootstrap.py into bootstrap_utils.py
Fix plt.show() blocking test suite by setting non-interactive matplotlib backend in conftest.py

New files

diff_diff/continuous_did.py — Main ContinuousDiD estimator class
diff_diff/continuous_did_results.py — ContinuousDiDResults and DoseResponseCurve dataclasses
diff_diff/continuous_did_bspline.py — B-spline basis construction, evaluation, derivatives
diff_diff/bootstrap_utils.py — Shared bootstrap weight generation, CI, p-value helpers
docs/methodology/continuous-did.md — Full methodology documentation
tests/test_continuous_did.py — 46 unit/integration tests
tests/test_methodology_continuous_did.py — 15 equation verification + 6 R benchmark tests

Methodology references (required if estimator / math changes)

Method name(s): Continuous Difference-in-Differences
Paper / source link(s): Callaway, Goodman-Bacon & Sant'Anna (2024), NBER WP 32117 — https://www.nber.org/papers/w32117
Reference implementation: R contdid v0.1.0 (https://bcallaway11.github.io/contdid/)
Any intentional deviations from the source (and why):
- Use training boundary knots for evaluation grid (R's contdid v0.1.0 has a boundary knot inconsistency where splines2::bSpline(dvals) uses range(dvals) instead of range(dose))
- Properly propagate control_group to overall ATT computation (R's contdid v0.1.0 always uses notyettreated internally regardless of user setting)

Validation

Tests added/updated:
- tests/test_continuous_did.py — 46 tests covering init, data validation, fit, results, B-splines, dose grid, aggregation, bootstrap, edge cases
- tests/test_methodology_continuous_did.py — 15 tests: hand-calculable equation verification + 6 R contdid benchmarks (ATT(d), ACRT(d), overall ATT/ACRT all within <1-2% relative tolerance)
- tests/conftest.py — Added MPLBACKEND=Agg to prevent plt.show() blocking
Full test suite: 1485 passed, 64 skipped, 0 failures (excluding TROP)
Bootstrap refactor safe: all 95 CallawaySantAnna tests pass

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement Callaway, Goodman-Bacon & Sant'Anna (2024) continuous treatment DiD estimator with B-spline dose-response curves, ACRT derivatives, staggered adoption support, and multiplier bootstrap inference. Validated against R contdid v0.1.0 across 6 benchmarks. Also extract shared bootstrap utilities to bootstrap_utils.py and fix plt.show() blocking in test suite via non-interactive backend. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-21T21:20:53Z

Overall assessment: ⛔ Blocker

Executive summary

P0: control_group="not_yet_treated" incorrectly includes the treated cohort itself in pre-period control sets, biasing pre-trend/event-study results and violating the Methodology Registry definition.
P1: New inline inference in DoseResponseCurve.to_dataframe() violates the required safe_inference() pattern.
P1: No validation that treated doses are strictly positive; negative/zero treated doses are silently excluded and can break spline setup or bias estimates.

Methodology

Severity: P0. Impact: For control_group="not_yet_treated", the control set includes the treated cohort itself when t < g because unit_cohorts > t is true for g. This contaminates the control mean, attenuates pre-trends, and violates the registry’s “not‑yet‑treated excludes the treated cohort” requirement. Location: diff_diff/continuous_did.py in _compute_dose_response_gt where control_mask = never_treated_mask | (unit_cohorts > t). Concrete fix: exclude the treated cohort explicitly, e.g. control_mask = never_treated_mask | ((unit_cohorts > t) & (unit_cohorts != g)), or build control mask as (unit_cohorts == 0) | (unit_cohorts > t) and then control_mask &= (unit_cohorts != g).
Severity: P1. Impact: The estimator assumes D_+ strictly positive (see Methodology Registry), but the code only drops units with zero dose; negative treated doses are treated as “untreated” in treated_mask = (unit_cohorts == g) & (dose_vector > 0), which can silently drop treated units and lead to empty dose grids or biased estimates. Location: diff_diff/continuous_did.py in fit() and _compute_dose_response_gt. Concrete fix: validate that all treated units have dose > 0 (or explicitly support negative doses by adjusting masks/grid construction), and raise a clear error if violated.

Code Quality

Severity: P1. Impact: Inline inference in DoseResponseCurve.to_dataframe() violates the “use safe_inference()” rule and risks inconsistent NaN propagation or distributional assumptions. Location: diff_diff/continuous_did_results.py in DoseResponseCurve.to_dataframe (manual t_stat/p_value). Concrete fix: compute t_stat, p_value, and CI via safe_inference() in a vectorized loop or helper to ensure consistent NaN handling.

Performance

No issues found in the diff.

Maintainability

No issues found in the diff.

Tech Debt

No issues found in the diff.

Security

No issues found in the diff.

Documentation/Tests

Severity: P2. Impact: The not‑yet‑treated control bug is not covered; current tests only assert that the estimator runs. Without a pre‑period validation test, regressions to control‑group logic won’t be caught. Location: tests/test_continuous_did.py. Concrete fix: add a test that checks pre‑period event‑study effects with control_group="not_yet_treated" and asserts the control group excludes the treated cohort (e.g., by constructing a small panel where contamination would produce a known bias).

Suggested next steps

Fix the control group logic for not_yet_treated in _compute_dose_response_gt and add a targeted test.
Replace inline inference in DoseResponseCurve.to_dataframe() with safe_inference() and add a small test for NaN propagation.
Add explicit validation for treated doses (dose > 0) and a test for negative/zero dose handling.

…lidation - Fix not_yet_treated control group to exclude cohort g from its own control set (matches staggered.py behavior) - Replace inline t_stat/p_value computation in DoseResponseCurve.to_dataframe() with safe_inference() loop per project convention - Add validation rejecting negative doses among treated units - Fix test_inf_first_treat_normalization CI failure (cast to float before inf) - Add test for not_yet_treated control group correctness and negative dose Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-21T22:22:49Z

/ai-review

github-actions · 2026-02-21T22:30:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b9e97f057b84657c0ba9ae6fbdfb87967f1a3eec

Overall assessment: ⛔ Blocker

Executive summary

Affected methods: new ContinuousDiD estimator (Callaway, Goodman-Bacon & Sant’Anna 2024) and shared bootstrap utilities used by CallawaySantAnna (refactor only).
P0 Analytical SEs from influence functions appear mis-scaled, likely understating SEs and overstating significance in the default n_bootstrap=0 path.
P1 Registry-required discrete-dose detection/warning is missing; discrete treatments are silently fit with splines.
P1 Event-study aggregation ignores anticipation in relative-time indexing; post-periods can be labeled as pre.
P1 No guard for empty post-treatment cells; aggregates default to 0.0 rather than error/NaN.

Methodology

P0 diff_diff/continuous_did.py:L747-L788 Impact: Influence-function SEs are scaled by / n_t and / n_c and then sqrt(mean(IF^2)), which underestimates SEs by roughly sqrt(n) (default inference is too optimistic). Concrete fix: define IFs without the 1/n_t/1/n_c scaling and compute se = sqrt(mean(IF^2) / n_units), or keep current IFs but multiply by sqrt(n_units) at the end; validate against bootstrap for parity.
P1 diff_diff/continuous_did.py:L170-L210 Impact: Registry says discrete dose should be detected and warned; current validation never checks for integer-valued or low-unique dose. Users may get misleading smooth curves without warning. Concrete fix: add a discrete-dose detection check on treated doses (e.g., all-close-to-integers or small unique count) and warn/raise with guidance that saturated regression is not yet implemented.
P1 diff_diff/continuous_did.py:L670-L690 Impact: Event-study relative time uses e = t - g even when anticipation > 0; this mislabels post-treatment periods as pre-treatment in the event-study output. Concrete fix: set e = t - (g - self.anticipation) or adjust the event-study aggregation to shift by anticipation consistently with the post-treatment filter.
P1 diff_diff/continuous_did.py:L269-L307 Impact: If post_gt is empty (e.g., all treatments occur after the last observed period), overall_att/overall_acrt stay at 0.0 with NaN inference, which is misleading rather than a hard failure. Concrete fix: after building post_gt, error or return NaNs with a clear warning when no post-treatment cells exist.

Code Quality

P2 diff_diff/continuous_did.py:L520-L551 Impact: Invalid control_group values silently fall into the not_yet_treated branch, which can mis-specify the estimator without any error. Concrete fix: validate control_group and base_period against allowed values in __init__ or fit and raise ValueError on unknown inputs.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P2 tests/test_continuous_did.py Impact: No tests covering anticipation with event-study aggregation or empty post_gt handling; discrete-dose warning not tested. Concrete fix: add tests for anticipation>0 event-study labels, no-post-treatment error/NaN behavior, and discrete-dose warning path.

Tests not run (not requested).

- Fix analytical SE: use sqrt(sum(IF^2)) instead of sqrt(mean(IF^2)) to match CallawaySantAnna's influence function convention (P0) - Add discrete dose detection warning for integer-valued treatments (P1) - Guard empty post-treatment cells: warn and return NaN instead of 0.0 (P1) - Validate control_group and base_period params in __init__ and set_params (P2) - Add 7 new tests: SE parity, discrete dose, anticipation event study, empty post_gt, and parameter validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-21T23:17:56Z

/ai-review

github-actions · 2026-02-21T23:23:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c3980b413a2aa20ecf5747d17bcb3ee091e4bcd

Overall assessment: ⚠️ Needs changes

Executive summary:

P1: ContinuousDiD allows estimation without any D=0 units under control_group="not_yet_treated", but the Methodology Registry requires P(D=0)>0 (Remark 3.1 not implemented).
P1: Bootstrap inference is used only for SEs; overall/event‑study p‑values and CIs are normal‑approx, and dose‑response p‑values are normal while CIs are bootstrap percentiles, which is inconsistent with the Registry’s “multiplier bootstrap for inference.”
P2: Event‑study inference is left as NaN when n_bootstrap=0, despite “analytical SE fallback” being advertised.
Otherwise the core B‑spline construction, (g,t) iteration, and weighting logic look aligned with the new methodology docs.

Methodology

Severity: P1 | Impact: Violates the Registry’s requirement of an untreated group (P(D=0)>0), so estimates can be produced in unidentified settings without warning when control_group="not_yet_treated". | Concrete fix: Enforce n_control > 0 regardless of control_group, or implement the “lowest‑dose‑as‑control” Remark 3.1 logic and document it. | diff_diff/continuous_did.py:L262-L267
Severity: P1 | Impact: When n_bootstrap>0, overall ATT/ACRT and event‑study p‑values/CIs are computed via normal approximation (safe_inference), while dose‑response CIs are percentile bootstrap; this is inconsistent with the Registry’s stated “multiplier bootstrap for inference” and with existing CS/SA bootstrap behavior. | Concrete fix: Use compute_effect_bootstrap_stats outputs for overall and event‑study (and optionally per‑dose) CIs/p‑values; if you want bootstrap‑based ACRT intervals later, add acrt_glob into bootstrap_info and shift boot_acrt_glob by the original effect before computing percentiles. | diff_diff/continuous_did.py:L378-L406, diff_diff/continuous_did.py:L1012-L1032, diff_diff/continuous_did.py:L688-L705, diff_diff/continuous_did.py:L954-L959, diff_diff/continuous_did_results.py:L47-L66
Severity: P2 | Impact: aggregate="eventstudy" with n_bootstrap=0 yields NaN SE/t/p/CI for event‑study effects, which undercuts the “analytical SE fallback” claim. | Concrete fix: Add an analytical event‑study SE path (re‑use influence functions + event‑time weights), or explicitly document that event‑study inference requires bootstrap. | diff_diff/continuous_did.py:L328-L406

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

No issues found.

…l event-study SEs - Use percentile CI/p-value from bootstrap (not normal approx) for overall ATT/ACRT and event-study effects, matching CallawaySantAnna convention - Add P(D=0)>0 warning when control_group='not_yet_treated' has no never-treated units (Remark 3.1 in Callaway et al.) - Compute IF-based analytical SEs for event-study bins when n_bootstrap=0 (previously yielded NaN) - Add tests for all three fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-21T23:51:03Z

/ai-review

github-actions · 2026-02-22T00:02:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 40d22d4530fb45aed6315b92a27d7a33f7d27e93

Overall assessment: ⚠️ Needs changes

Executive Summary

ContinuousDiD is well-scaffolded (estimator, B-splines, results, tests, registry), and bootstrap helpers were cleanly extracted.
P1 methodology gap: estimation proceeds without any D=0 units under control_group="not_yet_treated", but the registry says Remark 3.1 is not implemented and P(D=0)>0 is required.
P1 methodology gap: balanced-panel check only compares per-unit counts, not identical time sets, so some unbalanced panels slip through.
P3 test reliability: R benchmark tests write fixed /tmp/r_bench*.csv paths, which can collide under parallel pytest.

Methodology

P1 — Impact: Allows estimation when P(D=0)=0 despite registry stating lowest‑dose‑as‑control is not implemented and P(D=0)>0 is required, so ATT/ACRT are not identified. | Fix: raise a ValueError when control_group="not_yet_treated" and n_control==0, or implement Remark 3.1 and document it. diff_diff/continuous_did.py:L269-L277 and docs/methodology/REGISTRY.md:L425-L429
P1 — Impact: “Balanced panel required” is enforced only by equal counts per unit; panels with different missing periods can pass and propagate NaNs into ΔY and OLS, violating the registry requirement. | Fix: verify each unit has the same set of time values as the global set (e.g., compare sets or pivot and check for any NaNs) and raise on mismatch. diff_diff/continuous_did.py:L247-L252 and docs/methodology/REGISTRY.md:L429-L429

Code Quality

P3 — Impact: Invalid aggregate values silently fall back to dose aggregation, making typos hard to detect. | Fix: validate aggregate in {None, "dose", "eventstudy"} at the start of fit() and raise on invalid input. diff_diff/continuous_did.py:L338-L343

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P3 — Impact: Fixed /tmp/r_bench*.csv paths can collide in parallel test runs or leave stale files, causing flaky R benchmarks. | Fix: use tempfile.NamedTemporaryFile (or pass a temp path into the R script) and read back that same path in Python. tests/test_methodology_continuous_did.py:L477-L493, tests/test_methodology_continuous_did.py:L551-L568, tests/test_methodology_continuous_did.py:L621-L636

- Upgrade P(D=0)=0 warning to ValueError for not_yet_treated (P1) - Strengthen balanced-panel check to verify identical time sets (P1) - Add aggregate parameter validation at fit() entry (P3) - Replace hardcoded /tmp paths with tempfile in R benchmarks (P3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-22T00:21:04Z

/ai-review

github-actions · 2026-02-22T00:25:30Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1e01fd45b73f49bd7b04939b293c7a6f86b4ae72

Overall assessment: ⚠️ Needs changes

P1: Bootstrap ACRT^{glob} distribution omits the baseline term, so percentile CI/p-values for ACRT^{glob} are centered at 0 instead of the estimate when n_bootstrap > 0.
Methodology registry alignment looks solid for ATT(d)/ACRT(d) construction, control-group requirements, and edge-case handling.
Refactor to shared bootstrap utilities appears consistent with prior behavior; no NaN-guard regressions observed.
Docs/tests coverage is strong for the new estimator and edge cases.

Methodology

P1: Bootstrap ACRT^{glob} uses only the perturbation term (delta_beta @ dpsi_mean) and never adds the baseline acrt_glob, because _bootstrap_info doesn’t include acrt_glob. This yields a bootstrap distribution centered at 0, making percentile CI and p-values for ACRT^{glob} incorrect whenever bootstrap is enabled. Impact: misleading inference for ACRT^{glob} (CI/p-values wrong; SEs unaffected but still paired with wrong CI/p). Concrete fix: either store acrt_glob in _bootstrap_info and add it when building boot_acrt_glob, or compute acrt_glob_b directly as beta_b @ dpsi_mean and aggregate without adding a separate baseline. diff_diff/continuous_did.py:L741-L788, diff_diff/continuous_did.py:L987-L1042

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings.

Store acrt_glob in _bootstrap_info so the bootstrap distribution is centered at the point estimate instead of 0. Add test that verifies bootstrap ACRT CI brackets the estimate rather than zero. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-22T00:42:19Z

/ai-review

github-actions · 2026-02-22T00:50:23Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0f2849bf7b0091802a6c1053c6cd2ed56564cbc1

Overall assessment: ⚠️ Needs changes

Executive Summary

ContinuousDiD aligns with the registry on core estimation flow and the test coverage is thorough.
P1: B‑spline evaluation uses training boundary knots, but the methodology registry claims a strict match to R contdid; this deviation is not documented.
P1: Several __init__ parameters are not persisted into results objects, violating the project rule and losing provenance.
P3: Per‑cell OLS computes vcov that is never used.

Methodology

P1 | Impact: ATT(d)/ACRT(d) near tails will differ from the stated R reference because evaluation is clamped to training boundary knots, while the registry claims “matching R’s splines2::bSpline”; this is an undocumented deviation from the reference implementation. | Fix: Document the boundary‑knot choice in the methodology docs and registry (or add a switch to reproduce R’s dvals‑boundary behavior for benchmarking). | Location: diff_diff/continuous_did_bspline.py:L89-L100, docs/methodology/REGISTRY.md:L433-L436, docs/methodology/continuous-did.md:L357-L364

Code Quality

P1 | Impact: Results omit base_period, anticipation, n_bootstrap, bootstrap_weights, seed, rank_deficient_action, so outputs are not fully reproducible and violate the “new init param” rule. | Fix: Add these fields to ContinuousDiDResults and pass them in ContinuousDiD.fit (optionally include in summary). | Location: diff_diff/continuous_did.py:L84-L109, diff_diff/continuous_did.py:L547-L571, diff_diff/continuous_did_results.py:L91-L114

Performance

P3 | Impact: solve_ols(..., return_vcov=True) computes a covariance matrix per (g,t) cell that is never used, adding avoidable cost. | Fix: Drop return_vcov=True (or use vcov if intended). | Location: diff_diff/continuous_did.py:L727-L731

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No additional findings beyond the Methodology documentation gap noted above.

…e, drop unused vcov - Document B-spline boundary knot deviation in REGISTRY.md and continuous-did.md - Add base_period, anticipation, n_bootstrap, bootstrap_weights, seed, rank_deficient_action fields to ContinuousDiDResults with passthrough from fit() - Switch per-cell OLS to return_vcov=False to skip unused covariance computation - Add test_results_contain_init_params verifying param roundtrip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-22T01:39:30Z

/ai-review

github-actions · 2026-02-22T01:45:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f737ab6b99eb4ad6fbc2a001586bc59ec87c465a

Overall Assessment ⚠️ Needs changes

Executive Summary

The B‑spline knots are built once from all treated doses, but the Methodology Registry requires per‑cell (g,t) training‑dose boundary knots; this can induce extrapolation and change ATT/ACRT when dose supports differ by cohort.
Dose‑response aggregation weights are group‑averaged across post periods, while the Registry specifies n_treated weights per (g,t) cell; this changes the estimand and needs to be reconciled or documented.
Minor doc mismatch: ContinuousDiDResults docstring omits the supported "webb" bootstrap weight type.

Methodology

Severity: P1 | Impact: Violates the Registry requirement that boundary knots use the training‑dose range per (g,t) cell; using global knots can extrapolate outside cell‑specific supports and distort ATT(d)/ACRT(d), especially when cohorts have different dose ranges. | Fix: Build knots inside _compute_dose_response_gt from treated_doses for that cell (or cache per‑group), and store cell‑specific knots, Psi_eval, and dPsi_eval in _bootstrap_info. | Location: diff_diff/continuous_did.py:L293-L313 and _compute_dose_response_gt uses knots passed in, diff_diff/continuous_did.py:L593-L736; Registry requirement at docs/methodology/REGISTRY.md:L430-L436.
Severity: P1 | Impact: Aggregation weights for dose‑response/overall ATT/ACRT are group‑averaged across post periods (pg / n_post_cells), which conflicts with the Registry’s “n_treated weights” requirement and changes the estimand relative to documented methodology. | Fix: If n_treated weighting is intended, set cell_weights[(g,t)] = r["n_treated"] / total_treated for each post cell (no division by number of post cells). If group‑averaged weighting is intentional to match contdid, update the Registry/continuous‑did docs and tests to reflect that explicitly. | Location: diff_diff/continuous_did.py:L367-L395, Registry at docs/methodology/REGISTRY.md:L434-L436.

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P3 | Impact: Results docstring omits "webb" as a supported bootstrap weight type, which can mislead users and reviewers. | Fix: Update the docstring to include "webb". | Location: diff_diff/continuous_did_results.py:L95-L100.

Comments

Conversation

igerber commented Feb 21, 2026

Summary

New files

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 21, 2026

Uh oh!

igerber commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 21, 2026

Uh oh!

igerber commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 21, 2026

Uh oh!

igerber commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 22, 2026

Uh oh!

igerber commented Feb 22, 2026

Uh oh!

github-actions bot commented Feb 22, 2026

Uh oh!

igerber commented Feb 22, 2026

Uh oh!

github-actions bot commented Feb 22, 2026

Uh oh!

igerber commented Feb 22, 2026

Uh oh!

github-actions bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant