Fix CS event study SEs (missing WIF) + simultaneous confidence bands by igerber · Pull Request #189 · igerber/diff-diff

igerber · 2026-03-08T14:35:09Z

Summary

Fix event study aggregation SEs to include weight influence function (WIF) adjustment, matching R's did::aggte(..., type="dynamic")
Fix bootstrap to perturb combined influence functions (standard IF + WIF) instead of fixed-weight re-aggregation, for both event study and overall ATT
Add simultaneous confidence bands (cband=True, default) using sup-t bootstrap critical values, matching R's did::aggte(..., cband=TRUE)
Add cband_crit_value to results, cband_lower/cband_upper columns in to_dataframe(), and simultaneous CI display in summary()
Update plot_event_study() to use simultaneous CIs when available

Methodology references (required if estimator / math changes)

Method name(s): Callaway-Sant'Anna event study aggregation, simultaneous confidence bands
Paper / source link(s): Callaway, B., & Sant'Anna, P.H.C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230. https://doi.org/10.1016/j.jeconom.2020.12.001
R reference: did::aggte() — WIF formula in AGGTEobj.R, sup-t cband in compute.aggte()
Any intentional deviations from the source (and why): None — matches R's did package behavior

Validation

Tests added/updated:
- tests/test_methodology_callaway.py: 10 new tests in TestEventStudySEWithWIF (4) and TestSimultaneousConfidenceBands (6)
- All 155 CS-specific tests pass (59 methodology + 96 staggered)
- Full suite: 1496 passed, 94 skipped, 0 failures
Backtest / simulation / notebook evidence: End-to-end verification shows cband critical value > 1.96, simultaneous CIs wider than pointwise

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…bands Event study aggregation SEs were missing the weight influence function (WIF) adjustment that R's did::aggte(..., type="dynamic") includes. This caused systematic SE underestimation because uncertainty in group-size weight estimation was unaccounted for. The bootstrap path also used fixed-weight re-aggregation instead of perturbing combined IFs. Both analytical and bootstrap paths now include WIF for event study and overall ATT. Additionally adds simultaneous confidence bands (cband=True, default) using sup-t bootstrap critical values, matching R's did::aggte(..., cband=TRUE). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-08T14:42:24Z

Overall Assessment: ⚠️ Needs changes

Executive Summary

P1: Analytical WIF uses a local unit set instead of all units, which can mis-scale event-study/overall SEs and diverge from the Methodology Registry/R behavior.
P3: plot_event_study now silently overrides pointwise CIs with cband CIs and ignores alpha without an opt-out.
P3: cband and simultaneous-band behavior aren’t documented in the estimator/plot docs; plotting coverage for cband is missing.

Methodology

Severity: P1 | Impact: Callaway-Sant'Anna event-study and overall ATT analytical SEs can be biased because _compute_combined_influence_function computes pg using n_units from only the units appearing in the selected (g,t) pairs, while the registry/docstring define pg = n_g / n_all (all units). This can drop late-treated cohorts at long event times, altering WIF variance and breaking alignment with R’s did::aggte. | Fix: pass a global unit mapping/total N (from precomputed['all_units'] or df[unit].nunique()) into _compute_combined_influence_function for analytical SEs (same as the bootstrap path) so WIF is based on the full sample. | Location: diff_diff/staggered_aggregation.py:L200-L340, diff_diff/staggered_aggregation.py:L420-L425, docs/methodology/REGISTRY.md:L311-L316

Code Quality

Severity: P3 | Impact: plot_event_study ignores the user-specified alpha when cband_conf_int exists and there’s no opt-out to force pointwise CIs; this is a behavioral change that can surprise callers. | Fix: add a use_cband (or ci_mode) flag or emit a warning when overriding, and honor alpha for pointwise mode. | Location: diff_diff/visualization.py:L185-L238

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P3 | Impact: The new cband parameter and simultaneous CIs are not documented in the CallawaySantAnna parameter docs, and plotting docs don’t mention cband-driven CI overrides, making the new default behavior easy to miss. | Fix: update docstrings to describe cband, its bootstrap requirement, and plotting behavior. | Location: diff_diff/staggered.py:L162-L217, diff_diff/visualization.py:L130-L170
Severity: P3 | Impact: No test covers plot_event_study consuming cband_conf_int, so regressions in the override path could slip. | Fix: add a small plotting/extraction test that asserts cband bounds are used when present. | Location: diff_diff/visualization.py:L185-L238

Fix critical bug where _compute_aggregated_se_with_wif() used local unit set instead of global panel size N for pg computation, causing SE overestimation at extreme event times. Add use_cband parameter to plot_event_study() for opting out of simultaneous confidence bands. Add cband docstrings and plot tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-08T15:09:04Z

/ai-review

github-actions · 2026-03-08T15:12:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9758a19118e248d19a61d040730f97f4d63c6fce

Overall Assessment: ⚠️ Needs changes

Executive Summary

Methods affected: Callaway‑Sant’Anna overall ATT and event‑study aggregation SEs, bootstrap inference, and simultaneous confidence bands; intent matches updates in docs/methodology/REGISTRY.md.
P1 methodology risk: bootstrap WIF uses a local unit set/denominator, which can diverge from registry expectations (pg = n_g / N_total) and from analytical SEs when units drop out of influence functions.
cband wiring looks consistent across params/results/plot and tests, but edge coverage for unbalanced panels is missing.
Tests not run by reviewer.

Methodology

Severity: P1. Impact: Bootstrap combined IF and sup‑t critical values can be mis‑scaled in unbalanced panels because n_units is derived from units appearing in influence_func_info, while the registry defines pg = n_g / N_total and the analytical path uses all units. This can bias bootstrap SEs and cband widths and break consistency with R. Concrete fix: build bootstrap unit_to_idx/all_bootstrap_weights from precomputed['all_units'] (global) and pass those to _compute_combined_influence_function; or change _compute_combined_influence_function to use n_global_units for pg_by_group and WIF scaling even when local indexing is used, zero‑padding combined_if to global length. References: diff_diff/staggered_bootstrap.py:169-227, diff_diff/staggered_aggregation.py:198-232.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: Tests do not cover the unbalanced/missing‑unit scenario where influence functions omit some units; this is where the bootstrap/global‑N mismatch would surface. Concrete fix: add a test with units that never appear in any (g,t) IF (e.g., all‑NaN outcomes or missing base/post) and assert bootstrap SEs/cband align with analytical or expected R behavior. Reference: tests/test_methodology_callaway.py:1224-1367.
Severity: P3. Impact: Tests not run in this review. Concrete fix: run pytest tests/test_methodology_callaway.py tests/test_visualization.py.

The multiplier bootstrap in _run_multiplier_bootstrap() was building the unit set from influence_func_info (local units only), causing pg = n_g / n_local to overestimate group shares in unbalanced panels where some units don't appear in any influence function. Now uses precomputed['all_units'] (or df fallback) matching the analytical path fix from 9758a19. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T16:30:55Z

/ai-review

github-actions · 2026-03-08T16:37:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 05eabac559cdd44c1c3a890dab67c68d2bff33d7

Overall assessment: ✅ Looks good

Executive summary

Methods affected: Callaway–Sant’Anna overall ATT aggregation, event‑study aggregation, multiplier bootstrap, and simultaneous confidence bands.
Registry alignment: WIF-adjusted SEs for overall/event‑study and combined‑IF bootstrap are consistent with docs/methodology/REGISTRY.md; cband default and sup‑t workflow are wired through results/summary/plotting.
Edge‑case safety: No new inline t‑stat computations or partial NaN guards introduced; safe_inference() remains the inference gate.
Minor API gap: DataFrame plotting path ignores new cband columns.

Methodology

No findings. Code changes align with the updated Methodology Registry for WIF‑adjusted aggregation SEs and sup‑t simultaneous bands.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3 | Impact: plot_event_study() ignores cband_lower/cband_upper when input is a DataFrame, so simultaneous bands produced by CallawaySantAnnaResults.to_dataframe() (or user DataFrames) are silently dropped. | Fix: In _extract_plot_data, detect cband_lower/cband_upper columns for DataFrame inputs and return them as CI overrides (and document this in plot_event_study docstring). diff_diff/visualization.py:L386-L402, diff_diff/staggered_results.py:L261-L277

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings.
Tests not run (not requested).

The DataFrame branch of _extract_plot_data() always returned None for cband overrides, silently dropping cband_lower/cband_upper columns. Now detects and extracts these columns when present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T16:55:53Z

/ai-review

github-actions · 2026-03-08T17:02:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c086e85afc329d5dbd3deba0ad7e5bbeae04fd43

Overall assessment: ⚠️ Needs changes

Executive summary

P1: The new sup‑t cband critical value ignores the bootstrap-validity threshold stated in the Methodology Registry, so it can be computed from a tiny set of finite draws and yield misleading bands.
P2: plot_event_study() treats NaN cband_lower/upper columns as overrides, suppressing pointwise CIs for common to_dataframe() workflows when cband isn’t computed.
WIF + combined IF integration is otherwise aligned with the updated registry, and the added tests cover core paths, but the above edge cases aren’t exercised.

Methodology

P1 — Impact: The Methodology Registry says bootstrap inference should return NaN when <50% of samples are valid, but cband critical values are computed as long as any finite sup‑t draw exists. This can silently produce overconfident or unstable bands in degenerate bootstrap runs (small samples, NaN IFs). Fix: apply the same valid-sample threshold used in bootstrap inference (e.g., require finite_mask.sum() >= 0.5 * n_bootstrap or reuse the bootstrap_utils threshold) and set cband_crit_value=None (plus warning) otherwise. diff_diff/staggered_bootstrap.py:L401-L425

Code Quality

P2 — Impact: When users pass a DataFrame from results.to_dataframe(level="event_study") and cband wasn’t computed, cband_lower/upper columns are present but NaN. _extract_plot_data() treats those columns as overrides, so plot_event_study() suppresses pointwise CIs entirely (NaN error bars) despite finite SEs. Fix: only return cband overrides when bounds are finite (e.g., filter rows with finite cband_lower/upper), or in plot_event_study() fall back to pointwise CIs when override values are NaN. diff_diff/visualization.py:L241-L245, diff_diff/visualization.py:L404-L406

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P3 — Impact: The two edge cases above aren’t covered; regressions could slip through. Fix: add tests that (1) DataFrame inputs with NaN cband_lower/upper fall back to pointwise CIs (or use_cband=False), and (2) cband critical value returns None when too few valid sup‑t draws exist. tests/test_visualization.py, tests/test_methodology_callaway.py

…shold P2: DataFrame cband extraction now filters NaN values, falling back to pointwise CIs when cband columns are all-NaN (common when cband=False). P1: sup-t cband critical value now requires ≥50% finite bootstrap draws, matching the validity threshold used in bootstrap_utils. Warns and returns None when threshold not met. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T17:06:52Z

/ai-review

github-actions · 2026-03-08T17:15:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fa421bde5fcfffa5a707cf030f9f45082c0b7835

Overall assessment: ✅ Looks good

Executive summary:

Event-study and overall aggregation now include WIF in SEs and bootstrap perturbations via combined IFs, aligning with the Methodology Registry and the stated R did::aggte() behavior.
Simultaneous confidence bands are wired end-to-end (bootstrap critical value → results → plots) with sensible fallbacks when bootstrap is disabled.
One test-gap: no direct R comparison for dynamic/event-study SEs or cband critical values, so parity claims rely on internal consistency rather than external validation.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P2
Impact: The PR claims matching R’s did::aggte(..., type="dynamic", cband=TRUE) behavior, but the current R comparison helper only validates type="simple" and forces cband=FALSE, so regressions in dynamic WIF or sup‑t cband could slip.
Concrete fix: Extend the R comparison harness to run aggte(type="dynamic", cband=TRUE/FALSE) and assert event‑study SEs and cband critical values (or at least ordering vs pointwise) against R for a fixed dataset.
Location: tests/test_methodology_callaway.py:L368-L393

Extend _get_r_mpdta_and_results helper to also run aggte(type="dynamic") and aggte(type="dynamic", cband=TRUE) in the same R process. Add two new tests: test_mpdta_event_study_ses_match_r validates ATT/SE per event time, test_mpdta_cband_crit_value_vs_r validates sup-t critical values match R. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-03-08T17:43:41Z

/ai-review

github-actions · 2026-03-08T17:54:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 090d15981ee637f883078c95bbc498ce7e731887

Overall assessment: ✅ Looks good

Executive summary

Methods affected: Callaway-Sant’Anna event-study aggregation, bootstrap inference, and simultaneous confidence bands (cband).
Methodology cross-check: WIF adjustment for event-study SEs and combined-IF bootstrap align with the Methodology Registry and in-code docstrings; no mismatches found.
Edge-case checklist: NaN/Inf propagation is handled via safe_inference/bootstrap stats gating; sup‑t cband logic drops invalid draws.
Tests/docs were updated to cover WIF + cband paths; nothing blocking noted.

Methodology

No issues found.

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

No issues found.

If you want, I can do a deeper dive comparing the WIF/cband formulae against the exact R did::aggte() implementation (line-by-line) next.

igerber merged commit d3137f7 into main Mar 8, 2026
9 of 10 checks passed

igerber deleted the ca-event-studies-study branch March 8, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CS event study SEs (missing WIF) + simultaneous confidence bands#189

Fix CS event study SEs (missing WIF) + simultaneous confidence bands#189
igerber merged 6 commits intomainfrom
ca-event-studies-study

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Mar 8, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

igerber commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant