Skip to content

Fix CS event study SEs (missing WIF) + simultaneous confidence bands#189

Merged
igerber merged 6 commits intomainfrom
ca-event-studies-study
Mar 8, 2026
Merged

Fix CS event study SEs (missing WIF) + simultaneous confidence bands#189
igerber merged 6 commits intomainfrom
ca-event-studies-study

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 8, 2026

Summary

  • Fix event study aggregation SEs to include weight influence function (WIF) adjustment, matching R's did::aggte(..., type="dynamic")
  • Fix bootstrap to perturb combined influence functions (standard IF + WIF) instead of fixed-weight re-aggregation, for both event study and overall ATT
  • Add simultaneous confidence bands (cband=True, default) using sup-t bootstrap critical values, matching R's did::aggte(..., cband=TRUE)
  • Add cband_crit_value to results, cband_lower/cband_upper columns in to_dataframe(), and simultaneous CI display in summary()
  • Update plot_event_study() to use simultaneous CIs when available

Methodology references (required if estimator / math changes)

  • Method name(s): Callaway-Sant'Anna event study aggregation, simultaneous confidence bands
  • Paper / source link(s): Callaway, B., & Sant'Anna, P.H.C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230. https://doi.org/10.1016/j.jeconom.2020.12.001
  • R reference: did::aggte() — WIF formula in AGGTEobj.R, sup-t cband in compute.aggte()
  • Any intentional deviations from the source (and why): None — matches R's did package behavior

Validation

  • Tests added/updated:
    • tests/test_methodology_callaway.py: 10 new tests in TestEventStudySEWithWIF (4) and TestSimultaneousConfidenceBands (6)
    • All 155 CS-specific tests pass (59 methodology + 96 staggered)
    • Full suite: 1496 passed, 94 skipped, 0 failures
  • Backtest / simulation / notebook evidence: End-to-end verification shows cband critical value > 1.96, simultaneous CIs wider than pointwise

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…bands

Event study aggregation SEs were missing the weight influence function (WIF)
adjustment that R's did::aggte(..., type="dynamic") includes. This caused
systematic SE underestimation because uncertainty in group-size weight
estimation was unaccounted for. The bootstrap path also used fixed-weight
re-aggregation instead of perturbing combined IFs. Both analytical and
bootstrap paths now include WIF for event study and overall ATT.

Additionally adds simultaneous confidence bands (cband=True, default) using
sup-t bootstrap critical values, matching R's did::aggte(..., cband=TRUE).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 8, 2026

Overall Assessment: ⚠️ Needs changes

Executive Summary

  • P1: Analytical WIF uses a local unit set instead of all units, which can mis-scale event-study/overall SEs and diverge from the Methodology Registry/R behavior.
  • P3: plot_event_study now silently overrides pointwise CIs with cband CIs and ignores alpha without an opt-out.
  • P3: cband and simultaneous-band behavior aren’t documented in the estimator/plot docs; plotting coverage for cband is missing.

Methodology

  • Severity: P1 | Impact: Callaway-Sant'Anna event-study and overall ATT analytical SEs can be biased because _compute_combined_influence_function computes pg using n_units from only the units appearing in the selected (g,t) pairs, while the registry/docstring define pg = n_g / n_all (all units). This can drop late-treated cohorts at long event times, altering WIF variance and breaking alignment with R’s did::aggte. | Fix: pass a global unit mapping/total N (from precomputed['all_units'] or df[unit].nunique()) into _compute_combined_influence_function for analytical SEs (same as the bootstrap path) so WIF is based on the full sample. | Location: diff_diff/staggered_aggregation.py:L200-L340, diff_diff/staggered_aggregation.py:L420-L425, docs/methodology/REGISTRY.md:L311-L316

Code Quality

  • Severity: P3 | Impact: plot_event_study ignores the user-specified alpha when cband_conf_int exists and there’s no opt-out to force pointwise CIs; this is a behavioral change that can surprise callers. | Fix: add a use_cband (or ci_mode) flag or emit a warning when overriding, and honor alpha for pointwise mode. | Location: diff_diff/visualization.py:L185-L238

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P3 | Impact: The new cband parameter and simultaneous CIs are not documented in the CallawaySantAnna parameter docs, and plotting docs don’t mention cband-driven CI overrides, making the new default behavior easy to miss. | Fix: update docstrings to describe cband, its bootstrap requirement, and plotting behavior. | Location: diff_diff/staggered.py:L162-L217, diff_diff/visualization.py:L130-L170
  • Severity: P3 | Impact: No test covers plot_event_study consuming cband_conf_int, so regressions in the override path could slip. | Fix: add a small plotting/extraction test that asserts cband bounds are used when present. | Location: diff_diff/visualization.py:L185-L238

Fix critical bug where _compute_aggregated_se_with_wif() used local unit
set instead of global panel size N for pg computation, causing SE
overestimation at extreme event times. Add use_cband parameter to
plot_event_study() for opting out of simultaneous confidence bands.
Add cband docstrings and plot tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 9758a19118e248d19a61d040730f97f4d63c6fce


Overall Assessment: ⚠️ Needs changes

Executive Summary

  • Methods affected: Callaway‑Sant’Anna overall ATT and event‑study aggregation SEs, bootstrap inference, and simultaneous confidence bands; intent matches updates in docs/methodology/REGISTRY.md.
  • P1 methodology risk: bootstrap WIF uses a local unit set/denominator, which can diverge from registry expectations (pg = n_g / N_total) and from analytical SEs when units drop out of influence functions.
  • cband wiring looks consistent across params/results/plot and tests, but edge coverage for unbalanced panels is missing.
  • Tests not run by reviewer.

Methodology

  • Severity: P1. Impact: Bootstrap combined IF and sup‑t critical values can be mis‑scaled in unbalanced panels because n_units is derived from units appearing in influence_func_info, while the registry defines pg = n_g / N_total and the analytical path uses all units. This can bias bootstrap SEs and cband widths and break consistency with R. Concrete fix: build bootstrap unit_to_idx/all_bootstrap_weights from precomputed['all_units'] (global) and pass those to _compute_combined_influence_function; or change _compute_combined_influence_function to use n_global_units for pg_by_group and WIF scaling even when local indexing is used, zero‑padding combined_if to global length. References: diff_diff/staggered_bootstrap.py:169-227, diff_diff/staggered_aggregation.py:198-232.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: Tests do not cover the unbalanced/missing‑unit scenario where influence functions omit some units; this is where the bootstrap/global‑N mismatch would surface. Concrete fix: add a test with units that never appear in any (g,t) IF (e.g., all‑NaN outcomes or missing base/post) and assert bootstrap SEs/cband align with analytical or expected R behavior. Reference: tests/test_methodology_callaway.py:1224-1367.
  • Severity: P3. Impact: Tests not run in this review. Concrete fix: run pytest tests/test_methodology_callaway.py tests/test_visualization.py.

The multiplier bootstrap in _run_multiplier_bootstrap() was building
the unit set from influence_func_info (local units only), causing
pg = n_g / n_local to overestimate group shares in unbalanced panels
where some units don't appear in any influence function. Now uses
precomputed['all_units'] (or df fallback) matching the analytical path
fix from 9758a19.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 05eabac559cdd44c1c3a890dab67c68d2bff33d7


Overall assessment: ✅ Looks good

Executive summary

  • Methods affected: Callaway–Sant’Anna overall ATT aggregation, event‑study aggregation, multiplier bootstrap, and simultaneous confidence bands.
  • Registry alignment: WIF-adjusted SEs for overall/event‑study and combined‑IF bootstrap are consistent with docs/methodology/REGISTRY.md; cband default and sup‑t workflow are wired through results/summary/plotting.
  • Edge‑case safety: No new inline t‑stat computations or partial NaN guards introduced; safe_inference() remains the inference gate.
  • Minor API gap: DataFrame plotting path ignores new cband columns.

Methodology

  • No findings. Code changes align with the updated Methodology Registry for WIF‑adjusted aggregation SEs and sup‑t simultaneous bands.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3 | Impact: plot_event_study() ignores cband_lower/cband_upper when input is a DataFrame, so simultaneous bands produced by CallawaySantAnnaResults.to_dataframe() (or user DataFrames) are silently dropped. | Fix: In _extract_plot_data, detect cband_lower/cband_upper columns for DataFrame inputs and return them as CI overrides (and document this in plot_event_study docstring). diff_diff/visualization.py:L386-L402, diff_diff/staggered_results.py:L261-L277

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No findings.
  • Tests not run (not requested).

The DataFrame branch of _extract_plot_data() always returned None for
cband overrides, silently dropping cband_lower/cband_upper columns.
Now detects and extracts these columns when present.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: c086e85afc329d5dbd3deba0ad7e5bbeae04fd43


Overall assessment: ⚠️ Needs changes

Executive summary

  • P1: The new sup‑t cband critical value ignores the bootstrap-validity threshold stated in the Methodology Registry, so it can be computed from a tiny set of finite draws and yield misleading bands.
  • P2: plot_event_study() treats NaN cband_lower/upper columns as overrides, suppressing pointwise CIs for common to_dataframe() workflows when cband isn’t computed.
  • WIF + combined IF integration is otherwise aligned with the updated registry, and the added tests cover core paths, but the above edge cases aren’t exercised.

Methodology

  • P1 — Impact: The Methodology Registry says bootstrap inference should return NaN when <50% of samples are valid, but cband critical values are computed as long as any finite sup‑t draw exists. This can silently produce overconfident or unstable bands in degenerate bootstrap runs (small samples, NaN IFs). Fix: apply the same valid-sample threshold used in bootstrap inference (e.g., require finite_mask.sum() >= 0.5 * n_bootstrap or reuse the bootstrap_utils threshold) and set cband_crit_value=None (plus warning) otherwise. diff_diff/staggered_bootstrap.py:L401-L425

Code Quality

  • P2 — Impact: When users pass a DataFrame from results.to_dataframe(level="event_study") and cband wasn’t computed, cband_lower/upper columns are present but NaN. _extract_plot_data() treats those columns as overrides, so plot_event_study() suppresses pointwise CIs entirely (NaN error bars) despite finite SEs. Fix: only return cband overrides when bounds are finite (e.g., filter rows with finite cband_lower/upper), or in plot_event_study() fall back to pointwise CIs when override values are NaN. diff_diff/visualization.py:L241-L245, diff_diff/visualization.py:L404-L406

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P3 — Impact: The two edge cases above aren’t covered; regressions could slip through. Fix: add tests that (1) DataFrame inputs with NaN cband_lower/upper fall back to pointwise CIs (or use_cband=False), and (2) cband critical value returns None when too few valid sup‑t draws exist. tests/test_visualization.py, tests/test_methodology_callaway.py

…shold

P2: DataFrame cband extraction now filters NaN values, falling back to
pointwise CIs when cband columns are all-NaN (common when cband=False).

P1: sup-t cband critical value now requires ≥50% finite bootstrap draws,
matching the validity threshold used in bootstrap_utils. Warns and returns
None when threshold not met.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: fa421bde5fcfffa5a707cf030f9f45082c0b7835


Overall assessment: ✅ Looks good

Executive summary:

  • Event-study and overall aggregation now include WIF in SEs and bootstrap perturbations via combined IFs, aligning with the Methodology Registry and the stated R did::aggte() behavior.
  • Simultaneous confidence bands are wired end-to-end (bootstrap critical value → results → plots) with sensible fallbacks when bootstrap is disabled.
  • One test-gap: no direct R comparison for dynamic/event-study SEs or cband critical values, so parity claims rely on internal consistency rather than external validation.

Methodology

  • No findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity: P2
    Impact: The PR claims matching R’s did::aggte(..., type="dynamic", cband=TRUE) behavior, but the current R comparison helper only validates type="simple" and forces cband=FALSE, so regressions in dynamic WIF or sup‑t cband could slip.
    Concrete fix: Extend the R comparison harness to run aggte(type="dynamic", cband=TRUE/FALSE) and assert event‑study SEs and cband critical values (or at least ordering vs pointwise) against R for a fixed dataset.
    Location: tests/test_methodology_callaway.py:L368-L393

Extend _get_r_mpdta_and_results helper to also run aggte(type="dynamic")
and aggte(type="dynamic", cband=TRUE) in the same R process. Add two new
tests: test_mpdta_event_study_ses_match_r validates ATT/SE per event time,
test_mpdta_cband_crit_value_vs_r validates sup-t critical values match R.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 090d15981ee637f883078c95bbc498ce7e731887


Overall assessment: ✅ Looks good

Executive summary

  • Methods affected: Callaway-Sant’Anna event-study aggregation, bootstrap inference, and simultaneous confidence bands (cband).
  • Methodology cross-check: WIF adjustment for event-study SEs and combined-IF bootstrap align with the Methodology Registry and in-code docstrings; no mismatches found.
  • Edge-case checklist: NaN/Inf propagation is handled via safe_inference/bootstrap stats gating; sup‑t cband logic drops invalid draws.
  • Tests/docs were updated to cover WIF + cband paths; nothing blocking noted.

Methodology

  • No issues found.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • No issues found.

If you want, I can do a deeper dive comparing the WIF/cband formulae against the exact R did::aggte() implementation (line-by-line) next.

@igerber igerber merged commit d3137f7 into main Mar 8, 2026
9 of 10 checks passed
@igerber igerber deleted the ca-event-studies-study branch March 8, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant