HAD Phase 2b: multi-period event-study extension (Appendix B.2) by igerber · Pull Request #350 · igerber/diff-diff

igerber · 2026-04-22T11:11:14Z

Summary

Lifts aggregate="event_study" scaffolding from Phase 2a. HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") now returns a new HeterogeneousAdoptionDiDEventStudyResults dataclass with per-event-time WAS estimates on multi-period panels (paper Appendix B.2).
All three Phase 2a design paths (continuous_at_zero, continuous_near_d_lower, mass_point) are reused verbatim on per-horizon first differences anchored at Y_{g, F-1} (uniform baseline, consistent with Garrett et al. application).
Staggered-timing panels auto-filter to the last-treatment cohort PLUS never-treated units, per the paper's "there must be an untreated group, at least till the period where the last cohort gets treated" prescription. UserWarning surfaces kept/dropped counts and dropped cohort labels.
Per-horizon SEs use INDEPENDENT sandwiches (continuous: CCT-2014 robust / |den|; mass-point: Phase 2a structural-residual 2SLS). Pointwise CIs match paper's Pierce-Schott application. Joint cross-horizon covariance deferred to follow-up.

Methodology references (required if estimator / math changes)

Method name(s): HeterogeneousAdoptionDiD (event-study mode)
Paper / source link(s): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026), arXiv:2405.04465v6, Appendix B.2
Any intentional deviations from the source (and why):
- Uniform F-1 baseline for all horizons (paper review line 232 suggests asymmetric Y_{g,t} - Y_{g,1} for pre-periods; uniform baseline matches the paper's Garrett Section 5.1 application and simplifies event-time indexing). Documented in REGISTRY Note (Phase 2b baseline convention).
- Per-horizon INDEPENDENT sandwiches, not joint cross-horizon covariance. Paper reports pointwise CIs; joint covariance is not derived in Appendix B.2. Deferred to follow-up PR (tracked in TODO.md). Documented in REGISTRY Note (Phase 2b per-horizon SE).
- D_{g,F} used as single dose regressor across all horizons (paper convention assumes "once treated, stay treated with same dose"). Time-varying post-period dose deferred. Documented in TODO.md.

Validation

Tests added/updated: tests/test_had.py (+41 event-study tests, all passing)
- Smoke per design path (continuous_at_zero, continuous_near_d_lower, mass_point) on 5-period panels
- Staggered auto-filter: warning, filter_info populated, sample composition (last-cohort + never-treated retained, earlier cohorts dropped), explicit "retain never-treated" pin test
- Per-horizon SE independence: matches Phase 2a SE on (F-1, t) two-period subset at atol=1e-12
- Baseline convention: e = -1 not in event_times; event-time indexing symmetric around F
- Pre-period placebos under no-pre-trend DGP concentrate around 0
- 2x2 aggregate/period matrix: (T=2, event_study) raises, (T>2, overall) raises
- Panel contracts preserved: NaN rejection, RCS rejection, non-contiguous dose rejection
- Phase 2a guards preserved: reciprocal regime partition, d_lower contracts, NaN triple via safe_inference
- sklearn compat: clone round-trip, fit idempotence
- HeterogeneousAdoptionDiDEventStudyResults API: summary(), to_dict(), to_dataframe(), repr
Backtest / simulation / notebook evidence (if applicable): N/A — Phase 2b is an analytical extension; numerical verification is via per-horizon parity with Phase 2a (commit criterion Add fixed effects and absorb parameters to DifferenceInDifferences #2).

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Lifts `aggregate="event_study"` scaffolding left by Phase 2a. `HeterogeneousAdoptionDiD.fit(..., aggregate="event_study")` now returns a new `HeterogeneousAdoptionDiDEventStudyResults` dataclass with per- event-time WAS estimates on multi-period panels. All three Phase 2a design paths (continuous_at_zero, continuous_near_d_lower, mass_point) are reused verbatim on per-horizon first differences anchored at `Y_{g, F-1}` (uniform F-1 baseline, consistent with paper Garrett application). Staggered-timing panels are auto-filtered to the last-treatment cohort with a `UserWarning` per paper Appendix B.2 prescription ("did_had may be used only for the LAST treatment cohort in a staggered design"). Pre-period placebos included for `e <= -2`; the anchor `e = -1` is skipped since `ΔY = 0` there by construction. Per-horizon SEs use INDEPENDENT sandwiches: continuous paths use the CCT-2014 robust SE from Phase 1c divided by `|den|`; mass-point path uses the Phase 2a structural-residual 2SLS sandwich computed on each horizon's first differences. Pointwise CIs match the paper's own Pierce-Schott application (Figure 2); joint cross-horizon covariance is deferred to a follow-up PR (tracked in TODO.md). All Phase 2a policy guards (reciprocal regime partition, d_lower contracts, NaN contract via safe_inference, sklearn get/set_params atomicity) are preserved identically on the event-study path. The multi-period panel validator adds dose-contiguity enforcement (pre- periods < post-periods in natural ordering) and rejects non-monotonic dose sequences with a pointer to ChaisemartinDHaultfoeuille. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Paper Appendix B.2: "in designs with variation in treatment timing, there must be an untreated group, at least till the period where the last cohort gets treated." Previously the staggered last-cohort filter dropped BOTH earlier cohorts AND never-treated units (first_treat=0), which was undocumented and changed the comparison set. Fixed: `_validate_had_panel_event_study` now keeps `first_treat == F_last` AND `first_treat == 0` (never-treated), dropping only earlier cohorts (`first_treat` in `dropped_cohorts`). Never-treated units satisfy the dose invariant at every period (D=0 throughout) and serve as the untreated-group comparison per the paper's requirement. Matches CallawaySantAnna's never-treated handling convention. Updated: - Warning message names the "plus never-treated" composition. - REGISTRY.md Note (Phase 2b last-cohort filter) documents the new contract explicitly. - Function docstring clarifies `data_filtered` and `filter_info` shape. - New test `test_staggered_filter_retains_never_treated_units` pins the contract: kept first_treat values are exactly {0, F_last}. - Existing tests updated to reflect new n_kept / n_dropped counts. Also addresses AI review P3s: - Clarify sort-key comment (no false "str fallback" claim). - Drop misleading NaN-rendering comment on event-study summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T11:18:59Z

Overall Assessment

⛔ Blocker

Executive Summary

HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") silently accepts panels where treated units’ doses change after F, but the implementation still hard-codes D_{g,F} as the regressor for every horizon. That can return wrong later-horizon estimates on accepted input.
Staggered timing is only reduced to the last cohort when first_treat_col is provided. If users omit that column, mixed-cohort panels can still be fit as if they had a common treatment date, which violates Appendix B.2 identification.
I did not treat the uniform F-1 baseline or the independent per-horizon sandwiches as defects, because both are explicitly documented in REGISTRY.md.
The methodology docs still need cleanup: REGISTRY.md now contains contradictory statements about whether staggered timing hard-fails or auto-filters.
The new tests are broad, but they do not cover the two methodology-breaking input cases above.

Methodology

Severity: P0. Impact: _aggregate_multi_period_first_differences() always reuses D_{g,F} for every event-time horizon, and there is no front-door validation that post-treatment dose stays constant within unit. The docstring explicitly says time-varying post-period dose is treated as the realized F-period dose, so accepted panels with changing D_{g,t} can produce wrong later-horizon ATT/SE values with no warning. The TODO entry does not mitigate this because it is a silent correctness bug. Concrete fix: reject any panel where a unit has D_{g,t} != D_{g,F} for any t >= F, or implement the time-varying-dose estimator and document it in REGISTRY.md. References: diff_diff/had.py:L1197-L1255, TODO.md:L101-L101.
Severity: P1. Impact: the last-cohort reduction only runs inside if first_treat_col is not None. When a staggered panel is passed without that column, the validator skips cohort detection, infers F from the first period with any nonzero dose, and proceeds as if the panel had a common adoption date. Later-treated cohorts then enter d_arr as zero-dose units at the inferred F, which changes the control composition and violates the paper’s “last cohort only” Appendix B.2 contract. The public fit() docstring even says first_treat_col is required for staggered timing, but the code never enforces that requirement. Concrete fix: when first_treat_col is absent, infer each unit’s first positive-dose period from the observed dose path; if multiple cohorts exist, either auto-filter to F_last or raise a front-door error directing users to pass first_treat_col / use ChaisemartinDHaultfoeuille. References: diff_diff/had.py:L843-L855, diff_diff/had.py:L919-L1066, diff_diff/had.py:L1803-L1812, docs/methodology/REGISTRY.md:L2332-L2334.
Severity: P3. Impact: the uniform F-1 baseline and independent per-horizon sandwich SEs are documented in the Methodology Registry, so I did not count them as methodology defects. Concrete fix: none required for approval. References: docs/methodology/REGISTRY.md:L2332-L2335.

Code Quality

No additional findings.

Performance

No additional findings.

Maintainability

No additional findings beyond the documentation/test issues below.

Tech Debt

Severity: P3. Impact: joint cross-horizon covariance is explicitly tracked as deferred work and matches the paper’s pointwise-CI presentation, so it is not a blocker for this PR. Concrete fix: none required for approval; follow-up can add IF stacking or block bootstrap. References: docs/methodology/REGISTRY.md:L2334-L2334, TODO.md:L92-L92.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: REGISTRY.md now gives conflicting guidance on staggered timing. One note still says the implementation hard-fails and redirects on variation in treatment timing, while the new Phase 2b note says the estimator auto-filters to the last cohort and can infer F without first_treat_col. That ambiguity makes the shipped methodology contract hard to trust. Concrete fix: reconcile the older note with the new Phase 2b note and state explicitly whether omitted first_treat_col on staggered panels is supported, rejected, or auto-detected. References: docs/methodology/REGISTRY.md:L2251-L2251, docs/methodology/REGISTRY.md:L2333-L2333.
Severity: P2. Impact: the new tests exercise only the explicit first_treat_col staggered path and the choice to reuse D_{g,F}. There is no negative test for either blocker condition: staggered multi-cohort input without first_treat_col, or time-varying post-F dose. That leaves both failures unguarded. Concrete fix: add regression tests for both cases in tests/test_had.py. References: tests/test_had.py:L2271-L2397, tests/test_had.py:L2850-L2872, tests/test_had.py:L2917-L2930.

Path to Approval

Add a hard validation that event-study panels satisfy the constant-dose-after-treatment assumption behind D_{g,F} reuse, or implement the time-varying-dose method and document it in REGISTRY.md.
Detect staggered timing even when first_treat_col is omitted; if multiple cohorts are present, auto-filter to the last cohort or raise a front-door error requiring cohort metadata / redirecting to ChaisemartinDHaultfoeuille.
Add regression tests for both conditions above and reconcile the contradictory staggered-timing notes in REGISTRY.md.

Execution note: I did not run the test suite in this environment because the available Python on PATH does not have the project dependencies (numpy, pandas, scipy).

…tection **P0 (blocker):** `_aggregate_multi_period_first_differences` reuses `D_{g, F}` as the single regressor for every event-time horizon. Without validation, panels where a unit's dose varies across post-treatment periods silently misattribute later-horizon effects to the period-F dose. Fix: `_validate_had_panel_event_study` now rejects panels where any unit has time-varying dose across post-periods (within-unit spread beyond float tolerance), with a `ValueError` redirecting to ChaisemartinDHaultfoeuille for genuinely time-varying regimes. **P1:** Staggered-timing auto-filter previously only ran inside `if first_treat_col is not None`. Multi-cohort panels without cohort metadata slipped through, treating later-cohort units as zero-dose "controls" at the inferred F, violating Appendix B.2's last-cohort-only contract. Fix: When `first_treat_col is None`, the validator computes per-unit first-positive-dose period from the dose path. If multiple distinct cohorts are detected, it raises a `ValueError` directing users to pass `first_treat_col` (which activates the last-cohort auto-filter) or use ChaisemartinDHaultfoeuille for full staggered support. **P2 (docs):** Reconciled contradictory REGISTRY guidance between the legacy edge-case note (line ~2251) and the new Phase 2b last-cohort filter note. Both now describe the auto-filter + front-door rejection of un-annotated staggered panels consistently. **P2 (tests):** Added regression tests for both blockers: - `test_time_varying_post_F_dose_rejected` - `test_staggered_without_first_treat_col_rejected` Also added a **Note (Phase 2b constant-dose requirement)** block to REGISTRY documenting the new validator guard. TODO.md entry updated to reflect front-door rejection of time-varying doses (not silent reuse as before). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T11:30:22Z

/ai-review

github-actions · 2026-04-22T11:39:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1bfec37582c1e2c8f612d93da090637e06042b62

Overall Assessment

⚠️ Needs changes

Executive Summary

The two prior re-review blockers are addressed: the event-study validator now rejects time-varying post-F dose paths and no longer silently accepts multi-cohort staggered panels when first_treat_col is omitted; the earlier REGISTRY.md contradiction on staggered timing is also resolved (diff_diff/had.py#L1130, diff_diff/had.py#L1153, tests/test_had.py#L2693, tests/test_had.py#L2730, docs/methodology/REGISTRY.md#L2251).
I did not count the uniform F-1 anchor or per-horizon independent SEs as defects: they are explicitly documented in REGISTRY.md, and the DIDHAD reference docs also define dynamic effects/placebos relative to F-1 and direct staggered-timing users to did_multiplegt_dyn. (ideas.repec.org)
P1: when first_treat_col is present, the new last-cohort filter trusts that column without validating it against the observed dose path. A swapped or mistyped cohort label can silently keep the wrong cohort and change the estimand/sample composition (diff_diff/had.py#L919, diff_diff/had.py#L967, diff_diff/had.py#L1130).
P1: the event-study path derives chronology by sorting raw period labels. That reintroduces the label-order problem the two-period validator explicitly avoided, so valid multi-period panels with semantic or non-ordinal string periods can be rejected or misindexed (diff_diff/had.py#L732, diff_diff/had.py#L1065, diff_diff/had.py#L1321).
No new inline-inference / partial-NaN anti-patterns or security issues showed up in the changed estimator path.

Methodology

Severity: P1. Impact: _validate_had_panel_event_study() uses first_treat_col as the authoritative cohort selector, but it never checks that those labels equal each unit’s actual first positive-dose period. In this PR, that column now decides which cohort is treated as the Appendix B.2 “last cohort,” so a mislabeled metadata column can silently retain the wrong units and return event-study estimates for the wrong cohort. Concrete fix: infer each unit’s actual first positive-dose period from dose_col for all units, require equality with first_treat_col (or 0 for never-treated), and reject off-support/mismatched labels before building cohorts and F_last (diff_diff/had.py#L919, diff_diff/had.py#L967, diff_diff/had.py#L1130).
Severity: P1. Impact: Appendix B.2 event-time effects depend on chronological ordering, but the new validator/aggregator recover that order with raw sorted(...) on period labels. For plain string/object labels, that is lexicographic rather than chronological; valid panels such as month names or "pre1"/"pre2"/"post1"/"post2" can fail the contiguity check or get the wrong anchor/horizon order. Concrete fix: require an ordered time type on the event-study path (numeric, datetime, ordered categorical, or an explicit supplied order) and front-door reject unordered object/string labels instead of silently using lexicographic sort (diff_diff/had.py#L1065, diff_diff/had.py#L1080, diff_diff/had.py#L1321).
Severity: P3-informational. Impact: the uniform F-1 baseline and independent per-horizon sandwiches are documented deviations/choices, so they are not approval blockers. Concrete fix: none required (docs/methodology/REGISTRY.md#L2332, docs/methodology/REGISTRY.md#L2335).

Code Quality

No findings. The new per-horizon inference path still routes through safe_inference(), so I did not find a new inline t_stat = effect / se or partial-NaN guard regression (diff_diff/had.py#L2605).

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational. Impact: joint cross-horizon covariance and the broader staggered/time-varying-dose follow-ups are now explicitly tracked in TODO.md, so I did not count them against approval. Concrete fix: none required in this PR (TODO.md#L92, TODO.md#L93, TODO.md#L101).

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests close the prior blocker cases, but there is still no regression coverage for the two unmitigated issues above: event-study with unordered string/object period labels, and event-study with first_treat_col disagreeing with the observed dose path. The only explicit string-period coverage remains two-period-only. Concrete fix: add one regression test for semantic string periods on aggregate="event_study" and one for a swapped/mistyped first_treat_col that must raise (tests/test_had.py#L1760, tests/test_had.py#L2593, tests/test_had.py#L2730).
Severity: P3. Impact: the HeterogeneousAdoptionDiD class docstring still says the event-study extension is “queued for Phase 2b,” which now contradicts runtime behavior and the updated registry/docs. Concrete fix: update the class docstring so help(HeterogeneousAdoptionDiD) matches fit(..., aggregate="event_study") (diff_diff/had.py#L1607, diff_diff/had.py#L1925).

Path to Approval

Validate first_treat_col against each unit’s actual first positive-dose period before using it for last-cohort selection, and raise on any mismatch or off-support label.
Stop inferring event-study chronology from raw lexicographic sort of period labels; require an ordered time representation or explicit order and reject unordered object/string labels.
Add regression tests for both cases above so the fixes stay pinned.

…P1 ordered time **P1 (first_treat_col vs dose mismatch):** The last-cohort filter trusted `first_treat_col` without validating it against the observed dose path. A swapped or mistyped cohort label could silently retain the wrong cohort as F_last. Fix: `_validate_had_panel_event_study` now cross-validates each unit's declared first_treat against their actual first-positive-dose period: - declared == 0: unit must have D=0 at every period - declared == F_g > 0: unit's first period with D>0 must equal F_g Any mismatch raises `ValueError` with an example unit, declared value, and actual first-positive period. **P1 (unordered time labels):** Event-study chronology was inferred via raw `sorted()` on period labels. For object/string dtypes that falls back to lexicographic sort, which silently misorders panels like "pre1"/"pre2"/"post1"/"post2" or month-name labels. Fix: Event-study path now requires a numeric, datetime, or ordered- categorical time column. Object/string dtypes raise a front-door `ValueError` directing users to convert. Ordered categoricals are sorted by their declared category order (not the underlying string), via a dtype-aware `_sort_key` reused by both the validator and the multi-period aggregator. **P3 (docstring):** Class docstring no longer says the event-study extension is "queued for Phase 2b"; now documents both aggregation modes with pointers to the respective result classes. **Tests added:** - `test_first_treat_col_mismatch_with_dose_raises` pins the cross- validation contract. - `test_unordered_string_time_col_rejected` pins front-door rejection of object dtypes. - `test_ordered_categorical_time_col_accepted` confirms ordered categoricals sort by category order and fit successfully. Minor: added `observed=False` to the categorical-groupby in the balance check to silence the pandas FutureWarning while preserving behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T11:45:08Z

/ai-review

github-actions · 2026-04-22T11:52:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6e321ad74ed0f463ccdc56fb6803a94f8ace91c2

Overall Assessment

⛔ Blocker

Executive Summary

The two prior re-review blockers are addressed: the event-study path now rejects unordered time_col inputs and cross-validates first_treat_col against the realized dose path before filtering. The added tests pin both behaviors. diff_diff/had.py:L921 diff_diff/had.py:L969 tests/test_had.py:L2738 tests/test_had.py:L2771
P0: the new last-cohort selector still sorts first_treat_col with raw Python ordering, not the chronological order already required for time_col. On supported ordered-categorical staggered panels, this can silently choose the wrong F_last, keep the wrong cohort, and return event-study estimates for the wrong estimand/sample. diff_diff/had.py:L1011 diff_diff/had.py:L1023 diff_diff/had.py:L921 docs/methodology/REGISTRY.md:L2333
The uniform F-1 baseline, independent per-horizon sandwiches, and front-door rejection of time-varying post-F dose are documented in REGISTRY.md / TODO.md, so I did not count them as methodology defects. docs/methodology/REGISTRY.md:L2334 docs/methodology/REGISTRY.md:L2335 docs/methodology/REGISTRY.md:L2336 TODO.md:L92 TODO.md:L101
I did not find new inline-inference / partial-NaN regressions, performance issues, or security concerns in the changed estimator path. diff_diff/had.py:L2697

Methodology

Severity: P0. Impact: HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") now explicitly supports ordered categorical time labels, but _validate_had_panel_event_study() still determines the “last” treated cohort with sorted(..., key=lambda x: (x is None, x)). That is lexicographic/raw-Python ordering, not the chronological order already defined for time_col. On valid staggered inputs such as ordered month/quarter labels where lexical and chronological order differ, F_last can be wrong, so the auto-filter keeps the wrong cohort and changes the treated/comparison sample without error. This is a silent correctness bug in the Appendix B.2 last-cohort identification step, not a documented deviation. Concrete fix: derive cohorts/F_last from the validated time_col ordering (or sort actual_first_pos.unique() with the same time-order key used later for t_pre_list / t_post_list) before building keep_mask. diff_diff/had.py:L1011 diff_diff/had.py:L1023 diff_diff/had.py:L1129 diff_diff/had.py:L921 docs/methodology/REGISTRY.md:L2333
Severity: P3-informational. Impact: the uniform F-1 anchor, independent per-horizon SEs, and rejection of time-varying post-treatment dose are documented in the methodology registry / TODO tracking, so I did not count them against approval. Concrete fix: none. docs/methodology/REGISTRY.md:L2334 docs/methodology/REGISTRY.md:L2335 docs/methodology/REGISTRY.md:L2336 TODO.md:L92 TODO.md:L101

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: a couple of new docstrings are still stale after Phase 2b. fit() still opens with “two-period panel,” and the event-study result docstring says n_units contains only last-cohort units after the staggered filter, even though never-treated units are retained. Concrete fix: update the docstrings to describe the multi-period/event-study behavior accurately. diff_diff/had.py:L1943 diff_diff/had.py:L447

Tech Debt

No findings. Deferred joint cross-horizon covariance, broader staggered follow-up work, and time-varying-dose extensions are explicitly tracked, so I did not count them against approval. TODO.md:L92 TODO.md:L93 TODO.md:L101

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new ordered-categorical coverage only exercises the common-adoption path, while the staggered-filter tests use numeric cohort labels. That leaves the blocker above unpinned. Concrete fix: add a staggered event-study regression with ordered-categorical period / first_treat labels whose lexicographic order differs from chronological order, and assert that filter_info["F_last"], the retained units, and result.F follow chronological order. tests/test_had.py:L2274 tests/test_had.py:L2804

Path to Approval

Fix _validate_had_panel_event_study() so F_last is selected using the validated chronological order from time_col, not raw Python sorting of first_treat_col.
Add a regression test for staggered ordered-categorical panels where chronological last != lexicographic max, and assert the kept cohort/sample matches the chronological last cohort.

…docstrings **P0 (cohort sort key):** `_validate_had_panel_event_study` sorted first_treat_col values with raw Python `(x is None, x)` while `time_col` was already required to be ordered (numeric/datetime/ordered categorical). On ordered-categorical staggered panels where chronological order differs from lexicographic order, `F_last = cohorts[-1]` silently picked the lexicographically latest cohort, not the chronologically latest. That keeps the wrong cohort and returns event-study estimates for the wrong estimand. Fix: Promoted the dtype-aware `_sort_key` (ordered-categorical uses declared category index; numeric/datetime use natural order) to the top of the validator, just after the time-dtype check. Cohort sorting, pre/post period sorting, contiguity check, and the staggered-without-first_treat detection all now share this single `_sort_key`. Removed the duplicate `_sort_key` definition that was sitting further down in the same function. **P3 (stale docstrings):** - `fit()` no longer opens with "two-period panel"; now describes both aggregation modes with links to the respective result classes. - `HeterogeneousAdoptionDiDEventStudyResults.n_units` docstring no longer says "only last-cohort units"; now accurately reports last-cohort PLUS never-treated retained. **Test added:** `test_staggered_ordered_categorical_chooses_chronological_last` uses categories `["q1", "q2", "q3", "q10"]` where lex max of the two cohorts (`"q2", "q10"`) is `"q2"` but chronological last is `"q10"`; asserts the fix picks `"q10"` as `F_last` and retains only the q10-cohort units. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T12:33:35Z

/ai-review

github-actions · 2026-04-22T12:40:47Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cb11a1fdee026ec7d5d0fcd40160457d47a1f50c

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review blocker is addressed: last-cohort selection now uses the validated chronological sort key, and there is a targeted ordered-categorical staggered regression test that pins the non-lexicographic case. diff_diff/had.py:L946-L1050 tests/test_had.py:L2730-L2792
No unmitigated Appendix B.2 methodology deviations found. The uniform F-1 baseline, last-cohort auto-filter, independent per-horizon sandwiches / pointwise CIs, and front-door rejection of time-varying post-F dose are all documented in the Methodology Registry / TODO.md, so they do not count against approval. docs/methodology/REGISTRY.md:L2332-L2337 TODO.md:L92-L101
P1: valid ordered-categorical event-study panels with extra unused category levels will be falsely rejected as unbalanced because the new balance check groups with observed=False, which materializes zero-count unit-period cells for unused categories. diff_diff/had.py:L908-L936 diff_diff/had.py:L1112-L1124
The new tests cover the earlier chronological-last bug, first_treat/dose-path cross-validation, unordered-string rejection, and time-varying-dose rejection, but they do not pin the unused-categorical-level case above. tests/test_had.py:L2794-L2833 tests/test_had.py:L2835-L2901
I did not find new inline-inference / partial-NaN regressions, performance issues, or security concerns in the changed estimator path.

Methodology

Severity: P3-informational. Impact: Affected method is HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"). I cross-checked the changed estimator path against the Appendix B.2 notes and REGISTRY.md; the PR’s substantive deviations are documented, so they are not defects under the stated review policy. Concrete fix: none. docs/methodology/REGISTRY.md:L2251-L2252 docs/methodology/REGISTRY.md:L2332-L2337 TODO.md:L92-L101
Severity: P3-informational. Impact: The prior raw-ordering blocker is fixed: cohort ranking now uses the same chronological key as period ordering, and the ordered-categorical staggered regression test covers the intended F_last behavior. Concrete fix: none. diff_diff/had.py:L1029-L1050 tests/test_had.py:L2730-L2792

Code Quality

Severity: P1. Impact: _validate_had_panel_event_study() now explicitly supports ordered categorical time_col values, but its balance check uses data_filtered.groupby([unit_col, time_col], observed=False).size(). On categorical groupers, observed=False expands unused category levels into zero-count unit-period cells. That means a panel that is balanced on every observed period can still raise "Unbalanced panel" if the categorical dtype contains any extra unused levels, which is a realistic edge case for the new ordered-categorical event-study path. Concrete fix: run the balance check on observed period levels only, e.g. use observed=True or strip unused categories before grouping, and keep the rest of the validator keyed to the observed periods_list. diff_diff/had.py:L908-L936 diff_diff/had.py:L1112-L1124

Performance

No findings.

Maintainability

Severity: P3-informational. Impact: fit() is still annotated as returning HeterogeneousAdoptionDiDResults, even though aggregate="event_study" returns HeterogeneousAdoptionDiDEventStudyResults. This is documented and worked around in tests with typing.cast, so it is not blocker-worthy, but the public typing surface still mismatches runtime behavior. Concrete fix: add overloads keyed on aggregate or annotate the return as a union if you want the API typing to match the new result class. diff_diff/had.py:L1942-L2054 docs/methodology/REGISTRY.md:L2337-L2337 tests/test_had.py:L2135-L2146

Tech Debt

Severity: P3-informational. Impact: Deferred joint cross-horizon covariance, broader staggered follow-up work, and time-varying-dose extensions are properly tracked in TODO.md, so I did not count them against approval. Concrete fix: none. TODO.md:L92-L101

Security

No findings.

Documentation/Tests

Severity: P2. Impact: ordered-categorical coverage only exercises cases where the declared categories exactly match the observed periods, so it does not protect against the false-unbalance case above. Concrete fix: add a regression with an ordered categorical time_col whose categories are a strict superset of the observed periods and assert that fit(..., aggregate="event_study") succeeds with the expected F, event_times, and sample size. tests/test_had.py:L2868-L2901

Path to Approval

Change the event-study balanced-panel check to ignore unused categorical time levels so valid ordered-categorical panels are not rejected as unbalanced.
Add a regression test for ordered-categorical event-study inputs with unused category levels and assert the returned F, event_times, and sample size match the observed-period case.

Review basis: static diff review only; I did not run pytest here because the local runtime is missing the pandas test dependency.

…balance check **P1:** Round 3 added `observed=False` to the balance-check groupby to silence a pandas FutureWarning, but that creates a false-unbalance bug: on ordered-categorical `time_col` with extra category levels beyond the observed periods, `observed=False` materializes zero-count unit-period cells for the unused levels, and the balance check rejects the panel. Fix: switched to `observed=True`. This tells categorical groupby to count only OBSERVED unit-period cells, matching the `periods_list` (observed uniques) that the rest of the validator is keyed to. No change for numeric / datetime time columns. **Test added:** `test_ordered_categorical_with_unused_levels_accepted` declares categories `["pre0", "pre1", "pre2", "post1", "post2", "post3"]` but only observes `{"pre1", "pre2", "post1", "post2"}`; asserts the fit succeeds with `F="post1"` and `event_times=[-2, 0, 1]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T13:17:00Z

/ai-review

github-actions · 2026-04-22T13:26:17Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1f044e7c417c3d9fca572d11461ddacfb8beac08

Overall Assessment

✅ Looks good

Executive Summary

The prior re-review P1 is addressed: the event-study balance check now groups categorical periods with observed=True, and there is a direct regression for ordered categoricals with unused levels. diff_diff/had.py:L1112 tests/test_had.py:L2868
I did not find any unmitigated Appendix B.2 methodology defects. The substantive Phase 2b deviations in this diff are documented in the Methodology Registry and/or tracked in TODO.md, so they do not count against approval. docs/methodology/REGISTRY.md:L2335 TODO.md:L92
Severity: P2. The new HeterogeneousAdoptionDiDEventStudyResults.to_dict() is not actually JSON-friendly as documented, because it returns raw NumPy-backed scalar elements rather than normalized Python primitives. diff_diff/had.py:L591 tests/test_had.py:L2569
The existing static typing mismatch on fit(..., aggregate="event_study") remains documented and non-blocking. diff_diff/had.py:L1948 docs/methodology/REGISTRY.md:L2340 tests/test_had.py:L2135

Methodology

Severity: P3-informational. Impact: The affected method is HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"). The uniform F-1 baseline, last-cohort auto-filter plus never-treated retention, independent per-horizon sandwiches / pointwise CIs, and front-door rejection of time-varying post-treatment dose are all explicitly documented in REGISTRY.md / TODO.md, so there is no unmitigated methodology blocker here. Concrete fix: none. docs/methodology/REGISTRY.md:L2254 docs/methodology/REGISTRY.md:L2335 TODO.md:L92

Code Quality

Severity: P2. Impact: HeterogeneousAdoptionDiDEventStudyResults.to_dict() says its per-horizon arrays are returned in a JSON-friendly form, but list(self.event_times), list(self.att), etc. will still contain NumPy scalar elements. Downstream json.dumps(result.to_dict()) will fail on the new result object despite the public docstring promise. Concrete fix: use .tolist() or explicit int/float coercion for all ndarray fields, and add a regression that json.dumps(result.to_dict()) succeeds for an event-study result. diff_diff/had.py:L591 tests/test_had.py:L2569

Performance

No findings.

Maintainability

Severity: P3-informational. Impact: fit() still advertises HeterogeneousAdoptionDiDResults even though the event-study branch returns HeterogeneousAdoptionDiDEventStudyResults; the PR works around that with type: ignore in the implementation and typing.cast in tests. This mismatch is already documented in REGISTRY.md, so it is not approval-blocking. Concrete fix: none required; optional overloads or a union return type if you want the static surface to match runtime behavior. diff_diff/had.py:L1948 docs/methodology/REGISTRY.md:L2340 tests/test_had.py:L2135

Tech Debt

Severity: P3-informational. Impact: Deferred joint cross-horizon covariance, staggered support beyond the last cohort, and time-varying post-treatment dose estimation are now explicitly tracked in TODO.md, so they are properly mitigated under the stated review policy. Concrete fix: none. TODO.md:L92

Security

No findings.

Documentation/Tests

No additional findings from static review. The added tests now cover the previously reported unused-categorical-level regression and the new staggered / time-varying-dose guards. tests/test_had.py:L2693 tests/test_had.py:L2868 tests/test_had.py:L2945

Review basis: static diff review only; I could not run pytest here because it is not installed in this environment.

`HeterogeneousAdoptionDiDEventStudyResults.to_dict()` promised JSON- serializable output but previously returned raw numpy scalars via `list(ndarray)`, which `json.dumps` can't serialize. The `F` field and `filter_info.F_last` could also hold numpy scalars or pandas Timestamps that break serialization. Fix: - Per-horizon arrays use `.tolist()` (unwraps numpy scalars to native Python). - New `_json_safe_scalar` helper coerces numpy scalars via `.item()` and pandas Timestamp/Timedelta via `.isoformat()`; everything else passes through. - New `_json_safe_filter_info` helper applies `_json_safe_scalar` to `F_last` and each element of `dropped_cohorts`, and casts counts to native `int`. - `to_dict()` now applies these helpers consistently. **Test added:** `test_to_dict_json_serializable` asserts `json.dumps(result.to_dict())` succeeds and the round-trip values parse back as native Python types (int, float, list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T13:29:39Z

/ai-review

github-actions · 2026-04-22T13:34:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 736fdeaff8556af39f86bcee42d52ec4f85a8afb

Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings in the changed diff.

Executive Summary

The prior re-review P2 is addressed: HeterogeneousAdoptionDiDEventStudyResults.to_dict() now normalizes array/scalar outputs for JSON serialization, and there is a direct json.dumps(...) regression test. diff_diff/had.py:L132-L170 diff_diff/had.py:L638-L669 tests/test_had.py:L2579-L2598
The affected method is HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"). I did not find an undocumented Appendix B.2 methodology deviation; the substantive deviations in this PR are documented in the Methodology Registry and TODO.md, and the implementation matches those notes. diff_diff/had.py:L2542-L2805 docs/methodology/REGISTRY.md:L2335-L2340 docs/methodology/papers/dechaisemartin-2026-review.md:L192-L193 TODO.md:L92-L101
The new path has the high-risk guards I would expect from earlier review rounds: staggered panels without first_treat_col now fail front-door, first_treat_col is cross-validated against the observed dose path, ordered categorical chronology is respected, unused categorical levels no longer false-fail balance checks, time-varying post-treatment dose is rejected, and per-horizon inference is routed through safe_inference(). diff_diff/had.py:L889-L1324 diff_diff/had.py:L2757-L2780 diff_diff/utils.py:L174-L208
The static return-type mismatch on fit(..., aggregate="event_study") remains, but it is explicitly documented and is non-blocking under the stated rubric. diff_diff/had.py:L2000-L2011 diff_diff/had.py:L2086-L2095 docs/methodology/REGISTRY.md:L2340-L2340

Methodology

Severity: P3-informational. Impact: The changed estimator is HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"). I did not find an unmitigated Appendix B.2 mismatch, missing assumption check, or incorrect SE path in the new implementation. The notable deviations are documented rather than silent: uniform F-1 baseline, last-cohort auto-filter plus never-treated retention, independent per-horizon pointwise SEs, and front-door rejection of time-varying post-treatment dose. Concrete fix: none. diff_diff/had.py:L889-L1324 diff_diff/had.py:L2542-L2805 docs/methodology/REGISTRY.md:L2254-L2255 docs/methodology/REGISTRY.md:L2335-L2340 TODO.md:L92-L101

Code Quality

No findings. The previously reported JSON-serialization defect is fixed. diff_diff/had.py:L132-L170 diff_diff/had.py:L638-L669 tests/test_had.py:L2579-L2598

Performance

No findings from static review.

Maintainability

Severity: P3-informational. Impact: fit() is still statically annotated as returning HeterogeneousAdoptionDiDResults even though the aggregate="event_study" branch returns HeterogeneousAdoptionDiDEventStudyResults. This is documented in REGISTRY.md and handled explicitly in tests with typing.cast, so it does not block approval. Concrete fix: none required for approval; optional overloads or a union return type if you want the static API to match runtime behavior. diff_diff/had.py:L2000-L2011 diff_diff/had.py:L2086-L2095 docs/methodology/REGISTRY.md:L2340-L2340 tests/test_had.py:L2135-L2146

Tech Debt

Severity: P3-informational. Impact: Deferred joint cross-horizon covariance, last-cohort-only staggered support, and time-varying-dose estimation are explicitly tracked in TODO.md, so they are mitigated under the project’s review policy. Concrete fix: none for this PR. TODO.md:L92-L101

Security

No findings in the changed files.

Documentation/Tests

No findings from static review. The added tests cover the main correctness risks for this feature: per-horizon parity against Phase 2a, time-varying post-treatment dose rejection, chronological cohort ordering for ordered categoricals, first_treat_col/dose cross-validation, unused categorical levels, staggered panels without first_treat_col, and NaN-coupled inference. tests/test_had.py:L2400-L2475 tests/test_had.py:L2613-L2995 tests/test_had.py:L3047-L3070
I could not execute pytest here because pytest is not installed in this environment.

igerber and others added 2 commits April 22, 2026 07:03

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber merged commit 7f51fdf into main Apr 22, 2026
23 of 24 checks passed

igerber deleted the had-phase-2b-event-study branch April 22, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAD Phase 2b: multi-period event-study extension (Appendix B.2)#350

HAD Phase 2b: multi-period event-study extension (Appendix B.2)#350
igerber merged 7 commits intomainfrom
had-phase-2b-event-study

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant