Skip to content

HAD Phase 2b: multi-period event-study extension (Appendix B.2)#350

Merged
igerber merged 7 commits intomainfrom
had-phase-2b-event-study
Apr 22, 2026
Merged

HAD Phase 2b: multi-period event-study extension (Appendix B.2)#350
igerber merged 7 commits intomainfrom
had-phase-2b-event-study

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 22, 2026

Summary

  • Lifts aggregate="event_study" scaffolding from Phase 2a. HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") now returns a new HeterogeneousAdoptionDiDEventStudyResults dataclass with per-event-time WAS estimates on multi-period panels (paper Appendix B.2).
  • All three Phase 2a design paths (continuous_at_zero, continuous_near_d_lower, mass_point) are reused verbatim on per-horizon first differences anchored at Y_{g, F-1} (uniform baseline, consistent with Garrett et al. application).
  • Staggered-timing panels auto-filter to the last-treatment cohort PLUS never-treated units, per the paper's "there must be an untreated group, at least till the period where the last cohort gets treated" prescription. UserWarning surfaces kept/dropped counts and dropped cohort labels.
  • Per-horizon SEs use INDEPENDENT sandwiches (continuous: CCT-2014 robust / |den|; mass-point: Phase 2a structural-residual 2SLS). Pointwise CIs match paper's Pierce-Schott application. Joint cross-horizon covariance deferred to follow-up.

Methodology references (required if estimator / math changes)

  • Method name(s): HeterogeneousAdoptionDiD (event-study mode)
  • Paper / source link(s): de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026), arXiv:2405.04465v6, Appendix B.2
  • Any intentional deviations from the source (and why):
    • Uniform F-1 baseline for all horizons (paper review line 232 suggests asymmetric Y_{g,t} - Y_{g,1} for pre-periods; uniform baseline matches the paper's Garrett Section 5.1 application and simplifies event-time indexing). Documented in REGISTRY Note (Phase 2b baseline convention).
    • Per-horizon INDEPENDENT sandwiches, not joint cross-horizon covariance. Paper reports pointwise CIs; joint covariance is not derived in Appendix B.2. Deferred to follow-up PR (tracked in TODO.md). Documented in REGISTRY Note (Phase 2b per-horizon SE).
    • D_{g,F} used as single dose regressor across all horizons (paper convention assumes "once treated, stay treated with same dose"). Time-varying post-period dose deferred. Documented in TODO.md.

Validation

  • Tests added/updated: tests/test_had.py (+41 event-study tests, all passing)
    • Smoke per design path (continuous_at_zero, continuous_near_d_lower, mass_point) on 5-period panels
    • Staggered auto-filter: warning, filter_info populated, sample composition (last-cohort + never-treated retained, earlier cohorts dropped), explicit "retain never-treated" pin test
    • Per-horizon SE independence: matches Phase 2a SE on (F-1, t) two-period subset at atol=1e-12
    • Baseline convention: e = -1 not in event_times; event-time indexing symmetric around F
    • Pre-period placebos under no-pre-trend DGP concentrate around 0
    • 2x2 aggregate/period matrix: (T=2, event_study) raises, (T>2, overall) raises
    • Panel contracts preserved: NaN rejection, RCS rejection, non-contiguous dose rejection
    • Phase 2a guards preserved: reciprocal regime partition, d_lower contracts, NaN triple via safe_inference
    • sklearn compat: clone round-trip, fit idempotence
    • HeterogeneousAdoptionDiDEventStudyResults API: summary(), to_dict(), to_dataframe(), repr
  • Backtest / simulation / notebook evidence (if applicable): N/A — Phase 2b is an analytical extension; numerical verification is via per-horizon parity with Phase 2a (commit criterion Add fixed effects and absorb parameters to DifferenceInDifferences #2).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 2 commits April 22, 2026 07:03
Lifts `aggregate="event_study"` scaffolding left by Phase 2a.
`HeterogeneousAdoptionDiD.fit(..., aggregate="event_study")` now returns
a new `HeterogeneousAdoptionDiDEventStudyResults` dataclass with per-
event-time WAS estimates on multi-period panels. All three Phase 2a
design paths (continuous_at_zero, continuous_near_d_lower, mass_point)
are reused verbatim on per-horizon first differences anchored at
`Y_{g, F-1}` (uniform F-1 baseline, consistent with paper Garrett
application).

Staggered-timing panels are auto-filtered to the last-treatment cohort
with a `UserWarning` per paper Appendix B.2 prescription ("did_had may
be used only for the LAST treatment cohort in a staggered design").
Pre-period placebos included for `e <= -2`; the anchor `e = -1` is
skipped since `ΔY = 0` there by construction.

Per-horizon SEs use INDEPENDENT sandwiches: continuous paths use the
CCT-2014 robust SE from Phase 1c divided by `|den|`; mass-point path
uses the Phase 2a structural-residual 2SLS sandwich computed on each
horizon's first differences. Pointwise CIs match the paper's own
Pierce-Schott application (Figure 2); joint cross-horizon covariance
is deferred to a follow-up PR (tracked in TODO.md).

All Phase 2a policy guards (reciprocal regime partition, d_lower
contracts, NaN contract via safe_inference, sklearn get/set_params
atomicity) are preserved identically on the event-study path. The
multi-period panel validator adds dose-contiguity enforcement (pre-
periods < post-periods in natural ordering) and rejects non-monotonic
dose sequences with a pointer to ChaisemartinDHaultfoeuille.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper Appendix B.2: "in designs with variation in treatment timing,
there must be an untreated group, at least till the period where the
last cohort gets treated." Previously the staggered last-cohort filter
dropped BOTH earlier cohorts AND never-treated units (first_treat=0),
which was undocumented and changed the comparison set.

Fixed: `_validate_had_panel_event_study` now keeps `first_treat == F_last`
AND `first_treat == 0` (never-treated), dropping only earlier cohorts
(`first_treat` in `dropped_cohorts`). Never-treated units satisfy the
dose invariant at every period (D=0 throughout) and serve as the
untreated-group comparison per the paper's requirement. Matches
CallawaySantAnna's never-treated handling convention.

Updated:
- Warning message names the "plus never-treated" composition.
- REGISTRY.md Note (Phase 2b last-cohort filter) documents the new
  contract explicitly.
- Function docstring clarifies `data_filtered` and `filter_info` shape.
- New test `test_staggered_filter_retains_never_treated_units` pins
  the contract: kept first_treat values are exactly {0, F_last}.
- Existing tests updated to reflect new n_kept / n_dropped counts.

Also addresses AI review P3s:
- Clarify sort-key comment (no false "str fallback" claim).
- Drop misleading NaN-rendering comment on event-study summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⛔ Blocker

Executive Summary

  • HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") silently accepts panels where treated units’ doses change after F, but the implementation still hard-codes D_{g,F} as the regressor for every horizon. That can return wrong later-horizon estimates on accepted input.
  • Staggered timing is only reduced to the last cohort when first_treat_col is provided. If users omit that column, mixed-cohort panels can still be fit as if they had a common treatment date, which violates Appendix B.2 identification.
  • I did not treat the uniform F-1 baseline or the independent per-horizon sandwiches as defects, because both are explicitly documented in REGISTRY.md.
  • The methodology docs still need cleanup: REGISTRY.md now contains contradictory statements about whether staggered timing hard-fails or auto-filters.
  • The new tests are broad, but they do not cover the two methodology-breaking input cases above.

Methodology

  • Severity: P0. Impact: _aggregate_multi_period_first_differences() always reuses D_{g,F} for every event-time horizon, and there is no front-door validation that post-treatment dose stays constant within unit. The docstring explicitly says time-varying post-period dose is treated as the realized F-period dose, so accepted panels with changing D_{g,t} can produce wrong later-horizon ATT/SE values with no warning. The TODO entry does not mitigate this because it is a silent correctness bug. Concrete fix: reject any panel where a unit has D_{g,t} != D_{g,F} for any t >= F, or implement the time-varying-dose estimator and document it in REGISTRY.md. References: diff_diff/had.py:L1197-L1255, TODO.md:L101-L101.
  • Severity: P1. Impact: the last-cohort reduction only runs inside if first_treat_col is not None. When a staggered panel is passed without that column, the validator skips cohort detection, infers F from the first period with any nonzero dose, and proceeds as if the panel had a common adoption date. Later-treated cohorts then enter d_arr as zero-dose units at the inferred F, which changes the control composition and violates the paper’s “last cohort only” Appendix B.2 contract. The public fit() docstring even says first_treat_col is required for staggered timing, but the code never enforces that requirement. Concrete fix: when first_treat_col is absent, infer each unit’s first positive-dose period from the observed dose path; if multiple cohorts exist, either auto-filter to F_last or raise a front-door error directing users to pass first_treat_col / use ChaisemartinDHaultfoeuille. References: diff_diff/had.py:L843-L855, diff_diff/had.py:L919-L1066, diff_diff/had.py:L1803-L1812, docs/methodology/REGISTRY.md:L2332-L2334.
  • Severity: P3. Impact: the uniform F-1 baseline and independent per-horizon sandwich SEs are documented in the Methodology Registry, so I did not count them as methodology defects. Concrete fix: none required for approval. References: docs/methodology/REGISTRY.md:L2332-L2335.

Code Quality

No additional findings.

Performance

No additional findings.

Maintainability

No additional findings beyond the documentation/test issues below.

Tech Debt

  • Severity: P3. Impact: joint cross-horizon covariance is explicitly tracked as deferred work and matches the paper’s pointwise-CI presentation, so it is not a blocker for this PR. Concrete fix: none required for approval; follow-up can add IF stacking or block bootstrap. References: docs/methodology/REGISTRY.md:L2334-L2334, TODO.md:L92-L92.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: REGISTRY.md now gives conflicting guidance on staggered timing. One note still says the implementation hard-fails and redirects on variation in treatment timing, while the new Phase 2b note says the estimator auto-filters to the last cohort and can infer F without first_treat_col. That ambiguity makes the shipped methodology contract hard to trust. Concrete fix: reconcile the older note with the new Phase 2b note and state explicitly whether omitted first_treat_col on staggered panels is supported, rejected, or auto-detected. References: docs/methodology/REGISTRY.md:L2251-L2251, docs/methodology/REGISTRY.md:L2333-L2333.
  • Severity: P2. Impact: the new tests exercise only the explicit first_treat_col staggered path and the choice to reuse D_{g,F}. There is no negative test for either blocker condition: staggered multi-cohort input without first_treat_col, or time-varying post-F dose. That leaves both failures unguarded. Concrete fix: add regression tests for both cases in tests/test_had.py. References: tests/test_had.py:L2271-L2397, tests/test_had.py:L2850-L2872, tests/test_had.py:L2917-L2930.

Path to Approval

  1. Add a hard validation that event-study panels satisfy the constant-dose-after-treatment assumption behind D_{g,F} reuse, or implement the time-varying-dose method and document it in REGISTRY.md.
  2. Detect staggered timing even when first_treat_col is omitted; if multiple cohorts are present, auto-filter to the last cohort or raise a front-door error requiring cohort metadata / redirecting to ChaisemartinDHaultfoeuille.
  3. Add regression tests for both conditions above and reconcile the contradictory staggered-timing notes in REGISTRY.md.

Execution note: I did not run the test suite in this environment because the available Python on PATH does not have the project dependencies (numpy, pandas, scipy).

…tection

**P0 (blocker):** `_aggregate_multi_period_first_differences` reuses
`D_{g, F}` as the single regressor for every event-time horizon. Without
validation, panels where a unit's dose varies across post-treatment
periods silently misattribute later-horizon effects to the period-F dose.

Fix: `_validate_had_panel_event_study` now rejects panels where any unit
has time-varying dose across post-periods (within-unit spread beyond
float tolerance), with a `ValueError` redirecting to
ChaisemartinDHaultfoeuille for genuinely time-varying regimes.

**P1:** Staggered-timing auto-filter previously only ran inside
`if first_treat_col is not None`. Multi-cohort panels without cohort
metadata slipped through, treating later-cohort units as zero-dose
"controls" at the inferred F, violating Appendix B.2's last-cohort-only
contract.

Fix: When `first_treat_col is None`, the validator computes per-unit
first-positive-dose period from the dose path. If multiple distinct
cohorts are detected, it raises a `ValueError` directing users to
pass `first_treat_col` (which activates the last-cohort auto-filter)
or use ChaisemartinDHaultfoeuille for full staggered support.

**P2 (docs):** Reconciled contradictory REGISTRY guidance between the
legacy edge-case note (line ~2251) and the new Phase 2b last-cohort
filter note. Both now describe the auto-filter + front-door rejection
of un-annotated staggered panels consistently.

**P2 (tests):** Added regression tests for both blockers:
- `test_time_varying_post_F_dose_rejected`
- `test_staggered_without_first_treat_col_rejected`

Also added a **Note (Phase 2b constant-dose requirement)** block to
REGISTRY documenting the new validator guard. TODO.md entry updated
to reflect front-door rejection of time-varying doses (not silent
reuse as before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1bfec37582c1e2c8f612d93da090637e06042b62


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The two prior re-review blockers are addressed: the event-study validator now rejects time-varying post-F dose paths and no longer silently accepts multi-cohort staggered panels when first_treat_col is omitted; the earlier REGISTRY.md contradiction on staggered timing is also resolved (diff_diff/had.py#L1130, diff_diff/had.py#L1153, tests/test_had.py#L2693, tests/test_had.py#L2730, docs/methodology/REGISTRY.md#L2251).
  • I did not count the uniform F-1 anchor or per-horizon independent SEs as defects: they are explicitly documented in REGISTRY.md, and the DIDHAD reference docs also define dynamic effects/placebos relative to F-1 and direct staggered-timing users to did_multiplegt_dyn. (ideas.repec.org)
  • P1: when first_treat_col is present, the new last-cohort filter trusts that column without validating it against the observed dose path. A swapped or mistyped cohort label can silently keep the wrong cohort and change the estimand/sample composition (diff_diff/had.py#L919, diff_diff/had.py#L967, diff_diff/had.py#L1130).
  • P1: the event-study path derives chronology by sorting raw period labels. That reintroduces the label-order problem the two-period validator explicitly avoided, so valid multi-period panels with semantic or non-ordinal string periods can be rejected or misindexed (diff_diff/had.py#L732, diff_diff/had.py#L1065, diff_diff/had.py#L1321).
  • No new inline-inference / partial-NaN anti-patterns or security issues showed up in the changed estimator path.

Methodology

  • Severity: P1. Impact: _validate_had_panel_event_study() uses first_treat_col as the authoritative cohort selector, but it never checks that those labels equal each unit’s actual first positive-dose period. In this PR, that column now decides which cohort is treated as the Appendix B.2 “last cohort,” so a mislabeled metadata column can silently retain the wrong units and return event-study estimates for the wrong cohort. Concrete fix: infer each unit’s actual first positive-dose period from dose_col for all units, require equality with first_treat_col (or 0 for never-treated), and reject off-support/mismatched labels before building cohorts and F_last (diff_diff/had.py#L919, diff_diff/had.py#L967, diff_diff/had.py#L1130).
  • Severity: P1. Impact: Appendix B.2 event-time effects depend on chronological ordering, but the new validator/aggregator recover that order with raw sorted(...) on period labels. For plain string/object labels, that is lexicographic rather than chronological; valid panels such as month names or "pre1"/"pre2"/"post1"/"post2" can fail the contiguity check or get the wrong anchor/horizon order. Concrete fix: require an ordered time type on the event-study path (numeric, datetime, ordered categorical, or an explicit supplied order) and front-door reject unordered object/string labels instead of silently using lexicographic sort (diff_diff/had.py#L1065, diff_diff/had.py#L1080, diff_diff/had.py#L1321).
  • Severity: P3-informational. Impact: the uniform F-1 baseline and independent per-horizon sandwiches are documented deviations/choices, so they are not approval blockers. Concrete fix: none required (docs/methodology/REGISTRY.md#L2332, docs/methodology/REGISTRY.md#L2335).

Code Quality

  • No findings. The new per-horizon inference path still routes through safe_inference(), so I did not find a new inline t_stat = effect / se or partial-NaN guard regression (diff_diff/had.py#L2605).

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational. Impact: joint cross-horizon covariance and the broader staggered/time-varying-dose follow-ups are now explicitly tracked in TODO.md, so I did not count them against approval. Concrete fix: none required in this PR (TODO.md#L92, TODO.md#L93, TODO.md#L101).

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests close the prior blocker cases, but there is still no regression coverage for the two unmitigated issues above: event-study with unordered string/object period labels, and event-study with first_treat_col disagreeing with the observed dose path. The only explicit string-period coverage remains two-period-only. Concrete fix: add one regression test for semantic string periods on aggregate="event_study" and one for a swapped/mistyped first_treat_col that must raise (tests/test_had.py#L1760, tests/test_had.py#L2593, tests/test_had.py#L2730).
  • Severity: P3. Impact: the HeterogeneousAdoptionDiD class docstring still says the event-study extension is “queued for Phase 2b,” which now contradicts runtime behavior and the updated registry/docs. Concrete fix: update the class docstring so help(HeterogeneousAdoptionDiD) matches fit(..., aggregate="event_study") (diff_diff/had.py#L1607, diff_diff/had.py#L1925).

Path to Approval

  1. Validate first_treat_col against each unit’s actual first positive-dose period before using it for last-cohort selection, and raise on any mismatch or off-support label.
  2. Stop inferring event-study chronology from raw lexicographic sort of period labels; require an ordered time representation or explicit order and reject unordered object/string labels.
  3. Add regression tests for both cases above so the fixes stay pinned.

…P1 ordered time

**P1 (first_treat_col vs dose mismatch):** The last-cohort filter trusted
`first_treat_col` without validating it against the observed dose path.
A swapped or mistyped cohort label could silently retain the wrong
cohort as F_last.

Fix: `_validate_had_panel_event_study` now cross-validates each unit's
declared first_treat against their actual first-positive-dose period:
- declared == 0: unit must have D=0 at every period
- declared == F_g > 0: unit's first period with D>0 must equal F_g
Any mismatch raises `ValueError` with an example unit, declared value,
and actual first-positive period.

**P1 (unordered time labels):** Event-study chronology was inferred via
raw `sorted()` on period labels. For object/string dtypes that falls
back to lexicographic sort, which silently misorders panels like
"pre1"/"pre2"/"post1"/"post2" or month-name labels.

Fix: Event-study path now requires a numeric, datetime, or ordered-
categorical time column. Object/string dtypes raise a front-door
`ValueError` directing users to convert. Ordered categoricals are
sorted by their declared category order (not the underlying string),
via a dtype-aware `_sort_key` reused by both the validator and the
multi-period aggregator.

**P3 (docstring):** Class docstring no longer says the event-study
extension is "queued for Phase 2b"; now documents both aggregation
modes with pointers to the respective result classes.

**Tests added:**
- `test_first_treat_col_mismatch_with_dose_raises` pins the cross-
  validation contract.
- `test_unordered_string_time_col_rejected` pins front-door rejection
  of object dtypes.
- `test_ordered_categorical_time_col_accepted` confirms ordered
  categoricals sort by category order and fit successfully.

Minor: added `observed=False` to the categorical-groupby in the balance
check to silence the pandas FutureWarning while preserving behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6e321ad74ed0f463ccdc56fb6803a94f8ace91c2


Overall Assessment

⛔ Blocker

Executive Summary

Methodology

  • Severity: P0. Impact: HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") now explicitly supports ordered categorical time labels, but _validate_had_panel_event_study() still determines the “last” treated cohort with sorted(..., key=lambda x: (x is None, x)). That is lexicographic/raw-Python ordering, not the chronological order already defined for time_col. On valid staggered inputs such as ordered month/quarter labels where lexical and chronological order differ, F_last can be wrong, so the auto-filter keeps the wrong cohort and changes the treated/comparison sample without error. This is a silent correctness bug in the Appendix B.2 last-cohort identification step, not a documented deviation. Concrete fix: derive cohorts/F_last from the validated time_col ordering (or sort actual_first_pos.unique() with the same time-order key used later for t_pre_list / t_post_list) before building keep_mask. diff_diff/had.py:L1011 diff_diff/had.py:L1023 diff_diff/had.py:L1129 diff_diff/had.py:L921 docs/methodology/REGISTRY.md:L2333
  • Severity: P3-informational. Impact: the uniform F-1 anchor, independent per-horizon SEs, and rejection of time-varying post-treatment dose are documented in the methodology registry / TODO tracking, so I did not count them against approval. Concrete fix: none. docs/methodology/REGISTRY.md:L2334 docs/methodology/REGISTRY.md:L2335 docs/methodology/REGISTRY.md:L2336 TODO.md:L92 TODO.md:L101

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: a couple of new docstrings are still stale after Phase 2b. fit() still opens with “two-period panel,” and the event-study result docstring says n_units contains only last-cohort units after the staggered filter, even though never-treated units are retained. Concrete fix: update the docstrings to describe the multi-period/event-study behavior accurately. diff_diff/had.py:L1943 diff_diff/had.py:L447

Tech Debt

  • No findings. Deferred joint cross-horizon covariance, broader staggered follow-up work, and time-varying-dose extensions are explicitly tracked, so I did not count them against approval. TODO.md:L92 TODO.md:L93 TODO.md:L101

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new ordered-categorical coverage only exercises the common-adoption path, while the staggered-filter tests use numeric cohort labels. That leaves the blocker above unpinned. Concrete fix: add a staggered event-study regression with ordered-categorical period / first_treat labels whose lexicographic order differs from chronological order, and assert that filter_info["F_last"], the retained units, and result.F follow chronological order. tests/test_had.py:L2274 tests/test_had.py:L2804

Path to Approval

  1. Fix _validate_had_panel_event_study() so F_last is selected using the validated chronological order from time_col, not raw Python sorting of first_treat_col.
  2. Add a regression test for staggered ordered-categorical panels where chronological last != lexicographic max, and assert the kept cohort/sample matches the chronological last cohort.

…docstrings

**P0 (cohort sort key):** `_validate_had_panel_event_study` sorted
first_treat_col values with raw Python `(x is None, x)` while `time_col`
was already required to be ordered (numeric/datetime/ordered
categorical). On ordered-categorical staggered panels where
chronological order differs from lexicographic order, `F_last =
cohorts[-1]` silently picked the lexicographically latest cohort,
not the chronologically latest. That keeps the wrong cohort and
returns event-study estimates for the wrong estimand.

Fix: Promoted the dtype-aware `_sort_key` (ordered-categorical uses
declared category index; numeric/datetime use natural order) to the
top of the validator, just after the time-dtype check. Cohort
sorting, pre/post period sorting, contiguity check, and the
staggered-without-first_treat detection all now share this single
`_sort_key`. Removed the duplicate `_sort_key` definition that was
sitting further down in the same function.

**P3 (stale docstrings):**
- `fit()` no longer opens with "two-period panel"; now describes both
  aggregation modes with links to the respective result classes.
- `HeterogeneousAdoptionDiDEventStudyResults.n_units` docstring no
  longer says "only last-cohort units"; now accurately reports
  last-cohort PLUS never-treated retained.

**Test added:** `test_staggered_ordered_categorical_chooses_chronological_last`
uses categories `["q1", "q2", "q3", "q10"]` where lex max of the two
cohorts (`"q2", "q10"`) is `"q2"` but chronological last is `"q10"`;
asserts the fix picks `"q10"` as `F_last` and retains only the
q10-cohort units.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cb11a1fdee026ec7d5d0fcd40160457d47a1f50c


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review blocker is addressed: last-cohort selection now uses the validated chronological sort key, and there is a targeted ordered-categorical staggered regression test that pins the non-lexicographic case. diff_diff/had.py:L946-L1050 tests/test_had.py:L2730-L2792
  • No unmitigated Appendix B.2 methodology deviations found. The uniform F-1 baseline, last-cohort auto-filter, independent per-horizon sandwiches / pointwise CIs, and front-door rejection of time-varying post-F dose are all documented in the Methodology Registry / TODO.md, so they do not count against approval. docs/methodology/REGISTRY.md:L2332-L2337 TODO.md:L92-L101
  • P1: valid ordered-categorical event-study panels with extra unused category levels will be falsely rejected as unbalanced because the new balance check groups with observed=False, which materializes zero-count unit-period cells for unused categories. diff_diff/had.py:L908-L936 diff_diff/had.py:L1112-L1124
  • The new tests cover the earlier chronological-last bug, first_treat/dose-path cross-validation, unordered-string rejection, and time-varying-dose rejection, but they do not pin the unused-categorical-level case above. tests/test_had.py:L2794-L2833 tests/test_had.py:L2835-L2901
  • I did not find new inline-inference / partial-NaN regressions, performance issues, or security concerns in the changed estimator path.

Methodology

Code Quality

  • Severity: P1. Impact: _validate_had_panel_event_study() now explicitly supports ordered categorical time_col values, but its balance check uses data_filtered.groupby([unit_col, time_col], observed=False).size(). On categorical groupers, observed=False expands unused category levels into zero-count unit-period cells. That means a panel that is balanced on every observed period can still raise "Unbalanced panel" if the categorical dtype contains any extra unused levels, which is a realistic edge case for the new ordered-categorical event-study path. Concrete fix: run the balance check on observed period levels only, e.g. use observed=True or strip unused categories before grouping, and keep the rest of the validator keyed to the observed periods_list. diff_diff/had.py:L908-L936 diff_diff/had.py:L1112-L1124

Performance

  • No findings.

Maintainability

  • Severity: P3-informational. Impact: fit() is still annotated as returning HeterogeneousAdoptionDiDResults, even though aggregate="event_study" returns HeterogeneousAdoptionDiDEventStudyResults. This is documented and worked around in tests with typing.cast, so it is not blocker-worthy, but the public typing surface still mismatches runtime behavior. Concrete fix: add overloads keyed on aggregate or annotate the return as a union if you want the API typing to match the new result class. diff_diff/had.py:L1942-L2054 docs/methodology/REGISTRY.md:L2337-L2337 tests/test_had.py:L2135-L2146

Tech Debt

  • Severity: P3-informational. Impact: Deferred joint cross-horizon covariance, broader staggered follow-up work, and time-varying-dose extensions are properly tracked in TODO.md, so I did not count them against approval. Concrete fix: none. TODO.md:L92-L101

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: ordered-categorical coverage only exercises cases where the declared categories exactly match the observed periods, so it does not protect against the false-unbalance case above. Concrete fix: add a regression with an ordered categorical time_col whose categories are a strict superset of the observed periods and assert that fit(..., aggregate="event_study") succeeds with the expected F, event_times, and sample size. tests/test_had.py:L2868-L2901

Path to Approval

  1. Change the event-study balanced-panel check to ignore unused categorical time levels so valid ordered-categorical panels are not rejected as unbalanced.
  2. Add a regression test for ordered-categorical event-study inputs with unused category levels and assert the returned F, event_times, and sample size match the observed-period case.

Review basis: static diff review only; I did not run pytest here because the local runtime is missing the pandas test dependency.

…balance check

**P1:** Round 3 added `observed=False` to the balance-check groupby to
silence a pandas FutureWarning, but that creates a false-unbalance bug:
on ordered-categorical `time_col` with extra category levels beyond the
observed periods, `observed=False` materializes zero-count unit-period
cells for the unused levels, and the balance check rejects the panel.

Fix: switched to `observed=True`. This tells categorical groupby to
count only OBSERVED unit-period cells, matching the `periods_list`
(observed uniques) that the rest of the validator is keyed to. No
change for numeric / datetime time columns.

**Test added:** `test_ordered_categorical_with_unused_levels_accepted`
declares categories `["pre0", "pre1", "pre2", "post1", "post2", "post3"]`
but only observes `{"pre1", "pre2", "post1", "post2"}`; asserts the
fit succeeds with `F="post1"` and `event_times=[-2, 0, 1]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1f044e7c417c3d9fca572d11461ddacfb8beac08


Overall Assessment

✅ Looks good

Executive Summary

Methodology

  • Severity: P3-informational. Impact: The affected method is HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"). The uniform F-1 baseline, last-cohort auto-filter plus never-treated retention, independent per-horizon sandwiches / pointwise CIs, and front-door rejection of time-varying post-treatment dose are all explicitly documented in REGISTRY.md / TODO.md, so there is no unmitigated methodology blocker here. Concrete fix: none. docs/methodology/REGISTRY.md:L2254 docs/methodology/REGISTRY.md:L2335 TODO.md:L92

Code Quality

  • Severity: P2. Impact: HeterogeneousAdoptionDiDEventStudyResults.to_dict() says its per-horizon arrays are returned in a JSON-friendly form, but list(self.event_times), list(self.att), etc. will still contain NumPy scalar elements. Downstream json.dumps(result.to_dict()) will fail on the new result object despite the public docstring promise. Concrete fix: use .tolist() or explicit int/float coercion for all ndarray fields, and add a regression that json.dumps(result.to_dict()) succeeds for an event-study result. diff_diff/had.py:L591 tests/test_had.py:L2569

Performance

  • No findings.

Maintainability

  • Severity: P3-informational. Impact: fit() still advertises HeterogeneousAdoptionDiDResults even though the event-study branch returns HeterogeneousAdoptionDiDEventStudyResults; the PR works around that with type: ignore in the implementation and typing.cast in tests. This mismatch is already documented in REGISTRY.md, so it is not approval-blocking. Concrete fix: none required; optional overloads or a union return type if you want the static surface to match runtime behavior. diff_diff/had.py:L1948 docs/methodology/REGISTRY.md:L2340 tests/test_had.py:L2135

Tech Debt

  • Severity: P3-informational. Impact: Deferred joint cross-horizon covariance, staggered support beyond the last cohort, and time-varying post-treatment dose estimation are now explicitly tracked in TODO.md, so they are properly mitigated under the stated review policy. Concrete fix: none. TODO.md:L92

Security

  • No findings.

Documentation/Tests

Review basis: static diff review only; I could not run pytest here because it is not installed in this environment.

`HeterogeneousAdoptionDiDEventStudyResults.to_dict()` promised JSON-
serializable output but previously returned raw numpy scalars via
`list(ndarray)`, which `json.dumps` can't serialize. The `F` field and
`filter_info.F_last` could also hold numpy scalars or pandas
Timestamps that break serialization.

Fix:
- Per-horizon arrays use `.tolist()` (unwraps numpy scalars to native
  Python).
- New `_json_safe_scalar` helper coerces numpy scalars via `.item()`
  and pandas Timestamp/Timedelta via `.isoformat()`; everything else
  passes through.
- New `_json_safe_filter_info` helper applies `_json_safe_scalar` to
  `F_last` and each element of `dropped_cohorts`, and casts counts to
  native `int`.
- `to_dict()` now applies these helpers consistently.

**Test added:** `test_to_dict_json_serializable` asserts
`json.dumps(result.to_dict())` succeeds and the round-trip values
parse back as native Python types (int, float, list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 736fdeaff8556af39f86bcee42d52ec4f85a8afb


Overall Assessment

✅ Looks good

No unmitigated P0 or P1 findings in the changed diff.

Executive Summary

Methodology

Code Quality

Performance

  • No findings from static review.

Maintainability

Tech Debt

  • Severity: P3-informational. Impact: Deferred joint cross-horizon covariance, last-cohort-only staggered support, and time-varying-dose estimation are explicitly tracked in TODO.md, so they are mitigated under the project’s review policy. Concrete fix: none for this PR. TODO.md:L92-L101

Security

  • No findings in the changed files.

Documentation/Tests

  • No findings from static review. The added tests cover the main correctness risks for this feature: per-horizon parity against Phase 2a, time-varying post-treatment dose rejection, chronological cohort ordering for ordered categoricals, first_treat_col/dose cross-validation, unused categorical levels, staggered panels without first_treat_col, and NaN-coupled inference. tests/test_had.py:L2400-L2475 tests/test_had.py:L2613-L2995 tests/test_had.py:L3047-L3070
  • I could not execute pytest here because pytest is not installed in this environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026
@igerber igerber merged commit 7f51fdf into main Apr 22, 2026
23 of 24 checks passed
@igerber igerber deleted the had-phase-2b-event-study branch April 22, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant