Skip to content

HAD Phase 3 follow-up: joint Stute pretest + event-study workflow#353

Merged
igerber merged 8 commits intomainfrom
had-joint-stute-pretest
Apr 24, 2026
Merged

HAD Phase 3 follow-up: joint Stute pretest + event-study workflow#353
igerber merged 8 commits intomainfrom
had-joint-stute-pretest

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 23, 2026

Summary

  • Adds joint Stute pretest (stute_joint_pretest + joint_pretrends_test + joint_homogeneity_test) to close the paper step-2 gap that Phase 3 did_had_pretest_workflow flagged with an "Assumption 7 pre-trends test NOT run" caveat.
  • Extends did_had_pretest_workflow with aggregate="event_study" multi-period dispatch (QUG + joint pre-trends + joint homogeneity-linearity); aggregate="overall" preserves Phase 3 behavior bit-exactly.
  • Sum-of-CvMs aggregation with shared-η Mammen wild bootstrap across horizons per unit (Delgado-Manteiga 2001); per-horizon scale-invariant exact-linear short-circuit; reciprocal front-door guards on both data-in wrappers.
  • Eq (18) linear-trend detrending (paper Section 5.2 Pierce-Schott p=0.51) DEFERRED to Phase 4 replication harness where the published value serves as parity anchor.

Paper: de Chaisemartin, Ciccia, D'Haultfœuille, Knau (2026, arXiv:2405.04465v6). Sections 4.2-4.3 + 5.2.

Test plan

  • CI: pytest tests/test_had_pretests.py (115 tests total — 69 existing Phase 3 + 46 new covering core, wrappers, workflow, and serialization).
  • CI: black, ruff, mypy on diff_diff/had_pretests.py, diff_diff/__init__.py, tests/test_had_pretests.py.
  • Verify aggregate="overall" path is bit-exact with Phase 3 (TestMultiPeriodWorkflow::test_overall_aggregate_unchanged).
  • Verify event-study verdict does not emit the "paper step 2 deferred" string (TestMultiPeriodWorkflow::test_no_paper_step_2_deferred_string_on_event_study).
  • Verify shared-η bootstrap structure (TestStuteJointPretest::test_shared_eta_across_horizons_white_box).
  • Verify reciprocal validator twin parity (D=0 in pre vs D>0 in post; base-period ordering).

Notes

  • HADPretestReport.stute and .yatchew are now Optional because the event-study path emits None on those fields. aggregate="overall" always populates them, so Phase 3 consumers are unaffected. A handful of existing tests add explicit is not None narrowing assertions.
  • Event-study step 3 uses joint Stute only; no joint Yatchew variant exists because the paper does not derive one. Users needing Yatchew-style adjacent-difference robustness under multi-period data can call yatchew_hr_test on each (base, post) pair manually. REGISTRY.md documents this asymmetry.
  • Phase 3 follow-up TODO rows 98 (joint Eq 18) and 102 (multi-period workflow dispatch) are deleted; a new row tracks Eq 18 linear-trend detrending deferred to Phase 4.

🤖 Generated with Claude Code

…patch

Adds `stute_joint_pretest` (residuals-in core) plus `joint_pretrends_test`
and `joint_homogeneity_test` data-in wrappers for the paper's step-2
(mean-independence pre-trends) and step-3 (linearity joint extension)
nulls. Extends `did_had_pretest_workflow` with `aggregate="event_study"`
multi-period dispatch that closes the "paper step 2 deferred" gap
previously flagged on two-period reports.

Implementation highlights:
- Sum-of-CvMs aggregation (Delgado 1993; Escanciano 2006) with shared
  Mammen wild bootstrap multiplier across horizons per unit to preserve
  vector-valued empirical-process unit-level dependence (Delgado-Manteiga
  2001; Hlavka-Huskova 2020).
- Per-horizon scale- and translation-invariant exact-linear short-circuit
  (a single degenerate horizon does not collapse the joint test).
- Reciprocal front-door guards on both wrappers: non-empty horizon list,
  base_period ordering, D=0 invariant (pre-trends) and D>0 existence
  (post-homogeneity).
- Backward-compatible HADPretestReport extension: new fields
  pretrends_joint, homogeneity_joint, aggregate with defaults; stute
  and yatchew become Optional. summary, to_dict, to_dataframe, and
  __repr__ branch on aggregate and preserve Phase 3 schemas bit-exactly
  on the aggregate="overall" path.
- Eq (18) linear-trend detrending (paper Section 5.2 Pierce-Schott
  p=0.51) deferred to Phase 4 replication harness where the published
  value serves as parity anchor; TODO row migrated accordingly.

46 new tests (115 total in tests/test_had_pretests.py) covering:
K=1 parity with stute_test, shared-eta white-box, per-horizon short-
circuit independence, full reciprocal-validator matrix, event-study
verdict priority, serialization round-trip across aggregates. Includes
regression tests asserting the "paper step 2 deferred" string is absent
from any event-study verdict.

Closes TODO.md Phase 3 rows for joint Eq 18 and multi-period workflow
dispatch. See REGISTRY.md HeterogeneousAdoptionDiD "Joint Stute tests"
for algorithm, invariants, and the no-joint-Yatchew acknowledgment
(the paper does not derive one; multi-period Yatchew remains available
per-horizon via yatchew_hr_test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

Needs changes. The highest unmitigated issues are P1s in the new joint Stute APIs.

Executive Summary

  • Documented methodology choices are fine: sum-of-CvMs aggregation, shared-η bootstrap, no joint Yatchew variant, and the Eq. 18 detrending deferral are all recorded in docs/methodology/REGISTRY.md:L2338-L2351 and TODO.md:L98-L98.
  • P1: the new public data-in wrappers joint_pretrends_test and joint_homogeneity_test accept first_treat_col but never run the event-study validator, so direct calls bypass the Appendix B.2 last-cohort filter and constant-post-dose guard.
  • P1: the joint Stute core dropped the constant-dose degeneracy guard that stute_test already has; constant-d inputs can yield a bogus fail-to-reject in joint_pretrends_test or a singular-matrix crash in joint_homogeneity_test.
  • P2: the PR says the overall-path report outputs remain bit-exact, but summary(), to_dict(), and __repr__ now change on the aggregate="overall" path.
  • P3: docs/tests still overstate full step-2 closure for any >=3-period panel and do not cover the two edge cases above.

Methodology

  • P1: joint_pretrends_test and joint_homogeneity_test never use first_treat_col or call _validate_multi_period_panel / _validate_had_panel_event_study (diff_diff/had_pretests.py:L1571-L1604, diff_diff/had_pretests.py:L2030-L2164, diff_diff/had_pretests.py:L2167-L2298). Impact: the new public wrappers can silently run on staggered panels, time-varying post-dose panels, or otherwise unvalidated data even though Appendix B.2 requires last-cohort filtering and constant post-treatment dose (diff_diff/had.py:L889-L1324). Concrete fix: validate/filter inside each wrapper, operate on data_filtered, and verify that base_period / pre_periods / post_periods are subsets of the validated pre/post sets.
  • P1: stute_joint_pretest has no zero-variation-in-D guard, unlike stute_test (diff_diff/had_pretests.py:L1229-L1248 vs diff_diff/had_pretests.py:L1885-L1995). Impact: with constant d, joint_pretrends_test builds mean-zero intercept-only residuals (diff_diff/had_pretests.py:L2144-L2155) and the tie-collapsed CvM becomes mechanically zero (diff_diff/had_pretests.py:L837-L880), so the test can report a misleading fail-to-reject; joint_homogeneity_test can instead hit a singular [1, d] solve (diff_diff/had_pretests.py:L1991-L1995, diff_diff/had_pretests.py:L2287-L2298). Concrete fix: mirror the single-horizon behavior in the joint core: when np.var(doses_arr) <= 0, warn and return all-NaN/inconclusive output instead of computing a statistic or attempting the refit.

Code Quality

  • P2: the PR claims the aggregate="overall" path remains bit-exact, but HADPretestReport.__repr__, summary(), and to_dict() now always inject aggregate or aggregate headers on the legacy path (diff_diff/had_pretests.py:L610-L665), which contradicts CHANGELOG.md:L12-L12. Impact: downstream consumers comparing exact dict keys or formatted strings will see a breaking change on the supposedly unchanged path. Concrete fix: either keep the extra aggregate fields/header event-study-only or update the compatibility claim and versioning/docs to acknowledge the output change.

Performance

No findings.

Maintainability

No additional findings beyond the dead-parameter/validator-bypass issue above.

Tech Debt

No blocker. The Eq. 18 linear-trend detrending gap is explicitly tracked in TODO.md:L98-L98, so I did not count that deferral as a defect.

Security

No findings.

Documentation/Tests

  • P3: the registry/changelog still read as though aggregate="event_study" closes step 2 for any balanced >=3-period panel (docs/methodology/REGISTRY.md:L2328-L2336, CHANGELOG.md:L11-L12), but the implementation intentionally leaves pretrends_joint=None when there is only one pre-period (diff_diff/had_pretests.py:L2432-L2452). Impact: documentation overpromises workflow completeness, although the code correctly returns an inconclusive report. Concrete fix: document that the step-2 gap is only closed when there are at least two pre-periods.
  • P3: the new tests cover shared-η bootstrap and serialization, but they do not exercise direct-wrapper validation/first_treat_col propagation or constant-d joint inputs (tests/test_had_pretests.py:L1555-L1906, tests/test_had_pretests.py:L1906-L2330). Impact: both P1 issues above shipped uncovered. Concrete fix: add regression tests for direct wrapper calls on staggered panels and on all-treated same-dose panels.

Path to Approval

  1. Make joint_pretrends_test and joint_homogeneity_test run the same event-study validation/filtering contract that did_had_pretest_workflow(..., aggregate="event_study") uses, with first_treat_col actually wired through.
  2. Add a constant-d guard to the joint Stute core that returns NaN/inconclusive instead of a spurious p=1 or a singular-matrix failure.
  3. Add regression tests covering those two cases.
  4. Either restore truly bit-exact overall-path report outputs or correct the compatibility documentation.

Assumptions

  • I am treating joint_pretrends_test and joint_homogeneity_test as public API because they are exported from diff_diff/__init__.py:L468-L472 and documented in the registry. If they are meant to be internal-only, the safer fix is to de-export them and document the prevalidated-data requirement instead.

P1 - wrapper validator + first_treat_col wiring (had_pretests.py):
`joint_pretrends_test` and `joint_homogeneity_test` now route through
`_validate_had_panel_event_study` when the panel has >= 3 periods,
so direct wrapper calls inherit the Appendix B.2 last-cohort filter,
constant-post-dose invariant, and staggered/no-first_treat_col raise
contract. `first_treat_col` is actually wired through instead of
accepted-but-ignored. Subset checks (base_period in validated
t_pre_list; pre_periods / post_periods subsets of the corresponding
validated set) run after the validator, so callers get crisp errors
on mistyped horizons rather than silent miscomputation.

P1 - constant-d degeneracy guard in `stute_joint_pretest`:
When `ptp(doses) <= 0` (all units share identical dose), warn and
return all-NaN inference instead of computing a mechanically-zero
CvM (mean-independence null - bogus fail-to-reject) or attempting
a singular `[1, d]` refit (linearity null - matrix solve crash).
Uses `np.ptp` rather than `np.var` because var-of-constant yields
~1e-32 rounding noise that would slip past a `<= 0` comparison.
Mirrors stute_test's intent at single-horizon scale.

P2 - bit-exact overall-path serialization:
`HADPretestReport.__repr__`, `summary()`, and `to_dict()` now
produce Phase 3-identical output when `aggregate="overall"` - no
`aggregate` key in the dict, no header line in the summary, no new
kwarg in the repr. The `aggregate` field remains on the dataclass
internally and is surfaced in these serializations only on
`aggregate="event_study"`. Restores the CHANGELOG's bit-exact
compatibility claim.

P3 - regression tests + docs:
Four new tests cover the P1 edge cases: constant-d core path,
direct-wrapper staggered panel (with and without first_treat_col),
and wrapper-level constant-d propagation. REGISTRY.md and
CHANGELOG.md document that step-2 closure requires >=2 pre-periods
(the base `F-1` plus at least one earlier placebo); on single-pre-
period panels the workflow emits `pretrends_joint=None` with a skip
note in the verdict and `all_pass=False`.

Existing tests updated for the new validator path: the pre-period
D=0 and all-zero post-period checks now fire via the event-study
validator's staggered-cohort or contiguous-dose guards before the
wrapper's local reciprocal guards can run; regex matchers widened to
accept either error surface. `test_to_dict_overall_preserves_phase3_schema`
now asserts the ABSENCE of the `aggregate` key on the overall path
to match the restored bit-exact schema.

119 tests pass (115 + 4 new R1 regressions); black/ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8da8e43699e3a1260ad2e9af528713d3c2f47c07


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Previous review blockers appear addressed: the direct wrappers now route through the event-study validator, constant-dose degeneracy now returns NaN + warning, overall-path serialization is restored, and the docs/TODO entries were updated.
  • One unmitigated P1 remains in the new joint Stute/event-study path: ordered-categorical time labels are validated chronologically in had.py, but the new pretest code reintroduces raw Python </> comparisons on labels.
  • That can silently drop valid pre-period horizons, emit a false "joint pre-trends skipped" verdict, or raise on valid post periods, so the joint tests may run on the wrong horizons.
  • The methodology choices called out in the PR body are documented and therefore not defects: sum-of-CvMs aggregation, shared-η bootstrap, no joint Yatchew variant, and Eq. 18 detrending deferred to Phase 4.

Methodology

  • Severity: P1 Impact: The new joint-pretest workflow breaks the existing ordered-categorical chronology contract. joint_pretrends_test and joint_homogeneity_test check period order with raw t < base_period / t > base_period, and did_had_pretest_workflow(..., aggregate="event_study") builds earlier_pre the same way. On supported ordered-categorical panels where lexical and chronological order differ (for example "q2" vs "q10" or month-name labels), this can silently exclude valid placebo horizons, produce a spurious "joint pre-trends skipped" path, or reject valid post-period inputs. That changes which horizons enter the joint CvM test, so the methodology is not being applied to the intended sample. Concrete fix: Compare periods by validated chronological rank, not raw label comparison. In the workflow, earlier_pre should be t_pre_list[:-1]. In both wrappers, either derive the same _sort_key/rank map used by the validator or rely on membership against validator-produced t_pre_list / t_post_list instead of raw </> checks. References: diff_diff/had_pretests.py:L2160-L2178, diff_diff/had_pretests.py:L2344-L2358, diff_diff/had_pretests.py:L2594-L2600, diff_diff/had.py:L960-L1007, diff_diff/had.py:L1214-L1229, tests/test_had.py:L2751-L2965.

Code Quality

  • No additional findings.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • Severity: P3 Impact: The Eq. 18 linear-trend detrending gap is properly tracked, so I did not count that deferral as a blocker. Concrete fix: None required in this PR unless you want to pull Phase 4 work forward. References: TODO.md:L98-L98, docs/methodology/REGISTRY.md:L2349-L2349.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 Impact: The new had_pretests suite exercises numeric-year panels only, so it missed the ordered-categorical chronology contract that the existing HAD event-study suite already protects. That is how the P1 above slipped through. Concrete fix: Add regression tests for joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(..., aggregate="event_study") using ordered categorical labels whose lexical and chronological order disagree. References: tests/test_had_pretests.py:L1579-L2474, tests/test_had.py:L2751-L2965.

Verification note: I did not execute the test suite in this shell because the available Python environment is missing project dependencies such as pandas.

Path to Approval

  1. Replace all raw period-label </> comparisons in the new joint-pretest code with chronology-aware rank comparisons from the validated time order. Minimum touchpoints: joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(..., aggregate="event_study").
  2. Add ordered-categorical regression tests. A concrete fixture would use categories like ["q1", "q2", "q10", "post"] and assert that pretrends_joint includes both earlier pre-periods, plus a second fixture proving the homogeneity/workflow path accepts valid ordered-categorical post periods without false ordering errors.

P1 - ordered-categorical chronology: raw `t < base_period` /
`t > base_period` comparisons in `joint_pretrends_test`,
`joint_homogeneity_test`, and `did_had_pretest_workflow(aggregate=
"event_study")` silently misorder ordered-categorical time columns
whose lexical and chronological order disagree (e.g. categories
["q1", "q2", "q10"] sort lexically as "q1" < "q10" < "q2"). On
such panels the raw comparison could (a) silently drop valid
pre-period horizons via the raw `<` check, (b) emit a spurious
"joint pre-trends skipped" verdict from the workflow's `earlier_pre`
filter, or (c) raise on valid post-period inputs.

Fix: new private helper `_build_period_rank` returns a
{period_label: chronological_rank} map using the ordered-
categorical category order when applicable, natural sort on
numeric / datetime otherwise. Both wrappers compare period labels
via rank (`rank[t1] < rank[t2]`) instead of raw Python `<`/`>`.
The workflow's `earlier_pre` replaces the raw-< filter with
`list(t_pre_list[:-1])` - `t_pre_list` is already chronologically
sorted by the validator (via its `_sort_key`), so excluding the
last element yields the earlier pre-periods regardless of dtype.

P3 - ordered-categorical regression tests: new
`TestOrderedCategoricalChronology` class (4 tests) with a fixture
using categories `["q1", "q2", "q10", "post"]`. Covers (a) direct
pretrends wrapper picks up both earlier placebos, (b) pretrends
wrapper rejects lexically-ordered-but-chrono-invalid input (e.g.
pre=["q10"], base="q2"), (c) homogeneity wrapper accepts valid
post-period input, (d) workflow event-study dispatch surfaces both
earlier placebos in `pretrends_joint.horizon_labels` without the
false skip note.

123 tests pass (119 + 4 new); black/ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 84835defc69e7cdb36598f7c9b51425ce0ea2e5e


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior ordered-categorical chronology blocker appears resolved: the new code now compares periods by validated rank instead of raw label order, and the added regression coverage matches that fix.
  • Severity: P1 [Newly identified]. The new direct joint_homogeneity_test() API can still return finite results on invalid negative-dose data when called on a 2-period panel, because its n_periods < 3 fallback skips the event-study validator and then collapses unit dose with max().
  • The methodology choices called out in the registry are documented and not defects: sum-of-CvMs aggregation, shared-η bootstrap, no joint Yatchew variant, and Eq. 18 detrending deferred to Phase 4.
  • I did not find additional P1+ issues in the changed event-study workflow, chronology handling, or report-serialization paths.

Methodology

  • Severity: P1 [Newly identified]. Impact: joint_homogeneity_test() only routes through _validate_had_panel_event_study() when data[time_col].nunique() >= 3, then otherwise falls through to _aggregate_for_joint_test(), which defines unit dose as groupby(unit_col)[dose_col].max(). On a 2-period direct call, an invalid negative post dose is therefore silently converted into 0 before stute_joint_pretest() sees it (max(0, -d) = 0), so the wrapper can produce a finite joint-Stute result on data that violates the HAD support restriction D_{g,t} >= 0. That is a missing assumption check on a new public method, and it can change treated/control composition with no warning. Concrete fix: add a row-level non-negative-dose guard before the max() collapse, ideally in _aggregate_for_joint_test() so both data-in wrappers inherit it consistently; keep the existing row-level guard in _validate_had_panel_event_study() for the multi-period path. References: diff_diff/had_pretests.py:L1667-L1775, diff_diff/had_pretests.py:L2409-L2496, diff_diff/had_pretests.py:L1220-L1227, diff_diff/had.py:L1244-L1256.

Code Quality

  • Severity: none. Impact: no additional code-quality issues identified in the changed lines. Concrete fix: none.

Performance

  • Severity: none. Impact: no performance regressions stood out beyond the documented joint-bootstrap cost. Concrete fix: none.

Maintainability

  • Severity: none. Impact: no additional maintainability issues identified that should block this PR. Concrete fix: none.

Tech Debt

  • Severity: P3. Impact: Eq. 18 linear-trend detrending remains deferred, but it is explicitly tracked in both the methodology registry and TODO.md, so it is not a blocker under the project’s review rules. Concrete fix: none required in this PR unless Phase 4 work is being pulled forward. References: docs/methodology/REGISTRY.md:L2336-L2349, TODO.md:L98-L98.

Security

  • Severity: none. Impact: no security issues identified in the changed files. Concrete fix: none.

Documentation/Tests

  • Severity: P3. Impact: the new joint_homogeneity_test coverage exercises the validated multi-period path, but not the n_periods < 3 fallback where the missing negative-dose guard lives, which is how the P1 above slipped through. Concrete fix: add a direct-wrapper regression using a 2-period panel with mixed negative and positive post doses and assert that the call raises ValueError instead of returning a finite result. References: tests/test_had_pretests.py:L1862-L2036.

Path to Approval

  1. Add a front-door row-level dose_col >= 0 validation on the direct wrapper path before groupby(...).max() collapses unit dose, so joint_homogeneity_test() cannot silently recode negative post doses to zero.
  2. Add a regression test covering direct joint_homogeneity_test() on a 2-period panel with at least one negative post dose; if 2-period direct wrapper calls are intentionally unsupported, reject them explicitly and document that restriction.

Verification note: I did not execute the test suite in this shell because the available Python environment is missing project dependencies such as pandas.

P1 - row-level non-negative-dose guard in `_aggregate_for_joint_test`:
On a 2-period direct call to `joint_pretrends_test` or
`joint_homogeneity_test`, the n_periods < 3 path skips
`_validate_had_panel_event_study` (which requires >= 3 periods) and
falls through to `_aggregate_for_joint_test`. That helper collapsed
unit dose via `groupby(unit_col)[dose_col].max()`, which silently
recodes a negative post dose to 0 (`max(0, -d) = 0` for positive
pre-period d), allowing finite joint-Stute output on data that
violates the HAD support restriction `D_{g,t} >= 0` (paper Section 2).

Fix: add a row-level `dose_col >= 0` check in
`_aggregate_for_joint_test` BEFORE the groupby/max collapse.
Centralizes the guard so both data-in wrappers inherit it on the
n_periods < 3 fallback path. The multi-period path already enforces
the same invariant via `_validate_had_panel_event_study`, so the
contract is consistent across both wrapper dispatch modes.

P3 - regression test: new
`TestJointHomogeneityTest::test_two_period_negative_post_dose_raises`
constructs a 2-period panel with a single unit carrying a negative
post dose and asserts the wrapper raises `ValueError` with the
"negative dose value" substring rather than producing a finite
statistic via the groupby-max collapse.

124 tests pass (123 + 1 new R3 regression); black/ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0040bad7b4679a369ec5b404389bccdec94cf25c


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior direct-wrapper negative-dose blocker appears resolved: _aggregate_for_joint_test() now rejects row-level negative doses before the groupby(...).max() collapse, and a regression covers the 2-period path at diff_diff/had_pretests.py:L1740-L1758 and tests/test_had_pretests.py:L2038-L2075.
  • Severity: P1 [Newly identified]. The new public stute_joint_pretest() API silently collapses distinct horizon keys whose str() representations collide, which can overwrite one horizon and double-count another in S_joint / p_value.
  • The methodology choices called out in the registry are documented and therefore not defects: sum-of-CvMs aggregation, shared-η wild bootstrap, no joint Yatchew variant, and Eq. 18 detrending deferred to Phase 4.
  • I did not execute the test suite in this shell; the available Python environment is missing pandas, so this is a source-only review.

Methodology

  • Severity: P1 [Newly identified]. Impact: stute_joint_pretest() accepts non-string horizon keys, but it re-keys them with str(k) before computing the joint statistic. If two distinct keys collide after stringification, one horizon is overwritten in residuals_arrays / fitted_arrays, then the surviving horizon is counted twice when per_horizon_stats and S_joint are formed. That produces wrong statistical output with no warning and an inconsistent result surface (n_horizons can exceed the number of distinct diagnostics). Concrete fix: reject non-unique str() label mappings up front, or keep raw keys throughout the computation and stringify only when materializing output. References: diff_diff/had_pretests.py:L1979-L2001, diff_diff/had_pretests.py:L2058-L2061, tests/test_had_pretests.py:L1538-L1551.
  • Severity: P3-informational. Impact: The remaining methodology deviations are documented rather than silent: event-study step 3 uses joint Stute only, and Eq. 18 linear-trend detrending is still deferred to Phase 4 / TODO.md. Under the project rules these are not blockers. Concrete fix: none required in this PR. References: docs/methodology/REGISTRY.md:L2347-L2349, TODO.md:L98-L98.

Code Quality

  • Severity: P3. Impact: the staggered-panel auto-filter warning is emitted twice on the new paths, because _validate_had_panel_event_study() already warns and the wrappers / workflow warn again off filter_info. That adds noise without new information. Concrete fix: emit the warning in exactly one layer, ideally the validator, and let wrappers/workflow consume filter_info silently. References: diff_diff/had.py:L1133-L1148, diff_diff/had_pretests.py:L2255-L2264, diff_diff/had_pretests.py:L2442-L2451, diff_diff/had_pretests.py:L2626-L2635.

Performance

No findings.

Maintainability

No findings beyond the duplicate-warning issue above.

Tech Debt

No untracked tech-debt blockers. The Eq. 18 detrending deferment is already tracked in TODO.md:L98-L98.

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the new test coverage proves that non-string horizon keys are part of the supported surface, but it does not cover the string-collision edge case that triggers the P1 above. Concrete fix: add a regression using colliding labels such as {1: ..., "1": ...} and assert an explicit failure. References: tests/test_had_pretests.py:L1538-L1551.

Path to Approval

  1. Add a uniqueness guard for stringified horizon labels in stute_joint_pretest() so distinct horizons cannot be collapsed before S_joint is computed.
  2. Add a regression test covering colliding horizon identifiers (for example {1: ..., "1": ...}) and assert that the call fails explicitly rather than silently overwriting one horizon and double-counting the other.

P1 - stringified-label collision guard in stute_joint_pretest:
The core indexed residuals_arrays / fitted_arrays by `str(k)` with
no uniqueness check on the stringified keys. Two distinct raw keys
whose str() forms collide (e.g. {1: ..., "1": ...} both stringify
to "1", or custom objects with identical __str__) would silently
overwrite one entry and then be double-counted in S_joint =
sum(S_k) because the surviving horizon's statistic gets summed
twice while n_horizons still reports K=2. That produces wrong
methodology output with no diagnostic.

Fix: compute the stringified labels once up front and reject any
collision explicitly with a ValueError listing which raw keys
collide to which stringified form. Centralizes the check before
any residual/fitted array is dropped. Replaces the ad-hoc post-hoc
re-keying with a reuse of the pre-computed collision-free list.

P3 - dedupe staggered-filter UserWarning:
`_validate_had_panel_event_study` already warns on the staggered
auto-filter path; both joint-pretest wrappers and the event-study
workflow were re-emitting the same information with a
wrapper-prefixed message. Each staggered call therefore surfaced
two warnings to the user. Removes the secondary emissions;
wrappers now consume `_filter_info` silently. Existing tests still
pass because the validator's own `"Staggered-timing panel
detected"` message satisfies the regex matchers.

P3 - collision regression test: new
`TestStuteJointPretest::test_stringified_key_collision_raises`
exercises (a) the int 1 + str "1" case and (b) a pair of custom
objects with identical __str__ but distinct hash; both must raise
`ValueError` with "collision after str" in the message.

125 tests pass (124 + 1 new R4 collision regression); black/ruff/
mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: db170bd62366c83ddaab794294c69363679c9a17


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior rerereview P1 around horizon-label string collisions appears resolved: stute_joint_pretest() now rejects non-unique str() mappings before re-keying, with regression coverage.
  • I did not find an unmitigated methodology mismatch in the new joint/event-study implementation. The remaining deviations called out in the PR are documented in the Methodology Registry and/or TODO.md, so they are not defects.
  • Severity P1: the new joint Stute path raises on G < 10, so did_had_pretest_workflow(aggregate="event_study") can now error on otherwise valid small panels, or after last-cohort auto-filtering, instead of returning the NaN/inconclusive diagnostic surface used by the existing single-horizon workflow.
  • There is still documentation drift in-code: the top-level had_pretests.py module docstring says step 2 is deferred and the workflow is two-period-only, which no longer matches the shipped API.
  • This was a source-only review. I could not execute pytest here because the available shell is missing numpy and pytest.

Methodology

  • Severity: P3-informational. Impact: I did not find an unmitigated methodology defect in the new joint/event-study logic. The sum-of-CvMs aggregation, shared-η wild bootstrap, joint-Stute-only event-study step 3, and Eq. 18 detrending deferment are all explicitly documented in docs/methodology/REGISTRY.md:L2336-L2351; the Eq. 18 deferment is also tracked in TODO.md:L98-L98. The previously reported label-collision issue is now guarded at diff_diff/had_pretests.py:L1982-L2003, with regression coverage at tests/test_had_pretests.py:L1620-L1634.
  • Concrete fix: none required.

Code Quality

  • Severity: P1. Impact: stute_joint_pretest() hard-raises on G < 10 at diff_diff/had_pretests.py:L1966-L1967, while the existing single-horizon stute_test() explicitly treats the same condition as a warning + all-NaN result at diff_diff/had_pretests.py:L1230-L1246. Because the new event-study workflow calls the joint test directly for both step 2 and step 3 (diff_diff/had_pretests.py:L2673-L2703), a valid small multi-period panel, or a staggered panel whose last-cohort filter leaves fewer than 10 units, now crashes instead of producing an inconclusive report. That is a new edge-case regression on the public workflow path.
  • Concrete fix: mirror stute_test() here: emit UserWarning and return a StuteJointResult with cvm_stat_joint=np.nan, p_value=np.nan, reject=False, and full-NaN per_horizon_stats when G < _MIN_G_STUTE. Update the current small-G regression at tests/test_had_pretests.py:L1410-L1421 and add a workflow-level regression covering aggregate="event_study" after last-cohort filtering.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational. Impact: Eq. 18 linear-trend detrending remains deferred, but it is explicitly documented and tracked (docs/methodology/REGISTRY.md:L2347-L2349, TODO.md:L98-L98), so it is not a blocker under the project’s deferred-work rules.
  • Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the top-level module docstring still says Phase 3 only ships steps 1 and 3, that step 2 is deferred, and that did_had_pretest_workflow() is a two-period-only entry point (diff_diff/had_pretests.py:L1-L26). That now contradicts the new aggregate="event_study" API and the updated registry entry, so readers relying on in-code docs will get stale methodology guidance.
  • Concrete fix: update the module docstring to describe both aggregate="overall" and aggregate="event_study", and narrow the remaining deferred scope to Eq. 18 detrending.

Path to Approval

  1. Change stute_joint_pretest() to follow the existing small-sample contract used by stute_test(): warning + NaN result, not ValueError, when G < 10.
  2. Add/update regressions so did_had_pretest_workflow(aggregate="event_study") is exercised on a valid small panel, including a staggered panel whose last-cohort filter leaves <10 units, and assert an inconclusive report rather than an exception.

P1 - stute_joint_pretest G<_MIN_G_STUTE warn+NaN contract:
The joint core raised `ValueError` on G < 10, while single-horizon
`stute_test` emits a `UserWarning` and returns a NaN result on the
same condition. Because the event-study workflow dispatches into
the joint core for both step-2 pre-trends and step-3 homogeneity,
a staggered panel whose last-cohort auto-filter leaves fewer than
10 units would now crash the workflow instead of surfacing an
inconclusive report - a regression versus Phase 3's two-period
behavior.

Fix: mirror the single-horizon contract. Emit `UserWarning`
("below the minimum ... Returning NaN result") and return a
`StuteJointResult` with `cvm_stat_joint=nan`, `p_value=nan`,
`reject=False`, and a full-NaN `per_horizon_stats` dict keyed by
the validated horizon labels (so the diagnostic surface is
consistent with the NaN-propagation branch). `n_bootstrap <
_MIN_N_BOOTSTRAP` and non-numeric `alpha` still raise; only the
small-G branch relaxes.

Test updates:
- `test_small_G_raises` renamed to `test_small_G_warns_returns_nan`
  and rewritten to assert the new contract.
- New `test_event_study_small_panel_after_filter_inconclusive_not_
  crash` covers the workflow-level regression: a staggered fixture
  with 40 early-cohort + 6 late-cohort units filters to G=6 after
  the validator's last-cohort auto-filter; `did_had_pretest_
  workflow(aggregate="event_study")` now completes without
  exception, emits the "below the minimum" warning, and surfaces a
  NaN joint-Stute report with `all_pass=False`.

P3 - module docstring refresh:
`had_pretests.py` top-level docstring still said Phase 3 shipped
steps 1 + 3 only, that step 2 was deferred, and that
`did_had_pretest_workflow` was a two-period-only entry point. That
drifted after the joint-pretest follow-up landed. Rewrote the
docstring to describe: (a) the three single-horizon tests, (b) the
three new joint helpers (`stute_joint_pretest`,
`joint_pretrends_test`, `joint_homogeneity_test`), (c) both
workflow dispatch modes (`aggregate="overall"` two-period and
`aggregate="event_study"` multi-period), and (d) the narrowed
deferment - only Eq. 18 linear-trend detrending remains, tracked
in TODO for Phase 4 alongside the Pierce-Schott replication.

126 tests pass (125 + 1 new R5 workflow regression, -0 + 1
converted from raise to warn); black/ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e3f7450022fc0ca781a1ecdd59e0fabd33641b27


Overall Assessment

✅ Looks good

Executive Summary

  • The prior P1 is resolved: stute_joint_pretest() now mirrors stute_test() on G < 10 by warning and returning NaN instead of raising, and the workflow regression is covered.
  • I did not find an unmitigated methodology mismatch in the new joint/event-study HAD workflow.
  • The main non-paper-literal choices in this PR are documented in the Methodology Registry, so they are non-blocking under the review rules.
  • Ordered-categorical time handling, constant-dose degeneracy, negative-dose rejection, staggered last-cohort filtering, and stringified horizon-label collisions are now explicitly guarded.
  • One minor in-code docs drift remains: stute_joint_pretest() still documents G < _MIN_G_STUTE as a ValueError path.
  • This was a source-only review; I could not run pytest because numpy, pandas, pytest, and scipy are not installed in this environment.

Methodology

  • Severity P3. Impact: No unmitigated methodology defect found. The non-paper-literal choices in this PR — sum-of-CvMs aggregation, shared-η bootstrap across horizons, joint-Stute-only event-study step 3, and Eq. 18 detrending deferral — are explicitly documented in docs/methodology/REGISTRY.md:L2336-L2351 and match the companion paper notes in docs/methodology/papers/dechaisemartin-2026-review.md:L189-L206. Concrete fix: none required.

Code Quality

  • No findings. The prior small-G crash is resolved by the warning + NaN path in diff_diff/had_pretests.py:L2070-L2091, with direct and workflow regressions at tests/test_had_pretests.py:L1410-L1432 and tests/test_had_pretests.py:L2467-L2519.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: Eq. 18 linear-trend detrending remains deferred, but it is explicitly tracked in TODO.md:L98-L98 and documented as a deliberate Phase 4 deferral in docs/methodology/REGISTRY.md:L2347-L2349. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: stute_joint_pretest() still says G < _MIN_G_STUTE raises ValueError in diff_diff/had_pretests.py:L1965-L1970, but the implementation now warns and returns an all-NaN StuteJointResult in diff_diff/had_pretests.py:L2070-L2091. Callers relying on the in-code contract could code against the wrong failure mode. Concrete fix: update the Raises/Notes text to describe the warning + NaN behavior.
  • Verification note: I could not execute pytest tests/test_had_pretests.py here because the environment is missing numpy, pandas, pytest, and scipy.

P3 - stute_joint_pretest docstring drift:
The Raises block still listed `G < _MIN_G_STUTE` as a ValueError
condition, but R5 converted that branch to a UserWarning + full-NaN
StuteJointResult return to match single-horizon stute_test and
keep the event-study workflow from crashing on staggered-filtered
small panels.

Fix: rewrote the Returns and Raises docstring blocks to describe
the actual contract. Returns now enumerates the three NaN-result
branches (small G, constant dose, any-NaN residuals / fitted) with
their warning behavior. Raises is narrowed to the genuinely-raising
conditions: empty input, key-mismatch, str-label collision, shape
mismatch, negative doses, too-few bootstrap replicates, invalid
alpha. Explicitly notes that small-G does NOT raise.

No code changes; docstring-only edit. 126 tests still pass;
black/ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f381ed53d6392cd0c05525828ab5484e44fb43e2


Overall Assessment

✅ Looks good

Executive Summary

  • Affected methods reviewed: stute_joint_pretest(), joint_pretrends_test(), joint_homogeneity_test(), did_had_pretest_workflow(..., aggregate="event_study"), and the new HADPretestReport/StuteJointResult serialization surfaces.
  • The prior re-review docstring issue is resolved: stute_joint_pretest() now documents the G < 10 warn-and-return-NaN contract correctly at diff_diff/had_pretests.py:L1963-L1982.
  • I did not find an unmitigated P0/P1 methodology defect in the new joint-Stute/event-study workflow. The paper-facing behavior matches the multi-period joint-test / last-cohort structure, and the non-paper-literal choices are documented in docs/methodology/REGISTRY.md:L2336-L2351. citeturn2view1turn3view0turn3view3
  • One P2 code-quality issue remains for direct callers of the new public residuals-in API: singular custom design_matrix input falls through to a raw LinAlgError.
  • I could not execute pytest tests/test_had_pretests.py locally because pytest, numpy, pandas, and scipy are not installed in this environment.

Methodology

  • Severity P3. Impact: No unmitigated methodology defect found in diff_diff/had_pretests.py:L2241-L2812. The paper supports a pre-trends Stute test built from pre-period placebo residuals, a joint linearity test across post-treatment horizons in multi-period designs, and last-cohort handling under staggered timing; the PR’s departures from paper-literal implementation are explicitly documented in docs/methodology/REGISTRY.md:L2336-L2351, so they are informational rather than defects. Concrete fix: none required. citeturn2view1turn3view0turn3view3

Code Quality

  • Severity P2. Impact: stute_joint_pretest() is now a public API, but a rank-deficient custom design_matrix will raise a raw LinAlgError at diff_diff/had_pretests.py:L2202-L2205 instead of following the function’s otherwise front-door validation style. This does not affect the built-in wrappers, but it makes direct use of the new residuals-in core harder to handle predictably. Concrete fix: catch np.linalg.LinAlgError around the np.linalg.solve(X.T @ X, X.T) call, raise a clear ValueError for singular/rank-deficient designs, and add a regression test with duplicate columns in design_matrix.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: Eq. 18 linear-trend detrending remains deferred, but it is explicitly documented in docs/methodology/REGISTRY.md:L2347-L2349 and tracked in TODO.md:L98-L98, so this is properly recorded deferred work rather than an approval blocker. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: Two new docstrings still describe Yatchew as “step 4” in the workflow (diff_diff/had_pretests.py:L44-L46 and L1849-L1850), but docs/methodology/REGISTRY.md:L2328-L2332 defines step 4 as the final “use TWFE if none of the tests rejects” decision; Yatchew is the alternative linearity test within step 3. This is documentation drift only, not a behavioral bug. Concrete fix: rename those references to “step 3 alternative (Yatchew)”; while there, tighten joint_homogeneity_test()’s post_periods docstring at diff_diff/had_pretests.py:L2464-L2466 from >= base_period to > base_period to match the actual check.
  • Verification note: this was a source-only re-review because the local environment is missing the scientific Python test stack.

P2 - explicit ValueError on singular design_matrix:
`stute_joint_pretest` previously surfaced a raw
`np.linalg.LinAlgError` to direct callers when `design_matrix` was
rank-deficient (e.g. duplicate columns), breaking the front-door
validation style of the rest of the function. Wrap the
`np.linalg.solve(X.T @ X, X.T)` precompute in a try/except and
re-raise as `ValueError` with a message naming the likely cause
(linearly-dependent columns) and the shape.

Regression: new
`TestStuteJointPretest::test_singular_design_matrix_raises_valueerror`
constructs a (G, 2) design with two identical columns and asserts
the explicit `ValueError("rank-deficient")`.

P3 - Yatchew "step 4" -> "step 3 alternative" docstring drift:
Two docstrings (module header and `_compose_verdict_event_study`)
referred to the Yatchew-HR test as "step 4". Paper
Section 4.2-4.3 defines step 4 as the final admissibility decision
("use TWFE if none of the tests rejects"), not a separate
diagnostic; Yatchew is the alternative linearity test within step 3.
Updated both docstrings to describe Yatchew as the step-3
alternative (subsumed by joint Stute on the event-study path) and
clarified that paper step 4 has no separate code path.

P3 - `joint_homogeneity_test` post_periods docstring:
Text said `>= base_period` but the actual guard is strict `>
base_period` in chronological order. Tightened the Parameters
block to match.

127 tests pass (126 + 1 new R7 regression); black/ruff/mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7c7d5cdbffb5e7a8e6df3ccae9a788047d6e03d1


Overall Assessment

✅ Looks good

Executive Summary

  • Reviewed the new methodology surface in diff_diff/had_pretests.py:L1908-L2874: stute_joint_pretest(), joint_pretrends_test(), joint_homogeneity_test(), and did_had_pretest_workflow(..., aggregate="event_study").
  • I did not find an unmitigated P0/P1 methodology defect. The implementation matches the project’s documented interpretation of the paper: intercept-only residualization for joint pre-trends, [1, D] residualization for joint post-linearity, shared-η wild bootstrap, and event-study validator reuse. See docs/methodology/REGISTRY.md:L2336-L2351 and docs/methodology/papers/dechaisemartin-2026-review.md:L189-L206.
  • The prior re-review P2 is resolved: rank-deficient custom design_matrix input now raises a clear ValueError instead of a raw LinAlgError, diff_diff/had_pretests.py:L2205-L2220, with regression coverage at tests/test_had_pretests.py:L1589-L1609.
  • The prior documentation drift is also resolved: the module/report/workflow docs now describe the event-study path and Yatchew’s role correctly, diff_diff/had_pretests.py:L1-L56, L565-L627, L2645-L2715.
  • Edge-case coverage is materially better: small-G NaN handling, constant-dose and negative-dose guards, ordered-categorical chronology, staggered-wrapper parity, and serialization branching are all covered in tests/test_had_pretests.py:L1672-L1749, L1849-L1916, L2128-L2157, L2192-L2529, L2574-L2779.
  • I could not run pytest here because pytest, numpy, pandas, and scipy are not installed in this environment.

Methodology

  • Severity P3. Impact: No unmitigated methodology finding. The new joint-Stute/event-study workflow matches the Methodology Registry and in-repo paper review for the shipped scope, diff_diff/had_pretests.py:L1908-L2874, docs/methodology/REGISTRY.md:L2336-L2351, docs/methodology/papers/dechaisemartin-2026-review.md:L189-L206. Concrete fix: none.
  • Severity P3. Impact: The two paper-nonliteral choices are explicitly documented, so they are informational rather than defects: sum-of-CvMs aggregation and using joint Stute only on the event-study step-3 path (no joint Yatchew variant), docs/methodology/REGISTRY.md:L2341-L2349. Concrete fix: none.

Code Quality

  • Severity P3. Impact: Previous finding resolved. Public direct callers now get a front-door ValueError on singular/rank-deficient design_matrix input instead of a raw linear-algebra exception, diff_diff/had_pretests.py:L2205-L2220; covered at tests/test_had_pretests.py:L1589-L1609. Concrete fix: none.
  • No additional findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: Eq. 18 linear-trend detrending remains deferred, but it is explicitly documented in the registry and tracked in TODO.md, so it is properly mitigated deferred work rather than a blocker, docs/methodology/REGISTRY.md:L2347-L2349, TODO.md:L98-L98. Concrete fix: none in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: Previous docstring drift is resolved; the updated docs now align with the shipped workflow semantics and the event-study branch, diff_diff/had_pretests.py:L1-L56, L565-L627, L2645-L2715. Concrete fix: none.
  • Severity P3. Impact: Coverage on the new high-risk branches is strong, especially for chronology/order handling, staggered-wrapper behavior, small-sample NaN behavior, and event-study serialization/export surfaces, tests/test_had_pretests.py:L1672-L1749, L1849-L1916, L2128-L2157, L2192-L2529, L2574-L2779, diff_diff/__init__.py:L63-L76, L459-L473. Concrete fix: none.
  • Verification note: this was a source-only re-review because the local environment is missing pytest and the scientific Python stack.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026
@igerber igerber merged commit 869c19a into main Apr 24, 2026
23 of 24 checks passed
@igerber igerber deleted the had-joint-stute-pretest branch April 24, 2026 11:17
igerber added a commit that referenced this pull request Apr 24, 2026
…ression

Closes the second P1 from the review. Python `_compute_observation_weights`
had an extra `valid_control_at_t = D[t, :] == 0` gate that zeroed ω_j for
units treated at the target period (other than the target unit itself).
Rust's `compute_weight_matrix` has no such gate — per the paper's Eq. 2/3
and `docs/methodology/REGISTRY.md` TROP section, `ω_j = exp(-λ_unit ×
dist(j, i))` is distance-based for all `j ≠ i` and the treated-cell
exclusion is the `(1 - W_{js})` factor applied inside `_estimate_model`
via the control mask, not an extra target-period unit-weight gate.

The empirical impact of removing the gate is zero on the ATT point
estimate: same-cohort donors' pre-treatment rows are exactly absorbed
by their own unit fixed effect `alpha_j` without propagating into
`mu`, `beta`, or other units' parameters — adding them to the fit
changes which rows are scored but not the solution the fit converges
to. Verified: the flipped bootstrap-seed parity test, the main-fit
parity test at `lambda_nn=inf` (`atol=1e-14`) and at `lambda_nn=0.1`
(`atol=1e-10`), and the new same-cohort regression test (below) all
pass before and after the gate removal. The change is structural
alignment with the paper and Rust, not a numerical behavior shift.

Test addition
-------------
`TestTROPRustEdgeCaseParity::test_local_method_same_cohort_donor_parity`
isolates the scenario the gate used to handle differently from Rust: a
fixture with three treated units sharing one cohort (all treated at
`t=5`) and three controls. Before the gate was removed, Python's and
Rust's same-target-period donors diverged in which rows contributed to
the fit; the tests prove the ATT point estimate was never affected
(pre-treatment rows absorbed by `alpha_j`), and now both backends also
agree structurally. Parametrized over the same regime split as the
main-fit parity test (`lambda_nn=inf` → `atol=1e-14`, `lambda_nn=0.1`
→ `atol=1e-10`).

Note on the other P1 in the review (HAD rollback claim): that finding
was a phantom caused by a stale branch base — PR #353 (HAD joint Stute
pretest) landed on `origin/main` between this branch's cut and the
review run, so the PR diff against current `origin/main` appeared to
"delete" the PR #353 additions. Resolved by rebasing onto the updated
`origin/main` before this push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant