HAD Phase 3 follow-up: joint Stute pretest + event-study workflow#353
HAD Phase 3 follow-up: joint Stute pretest + event-study workflow#353
Conversation
…patch Adds `stute_joint_pretest` (residuals-in core) plus `joint_pretrends_test` and `joint_homogeneity_test` data-in wrappers for the paper's step-2 (mean-independence pre-trends) and step-3 (linearity joint extension) nulls. Extends `did_had_pretest_workflow` with `aggregate="event_study"` multi-period dispatch that closes the "paper step 2 deferred" gap previously flagged on two-period reports. Implementation highlights: - Sum-of-CvMs aggregation (Delgado 1993; Escanciano 2006) with shared Mammen wild bootstrap multiplier across horizons per unit to preserve vector-valued empirical-process unit-level dependence (Delgado-Manteiga 2001; Hlavka-Huskova 2020). - Per-horizon scale- and translation-invariant exact-linear short-circuit (a single degenerate horizon does not collapse the joint test). - Reciprocal front-door guards on both wrappers: non-empty horizon list, base_period ordering, D=0 invariant (pre-trends) and D>0 existence (post-homogeneity). - Backward-compatible HADPretestReport extension: new fields pretrends_joint, homogeneity_joint, aggregate with defaults; stute and yatchew become Optional. summary, to_dict, to_dataframe, and __repr__ branch on aggregate and preserve Phase 3 schemas bit-exactly on the aggregate="overall" path. - Eq (18) linear-trend detrending (paper Section 5.2 Pierce-Schott p=0.51) deferred to Phase 4 replication harness where the published value serves as parity anchor; TODO row migrated accordingly. 46 new tests (115 total in tests/test_had_pretests.py) covering: K=1 parity with stute_test, shared-eta white-box, per-horizon short- circuit independence, full reciprocal-validator matrix, event-study verdict priority, serialization round-trip across aggregates. Includes regression tests asserting the "paper step 2 deferred" string is absent from any event-study verdict. Closes TODO.md Phase 3 rows for joint Eq 18 and multi-period workflow dispatch. See REGISTRY.md HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and the no-joint-Yatchew acknowledgment (the paper does not derive one; multi-period Yatchew remains available per-horizon via yatchew_hr_test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Needs changes. The highest unmitigated issues are P1s in the new joint Stute APIs. Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No additional findings beyond the dead-parameter/validator-bypass issue above. Tech Debt No blocker. The Eq. 18 linear-trend detrending gap is explicitly tracked in Security No findings. Documentation/Tests
Path to Approval
Assumptions
|
P1 - wrapper validator + first_treat_col wiring (had_pretests.py): `joint_pretrends_test` and `joint_homogeneity_test` now route through `_validate_had_panel_event_study` when the panel has >= 3 periods, so direct wrapper calls inherit the Appendix B.2 last-cohort filter, constant-post-dose invariant, and staggered/no-first_treat_col raise contract. `first_treat_col` is actually wired through instead of accepted-but-ignored. Subset checks (base_period in validated t_pre_list; pre_periods / post_periods subsets of the corresponding validated set) run after the validator, so callers get crisp errors on mistyped horizons rather than silent miscomputation. P1 - constant-d degeneracy guard in `stute_joint_pretest`: When `ptp(doses) <= 0` (all units share identical dose), warn and return all-NaN inference instead of computing a mechanically-zero CvM (mean-independence null - bogus fail-to-reject) or attempting a singular `[1, d]` refit (linearity null - matrix solve crash). Uses `np.ptp` rather than `np.var` because var-of-constant yields ~1e-32 rounding noise that would slip past a `<= 0` comparison. Mirrors stute_test's intent at single-horizon scale. P2 - bit-exact overall-path serialization: `HADPretestReport.__repr__`, `summary()`, and `to_dict()` now produce Phase 3-identical output when `aggregate="overall"` - no `aggregate` key in the dict, no header line in the summary, no new kwarg in the repr. The `aggregate` field remains on the dataclass internally and is surfaced in these serializations only on `aggregate="event_study"`. Restores the CHANGELOG's bit-exact compatibility claim. P3 - regression tests + docs: Four new tests cover the P1 edge cases: constant-d core path, direct-wrapper staggered panel (with and without first_treat_col), and wrapper-level constant-d propagation. REGISTRY.md and CHANGELOG.md document that step-2 closure requires >=2 pre-periods (the base `F-1` plus at least one earlier placebo); on single-pre- period panels the workflow emits `pretrends_joint=None` with a skip note in the verdict and `all_pass=False`. Existing tests updated for the new validator path: the pre-period D=0 and all-zero post-period checks now fire via the event-study validator's staggered-cohort or contiguous-dose guards before the wrapper's local reciprocal guards can run; regex matchers widened to accept either error surface. `test_to_dict_overall_preserves_phase3_schema` now asserts the ABSENCE of the `aggregate` key on the overall path to match the restored bit-exact schema. 119 tests pass (115 + 4 new R1 regressions); black/ruff/mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: I did not execute the test suite in this shell because the available Python environment is missing project dependencies such as Path to Approval
|
P1 - ordered-categorical chronology: raw `t < base_period` /
`t > base_period` comparisons in `joint_pretrends_test`,
`joint_homogeneity_test`, and `did_had_pretest_workflow(aggregate=
"event_study")` silently misorder ordered-categorical time columns
whose lexical and chronological order disagree (e.g. categories
["q1", "q2", "q10"] sort lexically as "q1" < "q10" < "q2"). On
such panels the raw comparison could (a) silently drop valid
pre-period horizons via the raw `<` check, (b) emit a spurious
"joint pre-trends skipped" verdict from the workflow's `earlier_pre`
filter, or (c) raise on valid post-period inputs.
Fix: new private helper `_build_period_rank` returns a
{period_label: chronological_rank} map using the ordered-
categorical category order when applicable, natural sort on
numeric / datetime otherwise. Both wrappers compare period labels
via rank (`rank[t1] < rank[t2]`) instead of raw Python `<`/`>`.
The workflow's `earlier_pre` replaces the raw-< filter with
`list(t_pre_list[:-1])` - `t_pre_list` is already chronologically
sorted by the validator (via its `_sort_key`), so excluding the
last element yields the earlier pre-periods regardless of dtype.
P3 - ordered-categorical regression tests: new
`TestOrderedCategoricalChronology` class (4 tests) with a fixture
using categories `["q1", "q2", "q10", "post"]`. Covers (a) direct
pretrends wrapper picks up both earlier placebos, (b) pretrends
wrapper rejects lexically-ordered-but-chrono-invalid input (e.g.
pre=["q10"], base="q2"), (c) homogeneity wrapper accepts valid
post-period input, (d) workflow event-study dispatch surfaces both
earlier placebos in `pretrends_joint.horizon_labels` without the
false skip note.
123 tests pass (119 + 4 new); black/ruff/mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Verification note: I did not execute the test suite in this shell because the available Python environment is missing project dependencies such as |
P1 - row-level non-negative-dose guard in `_aggregate_for_joint_test`:
On a 2-period direct call to `joint_pretrends_test` or
`joint_homogeneity_test`, the n_periods < 3 path skips
`_validate_had_panel_event_study` (which requires >= 3 periods) and
falls through to `_aggregate_for_joint_test`. That helper collapsed
unit dose via `groupby(unit_col)[dose_col].max()`, which silently
recodes a negative post dose to 0 (`max(0, -d) = 0` for positive
pre-period d), allowing finite joint-Stute output on data that
violates the HAD support restriction `D_{g,t} >= 0` (paper Section 2).
Fix: add a row-level `dose_col >= 0` check in
`_aggregate_for_joint_test` BEFORE the groupby/max collapse.
Centralizes the guard so both data-in wrappers inherit it on the
n_periods < 3 fallback path. The multi-period path already enforces
the same invariant via `_validate_had_panel_event_study`, so the
contract is consistent across both wrapper dispatch modes.
P3 - regression test: new
`TestJointHomogeneityTest::test_two_period_negative_post_dose_raises`
constructs a 2-period panel with a single unit carrying a negative
post dose and asserts the wrapper raises `ValueError` with the
"negative dose value" substring rather than producing a finite
statistic via the groupby-max collapse.
124 tests pass (123 + 1 new R3 regression); black/ruff/mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings beyond the duplicate-warning issue above. Tech Debt No untracked tech-debt blockers. The Eq. 18 detrending deferment is already tracked in Security No findings. Documentation/Tests
Path to Approval
|
P1 - stringified-label collision guard in stute_joint_pretest:
The core indexed residuals_arrays / fitted_arrays by `str(k)` with
no uniqueness check on the stringified keys. Two distinct raw keys
whose str() forms collide (e.g. {1: ..., "1": ...} both stringify
to "1", or custom objects with identical __str__) would silently
overwrite one entry and then be double-counted in S_joint =
sum(S_k) because the surviving horizon's statistic gets summed
twice while n_horizons still reports K=2. That produces wrong
methodology output with no diagnostic.
Fix: compute the stringified labels once up front and reject any
collision explicitly with a ValueError listing which raw keys
collide to which stringified form. Centralizes the check before
any residual/fitted array is dropped. Replaces the ad-hoc post-hoc
re-keying with a reuse of the pre-computed collision-free list.
P3 - dedupe staggered-filter UserWarning:
`_validate_had_panel_event_study` already warns on the staggered
auto-filter path; both joint-pretest wrappers and the event-study
workflow were re-emitting the same information with a
wrapper-prefixed message. Each staggered call therefore surfaced
two warnings to the user. Removes the secondary emissions;
wrappers now consume `_filter_info` silently. Existing tests still
pass because the validator's own `"Staggered-timing panel
detected"` message satisfies the regex matchers.
P3 - collision regression test: new
`TestStuteJointPretest::test_stringified_key_collision_raises`
exercises (a) the int 1 + str "1" case and (b) a pair of custom
objects with identical __str__ but distinct hash; both must raise
`ValueError` with "collision after str" in the message.
125 tests pass (124 + 1 new R4 collision regression); black/ruff/
mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 - stute_joint_pretest G<_MIN_G_STUTE warn+NaN contract:
The joint core raised `ValueError` on G < 10, while single-horizon
`stute_test` emits a `UserWarning` and returns a NaN result on the
same condition. Because the event-study workflow dispatches into
the joint core for both step-2 pre-trends and step-3 homogeneity,
a staggered panel whose last-cohort auto-filter leaves fewer than
10 units would now crash the workflow instead of surfacing an
inconclusive report - a regression versus Phase 3's two-period
behavior.
Fix: mirror the single-horizon contract. Emit `UserWarning`
("below the minimum ... Returning NaN result") and return a
`StuteJointResult` with `cvm_stat_joint=nan`, `p_value=nan`,
`reject=False`, and a full-NaN `per_horizon_stats` dict keyed by
the validated horizon labels (so the diagnostic surface is
consistent with the NaN-propagation branch). `n_bootstrap <
_MIN_N_BOOTSTRAP` and non-numeric `alpha` still raise; only the
small-G branch relaxes.
Test updates:
- `test_small_G_raises` renamed to `test_small_G_warns_returns_nan`
and rewritten to assert the new contract.
- New `test_event_study_small_panel_after_filter_inconclusive_not_
crash` covers the workflow-level regression: a staggered fixture
with 40 early-cohort + 6 late-cohort units filters to G=6 after
the validator's last-cohort auto-filter; `did_had_pretest_
workflow(aggregate="event_study")` now completes without
exception, emits the "below the minimum" warning, and surfaces a
NaN joint-Stute report with `all_pass=False`.
P3 - module docstring refresh:
`had_pretests.py` top-level docstring still said Phase 3 shipped
steps 1 + 3 only, that step 2 was deferred, and that
`did_had_pretest_workflow` was a two-period-only entry point. That
drifted after the joint-pretest follow-up landed. Rewrote the
docstring to describe: (a) the three single-horizon tests, (b) the
three new joint helpers (`stute_joint_pretest`,
`joint_pretrends_test`, `joint_homogeneity_test`), (c) both
workflow dispatch modes (`aggregate="overall"` two-period and
`aggregate="event_study"` multi-period), and (d) the narrowed
deferment - only Eq. 18 linear-trend detrending remains, tracked
in TODO for Phase 4 alongside the Pierce-Schott replication.
126 tests pass (125 + 1 new R5 workflow regression, -0 + 1
converted from raise to warn); black/ruff/mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3 - stute_joint_pretest docstring drift: The Raises block still listed `G < _MIN_G_STUTE` as a ValueError condition, but R5 converted that branch to a UserWarning + full-NaN StuteJointResult return to match single-horizon stute_test and keep the event-study workflow from crashing on staggered-filtered small panels. Fix: rewrote the Returns and Raises docstring blocks to describe the actual contract. Returns now enumerates the three NaN-result branches (small G, constant dose, any-NaN residuals / fitted) with their warning behavior. Raises is narrowed to the genuinely-raising conditions: empty input, key-mismatch, str-label collision, shape mismatch, negative doses, too-few bootstrap replicates, invalid alpha. Explicitly notes that small-G does NOT raise. No code changes; docstring-only edit. 126 tests still pass; black/ruff/mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P2 - explicit ValueError on singular design_matrix:
`stute_joint_pretest` previously surfaced a raw
`np.linalg.LinAlgError` to direct callers when `design_matrix` was
rank-deficient (e.g. duplicate columns), breaking the front-door
validation style of the rest of the function. Wrap the
`np.linalg.solve(X.T @ X, X.T)` precompute in a try/except and
re-raise as `ValueError` with a message naming the likely cause
(linearly-dependent columns) and the shape.
Regression: new
`TestStuteJointPretest::test_singular_design_matrix_raises_valueerror`
constructs a (G, 2) design with two identical columns and asserts
the explicit `ValueError("rank-deficient")`.
P3 - Yatchew "step 4" -> "step 3 alternative" docstring drift:
Two docstrings (module header and `_compose_verdict_event_study`)
referred to the Yatchew-HR test as "step 4". Paper
Section 4.2-4.3 defines step 4 as the final admissibility decision
("use TWFE if none of the tests rejects"), not a separate
diagnostic; Yatchew is the alternative linearity test within step 3.
Updated both docstrings to describe Yatchew as the step-3
alternative (subsumed by joint Stute on the event-study path) and
clarified that paper step 4 has no separate code path.
P3 - `joint_homogeneity_test` post_periods docstring:
Text said `>= base_period` but the actual guard is strict `>
base_period` in chronological order. Tightened the Parameters
block to match.
127 tests pass (126 + 1 new R7 regression); black/ruff/mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…ression
Closes the second P1 from the review. Python `_compute_observation_weights`
had an extra `valid_control_at_t = D[t, :] == 0` gate that zeroed ω_j for
units treated at the target period (other than the target unit itself).
Rust's `compute_weight_matrix` has no such gate — per the paper's Eq. 2/3
and `docs/methodology/REGISTRY.md` TROP section, `ω_j = exp(-λ_unit ×
dist(j, i))` is distance-based for all `j ≠ i` and the treated-cell
exclusion is the `(1 - W_{js})` factor applied inside `_estimate_model`
via the control mask, not an extra target-period unit-weight gate.
The empirical impact of removing the gate is zero on the ATT point
estimate: same-cohort donors' pre-treatment rows are exactly absorbed
by their own unit fixed effect `alpha_j` without propagating into
`mu`, `beta`, or other units' parameters — adding them to the fit
changes which rows are scored but not the solution the fit converges
to. Verified: the flipped bootstrap-seed parity test, the main-fit
parity test at `lambda_nn=inf` (`atol=1e-14`) and at `lambda_nn=0.1`
(`atol=1e-10`), and the new same-cohort regression test (below) all
pass before and after the gate removal. The change is structural
alignment with the paper and Rust, not a numerical behavior shift.
Test addition
-------------
`TestTROPRustEdgeCaseParity::test_local_method_same_cohort_donor_parity`
isolates the scenario the gate used to handle differently from Rust: a
fixture with three treated units sharing one cohort (all treated at
`t=5`) and three controls. Before the gate was removed, Python's and
Rust's same-target-period donors diverged in which rows contributed to
the fit; the tests prove the ATT point estimate was never affected
(pre-treatment rows absorbed by `alpha_j`), and now both backends also
agree structurally. Parametrized over the same regime split as the
main-fit parity test (`lambda_nn=inf` → `atol=1e-14`, `lambda_nn=0.1`
→ `atol=1e-10`).
Note on the other P1 in the review (HAD rollback claim): that finding
was a phantom caused by a stale branch base — PR #353 (HAD joint Stute
pretest) landed on `origin/main` between this branch's cut and the
review run, so the PR diff against current `origin/main` appeared to
"delete" the PR #353 additions. Resolved by rebasing onto the updated
`origin/main` before this push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
stute_joint_pretest+joint_pretrends_test+joint_homogeneity_test) to close the paper step-2 gap that Phase 3did_had_pretest_workflowflagged with an "Assumption 7 pre-trends test NOT run" caveat.did_had_pretest_workflowwithaggregate="event_study"multi-period dispatch (QUG + joint pre-trends + joint homogeneity-linearity);aggregate="overall"preserves Phase 3 behavior bit-exactly.p=0.51) DEFERRED to Phase 4 replication harness where the published value serves as parity anchor.Paper: de Chaisemartin, Ciccia, D'Haultfœuille, Knau (2026, arXiv:2405.04465v6). Sections 4.2-4.3 + 5.2.
Test plan
pytest tests/test_had_pretests.py(115 tests total — 69 existing Phase 3 + 46 new covering core, wrappers, workflow, and serialization).black,ruff,mypyondiff_diff/had_pretests.py,diff_diff/__init__.py,tests/test_had_pretests.py.aggregate="overall"path is bit-exact with Phase 3 (TestMultiPeriodWorkflow::test_overall_aggregate_unchanged).TestMultiPeriodWorkflow::test_no_paper_step_2_deferred_string_on_event_study).TestStuteJointPretest::test_shared_eta_across_horizons_white_box).Notes
HADPretestReport.stuteand.yatcheware nowOptionalbecause the event-study path emitsNoneon those fields.aggregate="overall"always populates them, so Phase 3 consumers are unaffected. A handful of existing tests add explicitis not Nonenarrowing assertions.yatchew_hr_teston each(base, post)pair manually. REGISTRY.md documents this asymmetry.🤖 Generated with Claude Code