Phase 2a: HeterogeneousAdoptionDiD class (single-period, 3 design paths) by igerber · Pull Request #346 · igerber/diff-diff

igerber · 2026-04-20T21:22:29Z

Summary

Ships HeterogeneousAdoptionDiD + HeterogeneousAdoptionDiDResults as the user-facing estimator for de Chaisemartin, Ciccia, D'Haultfoeuille, and Knau (2026). Three design-dispatch paths: Design 1' (continuous_at_zero, boundary=0), Design 1 continuous-near-d_lower (regressor shift D - d_lower, boundary=d.min()), and Design 1 mass-point (Wald-IV / 2SLS with instrument 1{D > d_lower}).
Continuous paths compose Phase 1c's bias_corrected_local_linear and apply Equation 8's β-scale rescaling tau_bc / D_bar where D_bar = (1/G) * sum(D_{g,2}). Mass-point path uses the structural-residual 2SLS sandwich [Z'X]^-1 Ω [Z'X]^-T for classical/hc1 and cluster-robust CR1. HC2 and HC2-BM raise NotImplementedError pending a 2SLS-specific leverage derivation.
design="auto" resolves via the REGISTRY line-2320 rule: d.min() == 0 or d.min()/median(|d|) < 0.01 → continuous_at_zero; modal fraction at d.min() > 2% → mass_point; else continuous_near_d_lower. Mass-point path enforces d_lower == float(d.min()) within float tolerance (paper Section 3.2.4 contract); mismatched overrides raise with a clear pointer to the unsupported estimand.
Panel validator enforces balanced two-period panel, D_{g,1} = 0 for all units, and first_treat_col value-domain validation. >2 periods raises with a Phase 2b pointer. aggregate="event_study", survey=, and weights= all raise NotImplementedError pointing at follow-up PRs.
All inference fields (att, se, t_stat, p_value, conf_int) routed through safe_inference for NaN-safe CI gating. Raw design preserved on self so get_params() / sklearn.clone() round-trip reproduces "auto" even after fit.

Methodology references (required if estimator / math changes)

Method name(s): Heterogeneous Adoption Difference-in-Differences (HAD) / Weighted-Average-Slope (WAS) estimator; paper Section 3 (Designs 1 and 1'); paper Section 3.2.4 (mass-point 2SLS); paper Equation 8 (β-scale CI).
Paper / source link(s): de Chaisemartin, C., Ciccia, D., D'Haultfœuille, X., & Knau, F. (2026). Difference-in-Differences Estimators When No Unit Remains Untreated. arXiv:2405.04465v6.
Any intentional deviations from the source (and why): The paper specifies "standard 2SLS inference" for the mass-point path without naming a specific variance. Phase 2a implements the structural-residual 2SLS sandwich ([Z'X]^-1 built from structural residuals u = ΔY - α̂ - β̂·D, not the reduced-form Wald-IV-scaled OLS shortcut) to match canonical 2SLS (e.g., AER::ivreg, ivreg2). Documented in docs/methodology/REGISTRY.md under the HAD Phase 2a Note block. HC2 and HC2-BM require a 2SLS-specific leverage derivation and are queued for follow-up (tracked in TODO.md).

Validation

Tests added/updated: tests/test_had.py (NEW, 110 tests across 16 test classes covering the 12 plan commit criteria: smoke on all 3 designs, auto-detect correctness + 2 edge cases, β-scale rescaling at atol=1e-14, mass-point Wald-IV at atol=1e-14, mass-point sandwich SE parity at atol=1e-12 (classical/hc1/CR1), NotImplementedError contracts for hc2/hc2_bm/event_study/survey/weights, panel-contract violations, NaN propagation, sklearn clone round-trip, get_params signature enumeration, mass-point d_lower equality contract).
Backtest / simulation / notebook evidence (if applicable): N/A; the mass-point SE parity test uses a hand-coded textbook 2SLS sandwich as its reference (no R dependency). The β-scale rescaling parity is algebraic post-Phase-1c and is verified at atol=1e-14 against direct bias_corrected_local_linear output.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Ships the user-facing HAD estimator with three design-dispatch paths: - continuous_at_zero (Design 1'): bias-corrected local-linear at boundary=0, rescaled to beta-scale via Equation 8 divisor D_bar = (1/G) * sum(D_{g,2}). - continuous_near_d_lower (Design 1): same path with regressor shift D' = D - d_lower and boundary = float(d.min()). - mass_point (Design 1 Section 3.2.4): sample-average 2SLS with instrument 1{D_{g,2} > d_lower}. Point estimate is the Wald-IV ratio; SE is the structural-residual 2SLS sandwich [Z'X]^-1 Omega [Z'X]^-T for classical, hc1, and CR1 (cluster-robust). hc2/hc2_bm raise NotImplementedError pending a 2SLS-specific leverage derivation. design="auto" resolves via the REGISTRY auto-detect rule (d.min() < 0.01 * median(|d|) -> continuous_at_zero; modal fraction > 2% -> mass_point; else continuous_near_d_lower), with d.min() == 0 as an unconditional tie-break. Panel validator enforces balanced two-period panel, D_{g,1} = 0 for all units, no NaN in key columns. >2 periods raises pointing to Phase 2b. aggregate="event_study" (Appendix B.2 multi-period) raises NotImplementedError; survey= and weights= also raise (queued for follow-up survey-integration PR). All inference fields (att, se, t_stat, p_value, conf_int) routed through safe_inference() for NaN-safe CI gating. Raw design kept on self so get_params()/clone() round-trip preserves "auto" even after fit. 105 new tests cover the 12 plan commit criteria: three design paths finite, design="auto" correctness + edge cases, beta-scale rescaling at atol=1e-14, mass-point Wald-IV at atol=1e-14, mass-point sandwich SE parity at atol=1e-12 (classical/hc1/CR1), NotImplementedError contracts, panel contract violations, NaN propagation, sklearn clone round-trip, get_params signature enumeration. REGISTRY.md ticks four Phase 2a checkboxes and adds a Note block documenting the structural-residual 2SLS sandwich choice. TODO.md queues Phase 2b (multi-period), survey integration, weights, hc2/hc2_bm leverage derivation, pre-test diagnostics (Phase 3), Pierce-Schott replication (Phase 4), and tutorial integration (Phase 5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Paper Section 3.2.4 defines the Design 1 mass-point estimator with instrument Z = 1{D_{g,2} > d_lower} at d_lower = float(d.min()) (the lower-support mass point). Phase 2a previously accepted user-supplied d_lower overrides on the mass-point path without validation, which would silently redefine the instrument/control split and identify a different (LATE-like) estimand outside Phase 2a's documented scope. Changes: - diff_diff/had.py: fit() now raises ValueError on the mass-point path when an explicit d_lower differs from float(d.min()) beyond float tolerance. d_lower=None (auto-resolve) and d_lower == d.min() (within tolerance) remain supported. Continuous paths are unaffected - they already reject off-support d_lower via Phase 1c's _validate_had_inputs (negative-dose check after the regressor shift). - tests/test_had.py: new TestMassPointDLowerContract class adds 6 rejection/acceptance tests for the contract. The prior test_force_mass_point_on_continuous_data test is renamed and clarified to document that d_lower=0.0 matches d.min()==0.0 in that DGP. The stale all-above-d_lower NaN-propagation test is removed (the input is now unreachable from the public API; the helper-level NaN guard is still tested via test_helper_returns_nan_on_empty_z_zero). Smaller fixes: - _validate_had_panel: first_treat_col comment no longer overstates the cross-validation (Phase 2a does value-domain only, not dose consistency). - _fit_continuous dispatch: "Warn once" comment revised to "Warn" (per-fit warnings are not suppressed across calls in Phase 2a). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rename test_d_lower_contract_is_mass_point_only -> test_mass_point_equality_guard_does_not_fire_on_continuous and rewrite the docstring so it accurately describes what's exercised. Previously claimed "arbitrary d_lower" but only tests d_lower=d.min(); the renamed test now narrates that the mass-point-specific ValueError does not fire on the continuous path (the continuous path has its own upstream negative-dose guard after the regressor shift). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-20T21:31:21Z

Overall Assessment

⛔ Blocker

Executive Summary

The continuous-design implementation is mathematically off: diff_diff/had.py#L1177 computes att = tau_bc / mean(D), but the HAD paper’s continuous estimators are based on E[ΔY] - lim E[ΔY | D_2 ≤ d], and the no-QUG Design 1 path also uses E[D_2 - d̲] in the denominator, not E[D_2]. The new tests at tests/test_had.py#L234 and tests/test_had.py#L279 lock that wrong formula in. citeturn1view1turn0view1
The original dose-support contract is incomplete. diff_diff/had.py#L404 never checks post-period D_{g,2} >= 0, and diff_diff/had.py#L984 lets continuous_near_d_lower accept user d_lower values that are not the support infimum.
Design 1 paths fit silently without surfacing the extra Assumption 5/6 identification burden, despite the registry requiring that warning at docs/methodology/REGISTRY.md#L2155.
The mass-point 2SLS path itself looks directionally aligned; the unsupported hc2/hc2_bm, event_study, survey, weights, and continuous-path cluster= gaps are documented in TODO.md#L94 and docs/methodology/REGISTRY.md#L2323, so I would not block on those tracked scope cuts.

Methodology

P0 The continuous estimator is not implementing the paper’s estimand. diff_diff/had.py#L1137, diff_diff/had.py#L1177, diff_diff/had.py#L121, and docs/methodology/REGISTRY.md#L2319 all treat the Phase 1 local-linear boundary estimate tau_bc as the final numerator and then divide by mean(D). Impact: for Design 1' the paper’s estimator is (E[ΔY] - lim_{d↓0} E[ΔY | D_2 ≤ d]) / E[D_2], and for Design 1 / WAS_{d̲} it is (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]; the current code therefore returns the wrong point estimate and CI on both continuous paths with no warning. Concrete fix: in _fit_continuous(), compute dy_mean = dy_arr.mean(), use den = d_arr.mean() for continuous_at_zero and den = (d_arr - d_lower_val).mean() for continuous_near_d_lower, set att = (dy_mean - bc_fit.estimate_bias_corrected) / den, se = bc_fit.se_robust / den, and update the CI/docs/tests accordingly; if you transform the stored Phase 1 CI directly, the endpoints must reverse under subtraction. citeturn1view1turn0view1
P1 The continuous-path support checks are incomplete on the original dose scale. diff_diff/had.py#L404 validates NaNs and D_{g,1}=0 but never enforces post-period D_{g,2} >= 0; diff_diff/had.py#L1147 then shifts d_arr before calling the Phase 1 validator, which means negative original doses can be hidden and some d_lower < min(D) overrides can slip through the 5% heuristic in diff_diff/local_linear.py#L652. Impact: explicit design="continuous_near_d_lower" can fit out-of-scope data or an off-support threshold, silently changing the estimand. Concrete fix: validate aggregated d_arr before any shift, reject any d_arr < 0, and require d_lower == float(d_arr.min()) within tolerance on continuous_near_d_lower, mirroring the existing mass-point guard at diff_diff/had.py#L992. citeturn1view1turn0view1
P1 Design 1 fits do not surface the extra identification assumptions the registry requires. After resolving the design, diff_diff/had.py#L1035 only warns about ignored vcov_type/cluster knobs and then fits continuous_near_d_lower / mass_point silently. Impact: the paper and registry say that when d̲ > 0, WAS is not point-identified under Assumptions 1-3 alone; users need Assumption 5 for sign identification or Assumption 6 for WAS_{d̲}, and those assumptions are not pre-trend-testable. Concrete fix: emit an explicit warning whenever resolved_design is continuous_near_d_lower or mass_point, or require an assumption={"5","6"} parameter and make target_parameter reflect that choice instead of always reporting WAS_d_lower. citeturn1view1turn0view1

Code Quality

P2 The mass-point variance-family contract is internally inconsistent. diff_diff/had.py#L768 says vcov_type=None defaults to "hc1", but diff_diff/had.py#L1069 maps the default constructor state (robust=False) to "classical". With cluster=, diff_diff/had.py#L1075 computes CR1 anyway, while diff_diff/had.py#L1085 still stores the pre-cluster label. Impact: users cannot rely on result.vcov_type or the docstring to know which SE family they actually got. Concrete fix: resolve and store the effective family explicitly, and make the docstring/defaults consistent with the implementation.

Performance

No findings.

Maintainability

No independent finding beyond the methodology/contract issues above.

Tech Debt

P3 The unsupported aggregate="event_study", survey=, weights=, mass-point hc2/hc2_bm, and continuous-path cluster= gaps are already tracked in TODO.md#L94 through TODO.md#L98 and documented in docs/methodology/REGISTRY.md#L2323. No action required for this PR on those items.

Security

No findings.

Documentation/Tests

P2 The new continuous-path tests encode the wrong estimator formula. tests/test_had.py#L234 through tests/test_had.py#L290 assert att == tau_bc / D̄ and CI == ci_tau / D̄, so they will pass even while the estimator is returning the wrong quantity. Impact: the test suite now protects the bug instead of the paper. Concrete fix: replace those assertions with Equation 7 / Theorem 3 checks using (dy.mean() - bc_fit.estimate_bias_corrected) / den and the corresponding reversed CI endpoints, and add regression coverage for negative post-period doses plus d_lower < min(D) on continuous_near_d_lower.

Path to Approval

Fix the continuous estimator in diff_diff/had.py#L1177 so it uses dy_arr.mean() - tau_bc and the correct denominator for each design, then update the matching docs at diff_diff/had.py#L121 and docs/methodology/REGISTRY.md#L2319.
Add front-door validation on the original post-period dose: reject any D_{g,2} < 0, and require d_lower == min(D_{g,2}) on continuous_near_d_lower, not just on mass_point.
Surface the Design 1 identification choice explicitly by warning on Assumption 5/6 or by requiring an explicit assumption argument and storing that choice in results metadata.
Rewrite the new continuous-path tests at tests/test_had.py#L234 and tests/test_had.py#L279, and add regression tests for negative doses and off-support d_lower.
Align the mass-point vcov_type docs and stored metadata with the actual effective variance family.

Address CI AI review on PR #346. **P0 (Methodology — estimator formula was wrong):** The continuous paths computed `att = tau_bc / D_bar` instead of the paper's `att = (mean(ΔY) - tau_bc) / den`. On a known DGP with β = 0.3 the old code returned ~0 (since tau_bc ≈ lim_{d↓d̲} E[ΔY|D≤d] ≈ 0); the fix recovers the true β at n=4000. - Design 1' (continuous_at_zero), paper Equation 7 / Theorem 3: β = (E[ΔY] - lim_{d↓0} E[ΔY | D_2 ≤ d]) / E[D_2] Implementation: `att = (dy_arr.mean() - tau_bc) / d_arr.mean()`, `se = se_robust / |d_arr.mean()|`. - Design 1 (continuous_near_d_lower), paper Theorem 4 / `WAS_{d_lower}` under Assumption 6: β = (E[ΔY] - lim_{d↓d_lower} E[ΔY | D_2 ≤ d]) / E[D_2 - d_lower] Implementation: regressor shift `D - d_lower`, evaluate local-linear fit at boundary=0 on the shifted scale; `att = (dy_arr.mean() - tau_bc) / (d_arr - d_lower).mean()`. CI endpoints reverse under the subtraction `(ΔȲ - tau_bc)`; safe_inference computes `att ± z · se` from scratch so reversal is automatic. **P1 (Validator — incomplete contract on continuous paths):** - `_validate_had_panel` now rejects negative post-period doses (`D_{g,2} < 0`) front-door on the ORIGINAL unshifted scale so errors reference the user's dose column, not shifted values. - `continuous_near_d_lower` now enforces `d_lower == float(d.min())` within float tolerance, mirroring the mass-point guard. Off-support `d_lower` would otherwise pass through Phase 1c's 5% plausibility heuristic and silently identify a different estimand. **P1 (Identification — Assumption 5/6 not surfaced):** Design 1 paths (continuous_near_d_lower + mass_point) now emit a UserWarning stating that `WAS_{d_lower}` identification requires Assumption 6 (or Assumption 5 for sign identification only) beyond parallel trends, and that neither is testable via pre-trends. continuous_at_zero (Design 1', Assumption 3 only) does not emit the warning. **P2 (vcov_type docstring mismatch):** Corrected the constructor docstring to reflect actual behavior: `vcov_type=None` falls back to the `robust` flag (`robust=True -> hc1`, `robust=False -> classical`, the default); explicit `vcov_type` takes precedence over `robust`. **Tests rewritten + new regressions:** - `TestBetaScaleRescaling` now pins the corrected formula at atol=1e-14 (both designs), pins the CI endpoint reversal, and adds two "recover true β" asymptotic sanity tests. - New `TestDesign1DLowerContract` covers mass-point AND continuous_near_d_lower d_lower equality enforcement. - New `TestPostPeriodDoseContract` covers negative-dose rejection. - New `TestAssumptionFiveSixWarning` verifies the Design 1 warning is emitted on continuous_near_d_lower + mass_point and NOT on continuous_at_zero. **Docs:** REGISTRY.md HAD Phase 2a section updated with the corrected estimator formula, the d_lower contract on both Design 1 paths, the Assumption 5/6 warning note, and a CI-endpoint-reversal note. had.py module and result-class docstrings updated in kind. Targeted regression: 118 HAD tests + 497 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T22:44:45Z

/ai-review

github-actions · 2026-04-20T22:53:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c38f072c0e6d2a8fb31715aa51cff9db5fa1cec

Overall Assessment

⚠️ Needs changes

Executive Summary

By inspection, the prior formula/support/assumption blockers look resolved: the continuous path now matches the paper’s Design 1′ and Design 1 estimands, and the PR now surfaces the extra Design 1 identification burden. citeturn0view2turn0view1
Severity P1 [Newly identified]: design="continuous_near_d_lower" can still be forced on a true mass-point sample, which bypasses the paper’s Section 3.2.4 mass-point/2SLS branch after the code shifts D - d_lower. citeturn0view1
Severity P1 [Newly identified]: the new first_treat_col validator assumes numeric period labels, so it will fail on otherwise supported string-labelled two-period panels.
The remaining scope cuts (event_study, survey, weights, continuous-path cluster, mass-point hc2/hc2_bm) are documented in REGISTRY.md / TODO.md, so I am not counting them as blockers.
I could only review the tests statically here because pytest is not installed in this environment.

Methodology

Severity P1 [Newly identified]. Impact: HAD / WAS Design 1 dispatch is still methodologically unsafe on explicit overrides. fit() dispatches straight into the continuous path at diff_diff/had.py:L1115, _fit_continuous() shifts to d_arr - d_lower at diff_diff/had.py:L1252, and the upstream mass-point rejection only fires when d.min() > 0 at diff_diff/local_linear.py:L620. On a sample with a real mass point at the original d_lower, that shift turns the support minimum into zero and suppresses the Section 3.2.4 branch that should use the sample-average / 2SLS estimator. Concrete fix: before shifting, re-run the modal-fraction check on the original d_arr; if the fraction at d.min() exceeds 2%, either force mass_point or raise on explicit continuous_near_d_lower. Add a regression using the existing mass-point DGP. citeturn0view1
Severity P3. Impact: the structural-residual 2SLS sandwich on the mass-point path is documented in the registry note, so under the review rules I treat it as informational rather than a defect. Code: diff_diff/had.py:L651, docs/methodology/REGISTRY.md:L2323. Concrete fix: none for this PR.

Code Quality

Severity P1 [Newly identified]. Impact: first_treat_col validation is not dtype-agnostic. At diff_diff/had.py:L482 the code coerces grouped first_treat values to float, and at diff_diff/had.py:L488 it coerces them to int. That breaks the supported string-period path exercised at tests/test_had.py:L1399: a valid two-period panel with period in {"A","B"} and first_treat_col in {0,"B"} will error before the intended contract check. Concrete fix: use pd.isna(ft) for missingness and raw equality against {0, t_post} without numeric coercion; add a regression with string period labels and first_treat_col="ft".
Severity P2. Impact: clustered mass-point results still store the requested vcov_type instead of the effective CR1 family. The computation switches to CR1 whenever cluster is not None at diff_diff/had.py:L739, but the result object stores vcov_effective unchanged at diff_diff/had.py:L1165. The text summary patches over this at diff_diff/had.py:L272, but to_dict() and downstream consumers still see "classical" / "hc1" rather than the actual clustered family. Concrete fix: store "cr1" when cluster is supplied, or split requested vs effective variance labels.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the deferred scope cuts are properly tracked, so they are not blockers for this PR: aggregate="event_study", survey=, weights=, continuous-path cluster=, and mass-point hc2 / hc2_bm. Code: TODO.md:L94, TODO.md:L98, docs/methodology/REGISTRY.md:L2323, docs/methodology/REGISTRY.md:L2326. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: tests/test_had.py covers the earlier blockers, but it does not add regressions for the two remaining edge cases above. The explicit-override coverage at tests/test_had.py:L1064 misses continuous_near_d_lower on a mass-point sample, and the panel-validator coverage at tests/test_had.py:L1399 does not exercise first_treat_col with nonnumeric periods. Concrete fix: add those two regressions. I could not execute the suite here because pytest is unavailable.

Path to Approval

Add an original-scale modal-fraction guard before the continuous_near_d_lower path shifts d_arr. If modal_fraction(d.min()) > 0.02, fail fast or redirect to mass_point, and add a regression using the existing mass-point DGP.
Rewrite first_treat_col validation to be dtype-agnostic: use pd.isna for missingness and raw-value membership against {0, t_post}. Add a regression with period=("A","B") and first_treat_col in {0,"B"}.

**P1 #1 (Methodology): continuous_near_d_lower on mass-point samples** When a user explicitly forced design="continuous_near_d_lower" on a sample that actually satisfies the >2% modal-fraction mass-point criterion, the downstream regressor shift (D - d_lower) would move the support minimum to zero on the shifted scale. Phase 1c's mass-point rejection guard only fires when d.min() > 0 (_validate_had_inputs), so the silent coercion ran the nonparametric local-linear estimator on a sample the paper (Section 3.2.4) requires to use the 2SLS branch, producing the wrong estimand. Fix: `HeterogeneousAdoptionDiD.fit()` now runs the modal-fraction check on the ORIGINAL (unshifted) d_arr when the user explicitly selects design="continuous_near_d_lower". If the fraction at d.min() exceeds 2%, the fit raises ValueError pointing to design="mass_point" or design="auto". design="auto" is unaffected (_detect_design already correctly resolves such samples to mass_point). **P1 #2 (Code Quality): first_treat_col validator not dtype-agnostic** The previous validator called `.astype(np.float64)` and `int(v)` on grouped first_treat values, which crashed on otherwise-supported string-labelled two-period panels (period in {"A","B"}, first_treat in {0, "B"}). Rewrote using `pd.isna()` for missingness and raw-value set-membership against `{0, t_post}` with no numeric coercion. **P2 (Maintainability): cluster-applied mass-point stored wrong vcov_type** When cluster was supplied, `_fit_mass_point_2sls` unconditionally switches to the CR1 cluster-robust sandwich, but the result object stored the REQUESTED family ("hc1" or "classical") as `vcov_type`. `summary()` rendered correctly via the cluster_name branch, but `to_dict()` and downstream programmatic consumers saw the stale requested label. Fixed: when cluster is supplied, `vcov_type` is stored as `"cr1"` regardless of the requested family. Renamed the local variable from `vcov_effective` to `vcov_requested` to separate the input from the effective family. Updated the `HeterogeneousAdoptionDiDResults.summary()` branch so the cluster rendering still works with the new stored value. **Tests added (+8 regression):** - TestValidateHadPanel.test_first_treat_col_with_string_periods - TestValidateHadPanel.test_first_treat_col_dtype_agnostic_rejects_invalid_string - TestContinuousPathRejectsMassPoint (2 tests) - TestMassPointClusterLabel (4 tests: cr1 stored when clustered, base family when unclustered, classical+cluster collapses to cr1, to_dict shows effective family) Targeted regression: 126 HAD tests + 505 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T22:58:13Z

/ai-review

github-actions · 2026-04-20T23:04:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c320ddeb6bb6a8da191667a034a108734e66d8d5

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior blockers from the last review look addressed: the original-scale mass-point guard is now before the regressor shift, first_treat_col validation is now dtype-agnostic, and clustered mass-point results now surface the effective cr1 label. See diff_diff/had.py:L481-L507, diff_diff/had.py:L1057-L1083, and diff_diff/had.py:L1204-L1210.
Severity P1: _validate_had_panel() still infers pre/post by lexicographically sorting the two time labels, so valid string-labeled panels such as "pre"/"post" or "before"/"after" will be reversed and then fail the D_{g,1}=0 check. See diff_diff/had.py:L398-L420 and diff_diff/had.py:L449-L459.
Severity P2: HeterogeneousAdoptionDiDResults.summary() still hardcodes the coefficient row label as WAS, even when the estimator correctly stores target_parameter="WAS_d_lower". See diff_diff/had.py:L292-L303.
The structural-residual 2SLS sandwich and the Phase 2a scope cuts (hc2/hc2_bm, event_study, survey, weights, continuous-path cluster) are documented in the registry/TODO, so I am not counting them as defects. See docs/methodology/REGISTRY.md:L2319-L2326 and TODO.md:L94-L98.
I could not execute the test suite here: pytest is not installed in this environment, and the base Python environment also lacks project deps such as numpy.

Methodology

Severity P1. Impact: the panel validator’s period-order inference is not actually safe for supported string labels. _validate_had_panel() sorts unique() period labels lexicographically at diff_diff/had.py:L398-L420, then applies the HAD assumption check D_{g,1}=0 using that inferred t_pre at diff_diff/had.py:L449-L459. For a valid two-period panel labeled ("pre", "post"), the sort order is ("post", "pre"), so the code checks the treated period as if it were pre-treatment and raises on a valid design. This is a real production edge-case on a changed assumption-check path, not just a test artifact. Concrete fix: infer t_pre as the unique period whose dose is zero for all units and t_post as the other period, or require explicit pre/post ordering from the caller; do not use lexicographic sort as a proxy for chronology.
Severity P3. Impact: the mass-point structural-residual 2SLS sandwich and the hc2/hc2_bm fail-fast behavior are documented deviations/deferrals, so they are informational only. See diff_diff/had.py:L649-L791, docs/methodology/REGISTRY.md:L2323-L2326, and TODO.md:L97-L98. Concrete fix: none for this PR.

Code Quality

Severity P2. Impact: the result summary can misstate the estimand on Design 1 paths. The result object correctly stores target_parameter="WAS_d_lower" for continuous_near_d_lower and mass_point, but the printed coefficient row is still hardcoded to WAS at diff_diff/had.py:L292-L303. That is a user-facing contract mismatch in the estimator’s main output surface. Concrete fix: render self.target_parameter in the summary row instead of a hardcoded label.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining Phase 2a gaps are properly tracked, so they are not blockers for this review: aggregate="event_study", survey=, weights=, continuous-path cluster=, and mass-point hc2 / hc2_bm. See TODO.md:L94-L98 and docs/methodology/REGISTRY.md:L2323-L2326. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new tests cover only lexically ordered string labels ("A", "B"), so they miss the new period-order bug; they also do not assert that Design 1 summaries print WAS_d_lower. See tests/test_had.py:L1399-L1421 and tests/test_had.py:L894-L941. Concrete fix: add a regression with period=("pre","post") and first_treat_col in {0, "post"}, and add summary-label assertions for continuous_near_d_lower and mass_point.

Path to Approval

Replace lexicographic period ordering in _validate_had_panel() with HAD-consistent pre/post inference. The safest implementation is to identify the pre period as the unique period where dose == 0 for all units, and the post period as the other period.
Add a regression test that fits a valid two-period panel with string labels like ("pre", "post") and first_treat_col values {0, "post"} so this validator bug cannot regress again.

…abel **P1 (Methodology): _validate_had_panel inferred pre/post by lexicographic sort** Previously the validator sorted the two period labels alphabetically and assigned `t_pre=periods[0]`, `t_post=periods[1]`. On supported string-labelled panels like `("pre", "post")` the alphabetic order is ["post", "pre"], so the code flipped pre and post and then raised on the treated-period D>0 check for a valid design. Same bug for `("before", "after")` and any non-alphabetic-chronological label pair. Fix: identify `t_pre` as the unique period where dose == 0 for ALL units (HAD paper Section 2 no-unit-untreated convention); `t_post` is the other period. This is a DGP-consistent invariant, not a string ordering. If neither period has all-zero dose, raise with the contract message and per-period nonzero-count diagnostics. If both periods have all-zero dose, raise (no treatment variation to estimate). The existing pre-period D=0 check is now tautological and has been removed since the inference itself enforces the invariant. Behavior on valid numeric panels (e.g., 2020/2021) is unchanged. **P2 (Code Quality): summary() hardcoded 'WAS' row label** `HeterogeneousAdoptionDiDResults.summary()` printed "WAS" as the parameter label regardless of the resolved design. For Design 1 paths (continuous_near_d_lower, mass_point) the stored `target_parameter` is "WAS_d_lower" per paper Sections 3.2.2-3.2.4, so the user-facing output misrepresented the estimand. Fix: render `self.target_parameter` in the summary row. Now Design 1' prints "WAS", Design 1 prints "WAS_d_lower", matching the stored result metadata. **Tests (+7 regression):** - TestValidateHadPanel.test_semantic_pre_post_labels_not_lexicographic - TestValidateHadPanel.test_semantic_pre_post_with_first_treat_col - TestValidateHadPanel.test_semantic_pre_post_fit_end_to_end - TestValidateHadPanel.test_before_after_labels - TestValidateHadPanel.test_no_all_zero_period_raises - TestValidateHadPanel.test_both_all_zero_periods_raises - TestResultMethods.test_summary_uses_target_parameter_for_row_label Targeted regression: 133 HAD tests + 512 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:11:45Z

/ai-review

github-actions · 2026-04-20T23:19:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ddc09e4341f528862de4a62505032d314fa8deea

Overall Assessment

⚠️ Needs changes

Executive Summary

The two issues from the previous AI review are resolved: period ordering is now inferred from the all-zero pre-period instead of lexicographic sort, and the summary row now renders target_parameter rather than hardcoding WAS. See diff_diff/had.py:L292-L304, diff_diff/had.py:L447-L475, tests/test_had.py:L904-L936, and tests/test_had.py:L1493-L1508.
Severity P1 [Newly identified]: the new explicit design= override contract still lets paper-incompatible Design 1 / mass-point paths run on d.min()==0 samples, even though the registry and method description reserve that regime for Design 1'. See docs/methodology/REGISTRY.md:L2147-L2156, docs/methodology/REGISTRY.md:L2239-L2241, diff_diff/had.py:L828-L837, diff_diff/had.py:L1065-L1168, and tests/test_had.py:L1111-L1124.
The structural-residual 2SLS sandwich, hc2/hc2_bm fail-fast behavior, and the Phase 2a scope cuts (event_study, survey, weights, continuous-path cluster) are documented in the registry/TODO, so I am not counting them as defects. See docs/methodology/REGISTRY.md:L2323-L2326 and TODO.md:L94-L98.
The new tests now cover the prior regressions, but one test currently codifies the unsupported design="mass_point", d_lower=0.0 override as expected behavior. See tests/test_had.py:L1111-L1124.
I could not execute the suite here because pytest, numpy, and pandas are not installed in this environment.

Methodology

Severity P1 [Newly identified]. Impact: the Phase 2a registry text is regime-based, not user-override-based: when d̲=0, the documented path is Design 1' (continuous_at_zero), while continuous_near_d_lower and mass_point are the d̲>0 cases. The new estimator intentionally advertises “explicit overrides run the chosen path without auto-reject,” and there is no fit-time guard enforcing d_lower > 0 for continuous_near_d_lower or mass_point. As shipped, design="mass_point", d_lower=0.0 on a d.min()==0 sample returns a finite 2SLS result, and design="continuous_near_d_lower" on the same regime would run the Design 1' algebra while labeling the target WAS_d_lower and emitting the Assumption 5/6 warning. That is an undocumented methodology deviation from the source-material dispatch rules. See docs/methodology/REGISTRY.md:L2147-L2156, docs/methodology/REGISTRY.md:L2239-L2241, docs/methodology/REGISTRY.md:L2320-L2324, diff_diff/had.py:L109-L115, diff_diff/had.py:L831-L834, diff_diff/had.py:L1065-L1168, diff_diff/had.py:L1307-L1314, and tests/test_had.py:L1111-L1124. Concrete fix: reject design="mass_point" and design="continuous_near_d_lower" whenever the resolved d_lower is zero within tolerance, with an error pointing users to design="continuous_at_zero" or design="auto".

Code Quality

No findings. The prior summary-label mismatch is fixed in diff_diff/had.py:L292-L304.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining Phase 2a scope cuts and deferred variance work are properly tracked, so they are informational rather than blocking: aggregate="event_study", survey=, weights=, continuous-path cluster=, and mass-point vcov_type in {"hc2","hc2_bm"}. See docs/methodology/REGISTRY.md:L2323-L2326 and TODO.md:L94-L98. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the override contract is now internally inconsistent. The public docstring says explicit overrides run “without auto-reject,” while the tests encode one override that raises (continuous_at_zero on mass-point data) and another that succeeds (mass_point on d.min()==0 data). That ambiguity will confuse users and will regress once the methodology guard above is fixed. See diff_diff/had.py:L828-L834 and tests/test_had.py:L1101-L1124. Concrete fix: after tightening the design guard, rewrite the design docstring to describe the supported override behavior precisely, and invert tests/test_had.py:L1111-L1124 into a rejection regression.
Review limitation: I could not run tests/test_had.py here because pytest, numpy, and pandas are not installed.

Path to Approval

Add a fit-time guard in HeterogeneousAdoptionDiD.fit() that rejects design="mass_point" and design="continuous_near_d_lower" when the resolved d_lower is zero within the same tolerance family used by _detect_design, and direct users to design="continuous_at_zero" / design="auto".
Replace tests/test_had.py:L1111-L1124 with a regression that expects ValueError, and add the parallel rejection test for design="continuous_near_d_lower" on a d.min()==0 sample.
Update the design parameter docstring in diff_diff/had.py:L828-L837 so the supported override contract matches the guarded behavior.

**P1 (Methodology): Design 1 paths must reject d_lower = 0** Paper Section 3.2 partitions HAD by regime: `d_lower = 0` is Design 1' (`continuous_at_zero`); `d_lower > 0` is Design 1 (`continuous_near_ d_lower` or `mass_point`). The auto-detect rule already respects this partition, but explicit overrides previously allowed `design="mass_point", d_lower=0` and `design="continuous_near_d_lower"` on a `d.min()==0` sample to run silently, returning a paper-incompatible estimand (2SLS with degenerate single-unit mass for mass_point; Design 1' algebra relabeled as `WAS_d_lower` with a spurious Assumption 5/6 warning for continuous_near_d_lower). Fix: add a fit-time guard that raises `ValueError` when `resolved_design in ("mass_point", "continuous_near_d_lower")` and the resolved `d_lower_val` is within float tolerance of zero (same tolerance family as `_detect_design`'s d.min()==0 tie-break). The error message points users to `continuous_at_zero` or `auto` for samples with support infimum at zero. **Docstring + test updates:** - Rewrote the `design` parameter docstring to document the regime-partition contract precisely: each explicit override is now described with its d_lower precondition and mass-point compatibility. - Rewrote the `d_lower` parameter docstring to note the Design-1-requires-positive contract. - Inverted the prior `test_force_mass_point_on_continuous_data_at_ support_infimum` test (which incorrectly codified the unsupported behavior) into three rejection regressions: `test_force_mass_point_on_d_lower_zero_sample_raises`, `test_force_continuous_near_d_lower_on_d_lower_zero_sample_raises`, `test_force_mass_point_d_lower_none_on_zero_sample_raises`. Targeted regression: 135 HAD tests + 514 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:23:39Z

/ai-review

github-actions · 2026-04-20T23:31:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e622d60482fa06544f2ca0a51c3e06584c4e0b83

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous re-review issues look resolved: _validate_had_panel() now infers pre/post from the all-zero period, summary() now renders target_parameter, and the zero-boundary override problem for Design 1 / mass-point paths is now rejected. diff_diff/had.py:L292-L304 diff_diff/had.py:L447-L475 diff_diff/had.py:L1096-L1117 tests/test_had.py:L1111-L1140
Severity P1 [Newly identified]: explicit design="mass_point" still has no reciprocal regime check against the continuous-near-d̲ case. The PR guards continuous_near_d_lower against mass-point samples, but it does not guard mass_point against samples that only satisfy the continuous boundary-limit design, so Section 3.2.4’s 2SLS path can still run outside the documented mass-point regime. docs/methodology/REGISTRY.md:L2240-L2241 docs/methodology/REGISTRY.md:L2319-L2324 diff_diff/had.py:L602-L657 diff_diff/had.py:L833-L853 diff_diff/had.py:L1129-L1262
The structural-residual 2SLS sandwich, CR1 labeling, and safe_inference gating are aligned with the new registry notes and do not look methodologically problematic. diff_diff/had.py:L679-L806 diff_diff/had.py:L1276-L1300 docs/methodology/REGISTRY.md:L2323-L2325
The Phase 2a scope cuts (event_study, survey, weights, continuous-path cluster, mass-point hc2/hc2_bm) are explicitly tracked in TODO.md / REGISTRY.md, so I did not count them as blockers. TODO.md:L94-L102 docs/methodology/REGISTRY.md:L2323-L2326
I could not run tests/test_had.py here because this environment does not have numpy or pytest installed.

Methodology

Severity P1 [Newly identified]. Impact: the paper/registry split Design 1 into two different estimators: 2SLS only for the mass-point-at-d̲ case, and the shifted local-linear boundary-limit estimator for the continuous-near-d̲ case. The new code enforces one side of that split by rejecting continuous_near_d_lower on obvious mass-point samples, but there is no reciprocal guard before the mass_point dispatch. As written, any explicit design="mass_point" call with d_lower > 0 and d_lower == d.min() goes straight into _fit_mass_point_2sls(), even when the sample only fits the continuous-near-d̲ regime. That silently changes the control object from the boundary limit lim_{d \downarrow d̲} E[ΔY | D_2 \le d] to the exact-d.min() cell, which is not the documented Design 1 continuous estimator. docs/methodology/REGISTRY.md:L2240-L2241 docs/methodology/REGISTRY.md:L2319-L2324 diff_diff/had.py:L833-L853 diff_diff/had.py:L1129-L1145 diff_diff/had.py:L1250-L1262
Concrete fix: either add the reciprocal front-door rejection for design="mass_point" when the sample does not satisfy the same modal-min mass-point rule used by _detect_design(), or explicitly document in REGISTRY.md that design="mass_point" is an expert override that intentionally bypasses the heuristic. In either case, codify that contract in tests.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining Phase 2a scope cuts and deferred variance work are properly tracked, so they are informational rather than blocking. TODO.md:L94-L102 docs/methodology/REGISTRY.md:L2323-L2326
Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new tests now pin the fixed zero-boundary rejection and the continuous_near_d_lower-on-mass-point rejection, but they still do not pin the reciprocal mass_point-on-continuous contract. That leaves the methodology gap above unguarded in the suite. tests/test_had.py:L1111-L1140 tests/test_had.py:L1556-L1577
Concrete fix: add a regression using _dgp_continuous_near_d_lower() that either expects ValueError or asserts the documented expert-override behavior, whichever contract you choose.

Path to Approval

Decide whether explicit design="mass_point" is supposed to obey the same modal-min mass-point regime check as design="auto". If yes, add the fit-time rejection before _fit_mass_point_2sls(). If no, add a **Note:** block in docs/methodology/REGISTRY.md and align the design docstring so the bypass is explicit and intentional.
Add a regression in tests/test_had.py on _dgp_continuous_near_d_lower() that codifies that decision.
Re-run tests/test_had.py in an environment with numpy and pytest installed.

**P1 (Methodology): reciprocal regime check for design="mass_point"** Round 2 added a guard that rejects `design="continuous_near_d_lower"` on samples with modal-min fraction > 2% (mass-point samples). The reciprocal direction was left unguarded: `design="mass_point"` on a continuous-near-d_lower sample (modal fraction <= 2%) would silently run 2SLS, but the instrument Z = 1{D > d.min()} degenerates because the "mass" side has almost no observations. The resulting point estimate identifies the exact-d.min() cell rather than the paper's boundary-limit estimand (Section 3.2.4 2SLS is only the documented Design 1 estimator when the modal mass exists). Fix: extend the existing pre-shift regime check so it fires symmetrically on both explicit overrides. When `resolved_design == "mass_point"` and the modal fraction at d.min() does NOT exceed 2%, raise ValueError pointing users to `continuous_near_d_lower` or `auto`. Same 2% threshold used by `_detect_design()` and Phase 1c's `_validate_had_inputs()` so the three dispatch paths share one regime rule. **Docs:** Updated the `design` parameter docstring to document the reciprocal mass-point contract. **Tests (+4 regression):** - TestMassPointPathRejectsContinuousSample.test_mass_point_on_continuous_near_sample_raises - TestMassPointPathRejectsContinuousSample.test_mass_point_on_true_mass_point_sample_runs - TestMassPointPathRejectsContinuousSample.test_mass_point_modal_at_threshold_runs (2.5%) - TestMassPointPathRejectsContinuousSample.test_mass_point_modal_exactly_two_percent_raises Targeted regression: 139 HAD tests + 518 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:41:33Z

/ai-review

github-actions · 2026-04-20T23:47:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2884c1d81adf576edc6f14397b8cb97880a28cbe

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. The prior re-review blocker about forcing design="mass_point" on continuous-near-d̲ samples appears resolved, but the new estimator still ships as panel-only while the project’s HAD methodology docs still describe the method as applicable to panel or repeated cross-section data, and that restriction is not documented as a registry **Note:** / **Deviation**. See diff_diff/had.py:L1123, tests/test_had.py:L1580, docs/methodology/REGISTRY.md:L2146, and docs/methodology/papers/dechaisemartin-2026-review.md:L23.

Executive Summary

The previous P1 from the last AI review looks addressed: explicit design="mass_point" now rejects continuous-near-d̲ samples, and the regression tests cover that path. See diff_diff/had.py:L1154 and tests/test_had.py:L1587.
The structural-residual 2SLS sandwich, CR1 labeling, and safe_inference routing are aligned with the new registry notes; I did not find a new variance/NaN regression in the changed estimator code. See diff_diff/had.py:L665, diff_diff/had.py:L1297, docs/methodology/REGISTRY.md:L2323, and docs/methodology/REGISTRY.md:L2325.
Severity P1 [Newly identified]: HeterogeneousAdoptionDiD is currently panel-only, but the HAD registry entry and paper review still state panel-or-repeated-cross-section scope. That is an undocumented methodology-surface mismatch. See docs/methodology/REGISTRY.md:L2146, docs/methodology/papers/dechaisemartin-2026-review.md:L23, diff_diff/had.py:L995, and diff_diff/had.py:L421.
Severity P3 [Newly identified]: several theorem/equation references in had.py do not match the registry’s source map, which makes later methodology audits harder even though the computations themselves look right. See diff_diff/had.py:L12, diff_diff/had.py:L22, docs/methodology/REGISTRY.md:L2173, docs/methodology/REGISTRY.md:L2196, and docs/methodology/REGISTRY.md:L2262.
The Phase 2a scope cuts (event_study, survey, weights, hc2/hc2_bm, continuous-path cluster) are properly tracked in TODO.md / REGISTRY.md, so I did not count them as blockers. See TODO.md:L93, TODO.md:L101, and docs/methodology/REGISTRY.md:L2323.
I could not run tests/test_had.py here because this environment does not have numpy, pandas, or pytest installed.

Methodology

Severity P1 [Newly identified]. Impact: the project’s HAD methodology docs still say the estimator applies to panel or repeated cross-section data, but the shipped API is panel-only: fit() is documented as operating on a “two-period panel,” requires unit_col, and _validate_had_panel() hard-enforces a balanced two-period panel. The new HAD TODO rows track other Phase 2a deferrals, but not repeated-cross-section support. That makes the shipped surface narrower than the documented method without a registry **Note:** / **Deviation** recording the restriction. Concrete fix: either implement a repeated-cross-section path, or add an explicit HAD registry **Note:** that Phase 2a is panel-only, mirror that in the paper review and had.py docstrings, and add a regression that asserts the intended rejection on repeated-cross-section-shaped input. See docs/methodology/REGISTRY.md:L2146, docs/methodology/papers/dechaisemartin-2026-review.md:L23, diff_diff/had.py:L1007, diff_diff/had.py:L421, tests/test_had.py:L56, and TODO.md:L93.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining Phase 2a scope cuts are explicitly tracked, so they are informational rather than blocking for this PR. Concrete fix: none for this PR. See TODO.md:L93, TODO.md:L101, and docs/methodology/REGISTRY.md:L2326.

Security

No findings.

Documentation/Tests

Severity P3 [Newly identified]. Impact: had.py cites Design 1' as “Equation 7 / Theorem 3” and the continuous Design 1 path as “Proposition 3 / Theorem 4,” but the registry maps those source points to Theorem 1 / Equation 3 for Design 1' identification and Theorem 3 / Equation 11 for WAS_{d̲}; Theorem 4 in the registry is the QUG test. This does not change the computation, but it leaves the in-code references inconsistent with the source material. Concrete fix: update the module docstring, result docstring, and _fit_continuous() docstring to cite Theorem 1 / Equation 3 for Design 1' identification, Equation 7 for the sample estimator, and Theorem 3 / Equation 11 for WAS_{d̲}. See diff_diff/had.py:L12, diff_diff/had.py:L22, diff_diff/had.py:L138, diff_diff/had.py:L143, docs/methodology/REGISTRY.md:L2173, docs/methodology/REGISTRY.md:L2196, and docs/methodology/REGISTRY.md:L2262.
I could not execute tests/test_had.py here because numpy, pandas, and pytest are unavailable in this environment.

Path to Approval

Add an explicit HAD registry **Note:** / **Deviation** that Phase 2a is panel-only and does not yet support repeated cross-sections, and mirror that restriction in docs/methodology/papers/dechaisemartin-2026-review.md and the HeterogeneousAdoptionDiD.fit() docstring.
Add a regression test that constructs a repeated-cross-section-shaped input and asserts the intended panel-only error message. If the intent is to support repeated cross-sections now, implement that path instead.

**P1 (Methodology): Phase 2a panel-only not documented as deviation** Paper Section 2 defines HAD on panel OR repeated cross-section data, but Phase 2a ships a panel-only implementation. The existing balanced-panel validator rejects RCS inputs (disjoint unit IDs between periods) with a generic "unit(s) do not appear in both periods" error, but the scope restriction was not documented in REGISTRY.md as a `**Note:**` deviation from the paper's documented surface. Fix: - REGISTRY.md: new `**Note (Phase 2a panel-only):**` block under the HAD Phase 2a entry, explaining the restriction and pointing at the follow-up RCS PR. - `docs/methodology/papers/dechaisemartin-2026-review.md`: mirrored implementation note on the panel-or-RCS scope line. - `HeterogeneousAdoptionDiD.fit()` docstring: new preamble paragraph stating the panel-only restriction and why (unit-level first differences required). - `TODO.md`: new row queuing RCS identification-path work. - `tests/test_had.py`: two regression tests (`test_repeated_cross_section_raises` and `test_repeated_cross_section_fit_raises`) that construct RCS-shaped input and assert the validator + fit() raise with the expected error. **P3 (Docs/Tests): theorem/equation references were off** Module docstring, result-class `att` docstring, and `_fit_continuous()` docstring cited the paper's theorems/equations inconsistently with the registry's source map: - Design 1' identification: said "Theorem 3" -> should be Theorem 1 with Equation 3; Equation 7 is the sample estimator. - Design 1 continuous-near-d_lower: said "Proposition 3 / Theorem 4" -> should be Theorem 3 / Equation 11 (Theorem 4 is the QUG null test, not this estimand). Fix: updated all three docstring sites to cite Theorem 1 / Equation 3 (Design 1' identification) + Equation 7 (sample estimator), and Theorem 3 / Equation 11 (Design 1 continuous-near-d_lower / WAS_d_lower under Assumption 6). Targeted regression: 141 HAD tests + 520 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-20T23:54:30Z

/ai-review

github-actions · 2026-04-21T00:02:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: afa0f6c68586a1e15a09d1dfd19badc0402e52d5

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. The previous re-review blocker about Phase 2a being panel-only looks resolved via the new registry note, paper-review note, TODO tracking, and repeated-cross-section tests. The remaining blocker is a new parameter-interaction bug: cluster= is documented as ignored on continuous designs, but the implementation validates/extracts cluster data before design dispatch, so a missing or malformed cluster column can still abort an otherwise valid continuous fit. See diff_diff/had.py:L889, diff_diff/had.py:L1081, diff_diff/had.py:L1252, and TODO.md:L97.

Executive Summary

The prior panel-only mismatch is now mitigated in docs/methodology/REGISTRY.md:L2327, docs/methodology/papers/dechaisemartin-2026-review.md:L23, and TODO.md:L102, with regression coverage in tests/test_had.py:L1549.
I did not find a new unmitigated math/inference defect in the estimator core.
Severity P1: continuous-path fits do not truly ignore cluster=; early cluster validation can raise before the warning-only ignore branch is reached. See diff_diff/had.py:L1081, diff_diff/had.py:L573, and diff_diff/had.py:L1252.
Severity P3: the new Phase 2a checklist line in the Methodology Registry still cites the wrong theorem numbers, even though had.py and the paper review were corrected. See docs/methodology/REGISTRY.md:L2319, diff_diff/had.py:L12, and docs/methodology/papers/dechaisemartin-2026-review.md:L50.
I could not execute tests/test_had.py here because numpy, pandas, pytest, and scipy are not installed in this environment.

Methodology

No unmitigated P0/P1 findings. The prior panel-only scope mismatch is now explicitly documented and tracked in docs/methodology/REGISTRY.md:L2327, docs/methodology/papers/dechaisemartin-2026-review.md:L23, and TODO.md:L102, with rejection coverage in tests/test_had.py:L1549.
Severity P3. Impact: the new Phase 2a checklist entry in the Methodology Registry still maps Design 1' to “Equation 7 / Theorem 3” and the continuous-near-d̲ path to “Theorem 4”, while the main HAD registry section, the updated estimator docstrings, and the paper review all correctly use Theorem 1 / Equation 3 for Design 1' and Theorem 3 / Equation 11 for WAS_{d̲}. This does not change the implementation, but it leaves the project’s source-of-truth methodology map internally inconsistent. Concrete fix: update docs/methodology/REGISTRY.md:L2319 to match diff_diff/had.py:L12, diff_diff/had.py:L23, and docs/methodology/papers/dechaisemartin-2026-review.md:L50.

Code Quality

Severity P1. Impact: cluster is a new public parameter, and the public contract says it is ignored with a warning on continuous designs. In practice, fit() always passes self.cluster into _aggregate_first_difference() before design resolution, and that helper immediately raises if the cluster column is missing, non-constant within unit, or contains NaNs. So a valid continuous fit can fail just because a shared config supplied an irrelevant cluster=. That is a parameter-interaction bug on a new code path, and the current test coverage only exercises the happy path with a valid cluster column. Concrete fix: resolve the design before cluster extraction, or call _aggregate_first_difference(..., cluster_col=None) unless the resolved design is mass_point; then add regressions for design="continuous_at_zero" and design="auto" on continuous samples where cluster is missing/malformed and should only trigger the documented warning. See diff_diff/had.py:L889, diff_diff/had.py:L1081, diff_diff/had.py:L573, diff_diff/had.py:L1252, TODO.md:L97, and tests/test_had.py:L1332.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: the remaining scoped deferments for event_study, survey, weights, hc2/hc2_bm, and repeated-cross-section support are now explicitly tracked in TODO.md / REGISTRY.md, so I am not counting those limitations as blockers for this PR. Concrete fix: none for this PR. See TODO.md:L93, TODO.md:L96, TODO.md:L102, and docs/methodology/REGISTRY.md:L2323.

Security

No findings.

Documentation/Tests

No additional findings. tests/test_had.py is broad and appears to cover the prior review blockers, but I could not execute it here because numpy, pandas, pytest, and scipy are unavailable in this environment.

Path to Approval

Make continuous designs truly ignore cluster= by deferring cluster extraction/validation until after design resolution, or by skipping cluster aggregation entirely unless resolved_design == "mass_point".
Add regressions that a continuous fit (design="continuous_at_zero" and design="auto" resolving to a continuous path) still succeeds with only a warning when cluster is irrelevant, including cases where the named cluster column is missing, contains NaNs, or varies within unit.

…egistry refs **P1 (Code Quality): cluster= must truly be ignored on continuous paths** `HeterogeneousAdoptionDiD.fit()` previously passed `self.cluster` into `_aggregate_first_difference()` before the design was resolved. The aggregator validates the cluster column eagerly (missing column, within-unit variance, NaN ID), so a valid continuous fit could abort just because a shared config supplied an irrelevant `cluster=`. This contradicted the documented "ignored with a warning on continuous paths" contract. Fix: defer cluster extraction until after design resolution. The first aggregation call now passes `cluster_col=None` unconditionally; a second aggregation pass with `cluster_col=cluster_arg` runs only when `resolved_design == "mass_point"`, which is the only path that consumes the extracted cluster array. Continuous paths emit the existing `UserWarning` and proceed to fit without touching the cluster column at all. **P3 (Methodology): registry checklist theorem references were stale** Round 6 fixed the theorem citations in `had.py` and the paper review doc but missed the Phase 2a checklist line in `REGISTRY.md`, which still said "Equation 7 / Theorem 3" for Design 1' identification and "Theorem 4, WAS_{d̲} under Assumption 6" for the continuous-near-d_lower path. Updated the checklist line to match: Theorem 1 / Equation 3 (identification) + Equation 7 (sample estimator) for Design 1'; Theorem 3 / Equation 11 for WAS_{d̲}. **Tests (+4 regression):** - test_missing_cluster_column_on_continuous_only_warns: continuous_at_zero + cluster='does_not_exist' -> warn + fit succeeds. - test_nan_cluster_on_continuous_only_warns: NaN cluster IDs on continuous path -> warn + fit succeeds. - test_within_unit_varying_cluster_on_continuous_only_warns: within-unit- varying cluster IDs on continuous -> warn + fit succeeds. - test_auto_design_ignores_irrelevant_cluster_on_continuous: design='auto' resolving to continuous_at_zero also ignores cluster gracefully. Targeted regression: 145 HAD tests + 524 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T00:32:30Z

/ai-review

github-actions · 2026-04-21T00:40:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 792997dd2c40b5d844aae52b0420d9212333a73d

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior rerun blocker about continuous designs eagerly validating cluster= looks fixed: fit() now resolves the design before any cluster extraction, and the new regressions cover both explicit continuous fits and design="auto" resolving to a continuous path. diff_diff/had.py:L1081-L1128 tests/test_had.py:L1359-L1409
The earlier panel-only scope mismatch also appears properly mitigated: the registry, paper review, and TODO.md now document Phase 2a as panel-only, and there are repeated-cross-section rejection tests. docs/methodology/REGISTRY.md:L2326-L2327 docs/methodology/papers/dechaisemartin-2026-review.md:L22-L24 TODO.md:L101-L102 tests/test_had.py:L1601-L1637
I did not find a new unmitigated variance/SE defect in the estimator core: the mass-point path uses the documented structural-residual 2SLS sandwich, and all exported inference fields still go through safe_inference(). diff_diff/had.py:L668-L810 diff_diff/had.py:L1335-L1359 docs/methodology/REGISTRY.md:L2323-L2325
Severity P1: the newly documented “d_lower == float(d.min()) within float tolerance” contract is not propagated downstream. Accepted tolerance-sized deviations are used verbatim in the strict mass-point instrument and in the continuous-path regressor shift, so accepted inputs can either collapse the mass-point control group or trigger a negative-shift rejection on the continuous path. diff_diff/had.py:L861-L867 diff_diff/had.py:L1220-L1235 diff_diff/had.py:L1237-L1245 diff_diff/had.py:L1417-L1445 diff_diff/local_linear.py:L564-L581
Severity P2: the new per-unit validators for first_treat_col and mass-point cluster= use pandas reductions that skip NaNs / later bad rows, so mixed-row malformed inputs can bypass the claimed front-door checks. Current tests only cover the fully repeated invalid cases. diff_diff/had.py:L510-L526 diff_diff/had.py:L573-L595 tests/test_had.py:L645-L660 tests/test_had.py:L1322-L1330
I could not execute tests/test_had.py here because numpy, pandas, pytest, and scipy are not installed in this environment.

Methodology

Severity P1. Impact: the registry and class docstring both advertise that Design 1 paths accept d_lower == float(d.min()) within float tolerance, but once that tolerance check passes the implementation keeps the user-supplied value instead of snapping it back to d.min(). On the mass-point path, a tolerated d_lower = d.min() - ε changes the instrument 1{D > d_lower} so the mass point itself moves into Z=1, which can empty the control group while the reported counts still treat those units as “at d_lower”. On the continuous-near-d_lower path, a tolerated d_lower = d.min() + ε makes d_arr - d_lower slightly negative and then trips Phase 1c’s negative-dose guard. This is a contract violation on a methodology-defining threshold, not just a cosmetic tolerance issue. Concrete fix: after the tolerance check passes, normalize d_lower_val = float(d_arr.min()) before any strict > comparisons, count construction, or regressor shifting; then add regressions for both directions of the tolerated perturbation. docs/methodology/REGISTRY.md:L2321-L2323 diff_diff/had.py:L861-L867 diff_diff/had.py:L1220-L1235 diff_diff/had.py:L1237-L1245 diff_diff/had.py:L1417-L1420 diff_diff/local_linear.py:L564-L581 tests/test_had.py:L1184-L1190 tests/test_had.py:L1208-L1214

Code Quality

Severity P2. Impact: _validate_had_panel() reduces first_treat_col with groupby(...).first(), and _aggregate_first_difference() uses nunique() plus first() for cluster. Those operations drop NaNs and ignore later bad rows, so units like [valid, NaN] or [valid, invalid] can pass the new validators even though the implementation comments and error messages say NaNs / out-of-domain values are rejected front-door. This does not change the core estimator formula, but it weakens the advertised input-contract checks on new public parameters. Concrete fix: validate raw per-row values before collapsing to one value per unit (nunique(dropna=False), explicit per-unit isna().any(), and explicit per-row domain checks), then collapse only after validation passes. diff_diff/had.py:L510-L526 diff_diff/had.py:L573-L595

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The deferred Phase 2a scope items that remain out of scope here (event_study, survey, weights, hc2 / hc2_bm, repeated cross-section support, continuous-path cluster= threading) are now explicitly tracked in TODO.md / REGISTRY.md, so I am not counting them against approval. TODO.md:L93-L102 docs/methodology/REGISTRY.md:L2323-L2328

Security

No findings.

Documentation/Tests

No additional committed-docs defect beyond the issues above. The new registry and paper-review notes are materially better aligned with the implementation than in the previous rerun.
Runtime verification is still pending on my side because this environment does not have numpy, pandas, pytest, or scipy, so I could not execute tests/test_had.py.
Severity P2. Impact: the new test suite misses the two acceptance-band edge cases created by the d_lower tolerance contract and the mixed-row validator cases for first_treat_col / cluster=. The current tolerance coverage only exercises mass_point with d_lower = d.min() + 1e-15, which is the harmless side of the band. Concrete fix: add regressions for mass_point with d_lower = d.min() - ε, continuous_near_d_lower with d_lower = d.min() + ε, and mixed valid/NaN or valid/invalid per-unit rows for both first_treat_col and cluster=. tests/test_had.py:L1184-L1190 tests/test_had.py:L1208-L1214 tests/test_had.py:L645-L660 tests/test_had.py:L1322-L1330

Path to Approval

Snap any tolerance-accepted d_lower override to float(d_arr.min()) before constructing the mass-point instrument, cohort counts, or the continuous-path shifted regressor.
Add regressions for both sides of the tolerance band:
design="mass_point", d_lower=float(d.min()) - 1e-15 should behave identically to the exact-minimum case, and design="continuous_near_d_lower", d_lower=float(d.min()) + 1e-15 should not trip the shifted-dose rejection.
Strengthen first_treat_col and cluster= validation to check raw per-row values before groupby().first(), and add regressions where one row in a unit is valid and the other is NaN or out of domain.

…idation **P1 (Methodology): snap tolerance-accepted d_lower to float(d.min())** The `d_lower == float(d.min())` within-tolerance contract accepted user-supplied values that differed from the support infimum by float rounding noise (1e-15). Downstream algebra is not tolerance-aware: - `mass_point` + `d_lower = d.min() - ε`: `Z = d > d_lower` puts every mass-point unit (at exactly d.min()) into Z=1 because they are strictly > d_lower. The Z=0 control group empties and the Wald-IV ratio becomes undefined. - `continuous_near_d_lower` + `d_lower = d.min() + ε`: the regressor shift `d - d_lower` produces a minimum of `-ε` on the shifted scale. Phase 1c's `_validate_had_inputs` then raises on the negative-dose guard, aborting an otherwise acceptable fit. Fix: after the tolerance check passes on Design 1 paths, snap `d_lower_val = float(d_arr.min())` before the mass-point instrument construction, cohort counts, and the continuous-path regressor shift. This preserves the advertised tolerance contract while keeping downstream algebra exact. **P2 (Code Quality): row-level NaN/domain validation** The previous `first_treat_col` and `cluster=` validators collapsed each unit to a single value via `groupby().first()` before checking for NaN or out-of-domain values. pandas's `.first()` silently skips NaN, so a unit with rows `[valid, NaN]` would pass a unit-level check even though the raw NaN on the bad row should be rejected. Fix: for both `first_treat_col` (in `_validate_had_panel`) and `cluster=` (in `_aggregate_first_difference`), validate raw per-row values BEFORE the per-unit collapse: - Row-level NaN check on the raw column. - Row-level domain check on raw values (for first_treat_col). - Per-unit-constancy check using `nunique(dropna=False)` so a `[value, NaN]` within a unit registers as 2 distinct values. **Tests (+5 regression):** - test_mass_point_d_lower_below_min_within_tolerance_snaps: asserts d_lower = d.min() - 1e-15 collapses ULP-identically to the exact case across att, se, n_mass_point, n_above_d_lower. - test_continuous_near_d_lower_above_within_tolerance_snaps: asserts d_lower = d.min() + 1e-15 collapses to the exact case (no negative-shift rejection). - test_first_treat_col_mixed_row_nan_raises: unit with [valid, NaN] rows raises. - test_first_treat_col_mixed_row_invalid_value_raises: unit with [valid, 999] rows raises with the 999 in the error message. - test_mixed_row_nan_cluster_raises_on_mass_point: mass-point with cluster column holding [valid, NaN] per unit raises. Targeted regression: 150 HAD tests + 529 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T00:51:04Z

/ai-review

github-actions · 2026-04-21T00:59:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1eda6148f38b7391b4ca7c481e8829d4651e4a27

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

The prior rerun’s d_lower-tolerance blocker appears fixed: tolerance-accepted overrides are now snapped back to d.min() before either the mass-point instrument or the continuous shifted regressor is constructed, and both sides of that tolerance band are covered by tests. diff_diff/had.py:L1248 tests/test_had.py:L1222 tests/test_had.py:L1243
The prior row-level validation gap also appears fixed: first_treat_col and mass-point cluster are now validated before any groupby().first() collapse, with mixed valid/NaN and valid/invalid cases covered in the new suite. diff_diff/had.py:L510 diff_diff/had.py:L595 tests/test_had.py:L662 tests/test_had.py:L1403
I did not find a remaining methodology defect in the estimator core: the continuous paths implement the registry’s beta-scale transform, the mass-point path uses the documented structural-residual 2SLS sandwich, and exported inference still routes through safe_inference(). diff_diff/had.py:L696 diff_diff/had.py:L1374 diff_diff/had.py:L1411 docs/methodology/REGISTRY.md:L2319 docs/methodology/REGISTRY.md:L2323
[Newly identified] The new robust knob is documented as ignored on continuous paths with a warning, but the continuous dispatch only warns for vcov_type and cluster, leaving one variance-related control silently ignored. diff_diff/had.py:L910 diff_diff/had.py:L1314
I could not execute tests/test_had.py here because numpy, pandas, scipy, and pytest are not installed.

Methodology

No findings. The PR’s estimator math matches the Phase 2a registry entry, and the previous rerun’s methodology-critical d_lower bug is resolved by normalizing accepted overrides to the support infimum before downstream use. diff_diff/had.py:L1248 diff_diff/had.py:L1411 docs/methodology/REGISTRY.md:L2319

Code Quality

Severity P2 [Newly identified]. Impact: the public robust parameter is documented as a mass-point-only alias whose use on continuous paths triggers a warning, but the continuous-path dispatch only warns on vcov_type and cluster. A caller who treats robust as an active inference knob on continuous_at_zero or continuous_near_d_lower gets the same CCT robust SE with no signal that the flag had no effect. Concrete fix: either emit an ignore warning when robust=True on a continuous path, or tighten the API/docs to say that only vcov_type and cluster generate warnings while robust is ignored off the mass-point path. diff_diff/had.py:L901 diff_diff/had.py:L910 diff_diff/had.py:L1318

Performance

No findings.

Maintainability

Severity P3. Impact: _validate_had_panel()’s return doc still says the helper returns the two period identifiers “min then max,” but the implementation now returns identified pre/post periods using the all-zero-dose invariant. That is misleading for future callers, especially because the new tests explicitly rely on semantic labels like "pre" / "post". Concrete fix: update the Returns text to say “pre then post” (or equivalent). diff_diff/had.py:L366 diff_diff/had.py:L389 tests/test_had.py:L1623

Tech Debt

No findings. The remaining Phase 2a scope limitations (event_study, survey, weights, mass-point hc2 / hc2_bm, continuous-path cluster, panel-only / repeated-cross-section gap) are explicitly tracked in TODO.md and the Methodology Registry, so I am not counting them against approval. TODO.md:L93 TODO.md:L96 TODO.md:L102 docs/methodology/REGISTRY.md:L2323 docs/methodology/REGISTRY.md:L2326 docs/methodology/REGISTRY.md:L2327

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new suite covers mass-point handling of robust and the continuous-path warning behavior for vcov_type/cluster, but not the continuous-path robust=True case behind the P2 above, so that mismatch can regress unnoticed. Concrete fix: add a regression that a continuous fit with robust=True follows the chosen contract. tests/test_had.py:L462 tests/test_had.py:L521 tests/test_had.py:L1418
Runtime verification pending: I could not run the HAD tests in this environment because numpy, pandas, scipy, and pytest are unavailable.

Review 9 is overall ✅ Looks good; this PR addresses the remaining P2 and two P3s to close the loop cleanly. **P2 (Code Quality): `robust=True` silently ignored on continuous paths** The continuous dispatch block warns on `vcov_type` and `cluster` when they are supplied on a path that ignores them, but `robust=True` slipped through the same check. Since `robust=True` on the mass-point path is the backward-compat alias for `vcov_type="hc1"`, a user passing `robust=True` on `continuous_at_zero` / `continuous_near_d_lower` would reasonably expect it to do something; in reality the continuous paths always use the CCT-2014 robust SE from Phase 1c and the flag has no effect. Fix: extend the ignore-warning block to emit a `UserWarning` when `robust=True` and `resolved_design` is continuous. **P3 (Maintainability): _validate_had_panel return docstring stale** The Returns section said ``(t_pre, t_post)`` is returned in "min then max" order, but the implementation now uses the HAD dose invariant (the all-zero-dose period is `t_pre` regardless of label order). Updated the docstring to describe the identified semantic order and the arbitrary-dtype-label support (int, str, datetime). **Tests (+2 regression):** - test_robust_true_ignored_on_continuous_warns: `robust=True` on continuous path emits a warning containing "robust". - test_robust_false_silent_on_continuous: `robust=False` (the default) on continuous path does NOT emit the `robust=True is ignored` warning, keeping the default construction silent. Targeted regression: 152 HAD tests + 531 total across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T09:53:45Z

/ai-review

github-actions · 2026-04-21T10:05:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a3553ed287cf8695404aa90368c0839f4c9e6622

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior rerun’s continuous-path robust=True warning gap appears fixed: the warning is now emitted in diff_diff/had.py:L1331 and covered in tests/test_had.py:L532.
I did not find a remaining defect in the core HAD estimator math. The continuous paths still implement the paper’s boundary-limit rescaling logic, and the mass-point path still matches the Section 3.2.4 Wald/2SLS reduction that the registry documents. diff_diff/had.py:L700 diff_diff/had.py:L1425 docs/methodology/REGISTRY.md:L2323. (cdn.arenafi.org)
P1 [Newly identified]: the new sklearn-compatibility surface is not actually sklearn-compatible. get_params() omits the deep argument, set_params() accepts any existing attribute name, and the tests only manual-reconstruct the estimator instead of exercising the documented clone contract. diff_diff/had.py:L997 diff_diff/had.py:L1014 tests/test_had.py:L825. (scikit-learn.org)
P3 informational: the estimator-level continuous_at_zero docs overstate how strict the regime check is. In practice this path still inherits the already-documented Phase 1c 5% boundary plausibility heuristic, so I am not counting it as a blocker. diff_diff/had.py:L1163 diff_diff/local_linear.py:L639 docs/methodology/REGISTRY.md:L2320. (cdn.arenafi.org)
Deferred scope cuts (event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, repeated-cross-section support) are tracked in TODO.md:L92 and docs/methodology/REGISTRY.md:L2327, so I am not counting them against approval.
I could not run tests/test_had.py here because numpy, pandas, scipy, and pytest are not installed.

Methodology

P3 Impact: diff_diff/had.py:L1163 says continuous_at_zero is the paper’s Design 1' path, but there is no HAD-level guard on that branch; it falls straight through to the looser Phase 1c boundary ~ 0 heuristic in diff_diff/local_linear.py:L639. Because that heuristic is already documented in the registry note, I am treating this as an estimator-contract/documentation mismatch rather than an unmitigated methodology defect. Concrete fix: either add a HAD-level guard consistent with the Phase 2a rule, or explicitly document in the HAD entry/docstring that design="continuous_at_zero" reuses the Phase 1c plausibility heuristic and add a regression for that case. tests/test_had.py:L1162. (cdn.arenafi.org)
No other methodology findings.

Code Quality

P1 Impact: the class advertises sklearn.base.clone(est) round-trips in diff_diff/had.py:L997, but scikit-learn’s estimator API exposes get_params(deep=True) / set_params(**params), while this implementation defines get_params() without deep and validates set_params() via hasattr. That means the documented sklearn-style surface is incomplete, and non-parameter names like fit would be silently accepted and overwrite methods. The new test only checks HeterogeneousAdoptionDiD(**est.get_params()) at tests/test_had.py:L825, so the broken contract is currently untested. Concrete fix: implement get_params(self, deep: bool = True), restrict set_params() to keys from self.get_params(), and add a regression for get_params(deep=False) plus rejection of non-constructor keys (or a direct clone(est) regression if sklearn is in the test matrix). diff_diff/had.py:L1014. (scikit-learn.org)

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining Phase 2a scope limitations are explicitly tracked in TODO.md:L92 through TODO.md:L101 and mirrored in docs/methodology/REGISTRY.md:L2327 through docs/methodology/REGISTRY.md:L2331, so I am not counting them against approval.

Security

No findings.

Documentation/Tests

P3 Impact: runtime verification is pending because this environment lacks numpy, pandas, scipy, and pytest, so this review is source-only. Concrete fix: rerun the HAD suite in CI or a dev environment with the numerical stack installed; when fixing the P1 above, add a regression that exercises the real sklearn parameter contract rather than manual reconstruction.

Path to Approval

Make the estimator’s sklearn parameter surface actually compatible: implement get_params(deep=True), tighten set_params() to constructor keys only, and add a regression that exercises the documented clone/parameter contract.

**P1 (Code Quality): sklearn parameter contract** The class docstring advertises `sklearn.base.clone(est)` compatibility, but the actual methods did not match sklearn's `BaseEstimator.get_params(deep=True)` / `set_params(**params)` surface: - `get_params()` did not accept a `deep` keyword. sklearn's `clone` calls `get_params(deep=False)`, so any integration would have failed with a TypeError on the missing kwarg. - `set_params()` validated keys with `hasattr(self, key)`. That would silently accept non-constructor attribute names like `fit`, and a typo or malicious kwargs dict could overwrite estimator methods. Fix: - `get_params(self, deep: bool = True)` matches sklearn's signature. `deep` is accepted for compat; this estimator has no nested sub-estimators, so `deep=True` and `deep=False` return the same dict. `del deep` documents the no-op explicitly and silences unused-arg linters. - `set_params(**params)` now restricts to keys from `get_params()`. Non-constructor attribute names (including method names like `fit` and dunder/private attrs) raise `ValueError`. **Tests (+4 regression):** - test_set_params_rejects_method_names: `set_params(fit=...)` raises and `est.fit` stays callable. - test_set_params_rejects_private_attrs: `_internal=42` raises. - test_get_params_accepts_deep_keyword: `deep=True`, `deep=False`, and no-arg all return the same dict. - test_sklearn_clone_round_trip_if_available: `sklearn.base.clone` round-trips the estimator; gated on `pytest.importorskip("sklearn")` so it skips cleanly when sklearn is not in the test matrix. Targeted regression: 155 HAD tests (+ 1 skipped) + 534 total (+1 skipped) across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T23:35:23Z

/ai-review

github-actions · 2026-04-21T23:43:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 98b82017c58f0b32bbcb29aad4a9b47e4d49c4f2

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

The prior sklearn-compatibility blocker appears resolved: get_params(deep=...) now matches sklearn’s signature, set_params() rejects non-constructor keys, and there is a clone regression when sklearn is available. diff_diff/had.py#L997 tests/test_had.py#L869 tests/test_had.py#L882
I did not find an unmitigated methodology defect in the core HAD estimator math. The continuous paths match the registry’s documented Equation 7 / Equation 11 rescaling, and the mass-point structural-residual 2SLS sandwich is explicitly documented in the Methodology Registry, so I am not counting that implementation choice as a defect. docs/methodology/REGISTRY.md#L2323 docs/methodology/REGISTRY.md#L2327
P2 [Newly identified]: HeterogeneousAdoptionDiD.set_params() still mutates self before validation, so a rejected call can leave the estimator partially updated and invalid for later reuse. diff_diff/had.py#L1024
P3 [Newly identified]: the new HAD docstring example is not runnable as written because it uses generate_continuous_did_data() defaults (n_periods=4) with a Phase 2a estimator that requires exactly two periods. diff_diff/had.py#L944 diff_diff/prep_dgp.py#L845
Runtime verification was not possible here because pytest, numpy, pandas, and scipy are unavailable in this environment.

Methodology

No findings. The estimator’s three design paths, d_lower guards, NaN-safe inference routing, and documented mass-point variance choice all line up with the local methodology sources I checked. The remaining unsupported scope cuts are documented rather than silent. docs/methodology/REGISTRY.md#L2323 docs/methodology/REGISTRY.md#L2331

Code Quality

Severity: P2. Impact: HeterogeneousAdoptionDiD.set_params() applies setattr() inside the validation loop and only then calls _validate_constructor_args(), so a failing call like est.set_params(alpha=0.1, design="made_up") leaves the estimator half-mutated. That is a real reuse hazard for callers that catch ValueError and continue with the same object. The current HAD tests verify rejection and round-trip behavior, but not rollback-on-failure. diff_diff/had.py#L1032 tests/test_had.py#L847 tests/test_had.py#L892 tests/test_had.py#L974 Concrete fix: validate all incoming keys and derived constructor constraints on local variables first, then apply mutations atomically, following the established pattern already used in DifferenceInDifferences.set_params(). Add a regression asserting get_params() is unchanged after a failing HAD set_params() call. diff_diff/estimators.py#L824

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The Phase 2a scope cuts I checked (event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, repeated-cross-section support) are explicitly tracked in TODO.md, so I am not counting them against approval. TODO.md#L92 TODO.md#L101

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new class example in had.py will fail if a user copies it literally, because it constructs data with generate_continuous_did_data() defaults and that generator defaults to n_periods=4, while HeterogeneousAdoptionDiD.fit() is Phase-2a two-period-only. diff_diff/had.py#L944 diff_diff/had.py#L1058 diff_diff/prep_dgp.py#L845 Concrete fix: change the example to a true HAD-compatible 2-period dataset, or explicitly pass n_periods=2 and note the Phase 2a panel restriction; ideally add a smoke-style doc regression so the example stays executable.
Review limitation: this was a source-only review. I could not run tests/test_had.py here because pytest, numpy, pandas, and scipy are not installed.

…mple Round 11 is ✅ Looks good overall; this PR closes the remaining P2 and P3 for cleanliness. **P2 (Code Quality): set_params must be atomic** The previous implementation applied `setattr()` inside the validation loop and called `_validate_constructor_args()` only afterward. A multi-key call like `set_params(alpha=0.1, design="made_up")` would mutate `self.alpha` before raising on the invalid design, leaving the estimator half-updated. Callers that catch `ValueError` and retry would see the wrong state. Fix: dry-run validation via `type(self)(**merged)` before any attribute write. If the constructor raises on the combined state, `self` is not mutated. Mutations are applied only after validation passes, so rejection leaves the estimator fully rollback-safe. **P3 (Documentation/Tests): docstring example was non-runnable** The class docstring example constructed data via `generate_continuous_did_data()` which defaults to `n_periods=4`, while `HeterogeneousAdoptionDiD.fit()` requires exactly two periods in Phase 2a. A user copying the example verbatim would hit the two-period rejection. Fix: rewrote the example to build a two-period HAD panel by hand with an explicit `D_{g,1} = 0` pre-period column, a unit-level `dose_post`, and `delta_y` so the shape matches the estimator's contract exactly. Kept `doctest: +SKIP` markers on the RNG and fit lines. **Tests (+3 rollback regressions):** - test_set_params_rollback_on_failure: multi-key call with invalid design keeps estimator unchanged. - test_set_params_rollback_on_invalid_key: unknown kwarg keeps estimator unchanged. - test_set_params_rollback_on_invalid_alpha: out-of-range alpha keeps estimator unchanged. Targeted regression: 158 HAD tests (+1 sklearn-gated skip) + 537 total (+1 skipped) across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T23:48:29Z

/ai-review

github-actions · 2026-04-21T23:54:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c67505156b94fb557a69441349a43ee3d9b21c6b

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior set_params() atomicity issue appears resolved. HeterogeneousAdoptionDiD.set_params() now validates a merged parameter dict before mutating self. diff_diff/had.py:1040
The earlier doc-example issue also appears resolved. The class docstring now builds a true two-period HAD panel directly instead of relying on a generator with four-period defaults. diff_diff/had.py:943
I did not find an unmitigated defect in the core HAD math. The beta-scale continuous estimators, the Design 1 warning, and the structural-residual 2SLS sandwich all line up with the local methodology sources, and the 2SLS variance choice is explicitly documented in the registry. docs/methodology/REGISTRY.md:2323 docs/methodology/REGISTRY.md:2327
One P1 remains: an explicit nonzero d_lower is silently ignored whenever the resolved design is continuous_at_zero, even though the class documentation says explicit overrides are checked against the Section 3.2 regime contract and mismatches raise. diff_diff/had.py:868 diff_diff/had.py:1207
Runtime verification was not possible here because this environment does not have pytest or numpy installed.

Methodology

Severity: P1. Impact: HeterogeneousAdoptionDiD.fit() unconditionally sets d_lower_val = 0.0 on the continuous_at_zero path, so calls like HeterogeneousAdoptionDiD(design="continuous_at_zero", d_lower=0.5) silently fit Design 1' at zero anyway. That contradicts both the paper’s Design 1' contract (d_lower = 0) and the class docstring’s promise that mismatched overrides raise. It also creates a parameter-contract mismatch: get_params() can still report the user’s nonzero d_lower, while the fitted result reports d_lower=0.0. This is a silent propagation failure on a new public parameter. diff_diff/had.py:868 diff_diff/had.py:1030 diff_diff/had.py:1207 Concrete fix: after resolved_design is known, reject any explicit d_lower that is not approximately zero when resolved_design == "continuous_at_zero"; if you want tolerance behavior, snap only near-zero values to exact 0.0.
Severity: P3. Impact: The mass-point path’s structural-residual 2SLS sandwich is a documented deviation from the paper’s underspecified “standard 2SLS inference,” so I am not counting it as a defect. docs/methodology/REGISTRY.md:2327 Concrete fix: none required.

Code Quality

No additional findings. The previous partial-mutation set_params() issue appears fixed. diff_diff/had.py:1040

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The unsupported Phase 2a scope cuts I checked (event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, staggered timing, repeated cross-section) are explicitly tracked in TODO.md, so I am not counting them against approval. TODO.md:92 TODO.md:101

Security

No findings.

Documentation/Tests

No separate findings. The earlier doc example problem appears resolved. The current test file is broad, but it does not cover the continuous_at_zero plus explicit nonzero d_lower case that triggers the P1 above. diff_diff/had.py:943 tests/test_had.py:1140
Review limitation: I could not run the HAD tests here because pytest and numpy are not installed in this environment.

Path to Approval

In HeterogeneousAdoptionDiD.fit(), validate explicit d_lower against the Design 1' contract after resolved_design is computed. continuous_at_zero should accept only d_lower=None or an approximately zero value; any materially nonzero override should raise ValueError.
Add regression coverage for both design="continuous_at_zero" and design="auto" on a zero-support sample with d_lower > 0, and assert that the call fails instead of silently coercing to Design 1' at zero.

…inuous_at_zero **P1 (Methodology): continuous_at_zero silently ignored nonzero d_lower** `HeterogeneousAdoptionDiD.fit()` unconditionally set `d_lower_val = 0.0` on the `continuous_at_zero` path, so calls like `HeterogeneousAdoptionDiD(design="continuous_at_zero", d_lower=0.5)` silently fit Design 1' at zero. This contradicted paper Section 3.2's Design 1' regime contract (`d_lower = 0`) and the class docstring's promise that mismatched overrides raise. It also created a parameter- contract mismatch: `get_params()` reported the user's nonzero `d_lower` while the fitted result reported `d_lower=0.0`. Fix: after `resolved_design == "continuous_at_zero"` is known, check whether the user supplied an explicit `d_lower`. If the absolute value exceeds the float-tolerance band (same family as the Design 1 d_lower guards below), raise `ValueError` pointing at `continuous_near_d_lower` / `mass_point` / `auto`. `d_lower=None` (auto-resolve) and `d_lower=0.0` (redundant but benign) continue to succeed. **Tests (+4 regression):** - test_continuous_at_zero_with_nonzero_d_lower_raises: d_lower=0.5 raises with "d_lower == 0" / "Design 1'" pointer. - test_continuous_at_zero_with_small_d_lower_raises: d_lower=0.01 also raises (not just large overrides). - test_continuous_at_zero_with_zero_d_lower_succeeds: d_lower=0.0 exactly is accepted (redundant-but-valid case). - test_auto_on_zero_sample_ignores_user_d_lower: design='auto' resolving to continuous_at_zero must ALSO reject explicit nonzero d_lower, not silently drop it. Targeted regression: 162 HAD tests (+1 sklearn-gated skip) + 541 total (+1 skipped) across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-21T23:59:00Z

/ai-review

github-actions · 2026-04-22T00:07:50Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 07ab30f0b736d3a5fea6b1df97273412984e4ba5

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The prior finite-d_lower contract issue is mostly fixed: continuous_at_zero now rejects explicit nonzero thresholds, and the positive-support Design 1 branches enforce d_lower > 0, the mass-point/continuous split, and d_lower == d.min() snapping. diff_diff/had.py:1207 diff_diff/had.py:1243 diff_diff/had.py:1314 tests/test_had.py:1275
The core HAD formulas match the local source material: continuous paths implement the Equation 7 / Equation 11 beta-scale rescaling, and the mass-point path uses the paper’s Wald-IV / 2SLS construction. diff_diff/had.py:1480 diff_diff/had.py:700 docs/methodology/papers/dechaisemartin-2026-review.md:50 docs/methodology/papers/dechaisemartin-2026-review.md:60 docs/methodology/papers/dechaisemartin-2026-review.md:73
One P1 remains: d_lower=np.nan bypasses the new comparison-based guards and is silently coerced to 0.0 or d.min() instead of raising. On zero-support samples, that can reopen the same paper-incompatible mass_point / continuous_near_d_lower paths this PR otherwise fixed. diff_diff/had.py:969 diff_diff/had.py:1207 diff_diff/had.py:1243 diff_diff/had.py:1314 diff_diff/had.py:1013
The structural-residual 2SLS sandwich choice and the Phase 2a scope cuts are documented in REGISTRY.md / TODO.md, so I am not counting those as defects. docs/methodology/REGISTRY.md:2327 docs/methodology/REGISTRY.md:2330 TODO.md:92
Runtime verification was not possible here because this environment does not have pytest or numpy installed.

Methodology

Severity: P1. Impact: The new d_lower contract is still incomplete for non-finite inputs. fit() only uses comparison-based guards (>, <=, and abs(x - d_min) > tol), and all of those return False for NaN. As a result, d_lower=np.nan is silently accepted: continuous_at_zero falls through to d_lower_val = 0.0, while continuous_near_d_lower / mass_point fall through the regime checks and then snap to d.min(). On zero-support samples, that can even bypass the restored d_lower > 0 protection and let the wrong Design 1 path run at d_lower=0. That is a silent parameter-contract mismatch on a new public estimator argument, and it contradicts the documented “mismatched overrides raise” behavior. diff_diff/had.py:868 diff_diff/had.py:969 diff_diff/had.py:1207 diff_diff/had.py:1243 diff_diff/had.py:1314
Concrete fix: Reject non-finite d_lower values up front, ideally in _validate_constructor_args() so __init__, set_params(), and fit() all share the same contract. d_lower should be either None or a finite scalar; NaN / ±inf should raise ValueError before any design dispatch or snapping logic.
Severity: P3. Impact: The mass-point structural-residual 2SLS sandwich is a documented deviation from the paper’s underspecified “standard 2SLS inference,” and the registry explicitly records that choice plus the deferred hc2 / hc2_bm work. I am not counting it as a defect. diff_diff/had.py:700 docs/methodology/REGISTRY.md:2327
Concrete fix: None required.

Code Quality

No additional findings. The previous set_params() atomicity issue appears resolved by the validate-then-mutate flow. diff_diff/had.py:1040

Performance

No findings.

Maintainability

No additional findings beyond the d_lower validation gap above.

Tech Debt

Severity: P3. Impact: The unsupported Phase 2a items I checked (event_study, survey, weights, continuous-path cluster, mass-point hc2 / hc2_bm, repeated cross-section) are explicitly tracked in TODO.md and mirrored in the registry notes, so they are mitigated deferred work rather than blockers. TODO.md:92 TODO.md:95 TODO.md:96 TODO.md:101 docs/methodology/REGISTRY.md:2330 docs/methodology/REGISTRY.md:2331
Concrete fix: None required in this PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The new result docstring and registry entry overstate the NaN contract by saying att is part of the safe_inference-gated inference bundle / “all 5 inference fields.” The implementation only NaN-gates t_stat, p_value, and conf_int; att and se are passed through from the estimator path. That does not change the estimator math, but it leaves the public contract out of sync with the code. diff_diff/had.py:130 diff_diff/had.py:1449 docs/methodology/REGISTRY.md:2326 tests/test_had.py:729 tests/test_had.py:785
Concrete fix: Update the wording to say that undefined SEs force t_stat, p_value, and conf_int to NaN, while att / se remain raw estimator outputs. If full all-field NaN propagation is actually intended, implement that explicitly and add regression tests.

Path to Approval

Reject non-finite d_lower values (NaN, ±inf) before any branch-specific comparison or snapping, preferably in _validate_constructor_args() so set_params(d_lower=...) also fails atomically.
Add regression tests for d_lower=np.nan on design="continuous_at_zero", design="continuous_near_d_lower", design="mass_point", and design="auto" on both zero-support and positive-support samples, asserting ValueError instead of silent coercion.

… docstring **P1 (Methodology): d_lower=NaN/inf bypassed comparison guards** The Design 1 regime guards (`d_lower > 0`, `d_lower == d.min()` within tolerance, `d_lower <= tol` for continuous_at_zero) all use comparison operators that return False for NaN. A user passing `d_lower=np.nan` therefore slipped through every check and ended up at `0.0` (continuous_at_zero branch) or `d.min()` (the snap). On a zero-support sample, that could silently re-enter the paper- incompatible continuous_near_d_lower / mass_point path the earlier review rounds had fixed. Fix: reject non-finite d_lower front-door in `_validate_constructor_args()`. This makes `__init__`, `set_params()`, and the atomic dry-run validation all share the same contract: `d_lower` must be None or a finite scalar. NaN and +/-inf raise `ValueError`. Atomic `set_params()` confirms that a failing `d_lower=NaN` call leaves the estimator unchanged. **P3 (Documentation/Tests): safe_inference coverage overstated** The result-class docstring and REGISTRY said "all 5 inference fields are routed through safe_inference". In reality, `safe_inference` only gates the downstream triple (t_stat, p_value, conf_int) - `att` and `se` are raw estimator outputs. The fit paths themselves return `(nan, nan)` on degenerate configurations, which combined with the safe_inference gate produces the advertised "all five fields NaN together" behavior, but the docstring wording was misleading. Fix: rewrote the Results docstring preamble and the REGISTRY.md Phase 2a NaN-propagation line to describe the two-layer NaN flow (fit-path NaN on att/se + safe_inference on the downstream triple) precisely. **Tests (+4 regression):** - test_d_lower_nan_raises: __init__ rejects NaN. - test_d_lower_posinf_raises / test_d_lower_neginf_raises: +/-inf rejected. - test_d_lower_nan_via_set_params_raises: set_params(d_lower=NaN) raises AND leaves the estimator unchanged (atomic rollback). Targeted regression: 166 HAD tests (+1 sklearn-gated skip) + 545 total (+1 skipped) across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T00:21:08Z

/ai-review

github-actions · 2026-04-22T00:26:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2f6e1d0b3677ad0e5dd936818497bcd79358e24b

Overall Assessment

Looks good.

Highest unmitigated severity: P3.

Executive Summary

The prior P1 from the last review is resolved: non-finite d_lower is now rejected in _validate_constructor_args(), and both constructor and set_params() paths have regression coverage at diff_diff/had.py:1002, tests/test_had.py:1227, and tests/test_had.py:1240.
The design partition now matches the registry/paper: Design 1' is confined to d_lower=0, Design 1 paths require d_lower>0 and d_lower=float(d.min()), and mass-point vs continuous-near-d_lower mismatches fail fast before estimation at diff_diff/had.py:1228, diff_diff/had.py:1254, diff_diff/had.py:1277, and diff_diff/had.py:1322.
I did not find any new unmitigated P0/P1 methodology, weighting, or variance defects in the changed estimator code.
The mass-point structural-residual 2SLS sandwich remains a documented registry deviation from the paper’s underspecified “standard 2SLS inference,” so it is P3 informational only, not a blocker, at docs/methodology/REGISTRY.md:2327 and diff_diff/had.py:721.
One residual P3 remains in the docs/tests: the NaN-contract wording still overstates that all five fields move to NaN together, but the single-cluster CR1 path already leaves a finite att with se=np.nan at diff_diff/had.py:130, diff_diff/had.py:822, tests/test_had.py:785, and docs/methodology/REGISTRY.md:2326.
Runtime verification was not possible here because this environment lacks pytest, numpy, pandas, and scipy.

Methodology

Severity: P3. Impact: No unmitigated methodology defect found. The mass-point path’s structural-residual 2SLS sandwich is explicitly documented in the registry as a deviation from the paper’s underspecified inference language, so it should be treated as informational rather than blocking at docs/methodology/REGISTRY.md:2327 and diff_diff/had.py:721. Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: The main Phase 2a scope cuts are properly tracked in TODO.md and mirrored in the registry, including event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, and repeated-cross-section support, so I am treating them as accepted deferred work rather than blockers at TODO.md:92, TODO.md:101, docs/methodology/REGISTRY.md:2330, and docs/methodology/REGISTRY.md:2331. Concrete fix: None in this PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The NaN-contract wording is still too strong. diff_diff/had.py:130 and docs/methodology/REGISTRY.md:2326 say users can expect all five fields to be finite or all NaN together, but the clustered mass-point path returns (att=beta_hat, se=np.nan) when CR1 is undefined with a single cluster at diff_diff/had.py:822, and the new regression only checks that se is NaN in that case at tests/test_had.py:785. This is a documentation mismatch, not a numerical bug, because t_stat, p_value, and conf_int are still correctly NaN-gated via safe_inference() at diff_diff/had.py:1471. Concrete fix: Reword the docs/registry to say the guaranteed NaN coupling is on the downstream inference triple (t_stat, p_value, conf_int), while att and se remain raw estimator outputs and may stay finite when inference is undefined.

Round 14 is ✅ Looks good; this closes the final residual P3 for wording accuracy. **P3 (Docs): NaN-contract wording overstated** The prior wording said users can expect all five result fields to be finite or all NaN together. That is true on most degenerate paths (constant-y, no-dose-variation) but NOT on the single-cluster CR1 configuration of the mass-point path: `_fit_mass_point_2sls` returns `(att=beta_hat, se=nan)` there - `att` is finite because the Wald-IV ratio is well defined, while CR1 requires G >= 2 clusters so `se` is NaN. `safe_inference` then gates the downstream triple correctly, but the doc claim that `att` and `se` are coupled with the triple was inaccurate. Fix: rewrote the result-class docstring preamble and the REGISTRY Phase 2a NaN-propagation line to describe the invariant precisely: - GUARANTEED NaN coupling is on the downstream triple (`t_stat`, `p_value`, `conf_int`) via `safe_inference`. - `att` and `se` are RAW estimator outputs from the chosen fit path and are NOT gated by safe_inference. - On most degenerate paths, fit paths return `(nan, nan)` so all five move together, but on the single-cluster CR1 edge the fit path returns `(att=beta_hat, se=nan)` and only the downstream triple becomes NaN via the gate. No code behavior changes; the `assert_nan_inference` fixture already only checks the downstream triple, matching the restated contract. Targeted regression: 166 HAD tests (+1 sklearn-gated skip) + 545 total (+1 skipped) across Phase 1 and adjacent surfaces, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-22T00:33:35Z

/ai-review

github-actions · 2026-04-22T00:40:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 22d7f65f1ea126702ea97fbef2d03f1f8e5658db

Overall Assessment

✅ Looks good

Highest unmitigated severity: P3.

Executive Summary

Re-review result: the prior informational NaN-contract wording issue is fixed. The result docstring and registry now both state that the guaranteed NaN coupling is on t_stat / p_value / conf_int, while att may remain finite when single-cluster CR1 is undefined at diff_diff/had.py:L130-L153, docs/methodology/REGISTRY.md:L2326-L2329, and tests/test_had.py:L785-L792.
I did not find any new unmitigated P0/P1 issues in the changed estimator math, design dispatch, support-infimum enforcement, control-group construction, or NaN-safe inference routing at diff_diff/had.py:L657-L862, diff_diff/had.py:L1241-L1485, and tests/test_had.py:L1387-L1526.
The mass-point path’s structural-residual 2SLS sandwich remains a documented methodology note, not a defect: the paper leaves “standard 2SLS inference” underspecified, and the registry explicitly records this implementation choice at diff_diff/had.py:L720-L862 and docs/methodology/REGISTRY.md:L2327-L2329.
Phase 2a scope cuts are explicitly tracked, so I’m treating them as accepted deferred work rather than blockers: event_study, survey, weights, mass-point hc2 / hc2_bm, continuous-path cluster=, staggered reduction, and repeated-cross-section support at TODO.md:L92-L101 and docs/methodology/REGISTRY.md:L2330-L2332.
Runtime verification was not possible here because this environment lacks pytest, numpy, pandas, and scipy.

Methodology

Severity: P3. Impact: The mass-point SE implementation uses the structural-residual 2SLS sandwich rather than an OLS-on-indicator shortcut, but that choice is explicitly documented in the methodology registry and is backed by dedicated parity tests for classical, hc1, and CR1 at diff_diff/had.py:L720-L862, docs/methodology/REGISTRY.md:L2327-L2329, and tests/test_had.py:L434-L456. Concrete fix: None.
No other findings. The Design 1 / Design 1' partition, d_lower == d.min() enforcement, wrong-design fail-fast guards, Assumption 5/6 warning, and safe_inference() routing align with the registry and paper review at diff_diff/had.py:L1241-L1485 and docs/methodology/REGISTRY.md:L2323-L2329.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: Remaining Phase 2a limitations are properly tracked in TODO.md, so they are not blockers for this PR: aggregate="event_study", survey=, weights=, mass-point hc2 / hc2_bm, continuous-path cluster=, staggered reduction, and panel-only / RCS scope at TODO.md:L92-L101 and docs/methodology/REGISTRY.md:L2330-L2332. Concrete fix: None in this PR.

Security

No findings.

Documentation/Tests

No findings. The prior informational NaN-contract mismatch is resolved, and the new test file covers the corrected downstream-triple NaN behavior, mass-point SE parity, design auto-detection, support-infimum snapping, and sklearn-compat paths at diff_diff/had.py:L130-L153, tests/test_had.py:L434-L456, tests/test_had.py:L785-L792, and tests/test_had.py:L897-L999.
Review limitation: I could not execute the new tests locally because pytest, numpy, pandas, and scipy are not installed in this environment.

igerber and others added 3 commits April 20, 2026 17:21

igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026

igerber merged commit 7c9f7f1 into main Apr 22, 2026
23 of 24 checks passed

igerber deleted the did-no-untreated branch April 22, 2026 01:49

Conversation

igerber commented Apr 20, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

igerber commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

igerber commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

igerber commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects