Skip to content

Phase 2a: HeterogeneousAdoptionDiD class (single-period, 3 design paths)#346

Merged
igerber merged 17 commits intomainfrom
did-no-untreated
Apr 22, 2026
Merged

Phase 2a: HeterogeneousAdoptionDiD class (single-period, 3 design paths)#346
igerber merged 17 commits intomainfrom
did-no-untreated

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 20, 2026

Summary

  • Ships HeterogeneousAdoptionDiD + HeterogeneousAdoptionDiDResults as the user-facing estimator for de Chaisemartin, Ciccia, D'Haultfoeuille, and Knau (2026). Three design-dispatch paths: Design 1' (continuous_at_zero, boundary=0), Design 1 continuous-near-d_lower (regressor shift D - d_lower, boundary=d.min()), and Design 1 mass-point (Wald-IV / 2SLS with instrument 1{D > d_lower}).
  • Continuous paths compose Phase 1c's bias_corrected_local_linear and apply Equation 8's β-scale rescaling tau_bc / D_bar where D_bar = (1/G) * sum(D_{g,2}). Mass-point path uses the structural-residual 2SLS sandwich [Z'X]^-1 Ω [Z'X]^-T for classical/hc1 and cluster-robust CR1. HC2 and HC2-BM raise NotImplementedError pending a 2SLS-specific leverage derivation.
  • design="auto" resolves via the REGISTRY line-2320 rule: d.min() == 0 or d.min()/median(|d|) < 0.01 → continuous_at_zero; modal fraction at d.min() > 2% → mass_point; else continuous_near_d_lower. Mass-point path enforces d_lower == float(d.min()) within float tolerance (paper Section 3.2.4 contract); mismatched overrides raise with a clear pointer to the unsupported estimand.
  • Panel validator enforces balanced two-period panel, D_{g,1} = 0 for all units, and first_treat_col value-domain validation. >2 periods raises with a Phase 2b pointer. aggregate="event_study", survey=, and weights= all raise NotImplementedError pointing at follow-up PRs.
  • All inference fields (att, se, t_stat, p_value, conf_int) routed through safe_inference for NaN-safe CI gating. Raw design preserved on self so get_params() / sklearn.clone() round-trip reproduces "auto" even after fit.

Methodology references (required if estimator / math changes)

  • Method name(s): Heterogeneous Adoption Difference-in-Differences (HAD) / Weighted-Average-Slope (WAS) estimator; paper Section 3 (Designs 1 and 1'); paper Section 3.2.4 (mass-point 2SLS); paper Equation 8 (β-scale CI).
  • Paper / source link(s): de Chaisemartin, C., Ciccia, D., D'Haultfœuille, X., & Knau, F. (2026). Difference-in-Differences Estimators When No Unit Remains Untreated. arXiv:2405.04465v6.
  • Any intentional deviations from the source (and why): The paper specifies "standard 2SLS inference" for the mass-point path without naming a specific variance. Phase 2a implements the structural-residual 2SLS sandwich ([Z'X]^-1 built from structural residuals u = ΔY - α̂ - β̂·D, not the reduced-form Wald-IV-scaled OLS shortcut) to match canonical 2SLS (e.g., AER::ivreg, ivreg2). Documented in docs/methodology/REGISTRY.md under the HAD Phase 2a Note block. HC2 and HC2-BM require a 2SLS-specific leverage derivation and are queued for follow-up (tracked in TODO.md).

Validation

  • Tests added/updated: tests/test_had.py (NEW, 110 tests across 16 test classes covering the 12 plan commit criteria: smoke on all 3 designs, auto-detect correctness + 2 edge cases, β-scale rescaling at atol=1e-14, mass-point Wald-IV at atol=1e-14, mass-point sandwich SE parity at atol=1e-12 (classical/hc1/CR1), NotImplementedError contracts for hc2/hc2_bm/event_study/survey/weights, panel-contract violations, NaN propagation, sklearn clone round-trip, get_params signature enumeration, mass-point d_lower equality contract).
  • Backtest / simulation / notebook evidence (if applicable): N/A; the mass-point SE parity test uses a hand-coded textbook 2SLS sandwich as its reference (no R dependency). The β-scale rescaling parity is algebraic post-Phase-1c and is verified at atol=1e-14 against direct bias_corrected_local_linear output.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 3 commits April 20, 2026 17:21
Ships the user-facing HAD estimator with three design-dispatch paths:

- continuous_at_zero (Design 1'): bias-corrected local-linear at boundary=0,
  rescaled to beta-scale via Equation 8 divisor D_bar = (1/G) * sum(D_{g,2}).
- continuous_near_d_lower (Design 1): same path with regressor shift
  D' = D - d_lower and boundary = float(d.min()).
- mass_point (Design 1 Section 3.2.4): sample-average 2SLS with instrument
  1{D_{g,2} > d_lower}. Point estimate is the Wald-IV ratio; SE is the
  structural-residual 2SLS sandwich [Z'X]^-1 Omega [Z'X]^-T for classical,
  hc1, and CR1 (cluster-robust). hc2/hc2_bm raise NotImplementedError
  pending a 2SLS-specific leverage derivation.

design="auto" resolves via the REGISTRY auto-detect rule (d.min() < 0.01 *
median(|d|) -> continuous_at_zero; modal fraction > 2% -> mass_point; else
continuous_near_d_lower), with d.min() == 0 as an unconditional tie-break.

Panel validator enforces balanced two-period panel, D_{g,1} = 0 for all
units, no NaN in key columns. >2 periods raises pointing to Phase 2b.

aggregate="event_study" (Appendix B.2 multi-period) raises
NotImplementedError; survey= and weights= also raise (queued for follow-up
survey-integration PR).

All inference fields (att, se, t_stat, p_value, conf_int) routed through
safe_inference() for NaN-safe CI gating. Raw design kept on self so
get_params()/clone() round-trip preserves "auto" even after fit.

105 new tests cover the 12 plan commit criteria: three design paths finite,
design="auto" correctness + edge cases, beta-scale rescaling at atol=1e-14,
mass-point Wald-IV at atol=1e-14, mass-point sandwich SE parity at
atol=1e-12 (classical/hc1/CR1), NotImplementedError contracts, panel
contract violations, NaN propagation, sklearn clone round-trip,
get_params signature enumeration.

REGISTRY.md ticks four Phase 2a checkboxes and adds a Note block
documenting the structural-residual 2SLS sandwich choice. TODO.md queues
Phase 2b (multi-period), survey integration, weights, hc2/hc2_bm
leverage derivation, pre-test diagnostics (Phase 3), Pierce-Schott
replication (Phase 4), and tutorial integration (Phase 5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper Section 3.2.4 defines the Design 1 mass-point estimator with
instrument Z = 1{D_{g,2} > d_lower} at d_lower = float(d.min()) (the
lower-support mass point). Phase 2a previously accepted user-supplied
d_lower overrides on the mass-point path without validation, which would
silently redefine the instrument/control split and identify a different
(LATE-like) estimand outside Phase 2a's documented scope.

Changes:
- diff_diff/had.py: fit() now raises ValueError on the mass-point path
  when an explicit d_lower differs from float(d.min()) beyond float
  tolerance. d_lower=None (auto-resolve) and d_lower == d.min() (within
  tolerance) remain supported. Continuous paths are unaffected - they
  already reject off-support d_lower via Phase 1c's _validate_had_inputs
  (negative-dose check after the regressor shift).
- tests/test_had.py: new TestMassPointDLowerContract class adds 6
  rejection/acceptance tests for the contract. The prior
  test_force_mass_point_on_continuous_data test is renamed and clarified
  to document that d_lower=0.0 matches d.min()==0.0 in that DGP. The
  stale all-above-d_lower NaN-propagation test is removed (the input is
  now unreachable from the public API; the helper-level NaN guard is
  still tested via test_helper_returns_nan_on_empty_z_zero).

Smaller fixes:
- _validate_had_panel: first_treat_col comment no longer overstates the
  cross-validation (Phase 2a does value-domain only, not dose
  consistency).
- _fit_continuous dispatch: "Warn once" comment revised to "Warn"
  (per-fit warnings are not suppressed across calls in Phase 2a).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rename test_d_lower_contract_is_mass_point_only ->
test_mass_point_equality_guard_does_not_fire_on_continuous and rewrite
the docstring so it accurately describes what's exercised. Previously
claimed "arbitrary d_lower" but only tests d_lower=d.min(); the renamed
test now narrates that the mass-point-specific ValueError does not fire
on the continuous path (the continuous path has its own upstream
negative-dose guard after the regressor shift).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⛔ Blocker

Executive Summary

  • The continuous-design implementation is mathematically off: diff_diff/had.py#L1177 computes att = tau_bc / mean(D), but the HAD paper’s continuous estimators are based on E[ΔY] - lim E[ΔY | D_2 ≤ d], and the no-QUG Design 1 path also uses E[D_2 - d̲] in the denominator, not E[D_2]. The new tests at tests/test_had.py#L234 and tests/test_had.py#L279 lock that wrong formula in. citeturn1view1turn0view1
  • The original dose-support contract is incomplete. diff_diff/had.py#L404 never checks post-period D_{g,2} >= 0, and diff_diff/had.py#L984 lets continuous_near_d_lower accept user d_lower values that are not the support infimum.
  • Design 1 paths fit silently without surfacing the extra Assumption 5/6 identification burden, despite the registry requiring that warning at docs/methodology/REGISTRY.md#L2155.
  • The mass-point 2SLS path itself looks directionally aligned; the unsupported hc2/hc2_bm, event_study, survey, weights, and continuous-path cluster= gaps are documented in TODO.md#L94 and docs/methodology/REGISTRY.md#L2323, so I would not block on those tracked scope cuts.

Methodology

  • P0 The continuous estimator is not implementing the paper’s estimand. diff_diff/had.py#L1137, diff_diff/had.py#L1177, diff_diff/had.py#L121, and docs/methodology/REGISTRY.md#L2319 all treat the Phase 1 local-linear boundary estimate tau_bc as the final numerator and then divide by mean(D). Impact: for Design 1' the paper’s estimator is (E[ΔY] - lim_{d↓0} E[ΔY | D_2 ≤ d]) / E[D_2], and for Design 1 / WAS_{d̲} it is (E[ΔY] - lim_{d↓d̲} E[ΔY | D_2 ≤ d]) / E[D_2 - d̲]; the current code therefore returns the wrong point estimate and CI on both continuous paths with no warning. Concrete fix: in _fit_continuous(), compute dy_mean = dy_arr.mean(), use den = d_arr.mean() for continuous_at_zero and den = (d_arr - d_lower_val).mean() for continuous_near_d_lower, set att = (dy_mean - bc_fit.estimate_bias_corrected) / den, se = bc_fit.se_robust / den, and update the CI/docs/tests accordingly; if you transform the stored Phase 1 CI directly, the endpoints must reverse under subtraction. citeturn1view1turn0view1
  • P1 The continuous-path support checks are incomplete on the original dose scale. diff_diff/had.py#L404 validates NaNs and D_{g,1}=0 but never enforces post-period D_{g,2} >= 0; diff_diff/had.py#L1147 then shifts d_arr before calling the Phase 1 validator, which means negative original doses can be hidden and some d_lower < min(D) overrides can slip through the 5% heuristic in diff_diff/local_linear.py#L652. Impact: explicit design="continuous_near_d_lower" can fit out-of-scope data or an off-support threshold, silently changing the estimand. Concrete fix: validate aggregated d_arr before any shift, reject any d_arr < 0, and require d_lower == float(d_arr.min()) within tolerance on continuous_near_d_lower, mirroring the existing mass-point guard at diff_diff/had.py#L992. citeturn1view1turn0view1
  • P1 Design 1 fits do not surface the extra identification assumptions the registry requires. After resolving the design, diff_diff/had.py#L1035 only warns about ignored vcov_type/cluster knobs and then fits continuous_near_d_lower / mass_point silently. Impact: the paper and registry say that when d̲ > 0, WAS is not point-identified under Assumptions 1-3 alone; users need Assumption 5 for sign identification or Assumption 6 for WAS_{d̲}, and those assumptions are not pre-trend-testable. Concrete fix: emit an explicit warning whenever resolved_design is continuous_near_d_lower or mass_point, or require an assumption={"5","6"} parameter and make target_parameter reflect that choice instead of always reporting WAS_d_lower. citeturn1view1turn0view1

Code Quality

  • P2 The mass-point variance-family contract is internally inconsistent. diff_diff/had.py#L768 says vcov_type=None defaults to "hc1", but diff_diff/had.py#L1069 maps the default constructor state (robust=False) to "classical". With cluster=, diff_diff/had.py#L1075 computes CR1 anyway, while diff_diff/had.py#L1085 still stores the pre-cluster label. Impact: users cannot rely on result.vcov_type or the docstring to know which SE family they actually got. Concrete fix: resolve and store the effective family explicitly, and make the docstring/defaults consistent with the implementation.

Performance

No findings.

Maintainability

No independent finding beyond the methodology/contract issues above.

Tech Debt

Security

No findings.

Documentation/Tests

  • P2 The new continuous-path tests encode the wrong estimator formula. tests/test_had.py#L234 through tests/test_had.py#L290 assert att == tau_bc / D̄ and CI == ci_tau / D̄, so they will pass even while the estimator is returning the wrong quantity. Impact: the test suite now protects the bug instead of the paper. Concrete fix: replace those assertions with Equation 7 / Theorem 3 checks using (dy.mean() - bc_fit.estimate_bias_corrected) / den and the corresponding reversed CI endpoints, and add regression coverage for negative post-period doses plus d_lower < min(D) on continuous_near_d_lower.

Path to Approval

  1. Fix the continuous estimator in diff_diff/had.py#L1177 so it uses dy_arr.mean() - tau_bc and the correct denominator for each design, then update the matching docs at diff_diff/had.py#L121 and docs/methodology/REGISTRY.md#L2319.
  2. Add front-door validation on the original post-period dose: reject any D_{g,2} < 0, and require d_lower == min(D_{g,2}) on continuous_near_d_lower, not just on mass_point.
  3. Surface the Design 1 identification choice explicitly by warning on Assumption 5/6 or by requiring an explicit assumption argument and storing that choice in results metadata.
  4. Rewrite the new continuous-path tests at tests/test_had.py#L234 and tests/test_had.py#L279, and add regression tests for negative doses and off-support d_lower.
  5. Align the mass-point vcov_type docs and stored metadata with the actual effective variance family.

Address CI AI review on PR #346.

**P0 (Methodology — estimator formula was wrong):** The continuous paths
computed `att = tau_bc / D_bar` instead of the paper's
`att = (mean(ΔY) - tau_bc) / den`. On a known DGP with β = 0.3 the old
code returned ~0 (since tau_bc ≈ lim_{d↓d̲} E[ΔY|D≤d] ≈ 0); the fix
recovers the true β at n=4000.

- Design 1' (continuous_at_zero), paper Equation 7 / Theorem 3:
    β = (E[ΔY] - lim_{d↓0} E[ΔY | D_2 ≤ d]) / E[D_2]
  Implementation: `att = (dy_arr.mean() - tau_bc) / d_arr.mean()`,
  `se = se_robust / |d_arr.mean()|`.
- Design 1 (continuous_near_d_lower), paper Theorem 4 /
  `WAS_{d_lower}` under Assumption 6:
    β = (E[ΔY] - lim_{d↓d_lower} E[ΔY | D_2 ≤ d]) / E[D_2 - d_lower]
  Implementation: regressor shift `D - d_lower`, evaluate local-linear
  fit at boundary=0 on the shifted scale; `att = (dy_arr.mean() -
  tau_bc) / (d_arr - d_lower).mean()`.

CI endpoints reverse under the subtraction `(ΔȲ - tau_bc)`; safe_inference
computes `att ± z · se` from scratch so reversal is automatic.

**P1 (Validator — incomplete contract on continuous paths):**
- `_validate_had_panel` now rejects negative post-period doses
  (`D_{g,2} < 0`) front-door on the ORIGINAL unshifted scale so errors
  reference the user's dose column, not shifted values.
- `continuous_near_d_lower` now enforces `d_lower == float(d.min())`
  within float tolerance, mirroring the mass-point guard. Off-support
  `d_lower` would otherwise pass through Phase 1c's 5% plausibility
  heuristic and silently identify a different estimand.

**P1 (Identification — Assumption 5/6 not surfaced):** Design 1 paths
(continuous_near_d_lower + mass_point) now emit a UserWarning stating
that `WAS_{d_lower}` identification requires Assumption 6 (or
Assumption 5 for sign identification only) beyond parallel trends, and
that neither is testable via pre-trends. continuous_at_zero (Design 1',
Assumption 3 only) does not emit the warning.

**P2 (vcov_type docstring mismatch):** Corrected the constructor
docstring to reflect actual behavior: `vcov_type=None` falls back to
the `robust` flag (`robust=True -> hc1`, `robust=False -> classical`,
the default); explicit `vcov_type` takes precedence over `robust`.

**Tests rewritten + new regressions:**
- `TestBetaScaleRescaling` now pins the corrected formula at atol=1e-14
  (both designs), pins the CI endpoint reversal, and adds two
  "recover true β" asymptotic sanity tests.
- New `TestDesign1DLowerContract` covers mass-point AND
  continuous_near_d_lower d_lower equality enforcement.
- New `TestPostPeriodDoseContract` covers negative-dose rejection.
- New `TestAssumptionFiveSixWarning` verifies the Design 1 warning is
  emitted on continuous_near_d_lower + mass_point and NOT on
  continuous_at_zero.

**Docs:** REGISTRY.md HAD Phase 2a section updated with the corrected
estimator formula, the d_lower contract on both Design 1 paths, the
Assumption 5/6 warning note, and a CI-endpoint-reversal note. had.py
module and result-class docstrings updated in kind.

Targeted regression: 118 HAD tests + 497 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c38f072c0e6d2a8fb31715aa51cff9db5fa1cec


Overall Assessment

⚠️ Needs changes

Executive Summary

  • By inspection, the prior formula/support/assumption blockers look resolved: the continuous path now matches the paper’s Design 1′ and Design 1 estimands, and the PR now surfaces the extra Design 1 identification burden. citeturn0view2turn0view1
  • Severity P1 [Newly identified]: design="continuous_near_d_lower" can still be forced on a true mass-point sample, which bypasses the paper’s Section 3.2.4 mass-point/2SLS branch after the code shifts D - d_lower. citeturn0view1
  • Severity P1 [Newly identified]: the new first_treat_col validator assumes numeric period labels, so it will fail on otherwise supported string-labelled two-period panels.
  • The remaining scope cuts (event_study, survey, weights, continuous-path cluster, mass-point hc2/hc2_bm) are documented in REGISTRY.md / TODO.md, so I am not counting them as blockers.
  • I could only review the tests statically here because pytest is not installed in this environment.

Methodology

  • Severity P1 [Newly identified]. Impact: HAD / WAS Design 1 dispatch is still methodologically unsafe on explicit overrides. fit() dispatches straight into the continuous path at diff_diff/had.py:L1115, _fit_continuous() shifts to d_arr - d_lower at diff_diff/had.py:L1252, and the upstream mass-point rejection only fires when d.min() > 0 at diff_diff/local_linear.py:L620. On a sample with a real mass point at the original d_lower, that shift turns the support minimum into zero and suppresses the Section 3.2.4 branch that should use the sample-average / 2SLS estimator. Concrete fix: before shifting, re-run the modal-fraction check on the original d_arr; if the fraction at d.min() exceeds 2%, either force mass_point or raise on explicit continuous_near_d_lower. Add a regression using the existing mass-point DGP. citeturn0view1
  • Severity P3. Impact: the structural-residual 2SLS sandwich on the mass-point path is documented in the registry note, so under the review rules I treat it as informational rather than a defect. Code: diff_diff/had.py:L651, docs/methodology/REGISTRY.md:L2323. Concrete fix: none for this PR.

Code Quality

  • Severity P1 [Newly identified]. Impact: first_treat_col validation is not dtype-agnostic. At diff_diff/had.py:L482 the code coerces grouped first_treat values to float, and at diff_diff/had.py:L488 it coerces them to int. That breaks the supported string-period path exercised at tests/test_had.py:L1399: a valid two-period panel with period in {"A","B"} and first_treat_col in {0,"B"} will error before the intended contract check. Concrete fix: use pd.isna(ft) for missingness and raw equality against {0, t_post} without numeric coercion; add a regression with string period labels and first_treat_col="ft".
  • Severity P2. Impact: clustered mass-point results still store the requested vcov_type instead of the effective CR1 family. The computation switches to CR1 whenever cluster is not None at diff_diff/had.py:L739, but the result object stores vcov_effective unchanged at diff_diff/had.py:L1165. The text summary patches over this at diff_diff/had.py:L272, but to_dict() and downstream consumers still see "classical" / "hc1" rather than the actual clustered family. Concrete fix: store "cr1" when cluster is supplied, or split requested vs effective variance labels.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: tests/test_had.py covers the earlier blockers, but it does not add regressions for the two remaining edge cases above. The explicit-override coverage at tests/test_had.py:L1064 misses continuous_near_d_lower on a mass-point sample, and the panel-validator coverage at tests/test_had.py:L1399 does not exercise first_treat_col with nonnumeric periods. Concrete fix: add those two regressions. I could not execute the suite here because pytest is unavailable.

Path to Approval

  1. Add an original-scale modal-fraction guard before the continuous_near_d_lower path shifts d_arr. If modal_fraction(d.min()) > 0.02, fail fast or redirect to mass_point, and add a regression using the existing mass-point DGP.
  2. Rewrite first_treat_col validation to be dtype-agnostic: use pd.isna for missingness and raw-value membership against {0, t_post}. Add a regression with period=("A","B") and first_treat_col in {0,"B"}.

**P1 #1 (Methodology): continuous_near_d_lower on mass-point samples**

When a user explicitly forced design="continuous_near_d_lower" on a
sample that actually satisfies the >2% modal-fraction mass-point
criterion, the downstream regressor shift (D - d_lower) would move the
support minimum to zero on the shifted scale. Phase 1c's mass-point
rejection guard only fires when d.min() > 0 (_validate_had_inputs), so
the silent coercion ran the nonparametric local-linear estimator on a
sample the paper (Section 3.2.4) requires to use the 2SLS branch,
producing the wrong estimand.

Fix: `HeterogeneousAdoptionDiD.fit()` now runs the modal-fraction
check on the ORIGINAL (unshifted) d_arr when the user explicitly
selects design="continuous_near_d_lower". If the fraction at d.min()
exceeds 2%, the fit raises ValueError pointing to design="mass_point"
or design="auto". design="auto" is unaffected (_detect_design already
correctly resolves such samples to mass_point).

**P1 #2 (Code Quality): first_treat_col validator not dtype-agnostic**

The previous validator called `.astype(np.float64)` and `int(v)` on
grouped first_treat values, which crashed on otherwise-supported
string-labelled two-period panels (period in {"A","B"}, first_treat
in {0, "B"}). Rewrote using `pd.isna()` for missingness and raw-value
set-membership against `{0, t_post}` with no numeric coercion.

**P2 (Maintainability): cluster-applied mass-point stored wrong vcov_type**

When cluster was supplied, `_fit_mass_point_2sls` unconditionally
switches to the CR1 cluster-robust sandwich, but the result object
stored the REQUESTED family ("hc1" or "classical") as `vcov_type`.
`summary()` rendered correctly via the cluster_name branch, but
`to_dict()` and downstream programmatic consumers saw the stale
requested label. Fixed: when cluster is supplied, `vcov_type` is
stored as `"cr1"` regardless of the requested family. Renamed the
local variable from `vcov_effective` to `vcov_requested` to separate
the input from the effective family. Updated the
`HeterogeneousAdoptionDiDResults.summary()` branch so the cluster
rendering still works with the new stored value.

**Tests added (+8 regression):**
- TestValidateHadPanel.test_first_treat_col_with_string_periods
- TestValidateHadPanel.test_first_treat_col_dtype_agnostic_rejects_invalid_string
- TestContinuousPathRejectsMassPoint (2 tests)
- TestMassPointClusterLabel (4 tests: cr1 stored when clustered, base
  family when unclustered, classical+cluster collapses to cr1,
  to_dict shows effective family)

Targeted regression: 126 HAD tests + 505 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c320ddeb6bb6a8da191667a034a108734e66d8d5


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior blockers from the last review look addressed: the original-scale mass-point guard is now before the regressor shift, first_treat_col validation is now dtype-agnostic, and clustered mass-point results now surface the effective cr1 label. See diff_diff/had.py:L481-L507, diff_diff/had.py:L1057-L1083, and diff_diff/had.py:L1204-L1210.
  • Severity P1: _validate_had_panel() still infers pre/post by lexicographically sorting the two time labels, so valid string-labeled panels such as "pre"/"post" or "before"/"after" will be reversed and then fail the D_{g,1}=0 check. See diff_diff/had.py:L398-L420 and diff_diff/had.py:L449-L459.
  • Severity P2: HeterogeneousAdoptionDiDResults.summary() still hardcodes the coefficient row label as WAS, even when the estimator correctly stores target_parameter="WAS_d_lower". See diff_diff/had.py:L292-L303.
  • The structural-residual 2SLS sandwich and the Phase 2a scope cuts (hc2/hc2_bm, event_study, survey, weights, continuous-path cluster) are documented in the registry/TODO, so I am not counting them as defects. See docs/methodology/REGISTRY.md:L2319-L2326 and TODO.md:L94-L98.
  • I could not execute the test suite here: pytest is not installed in this environment, and the base Python environment also lacks project deps such as numpy.

Methodology

  • Severity P1. Impact: the panel validator’s period-order inference is not actually safe for supported string labels. _validate_had_panel() sorts unique() period labels lexicographically at diff_diff/had.py:L398-L420, then applies the HAD assumption check D_{g,1}=0 using that inferred t_pre at diff_diff/had.py:L449-L459. For a valid two-period panel labeled ("pre", "post"), the sort order is ("post", "pre"), so the code checks the treated period as if it were pre-treatment and raises on a valid design. This is a real production edge-case on a changed assumption-check path, not just a test artifact. Concrete fix: infer t_pre as the unique period whose dose is zero for all units and t_post as the other period, or require explicit pre/post ordering from the caller; do not use lexicographic sort as a proxy for chronology.
  • Severity P3. Impact: the mass-point structural-residual 2SLS sandwich and the hc2/hc2_bm fail-fast behavior are documented deviations/deferrals, so they are informational only. See diff_diff/had.py:L649-L791, docs/methodology/REGISTRY.md:L2323-L2326, and TODO.md:L97-L98. Concrete fix: none for this PR.

Code Quality

  • Severity P2. Impact: the result summary can misstate the estimand on Design 1 paths. The result object correctly stores target_parameter="WAS_d_lower" for continuous_near_d_lower and mass_point, but the printed coefficient row is still hardcoded to WAS at diff_diff/had.py:L292-L303. That is a user-facing contract mismatch in the estimator’s main output surface. Concrete fix: render self.target_parameter in the summary row instead of a hardcoded label.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the remaining Phase 2a gaps are properly tracked, so they are not blockers for this review: aggregate="event_study", survey=, weights=, continuous-path cluster=, and mass-point hc2 / hc2_bm. See TODO.md:L94-L98 and docs/methodology/REGISTRY.md:L2323-L2326. Concrete fix: none for this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new tests cover only lexically ordered string labels ("A", "B"), so they miss the new period-order bug; they also do not assert that Design 1 summaries print WAS_d_lower. See tests/test_had.py:L1399-L1421 and tests/test_had.py:L894-L941. Concrete fix: add a regression with period=("pre","post") and first_treat_col in {0, "post"}, and add summary-label assertions for continuous_near_d_lower and mass_point.

Path to Approval

  1. Replace lexicographic period ordering in _validate_had_panel() with HAD-consistent pre/post inference. The safest implementation is to identify the pre period as the unique period where dose == 0 for all units, and the post period as the other period.
  2. Add a regression test that fits a valid two-period panel with string labels like ("pre", "post") and first_treat_col values {0, "post"} so this validator bug cannot regress again.

…abel

**P1 (Methodology): _validate_had_panel inferred pre/post by lexicographic sort**

Previously the validator sorted the two period labels alphabetically
and assigned `t_pre=periods[0]`, `t_post=periods[1]`. On supported
string-labelled panels like `("pre", "post")` the alphabetic order is
["post", "pre"], so the code flipped pre and post and then raised on
the treated-period D>0 check for a valid design. Same bug for
`("before", "after")` and any non-alphabetic-chronological label pair.

Fix: identify `t_pre` as the unique period where dose == 0 for ALL
units (HAD paper Section 2 no-unit-untreated convention); `t_post` is
the other period. This is a DGP-consistent invariant, not a string
ordering. If neither period has all-zero dose, raise with the
contract message and per-period nonzero-count diagnostics. If both
periods have all-zero dose, raise (no treatment variation to estimate).

The existing pre-period D=0 check is now tautological and has been
removed since the inference itself enforces the invariant. Behavior
on valid numeric panels (e.g., 2020/2021) is unchanged.

**P2 (Code Quality): summary() hardcoded 'WAS' row label**

`HeterogeneousAdoptionDiDResults.summary()` printed "WAS" as the
parameter label regardless of the resolved design. For Design 1
paths (continuous_near_d_lower, mass_point) the stored
`target_parameter` is "WAS_d_lower" per paper Sections 3.2.2-3.2.4,
so the user-facing output misrepresented the estimand.

Fix: render `self.target_parameter` in the summary row. Now Design 1'
prints "WAS", Design 1 prints "WAS_d_lower", matching the stored
result metadata.

**Tests (+7 regression):**
- TestValidateHadPanel.test_semantic_pre_post_labels_not_lexicographic
- TestValidateHadPanel.test_semantic_pre_post_with_first_treat_col
- TestValidateHadPanel.test_semantic_pre_post_fit_end_to_end
- TestValidateHadPanel.test_before_after_labels
- TestValidateHadPanel.test_no_all_zero_period_raises
- TestValidateHadPanel.test_both_all_zero_periods_raises
- TestResultMethods.test_summary_uses_target_parameter_for_row_label

Targeted regression: 133 HAD tests + 512 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ddc09e4341f528862de4a62505032d314fa8deea


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity P1 [Newly identified]. Impact: the Phase 2a registry text is regime-based, not user-override-based: when d̲=0, the documented path is Design 1' (continuous_at_zero), while continuous_near_d_lower and mass_point are the d̲>0 cases. The new estimator intentionally advertises “explicit overrides run the chosen path without auto-reject,” and there is no fit-time guard enforcing d_lower > 0 for continuous_near_d_lower or mass_point. As shipped, design="mass_point", d_lower=0.0 on a d.min()==0 sample returns a finite 2SLS result, and design="continuous_near_d_lower" on the same regime would run the Design 1' algebra while labeling the target WAS_d_lower and emitting the Assumption 5/6 warning. That is an undocumented methodology deviation from the source-material dispatch rules. See docs/methodology/REGISTRY.md:L2147-L2156, docs/methodology/REGISTRY.md:L2239-L2241, docs/methodology/REGISTRY.md:L2320-L2324, diff_diff/had.py:L109-L115, diff_diff/had.py:L831-L834, diff_diff/had.py:L1065-L1168, diff_diff/had.py:L1307-L1314, and tests/test_had.py:L1111-L1124. Concrete fix: reject design="mass_point" and design="continuous_near_d_lower" whenever the resolved d_lower is zero within tolerance, with an error pointing users to design="continuous_at_zero" or design="auto".

Code Quality

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the remaining Phase 2a scope cuts and deferred variance work are properly tracked, so they are informational rather than blocking: aggregate="event_study", survey=, weights=, continuous-path cluster=, and mass-point vcov_type in {"hc2","hc2_bm"}. See docs/methodology/REGISTRY.md:L2323-L2326 and TODO.md:L94-L98. Concrete fix: none for this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the override contract is now internally inconsistent. The public docstring says explicit overrides run “without auto-reject,” while the tests encode one override that raises (continuous_at_zero on mass-point data) and another that succeeds (mass_point on d.min()==0 data). That ambiguity will confuse users and will regress once the methodology guard above is fixed. See diff_diff/had.py:L828-L834 and tests/test_had.py:L1101-L1124. Concrete fix: after tightening the design guard, rewrite the design docstring to describe the supported override behavior precisely, and invert tests/test_had.py:L1111-L1124 into a rejection regression.
  • Review limitation: I could not run tests/test_had.py here because pytest, numpy, and pandas are not installed.

Path to Approval

  1. Add a fit-time guard in HeterogeneousAdoptionDiD.fit() that rejects design="mass_point" and design="continuous_near_d_lower" when the resolved d_lower is zero within the same tolerance family used by _detect_design, and direct users to design="continuous_at_zero" / design="auto".
  2. Replace tests/test_had.py:L1111-L1124 with a regression that expects ValueError, and add the parallel rejection test for design="continuous_near_d_lower" on a d.min()==0 sample.
  3. Update the design parameter docstring in diff_diff/had.py:L828-L837 so the supported override contract matches the guarded behavior.

**P1 (Methodology): Design 1 paths must reject d_lower = 0**

Paper Section 3.2 partitions HAD by regime: `d_lower = 0` is Design 1'
(`continuous_at_zero`); `d_lower > 0` is Design 1 (`continuous_near_
d_lower` or `mass_point`). The auto-detect rule already respects this
partition, but explicit overrides previously allowed
`design="mass_point", d_lower=0` and `design="continuous_near_d_lower"`
on a `d.min()==0` sample to run silently, returning a
paper-incompatible estimand (2SLS with degenerate single-unit mass for
mass_point; Design 1' algebra relabeled as `WAS_d_lower` with a
spurious Assumption 5/6 warning for continuous_near_d_lower).

Fix: add a fit-time guard that raises `ValueError` when
`resolved_design in ("mass_point", "continuous_near_d_lower")` and the
resolved `d_lower_val` is within float tolerance of zero (same
tolerance family as `_detect_design`'s d.min()==0 tie-break). The
error message points users to `continuous_at_zero` or `auto` for
samples with support infimum at zero.

**Docstring + test updates:**

- Rewrote the `design` parameter docstring to document the
  regime-partition contract precisely: each explicit override is
  now described with its d_lower precondition and mass-point
  compatibility.
- Rewrote the `d_lower` parameter docstring to note the
  Design-1-requires-positive contract.
- Inverted the prior `test_force_mass_point_on_continuous_data_at_
  support_infimum` test (which incorrectly codified the unsupported
  behavior) into three rejection regressions:
  `test_force_mass_point_on_d_lower_zero_sample_raises`,
  `test_force_continuous_near_d_lower_on_d_lower_zero_sample_raises`,
  `test_force_mass_point_d_lower_none_on_zero_sample_raises`.

Targeted regression: 135 HAD tests + 514 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e622d60482fa06544f2ca0a51c3e06584c4e0b83


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

  • Severity P1 [Newly identified]. Impact: the paper/registry split Design 1 into two different estimators: 2SLS only for the mass-point-at- case, and the shifted local-linear boundary-limit estimator for the continuous-near- case. The new code enforces one side of that split by rejecting continuous_near_d_lower on obvious mass-point samples, but there is no reciprocal guard before the mass_point dispatch. As written, any explicit design="mass_point" call with d_lower > 0 and d_lower == d.min() goes straight into _fit_mass_point_2sls(), even when the sample only fits the continuous-near- regime. That silently changes the control object from the boundary limit lim_{d \downarrow d̲} E[ΔY | D_2 \le d] to the exact-d.min() cell, which is not the documented Design 1 continuous estimator. docs/methodology/REGISTRY.md:L2240-L2241 docs/methodology/REGISTRY.md:L2319-L2324 diff_diff/had.py:L833-L853 diff_diff/had.py:L1129-L1145 diff_diff/had.py:L1250-L1262
    Concrete fix: either add the reciprocal front-door rejection for design="mass_point" when the sample does not satisfy the same modal-min mass-point rule used by _detect_design(), or explicitly document in REGISTRY.md that design="mass_point" is an expert override that intentionally bypasses the heuristic. In either case, codify that contract in tests.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new tests now pin the fixed zero-boundary rejection and the continuous_near_d_lower-on-mass-point rejection, but they still do not pin the reciprocal mass_point-on-continuous contract. That leaves the methodology gap above unguarded in the suite. tests/test_had.py:L1111-L1140 tests/test_had.py:L1556-L1577
    Concrete fix: add a regression using _dgp_continuous_near_d_lower() that either expects ValueError or asserts the documented expert-override behavior, whichever contract you choose.

Path to Approval

  1. Decide whether explicit design="mass_point" is supposed to obey the same modal-min mass-point regime check as design="auto". If yes, add the fit-time rejection before _fit_mass_point_2sls(). If no, add a **Note:** block in docs/methodology/REGISTRY.md and align the design docstring so the bypass is explicit and intentional.
  2. Add a regression in tests/test_had.py on _dgp_continuous_near_d_lower() that codifies that decision.
  3. Re-run tests/test_had.py in an environment with numpy and pytest installed.

**P1 (Methodology): reciprocal regime check for design="mass_point"**

Round 2 added a guard that rejects `design="continuous_near_d_lower"`
on samples with modal-min fraction > 2% (mass-point samples). The
reciprocal direction was left unguarded: `design="mass_point"` on a
continuous-near-d_lower sample (modal fraction <= 2%) would silently
run 2SLS, but the instrument Z = 1{D > d.min()} degenerates because
the "mass" side has almost no observations. The resulting point
estimate identifies the exact-d.min() cell rather than the paper's
boundary-limit estimand (Section 3.2.4 2SLS is only the documented
Design 1 estimator when the modal mass exists).

Fix: extend the existing pre-shift regime check so it fires
symmetrically on both explicit overrides. When
`resolved_design == "mass_point"` and the modal fraction at d.min()
does NOT exceed 2%, raise ValueError pointing users to
`continuous_near_d_lower` or `auto`. Same 2% threshold used by
`_detect_design()` and Phase 1c's `_validate_had_inputs()` so the
three dispatch paths share one regime rule.

**Docs:** Updated the `design` parameter docstring to document the
reciprocal mass-point contract.

**Tests (+4 regression):**
- TestMassPointPathRejectsContinuousSample.test_mass_point_on_continuous_near_sample_raises
- TestMassPointPathRejectsContinuousSample.test_mass_point_on_true_mass_point_sample_runs
- TestMassPointPathRejectsContinuousSample.test_mass_point_modal_at_threshold_runs (2.5%)
- TestMassPointPathRejectsContinuousSample.test_mass_point_modal_exactly_two_percent_raises

Targeted regression: 139 HAD tests + 518 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2884c1d81adf576edc6f14397b8cb97880a28cbe


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. The prior re-review blocker about forcing design="mass_point" on continuous-near- samples appears resolved, but the new estimator still ships as panel-only while the project’s HAD methodology docs still describe the method as applicable to panel or repeated cross-section data, and that restriction is not documented as a registry **Note:** / **Deviation**. See diff_diff/had.py:L1123, tests/test_had.py:L1580, docs/methodology/REGISTRY.md:L2146, and docs/methodology/papers/dechaisemartin-2026-review.md:L23.

Executive Summary

Methodology

  • Severity P1 [Newly identified]. Impact: the project’s HAD methodology docs still say the estimator applies to panel or repeated cross-section data, but the shipped API is panel-only: fit() is documented as operating on a “two-period panel,” requires unit_col, and _validate_had_panel() hard-enforces a balanced two-period panel. The new HAD TODO rows track other Phase 2a deferrals, but not repeated-cross-section support. That makes the shipped surface narrower than the documented method without a registry **Note:** / **Deviation** recording the restriction. Concrete fix: either implement a repeated-cross-section path, or add an explicit HAD registry **Note:** that Phase 2a is panel-only, mirror that in the paper review and had.py docstrings, and add a regression that asserts the intended rejection on repeated-cross-section-shaped input. See docs/methodology/REGISTRY.md:L2146, docs/methodology/papers/dechaisemartin-2026-review.md:L23, diff_diff/had.py:L1007, diff_diff/had.py:L421, tests/test_had.py:L56, and TODO.md:L93.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Security

No findings.

Documentation/Tests

  • Severity P3 [Newly identified]. Impact: had.py cites Design 1' as “Equation 7 / Theorem 3” and the continuous Design 1 path as “Proposition 3 / Theorem 4,” but the registry maps those source points to Theorem 1 / Equation 3 for Design 1' identification and Theorem 3 / Equation 11 for WAS_{d̲}; Theorem 4 in the registry is the QUG test. This does not change the computation, but it leaves the in-code references inconsistent with the source material. Concrete fix: update the module docstring, result docstring, and _fit_continuous() docstring to cite Theorem 1 / Equation 3 for Design 1' identification, Equation 7 for the sample estimator, and Theorem 3 / Equation 11 for WAS_{d̲}. See diff_diff/had.py:L12, diff_diff/had.py:L22, diff_diff/had.py:L138, diff_diff/had.py:L143, docs/methodology/REGISTRY.md:L2173, docs/methodology/REGISTRY.md:L2196, and docs/methodology/REGISTRY.md:L2262.
  • I could not execute tests/test_had.py here because numpy, pandas, and pytest are unavailable in this environment.

Path to Approval

  1. Add an explicit HAD registry **Note:** / **Deviation** that Phase 2a is panel-only and does not yet support repeated cross-sections, and mirror that restriction in docs/methodology/papers/dechaisemartin-2026-review.md and the HeterogeneousAdoptionDiD.fit() docstring.
  2. Add a regression test that constructs a repeated-cross-section-shaped input and asserts the intended panel-only error message. If the intent is to support repeated cross-sections now, implement that path instead.

**P1 (Methodology): Phase 2a panel-only not documented as deviation**

Paper Section 2 defines HAD on panel OR repeated cross-section data,
but Phase 2a ships a panel-only implementation. The existing
balanced-panel validator rejects RCS inputs (disjoint unit IDs
between periods) with a generic "unit(s) do not appear in both
periods" error, but the scope restriction was not documented in
REGISTRY.md as a `**Note:**` deviation from the paper's documented
surface.

Fix:
- REGISTRY.md: new `**Note (Phase 2a panel-only):**` block under the
  HAD Phase 2a entry, explaining the restriction and pointing at the
  follow-up RCS PR.
- `docs/methodology/papers/dechaisemartin-2026-review.md`: mirrored
  implementation note on the panel-or-RCS scope line.
- `HeterogeneousAdoptionDiD.fit()` docstring: new preamble paragraph
  stating the panel-only restriction and why (unit-level first
  differences required).
- `TODO.md`: new row queuing RCS identification-path work.
- `tests/test_had.py`: two regression tests
  (`test_repeated_cross_section_raises` and
  `test_repeated_cross_section_fit_raises`) that construct RCS-shaped
  input and assert the validator + fit() raise with the expected
  error.

**P3 (Docs/Tests): theorem/equation references were off**

Module docstring, result-class `att` docstring, and
`_fit_continuous()` docstring cited the paper's theorems/equations
inconsistently with the registry's source map:
- Design 1' identification: said "Theorem 3" -> should be Theorem 1
  with Equation 3; Equation 7 is the sample estimator.
- Design 1 continuous-near-d_lower: said "Proposition 3 / Theorem 4"
  -> should be Theorem 3 / Equation 11 (Theorem 4 is the QUG null
  test, not this estimand).

Fix: updated all three docstring sites to cite Theorem 1 / Equation 3
(Design 1' identification) + Equation 7 (sample estimator), and
Theorem 3 / Equation 11 (Design 1 continuous-near-d_lower / WAS_d_lower
under Assumption 6).

Targeted regression: 141 HAD tests + 520 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 20, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: afa0f6c68586a1e15a09d1dfd19badc0402e52d5


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1. The previous re-review blocker about Phase 2a being panel-only looks resolved via the new registry note, paper-review note, TODO tracking, and repeated-cross-section tests. The remaining blocker is a new parameter-interaction bug: cluster= is documented as ignored on continuous designs, but the implementation validates/extracts cluster data before design dispatch, so a missing or malformed cluster column can still abort an otherwise valid continuous fit. See diff_diff/had.py:L889, diff_diff/had.py:L1081, diff_diff/had.py:L1252, and TODO.md:L97.

Executive Summary

Methodology

Code Quality

  • Severity P1. Impact: cluster is a new public parameter, and the public contract says it is ignored with a warning on continuous designs. In practice, fit() always passes self.cluster into _aggregate_first_difference() before design resolution, and that helper immediately raises if the cluster column is missing, non-constant within unit, or contains NaNs. So a valid continuous fit can fail just because a shared config supplied an irrelevant cluster=. That is a parameter-interaction bug on a new code path, and the current test coverage only exercises the happy path with a valid cluster column. Concrete fix: resolve the design before cluster extraction, or call _aggregate_first_difference(..., cluster_col=None) unless the resolved design is mass_point; then add regressions for design="continuous_at_zero" and design="auto" on continuous samples where cluster is missing/malformed and should only trigger the documented warning. See diff_diff/had.py:L889, diff_diff/had.py:L1081, diff_diff/had.py:L573, diff_diff/had.py:L1252, TODO.md:L97, and tests/test_had.py:L1332.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3. Impact: the remaining scoped deferments for event_study, survey, weights, hc2/hc2_bm, and repeated-cross-section support are now explicitly tracked in TODO.md / REGISTRY.md, so I am not counting those limitations as blockers for this PR. Concrete fix: none for this PR. See TODO.md:L93, TODO.md:L96, TODO.md:L102, and docs/methodology/REGISTRY.md:L2323.

Security

No findings.

Documentation/Tests

No additional findings. tests/test_had.py is broad and appears to cover the prior review blockers, but I could not execute it here because numpy, pandas, pytest, and scipy are unavailable in this environment.

Path to Approval

  1. Make continuous designs truly ignore cluster= by deferring cluster extraction/validation until after design resolution, or by skipping cluster aggregation entirely unless resolved_design == "mass_point".
  2. Add regressions that a continuous fit (design="continuous_at_zero" and design="auto" resolving to a continuous path) still succeeds with only a warning when cluster is irrelevant, including cases where the named cluster column is missing, contains NaNs, or varies within unit.

…egistry refs

**P1 (Code Quality): cluster= must truly be ignored on continuous paths**

`HeterogeneousAdoptionDiD.fit()` previously passed `self.cluster` into
`_aggregate_first_difference()` before the design was resolved. The
aggregator validates the cluster column eagerly (missing column,
within-unit variance, NaN ID), so a valid continuous fit could abort
just because a shared config supplied an irrelevant `cluster=`. This
contradicted the documented "ignored with a warning on continuous
paths" contract.

Fix: defer cluster extraction until after design resolution. The
first aggregation call now passes `cluster_col=None` unconditionally;
a second aggregation pass with `cluster_col=cluster_arg` runs only
when `resolved_design == "mass_point"`, which is the only path that
consumes the extracted cluster array. Continuous paths emit the
existing `UserWarning` and proceed to fit without touching the
cluster column at all.

**P3 (Methodology): registry checklist theorem references were stale**

Round 6 fixed the theorem citations in `had.py` and the paper review
doc but missed the Phase 2a checklist line in `REGISTRY.md`, which
still said "Equation 7 / Theorem 3" for Design 1' identification and
"Theorem 4, WAS_{d̲} under Assumption 6" for the continuous-near-d_lower
path. Updated the checklist line to match: Theorem 1 / Equation 3
(identification) + Equation 7 (sample estimator) for Design 1'; Theorem
3 / Equation 11 for WAS_{d̲}.

**Tests (+4 regression):**
- test_missing_cluster_column_on_continuous_only_warns: continuous_at_zero
  + cluster='does_not_exist' -> warn + fit succeeds.
- test_nan_cluster_on_continuous_only_warns: NaN cluster IDs on continuous
  path -> warn + fit succeeds.
- test_within_unit_varying_cluster_on_continuous_only_warns: within-unit-
  varying cluster IDs on continuous -> warn + fit succeeds.
- test_auto_design_ignores_irrelevant_cluster_on_continuous: design='auto'
  resolving to continuous_at_zero also ignores cluster gracefully.

Targeted regression: 145 HAD tests + 524 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 792997dd2c40b5d844aae52b0420d9212333a73d


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

Methodology

  • Severity P1. Impact: the registry and class docstring both advertise that Design 1 paths accept d_lower == float(d.min()) within float tolerance, but once that tolerance check passes the implementation keeps the user-supplied value instead of snapping it back to d.min(). On the mass-point path, a tolerated d_lower = d.min() - ε changes the instrument 1{D > d_lower} so the mass point itself moves into Z=1, which can empty the control group while the reported counts still treat those units as “at d_lower”. On the continuous-near-d_lower path, a tolerated d_lower = d.min() + ε makes d_arr - d_lower slightly negative and then trips Phase 1c’s negative-dose guard. This is a contract violation on a methodology-defining threshold, not just a cosmetic tolerance issue. Concrete fix: after the tolerance check passes, normalize d_lower_val = float(d_arr.min()) before any strict > comparisons, count construction, or regressor shifting; then add regressions for both directions of the tolerated perturbation. docs/methodology/REGISTRY.md:L2321-L2323 diff_diff/had.py:L861-L867 diff_diff/had.py:L1220-L1235 diff_diff/had.py:L1237-L1245 diff_diff/had.py:L1417-L1420 diff_diff/local_linear.py:L564-L581 tests/test_had.py:L1184-L1190 tests/test_had.py:L1208-L1214

Code Quality

  • Severity P2. Impact: _validate_had_panel() reduces first_treat_col with groupby(...).first(), and _aggregate_first_difference() uses nunique() plus first() for cluster. Those operations drop NaNs and ignore later bad rows, so units like [valid, NaN] or [valid, invalid] can pass the new validators even though the implementation comments and error messages say NaNs / out-of-domain values are rejected front-door. This does not change the core estimator formula, but it weakens the advertised input-contract checks on new public parameters. Concrete fix: validate raw per-row values before collapsing to one value per unit (nunique(dropna=False), explicit per-unit isna().any(), and explicit per-row domain checks), then collapse only after validation passes. diff_diff/had.py:L510-L526 diff_diff/had.py:L573-L595

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The deferred Phase 2a scope items that remain out of scope here (event_study, survey, weights, hc2 / hc2_bm, repeated cross-section support, continuous-path cluster= threading) are now explicitly tracked in TODO.md / REGISTRY.md, so I am not counting them against approval. TODO.md:L93-L102 docs/methodology/REGISTRY.md:L2323-L2328

Security

  • No findings.

Documentation/Tests

  • No additional committed-docs defect beyond the issues above. The new registry and paper-review notes are materially better aligned with the implementation than in the previous rerun.
  • Runtime verification is still pending on my side because this environment does not have numpy, pandas, pytest, or scipy, so I could not execute tests/test_had.py.
  • Severity P2. Impact: the new test suite misses the two acceptance-band edge cases created by the d_lower tolerance contract and the mixed-row validator cases for first_treat_col / cluster=. The current tolerance coverage only exercises mass_point with d_lower = d.min() + 1e-15, which is the harmless side of the band. Concrete fix: add regressions for mass_point with d_lower = d.min() - ε, continuous_near_d_lower with d_lower = d.min() + ε, and mixed valid/NaN or valid/invalid per-unit rows for both first_treat_col and cluster=. tests/test_had.py:L1184-L1190 tests/test_had.py:L1208-L1214 tests/test_had.py:L645-L660 tests/test_had.py:L1322-L1330

Path to Approval

  1. Snap any tolerance-accepted d_lower override to float(d_arr.min()) before constructing the mass-point instrument, cohort counts, or the continuous-path shifted regressor.
  2. Add regressions for both sides of the tolerance band:
    design="mass_point", d_lower=float(d.min()) - 1e-15 should behave identically to the exact-minimum case, and design="continuous_near_d_lower", d_lower=float(d.min()) + 1e-15 should not trip the shifted-dose rejection.
  3. Strengthen first_treat_col and cluster= validation to check raw per-row values before groupby().first(), and add regressions where one row in a unit is valid and the other is NaN or out of domain.

…idation

**P1 (Methodology): snap tolerance-accepted d_lower to float(d.min())**

The `d_lower == float(d.min())` within-tolerance contract accepted
user-supplied values that differed from the support infimum by float
rounding noise (1e-15). Downstream algebra is not tolerance-aware:

- `mass_point` + `d_lower = d.min() - ε`: `Z = d > d_lower` puts every
  mass-point unit (at exactly d.min()) into Z=1 because they are
  strictly > d_lower. The Z=0 control group empties and the Wald-IV
  ratio becomes undefined.
- `continuous_near_d_lower` + `d_lower = d.min() + ε`: the regressor
  shift `d - d_lower` produces a minimum of `-ε` on the shifted
  scale. Phase 1c's `_validate_had_inputs` then raises on the
  negative-dose guard, aborting an otherwise acceptable fit.

Fix: after the tolerance check passes on Design 1 paths, snap
`d_lower_val = float(d_arr.min())` before the mass-point instrument
construction, cohort counts, and the continuous-path regressor shift.
This preserves the advertised tolerance contract while keeping
downstream algebra exact.

**P2 (Code Quality): row-level NaN/domain validation**

The previous `first_treat_col` and `cluster=` validators collapsed
each unit to a single value via `groupby().first()` before checking
for NaN or out-of-domain values. pandas's `.first()` silently skips
NaN, so a unit with rows `[valid, NaN]` would pass a unit-level check
even though the raw NaN on the bad row should be rejected.

Fix: for both `first_treat_col` (in `_validate_had_panel`) and
`cluster=` (in `_aggregate_first_difference`), validate raw per-row
values BEFORE the per-unit collapse:
- Row-level NaN check on the raw column.
- Row-level domain check on raw values (for first_treat_col).
- Per-unit-constancy check using `nunique(dropna=False)` so a
  `[value, NaN]` within a unit registers as 2 distinct values.

**Tests (+5 regression):**
- test_mass_point_d_lower_below_min_within_tolerance_snaps: asserts
  d_lower = d.min() - 1e-15 collapses ULP-identically to the exact
  case across att, se, n_mass_point, n_above_d_lower.
- test_continuous_near_d_lower_above_within_tolerance_snaps: asserts
  d_lower = d.min() + 1e-15 collapses to the exact case (no
  negative-shift rejection).
- test_first_treat_col_mixed_row_nan_raises: unit with [valid, NaN]
  rows raises.
- test_first_treat_col_mixed_row_invalid_value_raises: unit with
  [valid, 999] rows raises with the 999 in the error message.
- test_mixed_row_nan_cluster_raises_on_mass_point: mass-point with
  cluster column holding [valid, NaN] per unit raises.

Targeted regression: 150 HAD tests + 529 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1eda6148f38b7391b4ca7c481e8829d4651e4a27


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

Methodology

Code Quality

  • Severity P2 [Newly identified]. Impact: the public robust parameter is documented as a mass-point-only alias whose use on continuous paths triggers a warning, but the continuous-path dispatch only warns on vcov_type and cluster. A caller who treats robust as an active inference knob on continuous_at_zero or continuous_near_d_lower gets the same CCT robust SE with no signal that the flag had no effect. Concrete fix: either emit an ignore warning when robust=True on a continuous path, or tighten the API/docs to say that only vcov_type and cluster generate warnings while robust is ignored off the mass-point path. diff_diff/had.py:L901 diff_diff/had.py:L910 diff_diff/had.py:L1318

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: _validate_had_panel()’s return doc still says the helper returns the two period identifiers “min then max,” but the implementation now returns identified pre/post periods using the all-zero-dose invariant. That is misleading for future callers, especially because the new tests explicitly rely on semantic labels like "pre" / "post". Concrete fix: update the Returns text to say “pre then post” (or equivalent). diff_diff/had.py:L366 diff_diff/had.py:L389 tests/test_had.py:L1623

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new suite covers mass-point handling of robust and the continuous-path warning behavior for vcov_type/cluster, but not the continuous-path robust=True case behind the P2 above, so that mismatch can regress unnoticed. Concrete fix: add a regression that a continuous fit with robust=True follows the chosen contract. tests/test_had.py:L462 tests/test_had.py:L521 tests/test_had.py:L1418
  • Runtime verification pending: I could not run the HAD tests in this environment because numpy, pandas, scipy, and pytest are unavailable.

Review 9 is overall ✅ Looks good; this PR addresses the remaining
P2 and two P3s to close the loop cleanly.

**P2 (Code Quality): `robust=True` silently ignored on continuous paths**

The continuous dispatch block warns on `vcov_type` and `cluster` when
they are supplied on a path that ignores them, but `robust=True`
slipped through the same check. Since `robust=True` on the mass-point
path is the backward-compat alias for `vcov_type="hc1"`, a user
passing `robust=True` on `continuous_at_zero` / `continuous_near_d_lower`
would reasonably expect it to do something; in reality the continuous
paths always use the CCT-2014 robust SE from Phase 1c and the flag
has no effect.

Fix: extend the ignore-warning block to emit a `UserWarning` when
`robust=True` and `resolved_design` is continuous.

**P3 (Maintainability): _validate_had_panel return docstring stale**

The Returns section said ``(t_pre, t_post)`` is returned in "min then
max" order, but the implementation now uses the HAD dose invariant
(the all-zero-dose period is `t_pre` regardless of label order).
Updated the docstring to describe the identified semantic order and
the arbitrary-dtype-label support (int, str, datetime).

**Tests (+2 regression):**
- test_robust_true_ignored_on_continuous_warns: `robust=True` on
  continuous path emits a warning containing "robust".
- test_robust_false_silent_on_continuous: `robust=False` (the default)
  on continuous path does NOT emit the `robust=True is ignored`
  warning, keeping the default construction silent.

Targeted regression: 152 HAD tests + 531 total across Phase 1 and
adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a3553ed287cf8695404aa90368c0839f4c9e6622


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

Methodology

  • P3 Impact: diff_diff/had.py:L1163 says continuous_at_zero is the paper’s Design 1' path, but there is no HAD-level guard on that branch; it falls straight through to the looser Phase 1c boundary ~ 0 heuristic in diff_diff/local_linear.py:L639. Because that heuristic is already documented in the registry note, I am treating this as an estimator-contract/documentation mismatch rather than an unmitigated methodology defect. Concrete fix: either add a HAD-level guard consistent with the Phase 2a rule, or explicitly document in the HAD entry/docstring that design="continuous_at_zero" reuses the Phase 1c plausibility heuristic and add a regression for that case. tests/test_had.py:L1162. (cdn.arenafi.org)
  • No other methodology findings.

Code Quality

  • P1 Impact: the class advertises sklearn.base.clone(est) round-trips in diff_diff/had.py:L997, but scikit-learn’s estimator API exposes get_params(deep=True) / set_params(**params), while this implementation defines get_params() without deep and validates set_params() via hasattr. That means the documented sklearn-style surface is incomplete, and non-parameter names like fit would be silently accepted and overwrite methods. The new test only checks HeterogeneousAdoptionDiD(**est.get_params()) at tests/test_had.py:L825, so the broken contract is currently untested. Concrete fix: implement get_params(self, deep: bool = True), restrict set_params() to keys from self.get_params(), and add a regression for get_params(deep=False) plus rejection of non-constructor keys (or a direct clone(est) regression if sklearn is in the test matrix). diff_diff/had.py:L1014. (scikit-learn.org)

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • P3 Impact: runtime verification is pending because this environment lacks numpy, pandas, scipy, and pytest, so this review is source-only. Concrete fix: rerun the HAD suite in CI or a dev environment with the numerical stack installed; when fixing the P1 above, add a regression that exercises the real sklearn parameter contract rather than manual reconstruction.

Path to Approval

  1. Make the estimator’s sklearn parameter surface actually compatible: implement get_params(deep=True), tighten set_params() to constructor keys only, and add a regression that exercises the documented clone/parameter contract.

**P1 (Code Quality): sklearn parameter contract**

The class docstring advertises `sklearn.base.clone(est)` compatibility,
but the actual methods did not match sklearn's
`BaseEstimator.get_params(deep=True)` / `set_params(**params)` surface:

- `get_params()` did not accept a `deep` keyword. sklearn's `clone`
  calls `get_params(deep=False)`, so any integration would have
  failed with a TypeError on the missing kwarg.
- `set_params()` validated keys with `hasattr(self, key)`. That would
  silently accept non-constructor attribute names like `fit`, and a
  typo or malicious kwargs dict could overwrite estimator methods.

Fix:
- `get_params(self, deep: bool = True)` matches sklearn's signature.
  `deep` is accepted for compat; this estimator has no nested
  sub-estimators, so `deep=True` and `deep=False` return the same
  dict. `del deep` documents the no-op explicitly and silences
  unused-arg linters.
- `set_params(**params)` now restricts to keys from `get_params()`.
  Non-constructor attribute names (including method names like `fit`
  and dunder/private attrs) raise `ValueError`.

**Tests (+4 regression):**
- test_set_params_rejects_method_names: `set_params(fit=...)` raises
  and `est.fit` stays callable.
- test_set_params_rejects_private_attrs: `_internal=42` raises.
- test_get_params_accepts_deep_keyword: `deep=True`, `deep=False`, and
  no-arg all return the same dict.
- test_sklearn_clone_round_trip_if_available: `sklearn.base.clone`
  round-trips the estimator; gated on `pytest.importorskip("sklearn")`
  so it skips cleanly when sklearn is not in the test matrix.

Targeted regression: 155 HAD tests (+ 1 skipped) + 534 total (+1
skipped) across Phase 1 and adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 98b82017c58f0b32bbcb29aad4a9b47e4d49c4f2


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

  • The prior sklearn-compatibility blocker appears resolved: get_params(deep=...) now matches sklearn’s signature, set_params() rejects non-constructor keys, and there is a clone regression when sklearn is available. diff_diff/had.py#L997 tests/test_had.py#L869 tests/test_had.py#L882
  • I did not find an unmitigated methodology defect in the core HAD estimator math. The continuous paths match the registry’s documented Equation 7 / Equation 11 rescaling, and the mass-point structural-residual 2SLS sandwich is explicitly documented in the Methodology Registry, so I am not counting that implementation choice as a defect. docs/methodology/REGISTRY.md#L2323 docs/methodology/REGISTRY.md#L2327
  • P2 [Newly identified]: HeterogeneousAdoptionDiD.set_params() still mutates self before validation, so a rejected call can leave the estimator partially updated and invalid for later reuse. diff_diff/had.py#L1024
  • P3 [Newly identified]: the new HAD docstring example is not runnable as written because it uses generate_continuous_did_data() defaults (n_periods=4) with a Phase 2a estimator that requires exactly two periods. diff_diff/had.py#L944 diff_diff/prep_dgp.py#L845
  • Runtime verification was not possible here because pytest, numpy, pandas, and scipy are unavailable in this environment.

Methodology

No findings. The estimator’s three design paths, d_lower guards, NaN-safe inference routing, and documented mass-point variance choice all line up with the local methodology sources I checked. The remaining unsupported scope cuts are documented rather than silent. docs/methodology/REGISTRY.md#L2323 docs/methodology/REGISTRY.md#L2331

Code Quality

  • Severity: P2. Impact: HeterogeneousAdoptionDiD.set_params() applies setattr() inside the validation loop and only then calls _validate_constructor_args(), so a failing call like est.set_params(alpha=0.1, design="made_up") leaves the estimator half-mutated. That is a real reuse hazard for callers that catch ValueError and continue with the same object. The current HAD tests verify rejection and round-trip behavior, but not rollback-on-failure. diff_diff/had.py#L1032 tests/test_had.py#L847 tests/test_had.py#L892 tests/test_had.py#L974 Concrete fix: validate all incoming keys and derived constructor constraints on local variables first, then apply mutations atomically, following the established pattern already used in DifferenceInDifferences.set_params(). Add a regression asserting get_params() is unchanged after a failing HAD set_params() call. diff_diff/estimators.py#L824

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The Phase 2a scope cuts I checked (event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, repeated-cross-section support) are explicitly tracked in TODO.md, so I am not counting them against approval. TODO.md#L92 TODO.md#L101

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the new class example in had.py will fail if a user copies it literally, because it constructs data with generate_continuous_did_data() defaults and that generator defaults to n_periods=4, while HeterogeneousAdoptionDiD.fit() is Phase-2a two-period-only. diff_diff/had.py#L944 diff_diff/had.py#L1058 diff_diff/prep_dgp.py#L845 Concrete fix: change the example to a true HAD-compatible 2-period dataset, or explicitly pass n_periods=2 and note the Phase 2a panel restriction; ideally add a smoke-style doc regression so the example stays executable.
  • Review limitation: this was a source-only review. I could not run tests/test_had.py here because pytest, numpy, pandas, and scipy are not installed.

…mple

Round 11 is ✅ Looks good overall; this PR closes the remaining P2 and
P3 for cleanliness.

**P2 (Code Quality): set_params must be atomic**

The previous implementation applied `setattr()` inside the validation
loop and called `_validate_constructor_args()` only afterward. A
multi-key call like `set_params(alpha=0.1, design="made_up")` would
mutate `self.alpha` before raising on the invalid design, leaving
the estimator half-updated. Callers that catch `ValueError` and
retry would see the wrong state.

Fix: dry-run validation via `type(self)(**merged)` before any
attribute write. If the constructor raises on the combined state,
`self` is not mutated. Mutations are applied only after validation
passes, so rejection leaves the estimator fully rollback-safe.

**P3 (Documentation/Tests): docstring example was non-runnable**

The class docstring example constructed data via
`generate_continuous_did_data()` which defaults to `n_periods=4`,
while `HeterogeneousAdoptionDiD.fit()` requires exactly two periods
in Phase 2a. A user copying the example verbatim would hit the
two-period rejection.

Fix: rewrote the example to build a two-period HAD panel by hand
with an explicit `D_{g,1} = 0` pre-period column, a unit-level
`dose_post`, and `delta_y` so the shape matches the estimator's
contract exactly. Kept `doctest: +SKIP` markers on the RNG and fit
lines.

**Tests (+3 rollback regressions):**
- test_set_params_rollback_on_failure: multi-key call with invalid
  design keeps estimator unchanged.
- test_set_params_rollback_on_invalid_key: unknown kwarg keeps
  estimator unchanged.
- test_set_params_rollback_on_invalid_alpha: out-of-range alpha
  keeps estimator unchanged.

Targeted regression: 158 HAD tests (+1 sklearn-gated skip) + 537
total (+1 skipped) across Phase 1 and adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c67505156b94fb557a69441349a43ee3d9b21c6b


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

  • The prior set_params() atomicity issue appears resolved. HeterogeneousAdoptionDiD.set_params() now validates a merged parameter dict before mutating self. diff_diff/had.py:1040
  • The earlier doc-example issue also appears resolved. The class docstring now builds a true two-period HAD panel directly instead of relying on a generator with four-period defaults. diff_diff/had.py:943
  • I did not find an unmitigated defect in the core HAD math. The beta-scale continuous estimators, the Design 1 warning, and the structural-residual 2SLS sandwich all line up with the local methodology sources, and the 2SLS variance choice is explicitly documented in the registry. docs/methodology/REGISTRY.md:2323 docs/methodology/REGISTRY.md:2327
  • One P1 remains: an explicit nonzero d_lower is silently ignored whenever the resolved design is continuous_at_zero, even though the class documentation says explicit overrides are checked against the Section 3.2 regime contract and mismatches raise. diff_diff/had.py:868 diff_diff/had.py:1207
  • Runtime verification was not possible here because this environment does not have pytest or numpy installed.

Methodology

  • Severity: P1. Impact: HeterogeneousAdoptionDiD.fit() unconditionally sets d_lower_val = 0.0 on the continuous_at_zero path, so calls like HeterogeneousAdoptionDiD(design="continuous_at_zero", d_lower=0.5) silently fit Design 1' at zero anyway. That contradicts both the paper’s Design 1' contract (d_lower = 0) and the class docstring’s promise that mismatched overrides raise. It also creates a parameter-contract mismatch: get_params() can still report the user’s nonzero d_lower, while the fitted result reports d_lower=0.0. This is a silent propagation failure on a new public parameter. diff_diff/had.py:868 diff_diff/had.py:1030 diff_diff/had.py:1207 Concrete fix: after resolved_design is known, reject any explicit d_lower that is not approximately zero when resolved_design == "continuous_at_zero"; if you want tolerance behavior, snap only near-zero values to exact 0.0.
  • Severity: P3. Impact: The mass-point path’s structural-residual 2SLS sandwich is a documented deviation from the paper’s underspecified “standard 2SLS inference,” so I am not counting it as a defect. docs/methodology/REGISTRY.md:2327 Concrete fix: none required.

Code Quality

  • No additional findings. The previous partial-mutation set_params() issue appears fixed. diff_diff/had.py:1040

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The unsupported Phase 2a scope cuts I checked (event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, staggered timing, repeated cross-section) are explicitly tracked in TODO.md, so I am not counting them against approval. TODO.md:92 TODO.md:101

Security

  • No findings.

Documentation/Tests

  • No separate findings. The earlier doc example problem appears resolved. The current test file is broad, but it does not cover the continuous_at_zero plus explicit nonzero d_lower case that triggers the P1 above. diff_diff/had.py:943 tests/test_had.py:1140
  • Review limitation: I could not run the HAD tests here because pytest and numpy are not installed in this environment.

Path to Approval

  1. In HeterogeneousAdoptionDiD.fit(), validate explicit d_lower against the Design 1' contract after resolved_design is computed. continuous_at_zero should accept only d_lower=None or an approximately zero value; any materially nonzero override should raise ValueError.
  2. Add regression coverage for both design="continuous_at_zero" and design="auto" on a zero-support sample with d_lower > 0, and assert that the call fails instead of silently coercing to Design 1' at zero.

…inuous_at_zero

**P1 (Methodology): continuous_at_zero silently ignored nonzero d_lower**

`HeterogeneousAdoptionDiD.fit()` unconditionally set `d_lower_val = 0.0`
on the `continuous_at_zero` path, so calls like
`HeterogeneousAdoptionDiD(design="continuous_at_zero", d_lower=0.5)`
silently fit Design 1' at zero. This contradicted paper Section 3.2's
Design 1' regime contract (`d_lower = 0`) and the class docstring's
promise that mismatched overrides raise. It also created a parameter-
contract mismatch: `get_params()` reported the user's nonzero
`d_lower` while the fitted result reported `d_lower=0.0`.

Fix: after `resolved_design == "continuous_at_zero"` is known, check
whether the user supplied an explicit `d_lower`. If the absolute value
exceeds the float-tolerance band (same family as the Design 1
d_lower guards below), raise `ValueError` pointing at
`continuous_near_d_lower` / `mass_point` / `auto`. `d_lower=None`
(auto-resolve) and `d_lower=0.0` (redundant but benign) continue to
succeed.

**Tests (+4 regression):**
- test_continuous_at_zero_with_nonzero_d_lower_raises: d_lower=0.5
  raises with "d_lower == 0" / "Design 1'" pointer.
- test_continuous_at_zero_with_small_d_lower_raises: d_lower=0.01 also
  raises (not just large overrides).
- test_continuous_at_zero_with_zero_d_lower_succeeds: d_lower=0.0
  exactly is accepted (redundant-but-valid case).
- test_auto_on_zero_sample_ignores_user_d_lower: design='auto'
  resolving to continuous_at_zero must ALSO reject explicit nonzero
  d_lower, not silently drop it.

Targeted regression: 162 HAD tests (+1 sklearn-gated skip) + 541
total (+1 skipped) across Phase 1 and adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 21, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 07ab30f0b736d3a5fea6b1df97273412984e4ba5


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

Methodology

  • Severity: P1. Impact: The new d_lower contract is still incomplete for non-finite inputs. fit() only uses comparison-based guards (>, <=, and abs(x - d_min) > tol), and all of those return False for NaN. As a result, d_lower=np.nan is silently accepted: continuous_at_zero falls through to d_lower_val = 0.0, while continuous_near_d_lower / mass_point fall through the regime checks and then snap to d.min(). On zero-support samples, that can even bypass the restored d_lower > 0 protection and let the wrong Design 1 path run at d_lower=0. That is a silent parameter-contract mismatch on a new public estimator argument, and it contradicts the documented “mismatched overrides raise” behavior. diff_diff/had.py:868 diff_diff/had.py:969 diff_diff/had.py:1207 diff_diff/had.py:1243 diff_diff/had.py:1314
    Concrete fix: Reject non-finite d_lower values up front, ideally in _validate_constructor_args() so __init__, set_params(), and fit() all share the same contract. d_lower should be either None or a finite scalar; NaN / ±inf should raise ValueError before any design dispatch or snapping logic.

  • Severity: P3. Impact: The mass-point structural-residual 2SLS sandwich is a documented deviation from the paper’s underspecified “standard 2SLS inference,” and the registry explicitly records that choice plus the deferred hc2 / hc2_bm work. I am not counting it as a defect. diff_diff/had.py:700 docs/methodology/REGISTRY.md:2327
    Concrete fix: None required.

Code Quality

No additional findings. The previous set_params() atomicity issue appears resolved by the validate-then-mutate flow. diff_diff/had.py:1040

Performance

No findings.

Maintainability

No additional findings beyond the d_lower validation gap above.

Tech Debt

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: The new result docstring and registry entry overstate the NaN contract by saying att is part of the safe_inference-gated inference bundle / “all 5 inference fields.” The implementation only NaN-gates t_stat, p_value, and conf_int; att and se are passed through from the estimator path. That does not change the estimator math, but it leaves the public contract out of sync with the code. diff_diff/had.py:130 diff_diff/had.py:1449 docs/methodology/REGISTRY.md:2326 tests/test_had.py:729 tests/test_had.py:785
    Concrete fix: Update the wording to say that undefined SEs force t_stat, p_value, and conf_int to NaN, while att / se remain raw estimator outputs. If full all-field NaN propagation is actually intended, implement that explicitly and add regression tests.

Path to Approval

  1. Reject non-finite d_lower values (NaN, ±inf) before any branch-specific comparison or snapping, preferably in _validate_constructor_args() so set_params(d_lower=...) also fails atomically.
  2. Add regression tests for d_lower=np.nan on design="continuous_at_zero", design="continuous_near_d_lower", design="mass_point", and design="auto" on both zero-support and positive-support samples, asserting ValueError instead of silent coercion.

… docstring

**P1 (Methodology): d_lower=NaN/inf bypassed comparison guards**

The Design 1 regime guards (`d_lower > 0`, `d_lower == d.min()` within
tolerance, `d_lower <= tol` for continuous_at_zero) all use
comparison operators that return False for NaN. A user passing
`d_lower=np.nan` therefore slipped through every check and ended up
at `0.0` (continuous_at_zero branch) or `d.min()` (the snap). On a
zero-support sample, that could silently re-enter the paper-
incompatible continuous_near_d_lower / mass_point path the earlier
review rounds had fixed.

Fix: reject non-finite d_lower front-door in
`_validate_constructor_args()`. This makes `__init__`, `set_params()`,
and the atomic dry-run validation all share the same contract:
`d_lower` must be None or a finite scalar. NaN and +/-inf raise
`ValueError`. Atomic `set_params()` confirms that a failing
`d_lower=NaN` call leaves the estimator unchanged.

**P3 (Documentation/Tests): safe_inference coverage overstated**

The result-class docstring and REGISTRY said "all 5 inference fields
are routed through safe_inference". In reality, `safe_inference`
only gates the downstream triple (t_stat, p_value, conf_int) - `att`
and `se` are raw estimator outputs. The fit paths themselves return
`(nan, nan)` on degenerate configurations, which combined with the
safe_inference gate produces the advertised "all five fields NaN
together" behavior, but the docstring wording was misleading.

Fix: rewrote the Results docstring preamble and the REGISTRY.md
Phase 2a NaN-propagation line to describe the two-layer NaN flow
(fit-path NaN on att/se + safe_inference on the downstream triple)
precisely.

**Tests (+4 regression):**
- test_d_lower_nan_raises: __init__ rejects NaN.
- test_d_lower_posinf_raises / test_d_lower_neginf_raises: +/-inf
  rejected.
- test_d_lower_nan_via_set_params_raises: set_params(d_lower=NaN)
  raises AND leaves the estimator unchanged (atomic rollback).

Targeted regression: 166 HAD tests (+1 sklearn-gated skip) + 545
total (+1 skipped) across Phase 1 and adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2f6e1d0b3677ad0e5dd936818497bcd79358e24b


Overall Assessment

Looks good.

Highest unmitigated severity: P3.

Executive Summary

Methodology

  • Severity: P3. Impact: No unmitigated methodology defect found. The mass-point path’s structural-residual 2SLS sandwich is explicitly documented in the registry as a deviation from the paper’s underspecified inference language, so it should be treated as informational rather than blocking at docs/methodology/REGISTRY.md:2327 and diff_diff/had.py:721. Concrete fix: None.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: The main Phase 2a scope cuts are properly tracked in TODO.md and mirrored in the registry, including event_study, survey, weights, mass-point hc2/hc2_bm, continuous-path cluster, and repeated-cross-section support, so I am treating them as accepted deferred work rather than blockers at TODO.md:92, TODO.md:101, docs/methodology/REGISTRY.md:2330, and docs/methodology/REGISTRY.md:2331. Concrete fix: None in this PR.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The NaN-contract wording is still too strong. diff_diff/had.py:130 and docs/methodology/REGISTRY.md:2326 say users can expect all five fields to be finite or all NaN together, but the clustered mass-point path returns (att=beta_hat, se=np.nan) when CR1 is undefined with a single cluster at diff_diff/had.py:822, and the new regression only checks that se is NaN in that case at tests/test_had.py:785. This is a documentation mismatch, not a numerical bug, because t_stat, p_value, and conf_int are still correctly NaN-gated via safe_inference() at diff_diff/had.py:1471. Concrete fix: Reword the docs/registry to say the guaranteed NaN coupling is on the downstream inference triple (t_stat, p_value, conf_int), while att and se remain raw estimator outputs and may stay finite when inference is undefined.

Round 14 is ✅ Looks good; this closes the final residual P3 for
wording accuracy.

**P3 (Docs): NaN-contract wording overstated**

The prior wording said users can expect all five result fields to be
finite or all NaN together. That is true on most degenerate paths
(constant-y, no-dose-variation) but NOT on the single-cluster CR1
configuration of the mass-point path: `_fit_mass_point_2sls` returns
`(att=beta_hat, se=nan)` there - `att` is finite because the Wald-IV
ratio is well defined, while CR1 requires G >= 2 clusters so `se`
is NaN. `safe_inference` then gates the downstream triple correctly,
but the doc claim that `att` and `se` are coupled with the triple
was inaccurate.

Fix: rewrote the result-class docstring preamble and the REGISTRY
Phase 2a NaN-propagation line to describe the invariant precisely:

- GUARANTEED NaN coupling is on the downstream triple (`t_stat`,
  `p_value`, `conf_int`) via `safe_inference`.
- `att` and `se` are RAW estimator outputs from the chosen fit path
  and are NOT gated by safe_inference.
- On most degenerate paths, fit paths return `(nan, nan)` so all
  five move together, but on the single-cluster CR1 edge the fit
  path returns `(att=beta_hat, se=nan)` and only the downstream
  triple becomes NaN via the gate.

No code behavior changes; the `assert_nan_inference` fixture already
only checks the downstream triple, matching the restated contract.

Targeted regression: 166 HAD tests (+1 sklearn-gated skip) + 545
total (+1 skipped) across Phase 1 and adjacent surfaces, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 22, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 22d7f65f1ea126702ea97fbef2d03f1f8e5658db


Overall Assessment

✅ Looks good

Highest unmitigated severity: P3.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: Remaining Phase 2a limitations are properly tracked in TODO.md, so they are not blockers for this PR: aggregate="event_study", survey=, weights=, mass-point hc2 / hc2_bm, continuous-path cluster=, staggered reduction, and panel-only / RCS scope at TODO.md:L92-L101 and docs/methodology/REGISTRY.md:L2330-L2332. Concrete fix: None in this PR.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 22, 2026
@igerber igerber merged commit 7c9f7f1 into main Apr 22, 2026
23 of 24 checks passed
@igerber igerber deleted the did-no-untreated branch April 22, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant