Phase 1a: Kernel infrastructure + HC2/Bell-McCaffrey variance by igerber · Pull Request #327 · igerber/diff-diff

igerber · 2026-04-19T00:37:43Z

Summary

First of seven phased PRs implementing HeterogeneousAdoptionDiD (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2026, arXiv:2405.04465v6).
New module diff_diff/local_linear.py: Epanechnikov / triangular / uniform kernels on [0, 1] with closed-form moment constants verified to 1e-12 vs numerical integration; univariate local-linear regression at a boundary via kernel-weighted OLS.
diff_diff/linalg.py: new vcov_type enum (classical / hc1 / hc2 / hc2_bm) with return_dof kwarg on compute_robust_vcov. HC2 one-way uses leverage-corrected meat with weighted-hat convention; HC2+Bell-McCaffrey one-way computes Imbens-Kolesar (2016) per-coefficient Satterthwaite DOF. CR2 Bell-McCaffrey cluster-robust uses symmetric matrix square root via eigendecomposition with Moore-Penrose pseudoinverse for singleton clusters and absorbed cluster fixed effects. Weighted cluster CR2 raises NotImplementedError (Phase 2+).
DifferenceInDifferences (and by inheritance MultiPeriodDiD, TwoWayFixedEffects) grows a vcov_type parameter with robust=True/False as backward-compat aliases. set_params re-validates the robust/vcov_type pair via the new resolve_vcov_type() helper. DiDResults.summary() prints a human-readable Variance: line.

Methodology references (required if estimator / math changes)

Method name(s): HeterogeneousAdoptionDiD (Phase 1a infrastructure only — the estimator itself lands in Phase 2).
Paper / source link(s):
- de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026). Difference-in-Differences Estimators When No Unit Remains Untreated. arXiv:2405.04465v6.
- Imbens & Kolesar (2016). Robust Standard Errors in Small Samples: Some Practical Advice. Review of Economics and Statistics, 98(4), 701-712.
- Pustejovsky & Tipton (2018). Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models. Journal of Business & Economic Statistics, 36(4), 672-683.
- Calonico, Cattaneo & Farrell (2018) — kernel moment conventions.
Any intentional deviations from the source (and why):
- Committed benchmarks/data/clubsandwich_cr2_golden.json has "source": "python_self_reference" as a regression anchor until benchmarks/R/generate_clubsandwich_golden.R is run against R clubSandwich. Documented with a Note: in docs/methodology/REGISTRY.md and tracked in TODO.md.
- Weighted CR2 Bell-McCaffrey is not implemented in this PR (raises NotImplementedError); the paper's applications don't use weighted cluster-robust and this is Phase 2+ per plan. Tracked in TODO.md.

Validation

Tests added/updated:
- tests/test_local_linear.py (57 tests): kernel closed-form moments, local_linear_fit parity vs manual WLS, NaN/Inf input validation, error paths.
- tests/test_linalg_hc2_bm.py (31 tests): HC2 hand-formula parity, Bell-McCaffrey DOF Satterthwaite derivation, CR2 adjustment matrix edge cases (singleton, rank-deficient, identity), CR2 parity harness with clubSandwich golden JSON, regression anchors for the unchanged HC1/CR1 paths.
- tests/test_estimators_vcov_type.py (22 tests): robust ⇔ vcov_type alias resolution, set_params conflict detection and re-derivation, MultiPeriodDiD / TwoWayFixedEffects inheritance, wild-bootstrap coexistence, summary variance-family label.
- All existing tests/test_linalg.py (97) and tests/test_estimators.py (145) pass with no regressions.
Backtest / simulation / notebook evidence (if applicable): N/A — Phase 1a is infrastructure. Simulation coverage against Table 1 of the paper and Pierce-Schott Figure 2 replication are Phase 2 and Phase 4 criteria respectively.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Pre-merge state: Local AI review (round 2) → ✅ looks good, two P1 blockers from round 1 resolved, all P2 items addressed or documented. Paper review file at docs/methodology/papers/dechaisemartin-2026-review.md is gitignored (per the existing papers/ pattern at .gitignore:91) so it is not in the PR; the Phase 0 REGISTRY stub in this PR supersedes it as the authoritative reference.

🤖 Generated with Claude Code

First of seven phased PRs implementing the HeterogeneousAdoptionDiD estimator from de Chaisemartin, Ciccia, D'Haultfoeuille & Knau (2026, arXiv:2405.04465v6). Ships the foundational RDD and small-sample variance infrastructure that Phases 1b, 1c, 2, 3 all compose. - diff_diff/local_linear.py (new): Epanechnikov, triangular, and uniform kernels on [0, 1] with closed-form moment constants matching numerical integration to 1e-12; univariate local-linear regression at a boundary via kernel-weighted OLS through solve_ols. - diff_diff/linalg.py: new vcov_type enum (classical, hc1, hc2, hc2_bm) with return_dof kwarg on compute_robust_vcov. HC2 one-way uses leverage-corrected meat with weighted-hat convention; HC2+Bell-McCaffrey one-way computes the Imbens-Kolesar (2016) Satterthwaite DOF per coefficient. CR2 Bell-McCaffrey cluster-robust uses symmetric matrix square root via eigendecomposition with Moore-Penrose pseudoinverse for singleton clusters and absorbed cluster fixed effects. Weighted cluster CR2 raises NotImplementedError (Phase 2+). Rust backend guards skip non-hc1 paths. - diff_diff/estimators.py: vcov_type threaded through DifferenceInDifferences (MultiPeriodDiD and TwoWayFixedEffects inherit via the base class). robust=True aliases vcov_type="hc1"; robust=False aliases "classical". Conflict detection at __init__. LinearRegression stores per-coefficient Bell-McCaffrey DOF and consumes it in get_inference. - diff_diff/results.py: DiDResults gains vcov_type and cluster_name fields; summary() prints a human-readable Variance family line. - benchmarks: R clubSandwich parity script plus JSON anchor (python_self_reference until R is run) for CR2 BM parity tests. - Tests: three new focused suites (test_local_linear.py, test_linalg_hc2_bm.py, test_estimators_vcov_type.py, 104 new tests total). All 145 existing estimator tests plus 97 existing linalg tests pass with no regressions. - Docs: REGISTRY.md HeterogeneousAdoptionDiD section with Phase 1a requirements checklist; ROADMAP.md entry updated with status line; TODO.md deferrals for weighted CR2, standalone-estimator threading, bread_inv perf kwarg, Rust HC2 backend, scores-based DOF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- diff_diff/linalg.py: fix compute_robust_vcov docstring to reflect that vcov_type="hc2_bm" supports both one-way and CR2 cluster-robust paths (the earlier "queued as a follow-up" language was stale). Extract resolve_vcov_type(robust, vcov_type) as the single source of truth for alias resolution and conflict detection; DifferenceInDifferences and LinearRegression both consume it. - diff_diff/estimators.py: DifferenceInDifferences.set_params re-validates the robust/vcov_type pair via resolve_vcov_type after mutation so invalid combinations (e.g. robust=False + vcov_type="hc2") raise instead of leaving the estimator in an inconsistent state. - diff_diff/local_linear.py: local_linear_fit now validates d/y/weights for NaN and Inf at the API boundary, returning targeted ValueErrors rather than relying on downstream solve_ols failures. Removed a stale inline comment about missing solve_ols overload stubs (the stubs now include weights/weight_type). - docs/methodology/REGISTRY.md: reframe the CR2 golden-JSON checkbox so it accurately reflects that the committed JSON is a python_self_reference stability anchor until the R script is run; authoritative clubSandwich regeneration is tracked in TODO.md. - Tests: set_params conflict tests (robust=False + vcov_type="hc2" raises; robust=True restores hc1; invalid vcov_type rejected) and local_linear_fit NaN/Inf validation tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Class-level docstrings now fully describe the vcov_type enum (classical, hc1, hc2, hc2_bm) on DifferenceInDifferences and MultiPeriodDiD, and clarify that robust is a legacy alias. Renamed test_set_params_rejects_conflict_on_robust_only to test_set_params_robust_only_rederives_vcov_type so the name matches the asserted behavior (robust-only mutation re-derives vcov_type from the alias rather than raising). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T00:48:47Z

Overall Assessment

⚠️ Needs changes. Highest unmitigated severity is P1: the new vcov_type API is not propagated end-to-end to all estimators the PR says now support it, so users can request one variance method and silently receive another.

Executive Summary

The low-level HC2 / Bell-McCaffrey implementation in linalg.py looks plausible, but the new vcov_type option is not actually wired through all inherited estimators.
MultiPeriodDiD accepts vcov_type, yet its analytical fit path still uses the legacy solve_ols() default and a single generic df, so classical, hc2, and hc2_bm requests do not produce the advertised inference.
TwoWayFixedEffects also exposes the new option via inheritance, but its fit path hardcodes robust inference and never forwards vcov_type.
summary() can mislabel wild-bootstrap output with an analytical variance family, and the new tests do not exercise the affected MultiPeriodDiD, TwoWayFixedEffects, or clustered bootstrap branches.

Methodology

The new low-level math is directionally aligned with the cited source material: the HAD paper uses local-linear boundary regression plus Calonico-style bias-corrected inference for the nonparametric branch, and for the small-G TWFE application it explicitly switches to HC2 with Bell-McCaffrey/Imbens-Kolesar corrections; Imbens and Kolesar also recommend BM intervals routinely in small and moderate samples. (arxiv.org)

Severity: P1 [Newly identified]. Impact: MultiPeriodDiD now claims inherited vcov_type support in the registry, but its analytical path still calls solve_ols() without vcov_type at diff_diff/estimators.py:L1296, only recomputes homoskedastic SEs when not self.robust at diff_diff/estimators.py:L1428, and then applies one shared df to every period effect and the post-period average at diff_diff/estimators.py:L1456 and diff_diff/estimators.py:L1500. Explicit vcov_type="classical" is therefore ignored unless the caller also sets robust=False, and hc2 / hc2_bm silently fall back to legacy HC1-style inference instead of the requested variance/DOF method. The completeness claim is at docs/methodology/REGISTRY.md:L2287. Concrete fix: route the analytical fit through LinearRegression(..., vcov_type=self.vcov_type) or extend solve_ols() to return BM DOF; key off self.vcov_type == "classical" rather than self.robust; and either compute the correct contrast-specific DOF for the averaged post-treatment effect or reject vcov_type="hc2_bm" for that summary until it is implemented.
Severity: P1 [Newly identified]. Impact: TwoWayFixedEffects inherits the new public parameter, but the fit path hardcodes robust=True and never forwards vcov_type to LinearRegression at diff_diff/twfe.py:L217 and diff_diff/twfe.py:L231. Analytical TWFE inference therefore stays on the legacy robust/unit-cluster path regardless of the requested variance family, and the results object created at diff_diff/twfe.py:L365 does not record the requested variance metadata either. Concrete fix: pass robust=self.robust and vcov_type=self.vcov_type through the TWFE analytical path, then explicitly reject or document classical / hc2 when TWFE’s default unit clustering remains in force.

Code Quality

Severity: P2. Impact: DiDResults.summary() always prints a Variance: label when vcov_type is set at diff_diff/results.py:L195, but the wild-bootstrap path overwrites the reported SE/p-value/CI with bootstrap inference at diff_diff/estimators.py:L463 while storing a default CR1 analytical vcov at diff_diff/estimators.py:L601. A bootstrap run can therefore report bootstrap inference while simultaneously labeling it as "CR2 Bell-McCaffrey" or "Classical". Concrete fix: suppress the variance-family line whenever inference_method != "analytical", or relabel it as auxiliary analytical metadata rather than the source of the displayed inference.

Performance

No new untracked findings. The main new cost centers are already tracked at TODO.md:L91, TODO.md:L92, and TODO.md:L93.

Maintainability

No separate findings beyond the propagation issues above.

Tech Debt

Severity: P3 informational. Impact: the explicit deferrals in this PR are properly tracked rather than hidden: weighted clustered hc2_bm is documented at docs/methodology/REGISTRY.md:L2286 and TODO.md:L81, and the committed clubSandwich golden JSON is explicitly marked as a self-reference anchor at docs/methodology/REGISTRY.md:L2288, TODO.md:L82, and benchmarks/data/clubsandwich_cr2_golden.json:L477. That is acceptable deferred work, not a blocker. The weighted omission is also methodologically defensible because the later corrigendum narrows the fixed-effects shortcut from weighted/GLS settings to OLS. Concrete fix: none required for approval; keep these items in the planned follow-up. (tandfonline.com)

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new coverage only checks MultiPeriodDiD / TwoWayFixedEffects inheritance through get_params() at tests/test_estimators_vcov_type.py:L133 and tests/test_estimators_vcov_type.py:L138; it never fits either estimator with non-default vcov_type. The “wild bootstrap coexistence” test at tests/test_estimators_vcov_type.py:L224 also omits cluster=, so it never enters the bootstrap branch that mislabels Variance:. Concrete fix: add end-to-end fit tests for MultiPeriodDiD and TwoWayFixedEffects with vcov_type="classical" and vcov_type="hc2_bm", plus a real clustered bootstrap test that asserts the reported inference source and summary label stay consistent.

Path to Approval

Make MultiPeriodDiD honor vcov_type end-to-end, including correct Bell-McCaffrey DOF handling for period-specific effects and either a correct contrast-specific DOF or an explicit rejection for the post-period average.
Make TwoWayFixedEffects forward robust / vcov_type into LinearRegression, and add explicit validation for variance types that are incompatible with its default unit-clustering behavior.
Add end-to-end tests that fit MultiPeriodDiD and TwoWayFixedEffects under non-default vcov_type settings and exercise the actual clustered wild-bootstrap branch.

CI review caught that Phase 1a wired vcov_type into DifferenceInDifferences __init__/get_params but not into the overridden fit() paths on MultiPeriodDiD and TwoWayFixedEffects, so `vcov_type="hc2_bm"` on either silently produced HC1 inference. Summary output also mislabeled wild-bootstrap inference with the analytical variance family. - diff_diff/estimators.py MultiPeriodDiD.fit: pass vcov_type=self.vcov_type into the analytical solve_ols call; remove the `not self.robust` homoskedastic fallback (subsumed by compute_robust_vcov's classical branch). When vcov_type="hc2_bm" and no survey design, compute Bell-McCaffrey Satterthwaite DOF via _compute_bm_dof_from_contrasts for both per-coefficient period effects AND the post-period-average contrast; fall back to the shared analytical df otherwise. Store vcov_type and cluster_name on MultiPeriodDiDResults. - diff_diff/twfe.py: forward self.robust and self.vcov_type into the two LinearRegression instantiations; store vcov_type and the TWFE auto- cluster label (or explicit self.cluster) on DiDResults. - diff_diff/linalg.py: split _compute_bm_dof_oneway into a contrast-aware helper _compute_bm_dof_from_contrasts(X, bread, h_diag, contrasts) so MultiPeriodDiD can request BM DOF for the avg_att linear combination. The per-coefficient wrapper now delegates to the shared helper with contrasts=I_k. - diff_diff/results.py DiDResults.summary and MultiPeriodDiDResults: gate the Variance family label on inference_method == "analytical" so wild-bootstrap output is no longer mislabeled; add vcov_type, cluster_name, inference_method, n_bootstrap, n_clusters fields to MultiPeriodDiDResults for symmetry with DiDResults and to drive the summary label. - tests/test_estimators_vcov_type.py: add five end-to-end tests exercising the previously-untested paths - MultiPeriodDiD classical vs hc1 SE differ; MultiPeriodDiD hc2_bm CI is finite; TWFE hc1 vs hc2_bm SE differ (CR1 vs CR2); TWFE records the unit auto-cluster label in summary; wild-bootstrap with cluster suppresses the Variance line. All 209 Phase 1a suites plus 145 estimator regression tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T01:07:00Z

/ai-review

github-actions · 2026-04-19T01:16:17Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d907eca4b034c0aac4fb4ca97cc7c56f7d32c3e9

Overall Assessment

⛔ Blocker. Highest unmitigated severity is P0: unsupported cluster/vcov_type combinations still reach user-facing fit paths and can silently produce wrong or undocumented inference.

Executive Summary

The prior re-review issues around TwoWayFixedEffects propagation, unclustered MultiPeriodDiD propagation, and bootstrap summary() labeling look resolved in this diff (diff_diff/twfe.py:L217-L244, diff_diff/estimators.py:L1296-L1307, diff_diff/results.py:L195-L206).
P0: estimator-facing OLS paths bypass the public compute_robust_vcov() validation. Unsupported requests such as cluster + vcov_type="classical", cluster + vcov_type="hc2", and cluster + weights + vcov_type="hc2_bm" do not reliably raise and can silently return invalid inference.
P1: MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") still uses one-way Bell-McCaffrey DOF for period-level and average-post inference instead of CR2 cluster DOF.
The self-reference clubSandwich golden JSON and the follow-up work on standalone estimators are properly documented in REGISTRY/TODO and are not blockers by themselves.
I did not run the suite in this container because the available Python environment is missing numpy; the review below is from source inspection.

Methodology

Severity: P0. Impact: the public contract says classical and hc2 are one-way only, and hc2_bm + cluster + weights must raise NotImplementedError (diff_diff/linalg.py:L1027-L1030, diff_diff/linalg.py:L1079-L1099, docs/methodology/REGISTRY.md:L2286-L2288, TODO.md:L81-L82). But estimator-facing paths go through LinearRegression.fit() / solve_ols() and call _compute_robust_vcov_numpy() directly (diff_diff/linalg.py:L2314-L2327, diff_diff/linalg.py:L899-L920, diff_diff/linalg.py:L702-L724, diff_diff/estimators.py:L1296-L1307). Inside that internal dispatcher, the "classical" branch returns before looking at cluster_ids, the "hc2" branch ignores cluster_ids, and clustered "hc2_bm" accepts weights and calls _compute_cr2_bm() anyway (diff_diff/linalg.py:L1471-L1515). That means unsupported combinations can silently yield one-way SEs or an undocumented weighted-CR2 hybrid instead of the documented error path. Concrete fix: centralize the vcov_type/cluster validation in a helper that is called by both compute_robust_vcov() and every internal vcov entry point before dispatch.
Severity: P1. Impact: MultiPeriodDiD now passes cluster_ids into solve_ols() (diff_diff/estimators.py:L1296-L1306), but its new Bell-McCaffrey DOF block always computes _compute_bm_dof_from_contrasts() from observation-level hat values with no cluster input (diff_diff/estimators.py:L1441-L1490). Those one-way DOF are then used for each period effect and the average post-treatment ATT (diff_diff/estimators.py:L1504-L1509, diff_diff/estimators.py:L1553-L1557). For clustered hc2_bm, this mixes CR2 cluster-robust SEs with one-way Bell-McCaffrey p-values/CIs, which is not the Pustejovsky-Tipton / Imbens-Kolesar correction described in the registry (docs/methodology/REGISTRY.md:L2199-L2200, docs/methodology/REGISTRY.md:L2286-L2287). Concrete fix: either implement CR2 Satterthwaite DOF for the per-period and post-average contrasts, or reject cluster + vcov_type="hc2_bm" in MultiPeriodDiD until that clustered contrast-DOF path exists.

Code Quality

No separate unmitigated findings beyond the validation split already captured under Methodology.

Performance

No unmitigated findings. The main new costs are already tracked in TODO.md:L91-L93.

Maintainability

No separate unmitigated findings beyond the duplicated public-vs-internal vcov validation logic already noted above.

Tech Debt

Severity: P3 informational. Impact: the committed clubSandwich JSON is explicitly marked as a self-reference anchor, and the remaining standalone-estimator vcov_type threading is tracked as deferred work (docs/methodology/REGISTRY.md:L2288, TODO.md:L80-L82, benchmarks/data/clubsandwich_cr2_golden.json:L476-L478). That tracking is acceptable. Concrete fix: none for approval. The only caveat is that the weighted-CR2 deferral still needs to be enforced in code, which is why it remains a blocker under Methodology.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new negative-path tests stop at the public compute_robust_vcov() wrapper (tests/test_linalg_hc2_bm.py:L78-L85, tests/test_linalg_hc2_bm.py:L140-L147, tests/test_linalg_hc2_bm.py:L295-L309), while the estimator tests exercise only valid fit paths (tests/test_estimators_vcov_type.py:L148-L252, tests/test_estimators_vcov_type.py:L260-L309). That gap is exactly why the user-facing solve_ols() / estimator regressions above survived. Concrete fix: add end-to-end tests that assert solve_ols(), LinearRegression, and at least one estimator raise on cluster + classical, cluster + hc2, and cluster + weights + hc2_bm, plus a clustered MultiPeriodDiD(vc o v_type="hc2_bm") regression or rejection test.

Path to Approval

Make the vcov_type/cluster validation run inside solve_ols() / _compute_robust_vcov_numpy() so unsupported combinations raise everywhere, not just through compute_robust_vcov().
For MultiPeriodDiD, either implement cluster-aware CR2 Bell-McCaffrey DOF for each interaction coefficient and the average-post contrast, or explicitly reject cluster + vcov_type="hc2_bm" until that DOF path exists.
Add regression tests covering the invalid estimator paths and the clustered MultiPeriodDiD(hc2_bm) behavior.

…2_bm CI re-review flagged two unmitigated issues on top of the Phase 1a diff. P0 - validation bypass: the `vcov_type`/`cluster`/`weights` raise logic lived only in the public `compute_robust_vcov()` wrapper. `solve_ols` and `_solve_ols_numpy` called `_compute_robust_vcov_numpy` directly and reached the dispatch table unvalidated, so `cluster + classical`, `cluster + hc2`, and `cluster + weights + hc2_bm` silently produced one-way SEs or a hybrid weighted-CR2 result instead of raising. Extract the checks into a shared `_validate_vcov_args()` helper and call it from both entry points so the raise is universal. P1 - MultiPeriodDiD cluster + hc2_bm: when `cluster_ids` is set, vcov comes from `_compute_cr2_bm` (CR2 cluster-robust) but the new per-period and post-average DOF block was still using `_compute_bm_dof_from_contrasts`, which builds the one-way residual-maker and ignores clusters. Pairing CR2 SEs with one-way BM DOF is a broken hybrid. A contrast-aware CR2 BM DOF helper is real work (Pustejovsky-Tipton per-cluster adjustment matrices applied to arbitrary contrast vectors) and not in Phase 1a scope. For now, reject the combination in `MultiPeriodDiD.fit` with a clear error that points to the workarounds: drop `cluster` for one-way HC2+BM, or drop `vcov_type="hc2_bm"` for CR1 (Liang-Zeger) cluster-robust. Track the full implementation as follow-up. Tests: four new negative-path tests exercising the previously-bypassed paths (solve_ols rejects cluster+classical, cluster+hc2, cluster+weights+hc2_bm; LinearRegression rejects cluster+hc2), plus a MultiPeriodDiD cluster+hc2_bm rejection test. All 299 Phase 1a + estimator regression tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T10:12:24Z

/ai-review

github-actions · 2026-04-19T10:20:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7ba6d5838e816be12693634ae6a0a730ced1b3a7

Overall Assessment

⚠️ Needs changes. The prior numerical blockers look resolved, but the highest remaining unmitigated issue is P1: MultiPeriodDiD now intentionally rejects clustered hc2_bm, while the public methodology/API docs still say that combination is supported and there is no matching REGISTRY.md Note/Deviation documenting the limitation.

Executive Summary

The previous re-review blockers around internal vcov validation and the clustered MultiPeriodDiD(hc2_bm) hybrid appear fixed: unsupported vcov_type combinations are now validated in the shared internal dispatcher, and MultiPeriodDiD explicitly rejects the unsupported clustered Bell-McCaffrey path instead of mixing CR2 SEs with one-way DOF. diff_diff/linalg.py:L933-L982 diff_diff/linalg.py:L1475-L1480 diff_diff/estimators.py:L1294-L1312
P1: MultiPeriodDiD’s docstring still advertises clustered vcov_type="hc2_bm" support, but the implementation and tests now deliberately raise NotImplementedError; there is no corresponding REGISTRY.md Note/Deviation for that limitation. diff_diff/estimators.py:L856-L870 diff_diff/estimators.py:L1294-L1312 docs/methodology/REGISTRY.md:L2283-L2288
P2: the new Variance: line can mislabel survey-based inference as HC1/HC2/CR2 Bell-McCaffrey, even when the actual vcov came from survey Taylor linearization or replicate weights. diff_diff/linalg.py:L2307-L2360 diff_diff/linalg.py:L2475-L2524 diff_diff/results.py:L182-L206 diff_diff/results.py:L501-L515
The self-reference clubSandwich JSON and weighted clustered CR2 deferral are properly documented/tracked, so they are not blockers. docs/methodology/REGISTRY.md:L2286-L2288 TODO.md:L80-L82 benchmarks/data/clubsandwich_cr2_golden.json:L476-L479

Methodology

Severity: P1. Impact: The affected method is the new HC2/Bell-McCaffrey variance surface as exposed through MultiPeriodDiD. The class docstring says cluster + vcov_type="hc2_bm" dispatches to CR2 Bell-McCaffrey, but fit() now intentionally raises because the clustered post-average contrast DOF is not implemented. Failing fast is the correct numerical choice, but under the review policy this is still an undocumented mismatch until it is called out in the Methodology Registry with a **Note:**/deviation entry. diff_diff/estimators.py:L856-L870 diff_diff/estimators.py:L1294-L1312 docs/methodology/REGISTRY.md:L2283-L2288
Concrete fix: Either implement clustered contrast-specific BM DOF for the post-period average, or add an explicit **Note:** / deviation note in REGISTRY.md and the MultiPeriodDiD docstring stating that clustered hc2_bm is currently unsupported and intentionally raises. If the implementation is deferred, add a matching TODO.md row.
Severity: P2. Impact: Survey fits use survey vcov, not the analytical vcov_type, but both summary methods still print the analytical variance-family label whenever inference_method == "analytical". That can tell users they got “CR2 Bell-McCaffrey” or “HC1” when the reported SEs actually came from survey Taylor linearization or replicate weights. diff_diff/linalg.py:L2307-L2360 diff_diff/linalg.py:L2475-L2524 diff_diff/results.py:L182-L206 diff_diff/results.py:L501-L515
Concrete fix: Gate the Variance: line on the actual inference source, not just inference_method; suppress it when survey_metadata is present or replace it with a survey-specific label.

Code Quality

No separate unmitigated findings beyond the methodology/reporting issues above.

Performance

No unmitigated findings. The main new costs (bread_inv reuse and the naive O(n^2 k) CR2 DOF loop) are already tracked as deferred work.

Maintainability

No separate unmitigated findings beyond the public-contract mismatch already noted.

Tech Debt

Severity: P3 informational. Impact: The committed CR2 golden JSON is still a self-reference anchor rather than an authoritative R export, and weighted clustered CR2 remains deferred, but both are explicitly documented/tracked, so they should not block this PR. docs/methodology/REGISTRY.md:L2286-L2288 TODO.md:L81-L82 benchmarks/R/generate_clubsandwich_golden.R:L1-L82 benchmarks/data/clubsandwich_cr2_golden.json:L476-L479
Concrete fix: None for approval.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new tests cover the explicit MultiPeriodDiD(cluster, vcov_type="hc2_bm") rejection and bootstrap-summary suppression, but they do not exercise survey-fit summaries. That leaves the new survey-label mismatch unguarded. tests/test_estimators_vcov_type.py:L254-L273 tests/test_estimators_vcov_type.py:L352-L382
Concrete fix: Add end-to-end summary tests for Taylor-linearized and replicate-weight survey fits asserting that the analytical Variance: line is suppressed or replaced with a survey-specific label.

Path to Approval

Document the unsupported MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") combination as an explicit **Note:** / deviation in docs/methodology/REGISTRY.md and in the MultiPeriodDiD docstring, or implement clustered contrast-specific BM DOF for the post-period average.
Fix the new summary labeling so survey-based fits do not print analytical vcov_type labels, and add regression tests for survey summaries in both the Taylor-linearization and replicate-weight paths.

…ov label Addresses CI AI review on PR #327 (head 7ba6d58): - P1: MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") now has a matching Note in docs/methodology/REGISTRY.md (both under MultiPeriodDiD standard-errors block and under the HeterogeneousAdoptionDiD Phase 1a requirements checklist) plus an explicit call-out in the MultiPeriodDiD docstring. Also clarifies the DifferenceInDifferences docstring that the limitation only applies to the multi-period subclass, since the scalar-coefficient base class handles cluster + CR2 Bell-McCaffrey correctly. - P2: DiDResults.summary() and MultiPeriodDiDResults.summary() now suppress the analytical "Variance:" line when survey_metadata is present. Survey fits use Taylor linearization or replicate-weight variance, not the analytical HC/CR sandwich, so printing "HC1"/"CR2 Bell-McCaffrey" alongside survey-produced SEs was misleading. The survey design block already surfaces the actual inference source (weight type, strata/PSU counts, replicate method), so dropping the parallel label is the cleanest fix. - P2-Tests: Four new tests in TestSummarySurveyLabeling pin the survey-fit suppression in both the Taylor-linearization path (SurveyDesign with PSU/strata) and the replicate-weight path (BRR), for both DiDResults and MultiPeriodDiDResults. A regression guard asserts the non-survey path still prints the analytical label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T10:32:23Z

/ai-review

github-actions · 2026-04-19T10:44:44Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dca1fce4a704f0c8c7bb91aeaeaa65591bcdcf0d

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous re-review blocker around MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") being undocumented is resolved: the limitation is now called out in both the class docstring and REGISTRY.md.
The earlier survey-summary labeling issue is also resolved: survey-backed summaries now suppress the analytical Variance: line, and the new tests cover both Taylor-linearized and replicate-weight paths.
P1: TwoWayFixedEffects still auto-clusters at unit level, so the inherited one-way variance families (robust=False/vcov_type="classical" and vcov_type="hc2") are not actually usable on the default non-survey TWFE path even though the new registry/API surface says vcov_type is threaded through TWFE.
P1: the new weighted one-way hc2_bm DOF path is internally inconsistent with the module’s own WLS transform, so weighted LinearRegression(..., vcov_type="hc2_bm") / compute_robust_vcov(..., return_dof=True) can emit wrong p-values/CIs with no warning.
The documented deferrals for weighted clustered CR2 and the self-reference clubSandwich JSON are properly tracked, so they are informational only.

Methodology

Severity: P1. Impact: TwoWayFixedEffects now inherits the new vcov_type surface, and the registry says the inheritance chain includes TWFE, but fit() still forces unit-level clustering when cluster=None. That means the low-level validator immediately rejects the inherited one-way families (classical, hc2) on the default non-survey TWFE path, so robust=False / vcov_type="classical" and vcov_type="hc2" are exposed but not actually usable. The new tests only check that TWFE can be constructed with vcov_type="hc2" and only fit hc1/hc2_bm, so this interaction slipped through. Concrete fix: either make TwoWayFixedEffects drop auto-clustering when the user explicitly requests a one-way family, or reject those combinations explicitly in the TWFE layer and document that deviation in the TWFE docs and REGISTRY.md; add fit-level regression tests for robust=False, vcov_type="classical", and vcov_type="hc2". diff_diff/twfe.py:L141 diff_diff/twfe.py:L217 diff_diff/linalg.py:L961 docs/methodology/REGISTRY.md:L2297 tests/test_estimators_vcov_type.py:L139 tests/test_estimators_vcov_type.py:L305
Severity: P1. Impact: the new weighted one-way Bell-McCaffrey DOF path is internally inconsistent with the solver’s own WLS definition. solve_ols() defines weighted estimation on the transformed design X* = sqrt(w) X, y* = sqrt(w) y, but _compute_bm_dof_from_contrasts() still builds q from the unscaled X and builds H as X (X'WX)^{-1} X' W, not from the same transformed design. LinearRegression.fit() then feeds that DOF vector directly into inference whenever weights is not None, cluster_ids is None, and vcov_type="hc2_bm". So the new weighted hc2_bm surface can silently report wrong small-sample p-values/CIs. Concrete fix: either re-derive the BM DOF helper on the transformed WLS design and validate it with dedicated weighted parity tests, or temporarily raise NotImplementedError for weights is not None and vcov_type="hc2_bm" until that derivation is implemented. diff_diff/linalg.py:L583 diff_diff/linalg.py:L594 diff_diff/linalg.py:L1424 diff_diff/linalg.py:L1427 diff_diff/linalg.py:L2367 TODO.md:L81 tests/test_linalg_hc2_bm.py:L560

Code Quality

No separate unmitigated findings.

Performance

No separate unmitigated findings. The new CR2/BM performance debt is already tracked in TODO.md, so it is non-blocking.

Maintainability

Severity: P2. Impact: set_params() mutates attributes before it validates the robust/vcov_type pair, so a failing call can leave the estimator in exactly the inconsistent state the new docstring says it avoids. That will not usually produce silent wrong numbers, but it does make downstream failures harder to reason about if caller code catches the ValueError. Concrete fix: validate on local variables before mutating self, or roll back on exception; add a regression test that a rejected set_params() call leaves the estimator unchanged. diff_diff/estimators.py:L799 diff_diff/estimators.py:L816

Tech Debt

Severity: P3 informational. Impact: the remaining documented deferrals are properly tracked: weighted clustered CR2 Bell-McCaffrey still raises NotImplementedError, and the committed clubSandwich golden JSON is still a self-reference anchor pending an authoritative R regeneration. Those are explicitly noted in TODO.md / REGISTRY.md, so they are not blockers. Concrete fix: none for approval. TODO.md:L81 TODO.md:L82 docs/methodology/REGISTRY.md:L2296 docs/methodology/REGISTRY.md:L2299

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the issues above. The prior documentation blocker for MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") is fixed, and the new survey-summary tests cover the earlier labeling problem. I could not execute the test suite in this sandbox because the Python environment is missing numpy, so this review is based on static inspection.

Path to Approval

Make the TWFE vcov_type contract real: either allow classical / one-way hc2 by bypassing auto unit clustering when those families are requested, or reject them explicitly in TwoWayFixedEffects and document that limitation in the TWFE docs and REGISTRY.md. Add fit-level regression tests for robust=False, vcov_type="classical", and vcov_type="hc2" on TWFE.
Fix weighted one-way hc2_bm inference: either derive the BM DOF on the same transformed WLS design used by solve_ols() and add weighted parity tests, or raise NotImplementedError for any weighted one-way hc2_bm fit until that derivation is in place.

…_params Addresses CI AI review on PR #327 head dca1fce: P1 — TWFE auto-cluster vs. one-way vcov families: TwoWayFixedEffects.fit() forced cluster_var = unit when self.cluster was None, which collided with the validator: vcov_type in {"classical", "hc2"} is one-way-only and raises ValueError when cluster_ids is non-None. The inheritance surface advertised those families as usable on TWFE but they were not. Fix: when cluster is None AND vcov_type is a one-way family, drop the auto-cluster. The explicit choice of a one-way family wins over the TWFE default. cluster_name in DiDResults is now None on that path so summary() labels the one-way family (not "CR1 cluster-robust at unit"). Docstring, REGISTRY would be the next doc pass if we ever hit a follow-up edge, but the TWFE docstring already documents the exception. P1 — Weighted one-way hc2_bm silent math mismatch: _compute_bm_dof_from_contrasts builds its hat matrix from the unscaled design as X (X'WX)^{-1} X' W, but solve_ols solves weighted regression by transforming to X* = sqrt(w) X, y* = sqrt(w) y. The symmetric-idempotent residual maker M* = I - H* with H* = sqrt(W) X (X'WX)^{-1} X' sqrt(W) is the correct one for the Satterthwaite (trG)^2 / tr(G^2) ratio; the asymmetric X (X'WX)^{-1} X' W is neither transformed nor original-scale. Rather than ship silently-inconsistent small-sample p-values, extend the existing weighted-cluster CR2 deferral to cover weighted one-way as well: _validate_vcov_args now raises NotImplementedError for vcov_type="hc2_bm" + weights (with OR without cluster). Tracked in TODO.md under Methodology/Correctness (rederive on transformed design + add weighted parity tests). P2 — set_params atomic validation: Previously set_params applied all setattr mutations BEFORE re-validating the robust/vcov_type pair. A failing call left the estimator in the half-configured state the alias/conflict check is designed to prevent, defeating callers that catch ValueError and keep using the object. Fix: validate unknown-key rejection + resolve_vcov_type on locals first, then apply mutations atomically. Tests: - TestFitBehavior.test_twfe_honors_classical_without_autocluster + test_twfe_honors_robust_false_without_autocluster + test_twfe_honors_hc2_one_way: all three one-way entry points now succeed on TWFE (and cluster_name is None). - TestFitBehavior.test_twfe_explicit_cluster_still_clusters_under_hc2_bm: regression guard that explicit cluster= keeps the auto-bypass off. - TestHC2BMCluster.test_hc2_bm_weighted_one_way_not_implemented: locks the NotImplementedError at both public and internal entry points. - TestParamsRoundTrip.test_set_params_conflict_leaves_estimator_unchanged + test_set_params_unknown_key_leaves_estimator_unchanged: atomicity regression guards. All 133 Phase 1a tests pass; 405 tests across estimators / survey / Phase 1a neighbours pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T10:54:01Z

/ai-review

github-actions · 2026-04-19T11:06:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6836836778786a4a7c23621f0016e9adfe76ece9

Overall Assessment
⛔ Blocker

Executive Summary

The previous re-review blockers around weighted one-way hc2_bm, set_params() atomicity, survey summary labeling, and TWFE’s unusable classical/hc2 analytical paths were addressed at the API surface.
P0: the new HC2 / Bell-McCaffrey leverage corrections are computed on the residualized FE design, not the full fixed-effects projection. That makes the new vcov_type="hc2" / "hc2_bm" output methodologically wrong on TwoWayFixedEffects and on any absorb= fit that reaches the new leverage-based code.
P1: TwoWayFixedEffects(inference="wild_bootstrap", vcov_type="classical"|"hc2", cluster=None) now disables the estimator’s default unit-level clustering before the bootstrap branch, so a previously supported auto-clustered bootstrap fit now errors instead of bootstrapping.
The remaining weighted-hc2_bm deferrals and the self-reference clubSandwich golden JSON are properly documented in TODO.md / REGISTRY.md, so they are informational only.
I could not execute the test suite in this sandbox because the Python environment is missing numpy; this review is based on static inspection.

Methodology

Severity: P0. Impact: the new HC2 / CR2 / Bell-McCaffrey corrections are being applied after within-transformation, not on the full FE model. TwoWayFixedEffects always demeans first, and the absorb= paths in the base estimators do the same, then pass only the reduced regressor matrix into LinearRegression; _compute_hat_diagonals() / _compute_cr2_bm() therefore never see the absorbed FE projection. That is not methodology-equivalent: FWL preserves coefficient estimates, but leverage corrections depend on the full hat matrix / cluster block hat matrix. So every absorbed-FE vcov_type="hc2" / "hc2_bm" result is silently using the wrong leverage adjustment and wrong Bell-McCaffrey DOF. Concrete fix: reject leverage-based vcov_type values on absorbed-FE fits until a FE-aware hat-matrix implementation exists, or compute HC2/CR2 from the full absorbed projection and validate it against a full-dummy / fixest / clubSandwich reference before re-enabling. Refs: diff_diff/twfe.py:172, diff_diff/twfe.py:242, diff_diff/estimators.py:303, diff_diff/estimators.py:390, diff_diff/estimators.py:1203, diff_diff/estimators.py:1331, diff_diff/linalg.py:1206, diff_diff/linalg.py:1261, docs/methodology/REGISTRY.md:246, docs/methodology/REGISTRY.md:2209.
Severity: P1. Impact: the new one-way-family bypass in TwoWayFixedEffects.fit() clears cluster_var before the bootstrap branch. For inference="wild_bootstrap" with no explicit cluster=, vcov_type="classical" or "hc2" now removes the estimator’s default unit-level clustering and then passes None into _run_wild_bootstrap_inference(). vcov_type should not disable clustering once inference has switched to wild bootstrap. Concrete fix: keep cluster_var = unit whenever self.inference == "wild_bootstrap" and cluster is None, or reject the combination early with a targeted error; add regression tests for both vcov_type="classical" and "hc2". Refs: diff_diff/twfe.py:148, diff_diff/twfe.py:205, diff_diff/twfe.py:373, tests/test_methodology_twfe.py:1008, tests/test_estimators_vcov_type.py:398, tests/test_estimators_vcov_type.py:433.

Code Quality

No separate unmitigated findings beyond the methodology issues above.

Performance

No separate unmitigated findings.

Maintainability

No separate unmitigated findings.

Tech Debt

Severity: P3. Impact: the remaining weighted-hc2_bm gaps and the self-reference clubSandwich JSON are explicitly tracked, so they are non-blocking. Concrete fix: none for approval; keep the tracked follow-ups open until the weighted derivations and R regeneration land. Refs: TODO.md:80, TODO.md:81, TODO.md:82, TODO.md:83, docs/methodology/REGISTRY.md:2296, docs/methodology/REGISTRY.md:2299.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests never validate HC2 / HC2+BM on absorbed-FE regressions against a methodology reference. The low-level linalg tests are plain-OLS only, existing TWFE reference tests are still HC1-only, and the new TWFE vcov_type tests only assert that SEs differ from HC1. That gap is why the absorbed-FE leverage bug above slipped through. Concrete fix: add parity tests for TwoWayFixedEffects(vcov_type="hc2") and vcov_type="hc2_bm" plus DifferenceInDifferences(absorb=...), using a full-dummy regression or R fixest / clubSandwich reference. Refs: tests/test_linalg_hc2_bm.py:58, tests/test_methodology_twfe.py:426, tests/test_estimators_vcov_type.py:350.

Path to Approval

Block or correctly re-derive HC2 / HC2+Bell-McCaffrey on absorbed-FE fits. The safe short-term fix is to raise on TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) and on DifferenceInDifferences / MultiPeriodDiD with absorb= plus those vcov_type values.
Restore TWFE’s no-cluster wild-bootstrap behavior for one-way vcov_type values, or reject that combination explicitly before entering the bootstrap path.
Add absorbed-FE parity tests for the new HC2 / HC2+BM surfaces and regression tests for TwoWayFixedEffects(inference="wild_bootstrap", vcov_type="classical"|"hc2", cluster=None).

…luster Addresses CI AI review on PR #327 head 6836836 (⛔ Blocker). P0 — HC2/CR2-BM applied to demeaned design produces wrong hat matrix: TWFE unconditionally demeans via within-transformation, and both DifferenceInDifferences(absorb=...) and MultiPeriodDiD(absorb=...) do the same before solving OLS on the reduced design. The HC2 leverage correction `h_ii = x_i' (X'X)^{-1} x_i` and the CR2 Bell-McCaffrey block adjustment `A_g = (I - H_gg)^{-1/2}` both depend on the FULL FE hat matrix, not the residualized one. FWL preserves coefficients and residuals but not the hat matrix, so applying HC2/CR2-BM to the demeaned regressors silently mis-states small-sample SEs and Satterthwaite DOF. Short-term fix: raise NotImplementedError in three places — - TwoWayFixedEffects.fit() unconditionally for vcov_type in {hc2, hc2_bm} - DifferenceInDifferences.fit() with absorb= and vcov_type in {hc2, hc2_bm} - MultiPeriodDiD.fit() with absorb= and vcov_type in {hc2, hc2_bm} HC1 and CR1 are unaffected (no leverage term; meat uses only the residuals, which FWL preserves). Workarounds documented in the error message: use vcov_type='hc1' with absorb=/TWFE, or switch to fixed_effects= dummies for a full-dummy design where the hat matrix is computed on the full projection. Lifting the guard requires computing HC2/CR2-BM from the full absorbed projection and validating against a full-dummy OLS or fixest/clubSandwich reference. Tracked in TODO.md. REGISTRY.md gets a matching Note under the Phase 1a checklist. P1 — TWFE wild_bootstrap + one-way family dropped the auto-cluster: The prior commit's one-way-family auto-cluster bypass in TWFE (classical/hc2, cluster=None → cluster_var=None) applied even when inference="wild_bootstrap". That silently dropped the unit cluster the bootstrap path needed to resample residuals. Fix: gate the bypass on inference=="analytical", so wild-bootstrap fits keep the unit auto-cluster. Since hc2/hc2_bm now raise earlier, only "classical" reaches the bypass branch; cleaned up accordingly. Tests: - test_twfe_rejects_hc2_and_hc2_bm: both combinations raise with the expected message. - test_did_absorb_rejects_hc2_and_hc2_bm: absorb= + hc2/hc2_bm rejected. - test_did_fixed_effects_dummies_still_accept_hc2_and_hc2_bm: dummy expansion path is unaffected (regression guard). - test_multi_period_absorb_rejects_hc2_and_hc2_bm: MultiPeriodDiD absorb= + hc2/hc2_bm rejected. - test_twfe_wild_bootstrap_preserves_auto_cluster: classical + wild_bootstrap + cluster=None keeps the unit auto-cluster (n_clusters == n_units). Removed/replaced: test_twfe_fit_honors_vcov_type (tested HC2+BM on TWFE), test_twfe_honors_hc2_one_way, test_twfe_explicit_cluster_still_clusters_under_hc2_bm — those paths now raise, so their replacements are the negative-path tests. All 135 Phase 1a tests pass; 448 tests across estimators / survey / TWFE methodology / Phase 1a neighbours pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T11:44:20Z

/ai-review

github-actions · 2026-04-19T11:51:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e113549a3f1bd4f8367221f5b9a2f6a36660ac79

Overall Assessment

✅ Looks good

Executive Summary

The prior re-review blockers appear resolved. HC2 / HC2+Bell-McCaffrey are now explicitly rejected on within-transformed FE fits in [diff_diff/estimators.py:303](/home/runner/work/diff-diff/diff-diff/diff_diff/estimators.py#L303), [diff_diff/estimators.py:1219](/home/runner/work/diff-diff/diff-diff/diff_diff/estimators.py#L1219), and [diff_diff/twfe.py:117](/home/runner/work/diff-diff/diff-diff/diff_diff/twfe.py#L117), matching the new registry note at [docs/methodology/REGISTRY.md:2296](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L2296).
TWFE’s classical analytical path now drops only the auto-cluster, while the wild-bootstrap path preserves unit clustering for resampling in [diff_diff/twfe.py:190](/home/runner/work/diff-diff/diff-diff/diff_diff/twfe.py#L190) and [diff_diff/twfe.py:420](/home/runner/work/diff-diff/diff-diff/diff_diff/twfe.py#L420), with regression coverage in [tests/test_estimators_vcov_type.py:428](/home/runner/work/diff-diff/diff-diff/tests/test_estimators_vcov_type.py#L428).
vcov_type is threaded consistently through the DifferenceInDifferences inheritance chain and into results summaries, and the new inference paths continue to use safe_inference() rather than inline t-stat logic. See [diff_diff/estimators.py:139](/home/runner/work/diff-diff/diff-diff/diff_diff/estimators.py#L139), [diff_diff/linalg.py:1059](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L1059), and [diff_diff/results.py:195](/home/runner/work/diff-diff/diff-diff/diff_diff/results.py#L195).
Remaining weighted-hc2_bm limitations and the self-reference clubSandwich golden are fail-fast or explicitly documented/tracked, so they are informational rather than blockers. See [diff_diff/linalg.py:974](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L974), [TODO.md:81](/home/runner/work/diff-diff/diff-diff/TODO.md#L81), and [benchmarks/data/clubsandwich_cr2_golden.json:477](/home/runner/work/diff-diff/diff-diff/benchmarks/data/clubsandwich_cr2_golden.json#L477).
Static review only: I could not execute the suite in this sandbox because numpy and pytest are unavailable.

Methodology

No unmitigated P0/P1 findings. The absorbed-FE HC2/BM mismatch from the prior review is now blocked instead of silently miscomputed, and MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") now fails fast rather than mixing CR2 SEs with one-way BM DOF. See [diff_diff/estimators.py:1350](/home/runner/work/diff-diff/diff-diff/diff_diff/estimators.py#L1350), [docs/methodology/REGISTRY.md:166](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L166), and [tests/test_estimators_vcov_type.py:300](/home/runner/work/diff-diff/diff-diff/tests/test_estimators_vcov_type.py#L300).
Severity: P3. Impact: the committed CR2 golden remains a self-reference regression anchor, so it stabilizes implementation changes but does not yet provide external clubSandwich parity. This is explicitly documented in the registry/TODO and is non-blocking. Concrete fix: none for approval; regenerate the JSON with the committed R script in the tracked follow-up. Refs: [benchmarks/data/clubsandwich_cr2_golden.json:477](/home/runner/work/diff-diff/diff-diff/benchmarks/data/clubsandwich_cr2_golden.json#L477), [benchmarks/R/generate_clubsandwich_golden.R:1](/home/runner/work/diff-diff/diff-diff/benchmarks/R/generate_clubsandwich_golden.R#L1), [docs/methodology/REGISTRY.md:2300](/home/runner/work/diff-diff/diff-diff/docs/methodology/REGISTRY.md#L2300), [TODO.md:84](/home/runner/work/diff-diff/diff-diff/TODO.md#L84).

Code Quality

No unmitigated findings.

Performance

No unmitigated findings.

Maintainability

No unmitigated findings.

Tech Debt

Severity: P3. Impact: weighted one-way hc2_bm, weighted clustered CR2/BM, and the remaining standalone-estimator vcov_type threading are still deferred, but they now fail fast where needed and are explicitly tracked. Concrete fix: none for approval; keep the queued follow-ups open. Refs: [diff_diff/linalg.py:974](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L974), [TODO.md:80](/home/runner/work/diff-diff/diff-diff/TODO.md#L80), [TODO.md:81](/home/runner/work/diff-diff/diff-diff/TODO.md#L81), [TODO.md:83](/home/runner/work/diff-diff/diff-diff/TODO.md#L83).

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the public helper docs lag the API change. solve_ols() and LinearRegression now accept vcov_type, but their parameter documentation does not describe that knob or its unsupported combinations, so the code/tests are ahead of the helper docs. Concrete fix: add vcov_type parameter docs and the new fail-fast limitation notes to those two docstrings. Refs: [diff_diff/linalg.py:429](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L429), [diff_diff/linalg.py:453](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L453), [diff_diff/linalg.py:2225](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L2225), [diff_diff/linalg.py:2151](/home/runner/work/diff-diff/diff-diff/diff_diff/linalg.py#L2151).
No unmitigated test-coverage finding in the diff. The new tests cover the prior blockers and the new variance paths, including manual HC2/BM checks, absorb/TWFE guard rails, and summary labeling. See [tests/test_linalg_hc2_bm.py:192](/home/runner/work/diff-diff/diff-diff/tests/test_linalg_hc2_bm.py#L192), [tests/test_linalg_hc2_bm.py:474](/home/runner/work/diff-diff/diff-diff/tests/test_linalg_hc2_bm.py#L474), [tests/test_estimators_vcov_type.py:350](/home/runner/work/diff-diff/diff-diff/tests/test_estimators_vcov_type.py#L350), [tests/test_estimators_vcov_type.py:450](/home/runner/work/diff-diff/diff-diff/tests/test_estimators_vcov_type.py#L450), [tests/test_estimators_vcov_type.py:546](/home/runner/work/diff-diff/diff-diff/tests/test_estimators_vcov_type.py#L546), and [tests/test_estimators_vcov_type.py:643](/home/runner/work/diff-diff/diff-diff/tests/test_estimators_vcov_type.py#L643).

Addresses the P3 docs gap flagged by CI AI review on PR #327 head e113549: both public helpers accept vcov_type but the parameter docs didn't list it or its unsupported combinations. - solve_ols: Parameters block now lists vcov_type with the four enum values and notes ``cluster_ids + {classical, hc2}`` and weighted hc2_bm raise. - LinearRegression.__init__: same threading plus a note that the class stores ``self._bm_dof`` and threads it into get_inference. No behavior change; purely docstring updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T11:54:19Z

/ai-review

github-actions · 2026-04-19T12:02:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 90a93c9887f150b2d1e1106dd8a49b74d12a12dc

Overall Assessment

⚠️ Needs changes

Executive Summary

Prior blockers from the earlier re-review look resolved: absorbed-FE hc2/hc2_bm paths now fail fast, and MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") is explicitly rejected in the same way the Methodology Registry now documents.
Severity P1: the new legacy alias logic hard-resolves robust=False to vcov_type="classical" before cluster handling, so clustered fits that previously worked now fail validation instead of producing clustered inference.
I did not find a new inline-inference or partial-NaN anti-pattern in the modified estimator/linalg paths; the changed analytical inference sites still route through safe_inference().
The committed CR2 golden is still a self-reference anchor, but that limitation is explicitly documented in REGISTRY.md and tracked in TODO.md, so it is informational only.
Static review only: I could not execute the new tests in this sandbox because numpy is unavailable.

Methodology

Affected methods in this PR: DifferenceInDifferences, MultiPeriodDiD, TwoWayFixedEffects, LinearRegression, and the shared HC2 / HC2+Bell-McCaffrey variance infrastructure.

Severity: P1. Impact: robust=False now maps to vcov_type="classical" unconditionally, and the new validator rejects any cluster/cluster_ids combination with classical. That means formerly valid clustered calls such as DifferenceInDifferences(robust=False, cluster=...), MultiPeriodDiD(robust=False, cluster=...), TwoWayFixedEffects(robust=False, cluster=...), and LinearRegression(robust=False, cluster_ids=...) now fail at fit time instead of preserving the old “cluster overrides robust” behavior. This is an undocumented default-behavior regression in the inference contract, not a documented registry deviation. Concrete fix: preserve backward compatibility when vcov_type is omitted by delaying alias resolution until cluster context is known, or by tracking whether vcov_type was explicit and remapping the implicit robust=False + cluster case back to CR1/hc1. Refs: diff_diff/linalg.py:L1021-L1071, diff_diff/linalg.py:L978-L988, diff_diff/linalg.py:L2170-L2175, diff_diff/linalg.py:L2374-L2387, diff_diff/estimators.py:L139-L157, diff_diff/estimators.py:L816-L846.
Severity: P3. Impact: the committed CR2 parity JSON is still a Python self-reference anchor rather than an authoritative clubSandwich export, so it only protects against internal drift for now. Because this is explicitly disclosed in the registry and tracked in TODO.md, it is informational and not a blocker. Concrete fix: none for approval; regenerate the golden from R in the tracked follow-up. Refs: docs/methodology/REGISTRY.md:L2296-L2300, TODO.md:L81-L84, benchmarks/data/clubsandwich_cr2_golden.json:L476-L479.

Code Quality

No unmitigated findings in the modified code paths.

Performance

No unmitigated findings. The new O(n^2 k) CR2/BM DOF cost is already tracked in TODO.md and is non-blocking for this PR.

Maintainability

No unmitigated findings.

Tech Debt

Severity: P3. Impact: weighted one-way BM, weighted clustered CR2/BM, and absorbed-FE HC2/BM remain deferred, but the PR now fails fast on those unsupported combinations and records them in TODO.md/REGISTRY.md, which is the right non-blocking treatment under the project rules. Concrete fix: none for approval; keep the tracked follow-ups open. Refs: TODO.md:L81-L84, docs/methodology/REGISTRY.md:L2296-L2300.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new status text points readers to docs/methodology/papers/dechaisemartin-2026-review.md, but that file is not present in this checkout, so the new methodology breadcrumb is dead. Concrete fix: either commit that review artifact or replace the pointer with the in-tree registry section that is actually authoritative. Refs: ROADMAP.md:L103-L103, docs/methodology/REGISTRY.md:L2122-L2122.

Static-review note: I could not run pytest here because numpy is missing from the sandbox, so the assessment is based on diff inspection and surrounding source only.

Path to Approval

Restore backward-compatible clustered behavior for the legacy alias path: when vcov_type is omitted, robust=False must not force classical once cluster/cluster_ids is present. The cleanest fix is to delay alias resolution until cluster context is known, or to remember whether vcov_type was explicit and remap the implicit robust=False + cluster case to hc1/CR1.
Add regression tests for DifferenceInDifferences(robust=False, cluster=...), MultiPeriodDiD(robust=False, cluster=...), TwoWayFixedEffects(robust=False, cluster=...), and LinearRegression(robust=False, cluster_ids=...) so this legacy-cluster interaction is pinned end to end.

Addresses CI AI review on PR #327 head 90a93c9: P1 — legacy alias broke clustered calls: The new `robust=False → vcov_type="classical"` alias was too eager. Clustered calls like `DifferenceInDifferences(robust=False, cluster="unit")` (and the TWFE/MultiPeriod/LinearRegression equivalents) used to produce CR1 cluster-robust SEs — the cluster structure silently overrode the non-robust flag. Phase 1a made them fail validation (classical is one-way only). Fix: track `_vcov_type_explicit` at __init__/set_params. At fit time, a new `_resolve_effective_vcov_type(effective_cluster_ids)` remaps implicit `"classical"` to `"hc1"` when a cluster structure is present, preserving CR1 behavior and emitting a UserWarning. Explicit `vcov_type="classical"` + cluster still raises (user made the choice deliberately). - DifferenceInDifferences.fit: remap at solve site; report remapped type on the result. - MultiPeriodDiD.fit: same pattern, both analytical and absorb paths. - TwoWayFixedEffects.fit: same pattern + the auto-cluster bypass now gates on `_vcov_type_explicit` so implicit classical keeps the unit auto-cluster (which feeds the remap). Wild-bootstrap behavior unchanged (already kept the auto-cluster). - LinearRegression.__init__: mirrors the remap for direct callers so the behavior is consistent across the library surface. All four LinearRegression call sites (DiD fit, MultiPeriod fit, TWFE two fit branches) drop the `robust=self.robust` forwarding when the remap could fire, since `robust=False + vcov_type="hc1"` would otherwise trip `resolve_vcov_type`'s conflict check. The resolved vcov_type becomes the single source of truth for the LR call. P3 — dead pointer: force-add the paper review file. `docs/methodology/papers/dechaisemartin-2026-review.md` was gitignored by the `.gitignore:91` `papers/` pattern. ROADMAP.md:103 and REGISTRY.md:2122 referenced it, so the breadcrumb was dead. Force-added now, same treatment as the existing `rambachan-roth-2023-review.md`. Tests: 7 new regression guards in TestFitBehavior covering DifferenceInDifferences / MultiPeriodDiD / TwoWayFixedEffects / LinearRegression `robust=False + cluster` round-trips, plus the explicit- vs-implicit distinction (`test_explicit_classical_with_cluster_still_raises` pins that deliberate classical + cluster still raises). All 141 Phase 1a tests pass; 454 tests across estimators / survey / TWFE methodology / Phase 1a neighbours pass (one flaky test-ordering failure in `test_hc1_cluster_unchanged` passes standalone, unrelated to this fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T12:14:32Z

/ai-review

github-actions · 2026-04-19T12:21:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3c4a393732dc58df436b5f382ce76411df73fef1

Overall Assessment

⚠️ Needs changes

Executive Summary

Prior methodology blockers around absorbed-FE HC2/Bell-McCaffrey support and MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") now look properly fail-fast and documented in the Methodology Registry/TODO.
One P1 remains: the legacy robust=False compatibility fix is incomplete on the public LinearRegression.fit(..., cluster_ids=...) override path. Constructor-time clustering is handled, but fit-time clustering still falls into the new classical + cluster_ids validator and errors.
The estimator-layer remap added to DifferenceInDifferences, MultiPeriodDiD, and TwoWayFixedEffects is the right pattern; it just was not carried through to the direct LinearRegression.fit() entry point.
The self-referential CR2 golden JSON is still only a regression anchor, but that limitation is explicitly documented in REGISTRY.md/TODO.md, so it is informational rather than blocking.
Static review only: I could not run pytest here because pytest and import-time dependencies like pandas are unavailable in the sandbox.

Methodology

Affected methods: DifferenceInDifferences, MultiPeriodDiD, TwoWayFixedEffects, LinearRegression, and the shared HC2 / HC2+Bell-McCaffrey variance infrastructure.

Severity: P1. Impact: the PR’s backward-compatibility repair for the robust → vcov_type alias is still incomplete. The new estimator helpers correctly remap implicit robust=False + cluster to CR1/hc1, but the public LinearRegression API only performs that remap when cluster_ids are supplied to __init__. If a caller instead uses the documented fit-time override, LinearRegression(robust=False).fit(..., cluster_ids=...), self.vcov_type remains "classical" and solve_ols() hits the new classical SEs are one-way only validation path. That leaves a public clustered inference entry point regressed even though the PR positions robust=True/False as a backward-compat alias. Refs: diff_diff/linalg.py:2170, diff_diff/linalg.py:2252, diff_diff/linalg.py:2291, diff_diff/linalg.py:978. Concrete fix: give LinearRegression the same implicit-vs-explicit vcov_type tracking the estimator classes now use, and resolve the effective vcov_type after effective_cluster_ids is known inside fit(), not only in __init__.
Severity: P3. Impact: the methodology-side deviations I checked are now explicitly documented and therefore non-blocking: absorbed-FE HC2 / HC2+Bell-McCaffrey rejection, MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") rejection pending contrast-aware CR2 DOF, weighted BM NotImplementedError, and the still-self-referential CR2 golden file. Refs: docs/methodology/REGISTRY.md:166, docs/methodology/REGISTRY.md:2297, docs/methodology/REGISTRY.md:2300, TODO.md:81, TODO.md:84, benchmarks/data/clubsandwich_cr2_golden.json:477. Concrete fix: none for approval.

Code Quality

No additional unmitigated findings beyond the incomplete LinearRegression.fit(..., cluster_ids=...) propagation above.

Performance

No unmitigated findings. The O(n^2 k) CR2 Bell-McCaffrey DOF cost is already tracked in TODO.md:95. Concrete fix: none for approval.

Maintainability

No additional unmitigated findings.

Tech Debt

Severity: P3. Impact: the PR correctly tracks deferred follow-ups for weighted BM, absorbed-FE HC2/BM, the remaining standalone estimators that still do not expose vcov_type, and regeneration of the authoritative clubSandwich golden. Refs: TODO.md:80, TODO.md:81, TODO.md:83, TODO.md:84. Concrete fix: none for approval.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new test file covers constructor-time cluster remapping, but it does not exercise the remaining broken public entry point, LinearRegression(robust=False).fit(..., cluster_ids=...), so the incomplete propagation escaped review. Refs: tests/test_estimators_vcov_type.py:1, diff_diff/linalg.py:2291. Concrete fix: add one direct regression test for fit-time cluster_ids override, not just constructor-time cluster_ids.

Path to Approval

Move the implicit robust=False remap for LinearRegression from constructor-time only to fit-time resolution based on effective_cluster_ids, so LinearRegression(robust=False).fit(..., cluster_ids=...) preserves CR1/hc1 just like the estimator classes now do.
Add a regression test that instantiates LinearRegression(robust=False) without constructor-time clusters, passes cluster_ids= to fit(), and asserts finite clustered inference instead of a classical SEs are one-way only failure.

Addresses CI AI review P1 on PR #327 head 3c4a393: the previous commit only remapped `robust=False + cluster_ids=...` → `"hc1"` at `LinearRegression.__init__`, so the documented `LinearRegression(robust=False).fit(..., cluster_ids=...)` override path still fell into `classical + cluster_ids` validation and errored. Fix: track `_vcov_type_explicit` at __init__; relocate the remap to `fit()`, where we already compute `effective_cluster_ids` (the union of constructor-time and fit-time cluster context). Both entry points now preserve CR1 behavior identically. Users who want non-robust SEs can still pass `vcov_type="classical"` explicitly (and no cluster). Tests: add `test_linear_regression_robust_false_fit_time_cluster_preserves_cr1` for the fit-time override path. Existing constructor-time test retained. All 143 Phase 1a tests pass; 313 tests in estimators / survey / methodology TWFE regression pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T12:26:19Z

/ai-review

github-actions · 2026-04-19T12:33:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 254660df7a413d9adae8460588f443507ab9390a

Overall Assessment

⚠️ Needs changes

Executive Summary

The previous blocker around LinearRegression.fit(..., cluster_ids=...) looks addressed on the direct first-fit path.
The methodology-side deviations I checked are documented in the Registry/TODO and are non-blocking: absorbed-FE HC2/Bell-McCaffrey rejection, MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") rejection, weighted Bell-McCaffrey NotImplementedError, and the self-reference CR2 golden file.
One new P1 remains: the PR does not preserve the distinction between the user-configured variance family and the fit-time effective variance family.
That shows up in two ways: get_params() serializes alias-derived "classical" as if it were explicit, and LinearRegression.fit() remaps by mutating instance state. Both can select the wrong SE family after clone/reuse.
The added tests cover first-fit remap behavior, but not clone/re-fit state preservation.

Methodology

Severity: P1. Impact: the new backward-compatibility logic for robust/vcov_type is not stable outside the first fit of the original object. get_params() in diff_diff/estimators.py:L776 returns the resolved self.vcov_type, so an estimator created with implicit robust=False state will clone as explicit vcov_type="classical" and lose the clustered remap on the next fit. Separately, LinearRegression.fit() in diff_diff/linalg.py:L2291 applies the legacy clustered remap by changing self.vcov_type on the instance, so a later unclustered fit can silently use HC1 instead of classical SEs. That is a wrong-variance-family bug, not just a cosmetic state mismatch. Concrete fix: keep raw user input and fit-time effective vcov_type separate; have get_params() return the raw constructor value (None when alias-derived); never mutate self.vcov_type inside fit().
Severity: P3. Impact: the remaining methodology limitations I checked are explicitly documented and tracked: one-way/cluster weighted Bell-McCaffrey not implemented, HC2/CR2-BM rejected on absorbed-FE fits, MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") rejected pending contrast-aware CR2 DOF, and the committed CR2 JSON still being a self-reference anchor. See docs/methodology/REGISTRY.md:L164, docs/methodology/REGISTRY.md:L2293, and TODO.md:L80. Concrete fix: none for approval.

Code Quality

No additional unmitigated findings beyond the vcov-state bug above.

Performance

No unmitigated findings. The expensive CR2 BM DOF loop and repeated bread solves are already tracked in TODO.md:L93. Concrete fix: none for approval.

Maintainability

No additional unmitigated findings beyond the configured-vs-effective vcov_type state bug above.

Tech Debt

Severity: P3. Impact: deferred follow-ups are properly tracked for standalone-estimator vcov_type threading, weighted Bell-McCaffrey/CR2 support, absorbed-FE HC2/BM parity work, and regeneration of the authoritative clubSandwich golden file in TODO.md:L80. Concrete fix: none for approval.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests cover the direct fit-time cluster override, but they do not pin the two state-preservation paths that now matter: get_params()/clone round-trips and repeated LinearRegression.fit() calls that alternate clustered and unclustered usage. That gap is why the P1 above can slip through despite the added suite. See tests/test_estimators_vcov_type.py:L88 and tests/test_estimators_vcov_type.py:L194. Concrete fix: add one clone round-trip test for DifferenceInDifferences(robust=False, cluster="unit"), and one repeat-fit test for LinearRegression(robust=False) that clusters once and then refits without clusters.

Path to Approval

Preserve configured vs effective vcov_type state separately. Do not mutate self.vcov_type during fit(); compute a local _fit_vcov_type and, if needed, expose the actually-used family on a separate fitted/results attribute.
Make get_params() round-trip the raw constructor input for vcov_type (None when alias-derived) so cloned estimators keep the same backward-compat remap behavior as the original object.
Add regression tests for sklearn-style clone/get_params() round-trips and for repeated LinearRegression.fit() calls that switch between clustered and unclustered fits.

Addresses the pattern underlying repeated CI review P1s on PR #327: `fit()` was mutating configuration state (`self.vcov_type`, `self.weights`, `self.weight_type`) to apply per-fit remaps (legacy alias, survey canonicalization), which silently contaminated subsequent fits and broke sklearn-style clone round-trips. This commit establishes a single invariant across the whole inference surface: fit() is idempotent on configuration. It computes all effective fit-time values as locals, stores them on fitted attributes (`_` suffix), and never mutates the user-configured state. LinearRegression changes: - `__init__` stores raw constructor `vcov_type` on `self._vcov_type_arg` alongside the resolved `self.vcov_type` and the existing `_vcov_type_explicit` flag. - `fit()` resolves `_fit_vcov_type`, `_fit_weights`, `_fit_weight_type` as locals at the top, based on: * effective cluster context (constructor OR fit-time override) * survey design canonicalization * legacy robust=False + cluster -> CR1 remap The configured fields on `self` are never written during fit. - The effective fit-time values are stored on fitted attributes `self._fit_vcov_type_`, `self._fit_weights_`, `self._fit_weight_type_` for downstream helpers (compute_deff). `compute_deff` now reads from those attrs (fallback to configured state for backward compat). - All ~15 read sites inside `fit()` switched from `self.X` to the corresponding `_fit_X` local. DifferenceInDifferences (and inherited classes) changes: - `__init__` stores `self._vcov_type_arg` (raw, possibly None). - `get_params()` returns the raw arg so sklearn clones preserve the implicit-vs-explicit distinction (and therefore the backward-compat remap). - `set_params()` updates `_vcov_type_arg` and `_vcov_type_explicit` consistently: explicit `vcov_type=X` sets both; `robust=` alone clears to None / False. - The existing `_resolve_effective_vcov_type(effective_cluster_ids)` already returned a local; confirmed no site mutates self post-init. Tests: - `test_get_params_round_trip_preserves_implicit_classical`: clone round-trip of `DifferenceInDifferences(robust=False, cluster="unit")`. Both orig and clone remap to CR1 at fit time (pinning that get_params returns None for alias path). - `test_get_params_round_trip_preserves_explicit_vcov_type`: round-trip for explicitly-set vcov_type. - `test_linear_regression_repeat_fit_clustered_then_unclustered`: repeat-fit idempotence — first fit with cluster remaps to hc1, second fit without cluster uses classical (not stale hc1 from prior fit). - Existing LinearRegression tests updated to assert `_fit_vcov_type_` (the fitted attr) is the remapped value, and `self.vcov_type` (configured) stays unchanged. - Survey test updated to assert `_fit_weights_` (fitted) is populated while `self.weights` (configured) stays at user's None. - `test_get_params_default_vcov_type` updated: default construction returns None for raw vcov_type, resolved is hc1. Why this sets up Phase 1b+: Future additions (bandwidth selector, HeterogeneousAdoptionDiD class, vcov_type threading on the 8 standalone estimators, weighted BM DOF rework) all hit the same configured-vs-effective shape. The single invariant above is the place to hang them: each new remap becomes a local variable in fit(), never a write to self. All 145 Phase 1a tests pass; 459 tests across estimators / survey / methodology / Phase 1a neighbours pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # diff_diff/results.py

igerber · 2026-04-19T12:50:22Z

/ai-review

github-actions · 2026-04-19T13:01:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 697571c4e1fb1012e4d42d486b7df7eb23b02f9d

Overall Assessment

✅ Looks good

Executive Summary

The prior P1 around configured vs effective vcov_type state appears resolved. The estimator layer now preserves the raw constructor value for clone round-trips, computes the fit-time effective family locally, and avoids mutating configured state during fit. See diff_diff/estimators.py:L157-L165, diff_diff/estimators.py:L417-L433, diff_diff/estimators.py:L793-L883, and diff_diff/linalg.py:L2343-L2480.
The re-review regression tests now directly cover the previously problematic clone/re-fit paths and the implicit-vs-explicit "classical" distinction. See tests/test_estimators_vcov_type.py:L202-L380.
The remaining methodology deviations I checked are explicitly documented and tracked, so they are non-blocking under the review policy: absorbed-FE HC2/Bell-McCaffrey rejection, MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") rejection, weighted Bell-McCaffrey NotImplementedError, and the self-reference CR2 golden file. See docs/methodology/REGISTRY.md:L2296-L2300 and TODO.md:L80-L84.
I did not find a new unmitigated P0/P1 in the changed methodology, inference, or parameter-threading code.
I could not execute the test suite here because pytest is not installed in this environment; this assessment is based on static review of the diff and the added tests.

Methodology

Severity: P3. Impact: Previous blocker resolved. The raw/effective variance-family split now preserves backward-compatible behavior across clone and repeat-fit flows instead of silently drifting the estimator into the wrong SE family. Concrete fix: none. See diff_diff/estimators.py:L157-L165, diff_diff/estimators.py:L886-L923, diff_diff/linalg.py:L2351-L2369, and tests/test_estimators_vcov_type.py:L319-L380.
Severity: P3. Impact: The remaining methodology limitations are documented and tracked, so they are not blockers: absorbed-FE HC2/BM rejection, MultiPeriodDiD(cluster=..., vcov_type="hc2_bm") rejection, weighted one-way/clustered Bell-McCaffrey NotImplementedError, and the non-authoritative CR2 golden JSON. Concrete fix: none for approval; follow the tracked TODOs. See diff_diff/twfe.py:L117-L144, diff_diff/estimators.py:L1295-L1308, diff_diff/estimators.py:L1418-L1437, diff_diff/linalg.py:L989-L1018, docs/methodology/REGISTRY.md:L164-L173, and TODO.md:L80-L84.

Code Quality

No findings.

Performance

Severity: P3. Impact: The expensive parts of the new HC2/CR2-Bell-McCaffrey implementation remain tracked, not unmitigated regressions: repeated bread inversion, lack of Rust HC2 support, and the naive O(n^2 k) CR2 DOF loop. Concrete fix: none for approval. See TODO.md:L93-L95.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: Remaining rollout gaps are explicitly tracked in TODO.md, especially threading vcov_type through the standalone estimators outside the DiD inheritance chain and regenerating the authoritative R CR2 golden file. Concrete fix: none for approval. See TODO.md:L80-L84.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The added tests cover the re-review-specific regression cases that mattered here: clone round-trips, repeat-fit state preservation, explicit-vs-implicit "classical" handling, and summary labeling. Concrete fix: none. See tests/test_estimators_vcov_type.py:L202-L380 and tests/test_linalg_hc2_bm.py:L419-L570.
Residual risk: I could not run the suite locally because pytest is unavailable in this environment.

CI on PR #327 failed on `test_hc1_cluster_unchanged` across macOS py3.11 and Linux-arm py3.11/3.13. Root cause: the test asserted `assert_array_equal` on two `compute_robust_vcov` call paths that reach the same math but accumulate sub-machine-epsilon ordering differences (5e-18 on macOS, 1.2e-17 on Linux arm) — likely BLAS reduction ordering depending on which validator branch runs first. Both failures showed `Max absolute difference among violations: ~1e-17`, well below float64 machine epsilon (~2e-16). Fix: switch both tests in `TestHC1Unchanged` to `np.testing.assert_allclose(..., atol=1e-14, rtol=1e-14)`. The tolerance is 3 orders of magnitude tighter than machine epsilon so the test still catches any real regression in HC1/CR1 semantics while tolerating Numpy BLAS reduction-order non-determinism across platforms. Applies to: - TestHC1Unchanged.test_default_path_unchanged (one-way HC1) - TestHC1Unchanged.test_hc1_cluster_unchanged (CR1 cluster-robust) Both tests pass locally in the combined suite (previously flaky on cross-test ordering, which is the same symptom as the CI failure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 3 commits April 18, 2026 20:35

igerber and others added 2 commits April 19, 2026 08:47

Merge remote-tracking branch 'origin/main' into did-no-untreated

697571c

# Conflicts: # diff_diff/results.py

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit d9aaf86 into main Apr 19, 2026
18 of 19 checks passed

igerber deleted the did-no-untreated branch April 19, 2026 14:42

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant