Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port#475
Conversation
`vcov_type="hc2_bm" + weights` (both one-way and cluster-robust) is now supported, matching `clubSandwich::vcovCR(..., type="CR2") + coef_test(test= "Satterthwaite")$df_Satt` and `Wald_test(test="HTZ")$df_denom` at atol=1e-10 on six new weighted scenarios in clubsandwich_cr2_golden.json. Immediate UX benefit: DifferenceInDifferences, MultiPeriodDiD, and TwoWayFixedEffects now accept `vcov_type="hc2_bm" + survey_design= SurveyDesign(weights=...)` for analytical weights. Closes TODO.md rows 104-105 (open weighted-CR2 gates). Algorithm note: the diff-diff form matches clubSandwich's specific algebra (W not sqrt(W) in hat matrix, W² in bias term, unweighted residuals in score), NOT a textbook Pustejovsky-Tipton (2018) §3.3 transform-once derivation - the two diverge by 0.5-30% on weighted designs per feedback_wls_cr2_clubsandwich_parity. Satterthwaite DOF uses the full H1/H2/H3 array construction (clubSandwich get_arrays.R::get_GH), not the simpler (tr B)²/tr(B²) form (which is exact unweighted but diverges from clubSandwich on weighted designs by ~6%). Step 0 R smoke test validated the algorithm at atol=1e-15 before source edits per feedback_r_source_smoke_test_before_implementing. Unweighted CR2-BM is bit-equal to prior at atol=1e-14 (regression-safe via TestUnweightedRegressionStillBitEqual + TestDOFFormulaDualPathEquivalence asserting the simple and P_array DOF formulas agree at the unweighted limit). clubSandwich version pin: >= 0.7.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…usters, P2 docstrings)
P0 (weight_type contract gap): The clubSandwich WLS-CR2 port matches the
`pweight` (sampling-weight) convention only. The dispatcher now rejects
`vcov_type="hc2_bm" + weights + weight_type in {"aweight", "fweight"}`
with NotImplementedError pointing to `weight_type="pweight"` or
`vcov_type="hc1"` (CR1 supports all three weight types) as workarounds.
P1 (zero-total-weight clusters): `_compute_cr2_bm` and
`_compute_cr2_bm_contrast_dof` now drop zero-total-weight clusters before
the G>=2 check, raising ValueError when fewer than 2 effective clusters
remain. Mirrors the CR1 zero-cluster handling. Three-cluster fits where
one cluster has all-zero weight silently drop it (its scores contribute
zero anyway).
P2 (stale docstrings): Updated `_validate_vcov_args` and `solve_ols`
docstrings to reflect the lifted-gate + pweight-only scope. Removed the
contradictory "Not supported with weights" claim that survived the gate
lift.
New tests in `tests/test_methodology_wls_cr2.py`:
- TestWLSCR2WeightTypeRejection: 4 tests (aweight/fweight rejections on
cluster and one-way paths; pweight smoke acceptance test).
- TestWLSCR2ZeroWeightClusterRejection: 3 tests (one-zero-cluster reject
on both `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof`; multi-
cluster with one zero-weight silent drop).
All 53 linalg+methodology tests pass; broader 336-test regression suite
across estimators / TWFE / MPD / SA / vcov_type also clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-cluster + scope tightening) R2 P0: LinearRegression.fit() previously skipped populating self._bm_dof when both effective_cluster_ids AND _fit_weights were present (the path the R1 clubSandwich port lifted), so get_inference() fell back to df = n - k and produced anti-conservative p-values / CIs on the weighted-cluster hc2_bm surface. The dispatcher already guards non-pweight weighted hc2_bm at the linalg validator level, so reaching the _bm_dof branch guarantees a finite Satterthwaite DOF. Drop the weighted-cluster skip and populate _bm_dof from compute_robust_vcov(..., return_dof=True) like the other hc2_bm paths. R2 P2: new regression tests TestLinearRegressionWeightedClusterHC2BM (2 tests) verify LinearRegression._bm_dof matches compute_robust_vcov-level Satterthwaite DOF and that get_inference(index=i).df threads correctly per coefficient. Sanity check: cluster-driven DOF << n-k (catches future regressions where the fallback would otherwise re-emerge). R2 P3: stale docstrings at solve_ols (linalg.py:1260) and LinearRegression class docstring (linalg.py:2852) updated to reflect the lifted hc2_bm + pweight surface and the documented aweight/fweight restriction. R2 P3: CHANGELOG and REGISTRY entries reworded to scope the lift to the analytical surface (compute_robust_vcov / solve_ols / LinearRegression direct callers + analytical CR2 contrast DOF in MPD). Removed the incorrect claim that survey_design= callers benefit directly — survey designs route through the Taylor-series linearization (TSL) survey variance path, which takes precedence over the analytical CR2 sandwich (unchanged). All 194 linalg/methodology regression tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e comments) R3 was ✅ "Looks good" with only 2 P3 informational items; addressing both for cleanliness. P3 #1 (dead-code thread, estimators.py:1893): The R1 patch threaded `weights=survey_weights` into `_compute_cr2_bm_contrast_dof` on the `not _use_survey_vcov` branch, but that branch only fires when `survey_design=` is unset, in which case `survey_weights` is always None (survey designs always route through the TSL `_use_survey_vcov=True` path). The threading was a no-op and made the surface look like it supported weighted MPD avg_att via survey_design — which it doesn't. Removed the kwarg and updated the comment to reflect the de facto contract on the analytical branch. P3 #2 (R-script comment scope, generate_clubsandwich_golden.R): comments on `weighted_did_absorbed_fe` and `weighted_mpd_avg_att_dof` said the fixtures pin `DiD/MPD(survey_design=SurveyDesign(weights="w"))` paths. Reworded to say these are analytical-CR2 design-matrix parity fixtures on DiD/MPD-shaped designs (the public surface they actually pin is `compute_robust_vcov` / `solve_ols` / `LinearRegression` / the analytical CR2 contrast-DOF helper). All 144 linalg + methodology + estimators-vcov-type tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R4 was ✅ "Looks good" with 2 more P3 informational items; addressing for final cleanliness. P3 #1 (registry overstatement, REGISTRY.md:L2646): The Gate 4-5 lift entry still claimed coverage of "the analytical CR2 contrast DOF used by MultiPeriodDiD.fit() when survey_design= is NOT set and weights are passed via another mechanism" — but MPD has no non-survey weighted public entry point. Reworded to scope the lift to compute_robust_vcov / solve_ols / LinearRegression direct callers, with a separate note that `_compute_cr2_bm_contrast_dof(weights=)` is helper-ready but not exercised by public MPD. P3 #2 (docstring drift, _validate_vcov_args): The Raises block claimed `_validate_vcov_args` itself rejects non-pweight hc2_bm + weights, but the function has no weight_type parameter and the actual enforcement lives in `_compute_robust_vcov_numpy` (which has weight_type in scope). Narrowed the docstring to describe what `_validate_vcov_args` actually validates (conley + weights), with a pointer to where the pweight enforcement happens. All 53 linalg + methodology tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… drift) R5 was ✅ "Looks good" with 1 P3 informational item: the docstring of _compute_bm_dof_from_contrasts still described only the unweighted (tr B)^2 / tr(B^2) formula, but the function body now dispatches the weighted case to the clubSandwich singleton-cluster CR2 P_array form. Split the docstring into Unweighted and Weighted sections matching the two code paths. No code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…asts (R6 P1) R6 codex review surfaced a P1: the weighted CR2-BM per-coefficient Satterthwaite DOF disagreed with clubSandwich by 15-30% on `weighted_did_absorbed_fe`'s treated-unit dummies (unit2/3/4), even though vcov matched at machine precision. Root cause (after instrumentation against R's get_GH and CR2 source): the contrast vectors for high-leverage FE-dummy coefficients project to near-zero on the design (e.g., unit2's dummy column has XW_g0 row 2 = exact zeros for the unit-1 cluster). The resulting per-cluster H_array slices and P_array entries land at the float64 noise floor (~1e-30 for typical matmul-product roundoff at ~1e-16 per entry). The DOF formula `(tr P)² / sum(P²)` is scale-invariant, but R and NumPy use different BLAS reduction orders, producing 1-bit-different roundoff that propagates into 30% DOF disagreement. Not a fixable algebra bug — fundamental FP precision limit for high-leverage contrasts. Mitigation: detect the noise floor (per-contrast `max(|P|)` below `1e-10 ×` the largest contrast's `max(|P|)`) and return NaN with a `UserWarning`. Honest signal that the DOF cannot be reliably computed instead of silently shipping BLAS-implementation-dependent inference. The coefficient SEs remain valid; only the affected DOF (and any t-test or CI that depends on it) is suppressed. Documented as a precision limit in REGISTRY.md and CHANGELOG.md. New regression test `TestWLSCR2FEDoFNoiseGuard` pins the NaN-guard behavior on the weighted_did_absorbed_fe scenario (unit2/3/4 expected NaN; all other 9 coefficients still match clubSandwich at atol=1e-10). All 339 linalg + estimators + methodology tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…R7 P0) R7 codex flagged a P0: the noise-floor NaN-guard in _cr2_bm_dof_inner_weighted correctly returns NaN DOF, but LinearRegression.get_inference() converted non-finite _bm_dof to df=None, which safe_inference() then treated as normal-theory inference — producing huge t-stats, p≈0, and zero-width CIs for the guarded coefficients instead of suppression. Fix: in get_inference(), when _bm_dof[index] is non-finite (NaN), return InferenceResult with NaN t_stat/p_value/conf_int and df=None directly, short-circuiting the normal-theory fallback. SE and coefficient remain valid (vcov matched at machine precision); only the affected coef's small-sample inference is suppressed. New end-to-end regression test TestLinearRegressionFENanGuardEndToEnd fits the public LinearRegression(vcov_type="hc2_bm", weights=, cluster_ids=) on weighted_did_absorbed_fe and asserts: NaN inference for the 3 treated- unit dummies (the noise-floor cases) AND finite inference for the other 9 coefficients. This catches the exact failure mode R7 surfaced. Also tightens CHANGELOG/REGISTRY wording (R7 P3): explicitly call out that "vcov + non-noise-floor DOF + compound-contrast DOF match clubSandwich"; high-leverage FE-dummy coefficients are suppressed to NaN. All 339 linalg/estimators/methodology tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality No findings. Performance
Maintainability No separate findings beyond the duplicated CR2 work above. Tech Debt No findings. Security No findings. Documentation/Tests
Assumption
|
CI codex on PR #475 (✅ verdict) flagged a real P2: the noise-floor NaN- guard in `_cr2_bm_dof_inner_weighted` was batch-relative only — for a single-contrast call to `_compute_cr2_bm_contrast_dof`, `max|P|_overall` equals the contrast's own max|P|, so the `1e-10 × max|P|_overall` rule could never classify it as degenerate. That left direct single-contrast weighted callers (e.g., MPD avg_att) unprotected: they could still emit BLAS-implementation-dependent finite DOF on noise-floor contrasts even though the registry/changelog said the helper was guarded. Fix: union the batch-relative criterion with an absolute floor scaled to the bread matrix's magnitude: `(EPS × n × k × max(bread_inv_scale, 1))²`. This covers the worst-case dgemm accumulation roundoff floor for `H1/H2/H3 @ contrast` products. A single-contrast call now correctly fires the NaN-guard on a high-leverage FE-dummy contrast. New regression tests in `tests/test_methodology_wls_cr2.py:: TestWLSCR2SingleContrastNoiseFloor` (2 tests): single weighted FE-dummy contrast triggers NaN-guard + warning; single non-noise contrast still returns finite DOF matching clubSandwich at atol=1e-10. CI codex P3 (perf): LinearRegression.fit() pays CR2 twice on the new weighted hc2_bm path (solve_ols + compute_robust_vcov). Added as a TODO follow-up row (PR #475 follow-up, Low priority). All 198 linalg+methodology+estimators-vcov-type+TWFE+SA tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality No findings. Performance
Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
… P0 R2) CI codex round 2 on PR #475 flagged a P0: weighted clustered CR2 wasn't subpopulation-invariant on mixed-zero clusters. The earlier "drop zero- total-weight clusters" guard handled all-zero clusters but missed mixed- zero clusters (positive total weight, some zero-weight rows inside). In those clusters, zero-weight rows still entered the CR2 adjustment matrices (H_gg, G_g, A_g, bias_term) on the row side, silently changing SE/DOF — contradicting the linalg contract that zero-weight rows are inert. Fix: physically filter `weights > 0` rows before all per-cluster computations in both `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof`. The caller's `bread_matrix = X.T @ (X * w[:, None])` is invariant to zero-weight row removal (those rows contribute 0 to the sum), so no bread rebuild is needed. Effective-cluster filter still applies on the filtered view. New regression tests `TestWLSCR2SubpopulationInvariance` (2 tests): - `test_per_coefficient_dof_invariant_to_zero_weight_padding`: pin vcov + per-coefficient DOF at atol=1e-12 between (a) computing on the full design with zero-weight padding rows interleaved and (b) computing on the physically dropped positive-weight subset. - `test_contrast_dof_invariant_to_zero_weight_padding`: same invariance for compound-contrast DOF. Also addresses two P3 doc items from the same review: - REGISTRY + CHANGELOG noise-floor description: now explicitly mentions both criteria (batch-relative AND absolute single-contrast safe) instead of only the batch-relative one. - TODO.md Tier C self-contradiction: marked the WLS-CR2 line as LIFTED with a back-reference to closed rows 109-110, removing the contradiction between the rows table (lifted) and the prose Tier C list (still outstanding). P3 performance (LinearRegression pays CR2 twice on weighted hc2_bm path) remains tracked in TODO.md row 111 from the previous round. All 200 linalg+methodology+TWFE+SA+estimators-vcov-type tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No P0/P1 findings. The registry now documents the load-bearing methodology choices and precision-limit behavior in
Code Quality No findings. Performance
Maintainability No findings. Tech Debt No additional findings beyond the tracked weighted-CR2 double-compute follow-up in Security No findings. Documentation/Tests No blocking findings. The new methodology harness covers parity, weight-type rejection, effective-cluster rejection, mixed-zero-row subpopulation invariance, single-contrast noise-floor guarding, and I could not execute the suite here because the sandbox lacks NumPy/pytest. |
CI codex R3 on PR #475 (✅ verdict) flagged a single actionable P3 docs item: the Phase 1a registry row said "Weighted Bell-McCaffrey is supported" but didn't record the load-bearing `pweight`-only restriction. The code correctly rejects `aweight`/`fweight` at the linalg validator, but the registry overstated the supported surface. Fix: added a sentence to the Phase 1a row stating the clubSandwich WLS-CR2 port matches the `pweight` convention only; `aweight`/`fweight` remain intentionally unsupported (CR1 / `vcov_type="hc1"` still supports all three). CI codex R3 P3 perf (LinearRegression pays CR2 twice on weighted hc2_bm path) is already tracked in `TODO.md` row 111 from R2; no new action. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
NotImplementedErrorgates in_validate_vcov_argsblockingvcov_type="hc2_bm" + weights(TODO.md rows 104-105, "Gates 4 and 5")_compute_cr2_bm/_compute_cr2_bm_contrast_dof/_compute_bm_dof_from_contrasts(W not √W in hat matrix, W² in bias-correction term, unweighted residuals in score, full H1/H2/H3 array Satterthwaite DOF)LinearRegression._bm_dofandLinearRegression.get_inference()benchmarks/data/clubsandwich_cr2_golden.jsonpin Python vs R parity at atol=1e-10 (vcov + non-noise-floor DOF + compound-contrast DOF); existing 6 unweighted scenarios unchangedweight_type ∈ {"aweight", "fweight"}+hc2_bm + weightsraisesNotImplementedError(port matchespweightonly); zero-total-weight clusters dropped with effective-cluster ≥ 2 guardMethodology references
clubSandwichv0.7.0 (Pustejovsky 2024) R source —R/CR-adjustments.R::CR2,R/clubSandwich.R::vcov_CR,R/coef_test.R::Satterthwaite_df,R/get_arrays.R::get_GH. Foundational papers: Bell & McCaffrey (2002), Pustejovsky & Tipton (2018) JBES, Imbens & Kolesar (2016) ReStat.feedback_wls_cr2_clubsandwich_parity), the textbook reading diverges from clubSandwich by 0.5-30% on weighted designs. clubSandwich uses W (not √W) in the hat matrix, W² in the bias term, and unweighted residuals in the score construction. Documented indocs/methodology/REGISTRY.mdPhase 1a section.UserWarningrather than ship BLAS-implementation-dependent values.Validation
tests/test_methodology_wls_cr2.py(19 tests: clubSandwich parity at atol=1e-10 across 6 weighted scenarios + compound-contrast DOF + unweighted regression safety + dual-path equivalence + weight-type rejection + zero-weight cluster rejection + LinearRegression_bm_dofthreading + LinearRegression NaN inference end-to-end + FE-dummy noise-floor guard)tests/test_linalg_hc2_bm.pyflipped two "gate raises NotImplementedError" tests to "gate lifted, produces finite vcov+DOF" smoke testsfeedback_r_source_smoke_test_before_implementing)linalg + methodology + estimators + TWFE + SA + estimators-vcov-typeSecurity / privacy
Test plan
pytest tests/test_methodology_wls_cr2.pypytest tests/test_linalg_hc2_bm.pypytest tests/test_estimators_vcov_type.py tests/test_methodology_twfe.py tests/test_methodology_sun_abraham.py tests/test_estimators.pyfeedback_local_codex_vs_ci_codex_divergence)🤖 Generated with Claude Code