Skip to content

Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port#475

Merged
igerber merged 11 commits into
mainfrom
wls-cr2-clubsandwich-port
May 21, 2026
Merged

Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port#475
igerber merged 11 commits into
mainfrom
wls-cr2-clubsandwich-port

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 20, 2026

Summary

  • Lifts the two NotImplementedError gates in _validate_vcov_args blocking vcov_type="hc2_bm" + weights (TODO.md rows 104-105, "Gates 4 and 5")
  • Ports clubSandwich's specific WLS-CR2 algebra into _compute_cr2_bm / _compute_cr2_bm_contrast_dof / _compute_bm_dof_from_contrasts (W not √W in hat matrix, W² in bias-correction term, unweighted residuals in score, full H1/H2/H3 array Satterthwaite DOF)
  • Threads weighted Bell-McCaffrey DOF through LinearRegression._bm_dof and LinearRegression.get_inference()
  • Adds NaN-guard for FP-noise-floor high-leverage FE-dummy contrasts (where BLAS reduction order differences between NumPy and R produce 15-30% DOF disagreement despite vcov matching at machine precision)
  • 6 new weighted scenarios in benchmarks/data/clubsandwich_cr2_golden.json pin Python vs R parity at atol=1e-10 (vcov + non-noise-floor DOF + compound-contrast DOF); existing 6 unweighted scenarios unchanged
  • Public surface restrictions: weight_type ∈ {"aweight", "fweight"} + hc2_bm + weights raises NotImplementedError (port matches pweight only); zero-total-weight clusters dropped with effective-cluster ≥ 2 guard

Methodology references

  • Method: WLS-CR2 Bell-McCaffrey cluster-robust variance + Satterthwaite DOF
  • Source: clubSandwich v0.7.0 (Pustejovsky 2024) R source — R/CR-adjustments.R::CR2, R/clubSandwich.R::vcov_CR, R/coef_test.R::Satterthwaite_df, R/get_arrays.R::get_GH. Foundational papers: Bell & McCaffrey (2002), Pustejovsky & Tipton (2018) JBES, Imbens & Kolesar (2016) ReStat.
  • Intentional deviation from textbook: The diff-diff port matches clubSandwich's specific algebra, NOT a textbook PT2018 §3.3 transform-once derivation. Per prior empirical finding (feedback_wls_cr2_clubsandwich_parity), the textbook reading diverges from clubSandwich by 0.5-30% on weighted designs. clubSandwich uses W (not √W) in the hat matrix, W² in the bias term, and unweighted residuals in the score construction. Documented in docs/methodology/REGISTRY.md Phase 1a section.
  • Documented precision limit: high-leverage FE-dummy contrasts (e.g., unit dummies in a balanced DiD) have Satterthwaite DOF at the float64 noise floor; the helper returns NaN with a UserWarning rather than ship BLAS-implementation-dependent values.

Validation

  • Tests added: tests/test_methodology_wls_cr2.py (19 tests: clubSandwich parity at atol=1e-10 across 6 weighted scenarios + compound-contrast DOF + unweighted regression safety + dual-path equivalence + weight-type rejection + zero-weight cluster rejection + LinearRegression _bm_dof threading + LinearRegression NaN inference end-to-end + FE-dummy noise-floor guard)
  • Tests updated: tests/test_linalg_hc2_bm.py flipped two "gate raises NotImplementedError" tests to "gate lifted, produces finite vcov+DOF" smoke tests
  • Step 0 R/Python smoke parity verified at atol=1e-15 before source edits (per feedback_r_source_smoke_test_before_implementing)
  • Local codex review iterated R1-R8 to ✅ clean (8 rounds)
  • 340 tests pass across linalg + methodology + estimators + TWFE + SA + estimators-vcov-type

Security / privacy

  • Confirm no secrets/PII in this PR: Yes (canonical secret-pattern scan clean)

Test plan

  • Verify R-parity tests pass: pytest tests/test_methodology_wls_cr2.py
  • Verify unweighted regression safety: pytest tests/test_linalg_hc2_bm.py
  • Verify broader regression: pytest tests/test_estimators_vcov_type.py tests/test_methodology_twfe.py tests/test_methodology_sun_abraham.py tests/test_estimators.py
  • CI codex review (local convergence is necessary but not sufficient per feedback_local_codex_vs_ci_codex_divergence)

🤖 Generated with Claude Code

igerber and others added 8 commits May 20, 2026 14:23
`vcov_type="hc2_bm" + weights` (both one-way and cluster-robust) is now
supported, matching `clubSandwich::vcovCR(..., type="CR2") + coef_test(test=
"Satterthwaite")$df_Satt` and `Wald_test(test="HTZ")$df_denom` at atol=1e-10
on six new weighted scenarios in clubsandwich_cr2_golden.json.

Immediate UX benefit: DifferenceInDifferences, MultiPeriodDiD, and
TwoWayFixedEffects now accept `vcov_type="hc2_bm" + survey_design=
SurveyDesign(weights=...)` for analytical weights. Closes TODO.md rows
104-105 (open weighted-CR2 gates).

Algorithm note: the diff-diff form matches clubSandwich's specific algebra
(W not sqrt(W) in hat matrix, W² in bias term, unweighted residuals in
score), NOT a textbook Pustejovsky-Tipton (2018) §3.3 transform-once
derivation - the two diverge by 0.5-30% on weighted designs per
feedback_wls_cr2_clubsandwich_parity. Satterthwaite DOF uses the full
H1/H2/H3 array construction (clubSandwich get_arrays.R::get_GH), not the
simpler (tr B)²/tr(B²) form (which is exact unweighted but diverges from
clubSandwich on weighted designs by ~6%).

Step 0 R smoke test validated the algorithm at atol=1e-15 before source
edits per feedback_r_source_smoke_test_before_implementing.

Unweighted CR2-BM is bit-equal to prior at atol=1e-14 (regression-safe via
TestUnweightedRegressionStillBitEqual + TestDOFFormulaDualPathEquivalence
asserting the simple and P_array DOF formulas agree at the unweighted
limit).

clubSandwich version pin: >= 0.7.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…usters, P2 docstrings)

P0 (weight_type contract gap): The clubSandwich WLS-CR2 port matches the
`pweight` (sampling-weight) convention only. The dispatcher now rejects
`vcov_type="hc2_bm" + weights + weight_type in {"aweight", "fweight"}`
with NotImplementedError pointing to `weight_type="pweight"` or
`vcov_type="hc1"` (CR1 supports all three weight types) as workarounds.

P1 (zero-total-weight clusters): `_compute_cr2_bm` and
`_compute_cr2_bm_contrast_dof` now drop zero-total-weight clusters before
the G>=2 check, raising ValueError when fewer than 2 effective clusters
remain. Mirrors the CR1 zero-cluster handling. Three-cluster fits where
one cluster has all-zero weight silently drop it (its scores contribute
zero anyway).

P2 (stale docstrings): Updated `_validate_vcov_args` and `solve_ols`
docstrings to reflect the lifted-gate + pweight-only scope. Removed the
contradictory "Not supported with weights" claim that survived the gate
lift.

New tests in `tests/test_methodology_wls_cr2.py`:
- TestWLSCR2WeightTypeRejection: 4 tests (aweight/fweight rejections on
  cluster and one-way paths; pweight smoke acceptance test).
- TestWLSCR2ZeroWeightClusterRejection: 3 tests (one-zero-cluster reject
  on both `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof`; multi-
  cluster with one zero-weight silent drop).

All 53 linalg+methodology tests pass; broader 336-test regression suite
across estimators / TWFE / MPD / SA / vcov_type also clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-cluster + scope tightening)

R2 P0: LinearRegression.fit() previously skipped populating self._bm_dof when
both effective_cluster_ids AND _fit_weights were present (the path the R1
clubSandwich port lifted), so get_inference() fell back to df = n - k and
produced anti-conservative p-values / CIs on the weighted-cluster hc2_bm
surface. The dispatcher already guards non-pweight weighted hc2_bm at the
linalg validator level, so reaching the _bm_dof branch guarantees a finite
Satterthwaite DOF. Drop the weighted-cluster skip and populate _bm_dof from
compute_robust_vcov(..., return_dof=True) like the other hc2_bm paths.

R2 P2: new regression tests TestLinearRegressionWeightedClusterHC2BM (2 tests)
verify LinearRegression._bm_dof matches compute_robust_vcov-level Satterthwaite
DOF and that get_inference(index=i).df threads correctly per coefficient.
Sanity check: cluster-driven DOF << n-k (catches future regressions where the
fallback would otherwise re-emerge).

R2 P3: stale docstrings at solve_ols (linalg.py:1260) and LinearRegression class
docstring (linalg.py:2852) updated to reflect the lifted hc2_bm + pweight
surface and the documented aweight/fweight restriction.

R2 P3: CHANGELOG and REGISTRY entries reworded to scope the lift to the
analytical surface (compute_robust_vcov / solve_ols / LinearRegression direct
callers + analytical CR2 contrast DOF in MPD). Removed the incorrect claim
that survey_design= callers benefit directly — survey designs route through
the Taylor-series linearization (TSL) survey variance path, which takes
precedence over the analytical CR2 sandwich (unchanged).

All 194 linalg/methodology regression tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e comments)

R3 was ✅ "Looks good" with only 2 P3 informational items; addressing both
for cleanliness.

P3 #1 (dead-code thread, estimators.py:1893): The R1 patch threaded
`weights=survey_weights` into `_compute_cr2_bm_contrast_dof` on the
`not _use_survey_vcov` branch, but that branch only fires when
`survey_design=` is unset, in which case `survey_weights` is always None
(survey designs always route through the TSL `_use_survey_vcov=True` path).
The threading was a no-op and made the surface look like it supported
weighted MPD avg_att via survey_design — which it doesn't. Removed the
kwarg and updated the comment to reflect the de facto contract on the
analytical branch.

P3 #2 (R-script comment scope, generate_clubsandwich_golden.R): comments
on `weighted_did_absorbed_fe` and `weighted_mpd_avg_att_dof` said the
fixtures pin `DiD/MPD(survey_design=SurveyDesign(weights="w"))` paths.
Reworded to say these are analytical-CR2 design-matrix parity fixtures
on DiD/MPD-shaped designs (the public surface they actually pin is
`compute_robust_vcov` / `solve_ols` / `LinearRegression` / the analytical
CR2 contrast-DOF helper).

All 144 linalg + methodology + estimators-vcov-type tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R4 was ✅ "Looks good" with 2 more P3 informational items; addressing for
final cleanliness.

P3 #1 (registry overstatement, REGISTRY.md:L2646): The Gate 4-5 lift entry
still claimed coverage of "the analytical CR2 contrast DOF used by
MultiPeriodDiD.fit() when survey_design= is NOT set and weights are passed
via another mechanism" — but MPD has no non-survey weighted public entry
point. Reworded to scope the lift to compute_robust_vcov / solve_ols /
LinearRegression direct callers, with a separate note that
`_compute_cr2_bm_contrast_dof(weights=)` is helper-ready but not exercised
by public MPD.

P3 #2 (docstring drift, _validate_vcov_args): The Raises block claimed
`_validate_vcov_args` itself rejects non-pweight hc2_bm + weights, but
the function has no weight_type parameter and the actual enforcement lives
in `_compute_robust_vcov_numpy` (which has weight_type in scope). Narrowed
the docstring to describe what `_validate_vcov_args` actually validates
(conley + weights), with a pointer to where the pweight enforcement
happens.

All 53 linalg + methodology tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… drift)

R5 was ✅ "Looks good" with 1 P3 informational item: the docstring of
_compute_bm_dof_from_contrasts still described only the unweighted
(tr B)^2 / tr(B^2) formula, but the function body now dispatches the
weighted case to the clubSandwich singleton-cluster CR2 P_array form.
Split the docstring into Unweighted and Weighted sections matching the
two code paths.

No code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…asts (R6 P1)

R6 codex review surfaced a P1: the weighted CR2-BM per-coefficient Satterthwaite
DOF disagreed with clubSandwich by 15-30% on `weighted_did_absorbed_fe`'s
treated-unit dummies (unit2/3/4), even though vcov matched at machine precision.

Root cause (after instrumentation against R's get_GH and CR2 source):
the contrast vectors for high-leverage FE-dummy coefficients project to
near-zero on the design (e.g., unit2's dummy column has XW_g0 row 2 = exact
zeros for the unit-1 cluster). The resulting per-cluster H_array slices and
P_array entries land at the float64 noise floor (~1e-30 for typical
matmul-product roundoff at ~1e-16 per entry). The DOF formula
`(tr P)² / sum(P²)` is scale-invariant, but R and NumPy use different BLAS
reduction orders, producing 1-bit-different roundoff that propagates into
30% DOF disagreement. Not a fixable algebra bug — fundamental FP precision
limit for high-leverage contrasts.

Mitigation: detect the noise floor (per-contrast `max(|P|)` below `1e-10 ×`
the largest contrast's `max(|P|)`) and return NaN with a `UserWarning`.
Honest signal that the DOF cannot be reliably computed instead of silently
shipping BLAS-implementation-dependent inference. The coefficient SEs remain
valid; only the affected DOF (and any t-test or CI that depends on it) is
suppressed.

Documented as a precision limit in REGISTRY.md and CHANGELOG.md. New
regression test `TestWLSCR2FEDoFNoiseGuard` pins the NaN-guard behavior on
the weighted_did_absorbed_fe scenario (unit2/3/4 expected NaN; all other
9 coefficients still match clubSandwich at atol=1e-10).

All 339 linalg + estimators + methodology tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…R7 P0)

R7 codex flagged a P0: the noise-floor NaN-guard in
_cr2_bm_dof_inner_weighted correctly returns NaN DOF, but
LinearRegression.get_inference() converted non-finite _bm_dof to df=None,
which safe_inference() then treated as normal-theory inference — producing
huge t-stats, p≈0, and zero-width CIs for the guarded coefficients
instead of suppression.

Fix: in get_inference(), when _bm_dof[index] is non-finite (NaN), return
InferenceResult with NaN t_stat/p_value/conf_int and df=None directly,
short-circuiting the normal-theory fallback. SE and coefficient remain
valid (vcov matched at machine precision); only the affected coef's
small-sample inference is suppressed.

New end-to-end regression test TestLinearRegressionFENanGuardEndToEnd
fits the public LinearRegression(vcov_type="hc2_bm", weights=, cluster_ids=)
on weighted_did_absorbed_fe and asserts: NaN inference for the 3 treated-
unit dummies (the noise-floor cases) AND finite inference for the other 9
coefficients. This catches the exact failure mode R7 surfaced.

Also tightens CHANGELOG/REGISTRY wording (R7 P3): explicitly call out that
"vcov + non-noise-floor DOF + compound-contrast DOF match clubSandwich";
high-leverage FE-dummy coefficients are suppressed to NaN.

All 339 linalg/estimators/methodology tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • The core methodology is aligned with the Methodology Registry: this is a documented clubSandwich WLS-CR2 port, not a textbook PT2018 transform-once implementation, and that deviation is explicitly documented in docs/methodology/REGISTRY.md:L2651-L2656, so it is not a defect.
  • The main public-surface risks called out in the PR are covered in code and tests: pweight-only gating, zero-total-weight cluster handling, weighted one-way singleton-cluster reduction, and LinearRegression BM-DOF threading.
  • I found no unmitigated P0/P1 issues in the changed public surfaces (compute_robust_vcov, solve_ols, LinearRegression).
  • P2 The new noise-floor NaN guard is batch-relative, so weighted single-contrast calls to _compute_cr2_bm_contrast_dof can still emit BLAS-dependent finite DOF even though the Registry/CHANGELOG describe the helper as guarded. diff_diff/linalg.py:L1852-L1882, diff_diff/linalg.py:L1887-L2005, docs/methodology/REGISTRY.md:L2651-L2656, CHANGELOG.md:L13, tests/test_methodology_wls_cr2.py:L169-L215, tests/test_methodology_wls_cr2.py:L476-L533.
  • I could not run pytest here because the sandbox is missing the Python numerical stack.

Methodology

  • P2 Impact: _cr2_bm_dof_inner_weighted only suppresses a contrast when its max|P| is tiny relative to a larger contrast in the same batch. For a single weighted contrast, max|P|_overall is the contrast’s own scale, so the 1e-10 × max|P|_overall rule can never classify it as degenerate. That leaves direct single-contrast weighted calls to _compute_cr2_bm_contrast_dof able to return BLAS-dependent finite DOF on the exact noise-floor case the docs say should be suppressed. Current public estimator surfaces do not appear to rely on that path, so this is non-blocking. Concrete fix: add a single-contrast instability criterion (absolute/self-scaled, not batch-relative), or narrow the docs to say the NaN guard applies only to batched per-coefficient calls; add a regression test with a weighted FE-dummy single contrast. diff_diff/linalg.py:L1852-L1882, diff_diff/linalg.py:L1887-L2005, docs/methodology/REGISTRY.md:L2651-L2656
  • P3 Impact: the documented deviation from textbook PT2018 to clubSandwich-specific WLS algebra is clearly disclosed and matches the implementation, so I am treating it as an accepted implementation choice, not a defect. Concrete fix: none. diff_diff/linalg.py:L1495-L1654, docs/methodology/REGISTRY.md:L2651-L2656, CHANGELOG.md:L13

Code Quality

No findings.

Performance

  • P3 Impact: LinearRegression.fit() now pays for CR2 twice on the new weighted hc2_bm path: once inside solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to fill _bm_dof. Correctness is fine, but it adds a second full CR2 pass on the heaviest new path. Concrete fix: let solve_ols optionally return DOF from the same CR2 computation, or cache/share the CR2 precomputes. diff_diff/linalg.py:L3235-L3276

Maintainability

No separate findings beyond the duplicated CR2 work above.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • No documentation defect: the Registry, TODO lift, and changelog all match the implemented pweight-only contract and the clubSandwich-specific algebra. docs/methodology/REGISTRY.md:L2651-L2656, TODO.md:L104-L110, CHANGELOG.md:L13
  • Test coverage is strong for parity, weight-type rejection, zero-weight clusters, unweighted regression safety, and LinearRegression NaN propagation; the remaining gap is the single-contrast noise-floor case noted above. tests/test_methodology_wls_cr2.py:L73-L215, tests/test_methodology_wls_cr2.py:L218-L600
  • I did not execute the suite here because the sandbox lacks NumPy/pytest.

Assumption

  • I treated _compute_cr2_bm_contrast_dof as an internal helper. If it is considered supported API rather than internal implementation detail, I would raise the single-contrast noise-floor issue one severity level.

CI codex on PR #475 (✅ verdict) flagged a real P2: the noise-floor NaN-
guard in `_cr2_bm_dof_inner_weighted` was batch-relative only — for a
single-contrast call to `_compute_cr2_bm_contrast_dof`, `max|P|_overall`
equals the contrast's own max|P|, so the `1e-10 × max|P|_overall` rule
could never classify it as degenerate. That left direct single-contrast
weighted callers (e.g., MPD avg_att) unprotected: they could still emit
BLAS-implementation-dependent finite DOF on noise-floor contrasts even
though the registry/changelog said the helper was guarded.

Fix: union the batch-relative criterion with an absolute floor scaled to
the bread matrix's magnitude: `(EPS × n × k × max(bread_inv_scale, 1))²`.
This covers the worst-case dgemm accumulation roundoff floor for
`H1/H2/H3 @ contrast` products. A single-contrast call now correctly
fires the NaN-guard on a high-leverage FE-dummy contrast.

New regression tests in `tests/test_methodology_wls_cr2.py::
TestWLSCR2SingleContrastNoiseFloor` (2 tests): single weighted FE-dummy
contrast triggers NaN-guard + warning; single non-noise contrast still
returns finite DOF matching clubSandwich at atol=1e-10.

CI codex P3 (perf): LinearRegression.fit() pays CR2 twice on the new
weighted hc2_bm path (solve_ols + compute_robust_vcov). Added as a TODO
follow-up row (PR #475 follow-up, Low priority).

All 198 linalg+methodology+estimators-vcov-type+TWFE+SA tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8f00e976d33dcd3600239bebcdfaf8ce5d72960b


Overall Assessment

⛔ Blocker

Executive Summary

  • The prior non-blocking single-contrast noise-floor concern appears resolved: _cr2_bm_dof_inner_weighted() now has an absolute single-contrast floor, and the new regression tests cover both the guarded and non-guarded single-contrast cases. diff_diff/linalg.py:L1852-L1903, tests/test_methodology_wls_cr2.py:L544-L623
  • P0 The new weighted clustered CR2 path is still wrong on supported pweight inputs when a positive-weight cluster contains some zero-weight padding rows. Those zero-weight rows still enter the CR2 adjustment blocks and can silently change SE/DOF instead of being inert. diff_diff/linalg.py:L1548-L1654, diff_diff/linalg.py:L1951-L2024, diff_diff/linalg.py:L2179-L2188
  • The documented methodology deviation from a textbook PT2018 transform-once reading is properly recorded in the Methodology Registry, so I do not treat that clubSandwich-specific algebra choice as a defect. docs/methodology/REGISTRY.md:L2651-L2656
  • Weight-type rejection (aweight/fweight) and LinearRegression._bm_dof threading look correctly implemented and covered on the changed public paths. diff_diff/linalg.py:L2263-L2317, diff_diff/linalg.py:L3272-L3315, tests/test_methodology_wls_cr2.py:L214-L298, tests/test_methodology_wls_cr2.py:L358-L457
  • I could not run pytest here because the sandbox does not have NumPy/pytest.

Methodology

  • Severity: P0. Impact: weighted clustered CR2 is not subpopulation-invariant on supported pweight inputs with mixed zero/nonzero weights inside a cluster. The new code drops only zero-total-weight clusters, but it still builds S_W, H_gg, G_g, and A_g on X_g for all rows in a positive-weight cluster. Because zero-weight rows remain in the row side of H_gg and in bias_term = X_g @ MUWTWUM @ X_g.T, excluded observations can still alter the adjustment matrix and therefore the reported vcov/DOF. That is silent wrong statistical output on an allowed input, and it contradicts the surrounding linalg contract that zero-weight rows “contribute nothing to the sandwich.” The new tests only cover all-zero clusters, not this mixed-zero case. Concrete fix: before weighted clustered CR2/contrast-DOF construction, filter to weights > 0 (globally or per cluster) so mixed-zero rows are fully removed from X, residuals, cluster_ids, and weights; then add an invariance test against the physically dropped positive-weight subset. Refs: diff_diff/linalg.py:L1548-L1654, diff_diff/linalg.py:L1951-L2024, diff_diff/linalg.py:L2179-L2188, tests/test_methodology_wls_cr2.py:L301-L355, docs/methodology/REGISTRY.md:L2651-L2656
  • Severity: P3 informational. Impact: the clubSandwich-specific WLS algebra (W, not sqrt(W); W^2 bias term; unweighted residual score) is documented in the Registry, so that deviation from a textbook reading is acceptable and not a defect. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L2651-L2656, diff_diff/linalg.py:L1504-L1527

Code Quality

No findings.

Performance

  • Severity: P3 informational, tracked in TODO.md. Impact: LinearRegression.fit() still pays for CR2 twice on the weighted hc2_bm path: once via solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to populate _bm_dof. Concrete fix: thread return_dof through solve_ols or cache/share the CR2 precomputes. Refs: diff_diff/linalg.py:L3251-L3295, TODO.md:L111-L111

Maintainability

No findings.

Tech Debt

  • Severity: P3. Impact: TODO.md now says rows 104-105 are lifted, but Tier C still separately lists “Weighted one-way Bell-McCaffrey + weighted CR2 Bell-McCaffrey” as outstanding, so the deferred-work inventory is self-contradictory. Concrete fix: remove or rewrite the stale Tier C bullet. Refs: TODO.md:L104-L111, TODO.md:L210-L210

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the Registry and changelog still describe the noise-floor guard only as the batch-relative 1e-10 × max|P| rule, but the shipped code now also applies an absolute single-contrast floor to fix the previous review item. The docs therefore underdescribe when NaN suppression can fire on direct single-contrast calls. Concrete fix: update REGISTRY.md and CHANGELOG.md to mention the absolute single-contrast floor (or describe the full max(relative_floor, abs_noise_floor) rule). Refs: diff_diff/linalg.py:L1862-L1885, docs/methodology/REGISTRY.md:L2651-L2651, CHANGELOG.md:L13-L13, tests/test_methodology_wls_cr2.py:L544-L623
  • I did not execute the suite here because NumPy/pytest are unavailable in the sandbox.

Path to Approval

  1. Make weighted clustered CR2 ignore zero-weight rows inside otherwise positive-weight clusters, not just zero-total-weight clusters.
  2. Add regression tests that compare clustered vcov_type="hc2_bm" + weight_type="pweight" on mixed-zero clusters to the physically dropped positive-weight subset, covering both per-coefficient DOF and contrast DOF on at least one public entry point.

… P0 R2)

CI codex round 2 on PR #475 flagged a P0: weighted clustered CR2 wasn't
subpopulation-invariant on mixed-zero clusters. The earlier "drop zero-
total-weight clusters" guard handled all-zero clusters but missed mixed-
zero clusters (positive total weight, some zero-weight rows inside). In
those clusters, zero-weight rows still entered the CR2 adjustment matrices
(H_gg, G_g, A_g, bias_term) on the row side, silently changing SE/DOF —
contradicting the linalg contract that zero-weight rows are inert.

Fix: physically filter `weights > 0` rows before all per-cluster
computations in both `_compute_cr2_bm` and `_compute_cr2_bm_contrast_dof`.
The caller's `bread_matrix = X.T @ (X * w[:, None])` is invariant to
zero-weight row removal (those rows contribute 0 to the sum), so no bread
rebuild is needed. Effective-cluster filter still applies on the filtered
view.

New regression tests `TestWLSCR2SubpopulationInvariance` (2 tests):
- `test_per_coefficient_dof_invariant_to_zero_weight_padding`: pin
  vcov + per-coefficient DOF at atol=1e-12 between (a) computing on the
  full design with zero-weight padding rows interleaved and (b) computing
  on the physically dropped positive-weight subset.
- `test_contrast_dof_invariant_to_zero_weight_padding`: same invariance
  for compound-contrast DOF.

Also addresses two P3 doc items from the same review:
- REGISTRY + CHANGELOG noise-floor description: now explicitly mentions
  both criteria (batch-relative AND absolute single-contrast safe) instead
  of only the batch-relative one.
- TODO.md Tier C self-contradiction: marked the WLS-CR2 line as LIFTED
  with a back-reference to closed rows 109-110, removing the contradiction
  between the rows table (lifted) and the prose Tier C list (still
  outstanding).

P3 performance (LinearRegression pays CR2 twice on weighted hc2_bm path)
remains tracked in TODO.md row 111 from the previous round.

All 200 linalg+methodology+TWFE+SA+estimators-vcov-type tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a2fa8016660437d5972b394a32ca4965f4bd6c88


Overall Assessment

✅ Looks good

Executive Summary

  • The prior blocker from the last review appears resolved: both weighted CR2 builders now physically drop weights == 0 rows before any cluster-block construction, and the new invariance tests cover the mixed-zero case.
  • The clubSandwich-specific WLS-CR2 algebra and the FE-dummy noise-floor NaN guard are documented in the Methodology Registry, so I do not treat those deviations from a textbook PT2018 reading as defects.
  • LinearRegression now threads weighted Bell-McCaffrey DOF into get_inference() and returns all-NaN inference when the noise-floor guard fires, which fixes the earlier fallback-to-n-k / normal-theory problem.
  • Remaining items are non-blocking: the Registry row should explicitly say the lift is pweight-only, and weighted hc2_bm still recomputes CR2 twice in LinearRegression.fit() (already tracked in TODO.md).
  • I could not run the test suite here because the sandbox lacks NumPy/pytest.

Methodology

No P0/P1 findings. The registry now documents the load-bearing methodology choices and precision-limit behavior in docs/methodology/REGISTRY.md:L2651-L2656, and the prior mixed-zero-weight blocker appears resolved by the new weights > 0 filtering in both weighted CR2 builders diff_diff/linalg.py:L1548-L1583 and diff_diff/linalg.py:L1969-L1993, with regression coverage in tests/test_methodology_wls_cr2.py:L536-L724.

  • Severity: P3. Impact: The new Phase 1a registry row says weighted hc2_bm + weights is supported on the analytical linalg surface, but unlike the in-code docs it does not record the load-bearing pweight-only restriction. That overstates the supported surface in the project’s methodology registry even though the code correctly rejects aweight and fweight. Refs: docs/methodology/REGISTRY.md:L2651-L2656, diff_diff/linalg.py:L1259-L1267. Concrete fix: Add one sentence to the Phase 1a registry row stating that the clubSandwich WLS-CR2 port matches weight_type="pweight" only and that aweight / fweight remain intentionally unsupported.

Code Quality

No findings.

Performance

  • Severity: P3. Impact: LinearRegression.fit() still pays for the weighted hc2_bm CR2 path twice: once via solve_ols(..., return_vcov=True) and again via compute_robust_vcov(..., return_dof=True) to populate _bm_dof. This is correct but redundant on the newly opened path. Refs: diff_diff/linalg.py:L3306-L3336, TODO.md:L111-L111. Concrete fix: Thread return_dof through solve_ols() or cache/share the weighted-CR2 precomputes (A_g, MUWTWUM) between the vcov and DOF calls.

Maintainability

No findings.

Tech Debt

No additional findings beyond the tracked weighted-CR2 double-compute follow-up in TODO.md:L111-L111.

Security

No findings.

Documentation/Tests

No blocking findings. The new methodology harness covers parity, weight-type rejection, effective-cluster rejection, mixed-zero-row subpopulation invariance, single-contrast noise-floor guarding, and LinearRegression end-to-end NaN inference propagation in tests/test_methodology_wls_cr2.py:L218-L808.

I could not execute the suite here because the sandbox lacks NumPy/pytest.

CI codex R3 on PR #475 (✅ verdict) flagged a single actionable P3 docs item:
the Phase 1a registry row said "Weighted Bell-McCaffrey is supported" but
didn't record the load-bearing `pweight`-only restriction. The code
correctly rejects `aweight`/`fweight` at the linalg validator, but the
registry overstated the supported surface.

Fix: added a sentence to the Phase 1a row stating the clubSandwich WLS-CR2
port matches the `pweight` convention only; `aweight`/`fweight` remain
intentionally unsupported (CR1 / `vcov_type="hc1"` still supports all
three).

CI codex R3 P3 perf (LinearRegression pays CR2 twice on weighted hc2_bm
path) is already tracked in `TODO.md` row 111 from R2; no new action.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fe217e8d2cbb1cf9e4b3347ff0025fd602a71a6


Overall Assessment

✅ Looks good

Executive Summary

  • The prior re-review’s remaining documentation nit is resolved: the Phase 1a registry row now explicitly scopes weighted hc2_bm to the analytical linalg surface and weight_type="pweight" only, matching the new code paths.
  • I did not find any unmitigated P0/P1 issues in the weighted CR2 lift. The two earlier correctness risks this PR family had been carrying are addressed in the changed code: zero-weight padded rows are dropped before CR2 block algebra, and LinearRegression.get_inference() now suppresses all inference fields when BM DOF is NaN instead of silently falling back to normal-theory inference.
  • The methodology change is consistent with the cited source material and the Methodology Registry; I do not treat the documented clubSandwich-specific algebra or the documented noise-floor NaN behavior as defects.
  • Test coverage is strong for the new surface: R-golden parity, weight-type rejection, effective-cluster rejection, zero-weight subpopulation invariance, noise-floor guarding, and LinearRegression end-to-end threading.
  • The only remaining issue I saw is the already-tracked double-compute on weighted hc2_bm inside LinearRegression.fit(), which is P3/non-blocking.

Methodology

  • No P0/P1 findings. The weighted CR2 lift matches the clubSandwich source path the PR says it ports: weighted lm objects default to the non-inverse_var branch, CR2 builds the weighted H_jj / MUWTWUM adjustment, and get_GH + get_P_array + Satterthwaite_df define the H-array/P-array Satterthwaite DOF used by coef_test. The local implementation in diff_diff/linalg.py:L1495-L2052 and the registry note in docs/methodology/REGISTRY.md:L2651-L2656 are consistent with that contract. (rdrr.io)

Code Quality

  • No findings.

Performance

  • P3 Impact: LinearRegression.fit() still computes weighted CR2 twice on the opened hc2_bm path, once through solve_ols(..., return_vcov=True) and again through compute_robust_vcov(..., return_dof=True). This is already tracked in TODO.md:L111-L111, so it is mitigated and non-blocking. Concrete fix: thread return_dof through solve_ols() or cache the CR2 precomputes reused by both calls. Refs: diff_diff/linalg.py:L3300-L3339, TODO.md:L111-L111.

Maintainability

  • No findings.

Tech Debt

  • No new untracked findings.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 20, 2026
@igerber igerber merged commit 955aa4b into main May 21, 2026
33 of 34 checks passed
@igerber igerber deleted the wls-cr2-clubsandwich-port branch May 21, 2026 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant