igerber · igerber · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/TODO.md b/TODO.md
@@ -101,7 +101,6 @@ Deferred items from PR reviews that were not addressed before merge.
 | Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium |
 | HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits currently raises `NotImplementedError` in three places: `TwoWayFixedEffects` unconditionally; `DifferenceInDifferences(absorb=..., vcov_type in {"hc2","hc2_bm"})`; `MultiPeriodDiD(absorb=..., vcov_type in {"hc2","hc2_bm"})`. Within-transformation preserves coefficients and residuals under FWL but not the hat matrix, so the reduced-design `h_ii` is not the diagonal of the full FE projection and CR2's block adjustment `A_g = (I - H_gg)^{-1/2}` is likewise wrong on absorbed cluster blocks. Lifting the guard needs HC2/CR2-BM computed from the full absorbed projection (unit/time FE dummies reconstructed internally, or a FE-aware hat-matrix formulation) and a parity harness against a full-dummy OLS run or R `fixest`/`clubSandwich`. HC1/CR1 are unaffected by this because they have no leverage term. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit`, `estimators.py::MultiPeriodDiD.fit` | Phase 1a | Medium |
 | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
-| Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
 | `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
 | Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low |
 | Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
@@ -166,7 +165,6 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk,
 
 #### Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)
 
-- Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R via `benchmarks/R/generate_clubsandwich_golden.R` (currently `source: python_self_reference`)
 - HonestDiD `test_m0_short_circuit`: replace wall-clock `elapsed < 0.5s` proxy with a state flag (`tests/test_methodology_honest_did.py:246`)
 - EfficientDiD `control_group="last_cohort"` REGISTRY-vs-code alignment with `anticipation>0` (`efficient_did.py`, one design decision)
 - TripleDifference: add `generate_ddd_panel_data` for panel DDD power analysis (`prep_dgp.py`, `power.py`)

diff --git a/benchmarks/R/generate_clubsandwich_golden.R b/benchmarks/R/generate_clubsandwich_golden.R
@@ -7,7 +7,7 @@
 #   Rscript benchmarks/R/generate_clubsandwich_golden.R
 #
 # Requirements:
-#   clubSandwich (CRAN), jsonlite, readr
+#   clubSandwich (CRAN), jsonlite
 #
 # Output:
 #   benchmarks/data/clubsandwich_cr2_golden.json
@@ -50,12 +50,12 @@ for (nm in names(datasets)) {
   d <- datasets[[nm]]
   fit <- lm(y ~ x, data = d)
   vcov_cr2 <- vcovCR(fit, cluster = d$cluster, type = "CR2")
-  # Per-contrast Bell-McCaffrey DOF: one per coefficient via a unit contrast.
+  # Per-coefficient Bell-McCaffrey Satterthwaite DOF via coef_test()$df_Satt.
+  # (clubSandwich 0.7+ removed `Wald_test(..., test="Satterthwaite")`; the
+  # `df_Satt` column from coef_test() is the idiomatic per-coefficient form
+  # and is numerically identical to the old per-unit-contrast path.)
+  ct <- coef_test(fit, vcov = vcov_cr2)
   coef_names <- names(coef(fit))
-  dof_vec <- sapply(coef_names, function(nm_coef) {
-    ctr <- setNames(as.numeric(names(coef(fit)) == nm_coef), names(coef(fit)))
-    Wald_test(fit, constraints = matrix(ctr, 1), vcov = vcov_cr2, test = "Satterthwaite")$df
-  })
   output[[nm]] <- list(
     x = d$x,
     y = d$y,
@@ -64,7 +64,7 @@ for (nm in names(datasets)) {
     coef_names = coef_names,
     vcov_cr2 = as.numeric(vcov_cr2),
     vcov_shape = dim(vcov_cr2),
-    dof_bm = as.numeric(dof_vec),
+    dof_bm = as.numeric(ct$df_Satt),
     cluster_sizes = as.numeric(table(d$cluster))
   )
 }