Regenerate clubsandwich_cr2_golden.json from authoritative R clubSandwich#450
Conversation
|
Overall Assessment ✅ Looks good No unmitigated P0/P1 findings. The substantive dCDH changes are documented rollbacks in Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No untracked tech-debt findings. The two dCDH rollbacks introduced here are explicitly tracked in TODO.md:L81-L82. Security No findings. Documentation/Tests
Verification note: I could not execute |
…wich
Wave 2 of multi-wave tech-debt paydown. Tier A row in the post-Wave-1
backlog: convert the CR2 Bell-McCaffrey parity test from a
"Python matches Python" self-reference regression test into a real
"Python matches clubSandwich" parity test.
Changes:
- benchmarks/R/generate_clubsandwich_golden.R: replace the broken
`Wald_test(..., test="Satterthwaite")` block with `coef_test(fit,
vcov=vcov_cr2)$df_Satt`. clubSandwich 0.7+ removed the "Satterthwaite"
test name from `Wald_test`; the `df_Satt` column from `coef_test()`
is the idiomatic per-coefficient Bell-McCaffrey Satterthwaite DOF and
is numerically identical to the old per-unit-contrast path.
- benchmarks/R/generate_clubsandwich_golden.R: drop stale `readr`
requirement from the header comment (never `library(readr)`'d).
- benchmarks/data/clubsandwich_cr2_golden.json: regenerated end-to-end.
meta.source flips from "python_self_reference" → "clubSandwich"; also
captures clubSandwich version (0.7.0), R version (4.5.2), and
generated_at timestamp for forensic traceability.
Parity:
Python `_compute_cr2_bm` matches clubSandwich `vcovCR(..., "CR2")` +
`coef_test()$df_Satt` at machine precision on the new dataset:
Dataset max|Δ coef| max|Δ vcov| max|Δ dof_bm|
balanced_small 2.8e-16 4.6e-16 4.0e-15
unbalanced_medium 1.8e-15 1.8e-15 4.4e-15
singletons_present 5.0e-16 1.4e-16 7.1e-15
Well under the 1e-6 test tolerance.
The dataset itself (x, y values) shifts because R 4.5.2's RNG produces
different output than whatever R version originally bootstrapped the
`python_self_reference` values into the JSON. Same seeds, different
RNG streams. coef/vcov/dof_bm differ by order 1e-1 to 1e+0 between
old and new JSON — that's the dataset change, not a methodology change.
The `meta.R_version` field locks today's RNG behavior so future
regenerations can be diagnosed if R bumps its RNG defaults again.
TODO.md: remove the row from the Methodology/Correctness table AND the
Tier A bullet under Prioritized Tech-Debt Backlog (both reference the
now-resolved regen).
No `diff_diff/*.py` files touched — Python was already correct.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
22a5c71 to
6c8eee4
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No P0/P1 findings. clubSandwich’s current docs describe
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Removing the resolved Security No findings. Documentation/Tests
Verification note: I could not run |
P2 — REGISTRY.md L2543 still described the committed JSON as a "python_self_reference" regression anchor with the authoritative R run tracked in TODO.md. After this PR the JSON IS the authoritative R run. Rewrote the bullet to reflect the new state: meta.source = clubSandwich, version + R_version + generated_at captured, parity verified at <= 7.1e-15 across all three datasets. P3 — tests/test_linalg_hc2_bm.py::test_cr2_parity_with_golden docstring L525-L531 still said "until then the JSON is a self-reference anchor". Rewrote to describe the JSON as the authoritative clubSandwich fixture with the empirical parity margin. Both are factual corrections; no methodology surface changes. Test still passes at 1e-6 tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
Wave 2 of multi-wave tech-debt paydown. Tier A row in the post-Wave-1 backlog. Converts the CR2 Bell-McCaffrey parity test from a "Python matches Python" self-reference regression test into a real "Python matches R clubSandwich" parity test.
benchmarks/R/generate_clubsandwich_golden.R: replace the brokenWald_test(..., test="Satterthwaite")block withcoef_test(fit, vcov=vcov_cr2)$df_Satt. clubSandwich 0.7+ removed"Satterthwaite"from theWald_testtest-name set; thedf_Sattcolumn fromcoef_test()is the idiomatic per-coefficient Bell-McCaffrey Satterthwaite DOF and is numerically identical to the old per-unit-contrast path. Also drops the stalereadrrequirement from the header (neverlibrary(readr)'d).benchmarks/data/clubsandwich_cr2_golden.json: regenerated end-to-end.meta.sourceflipspython_self_reference→clubSandwich; also capturesclubSandwich_version(0.7.0),R_version(4.5.2), andgenerated_attimestamp for forensic traceability.TODO.md: remove the Methodology/Correctness table row AND the Tier A bullet — both reference the now-resolved regen.No
diff_diff/*.pyfiles touched — Python was already correct (verified at machine precision; see parity table below).Parity demonstration
Python
_compute_cr2_bmvs clubSandwichvcovCR(..., "CR2")+coef_test()$df_Satton the new dataset — max abs Δ per dataset × field:coefvcov_cr2dof_bmbalanced_smallunbalanced_mediumsingletons_presentAll ≤ 1e-14. Well under the 1e-6 test tolerance.
Dataset shift caveat
The dataset itself (x, y values) shifts under regeneration because R 4.5.2's RNG produces different output than whatever R version originally bootstrapped the
python_self_referencevalues into the JSON. Same seeds, different RNG streams.Old (
python_self_reference) vs New (clubSandwich) JSON — max abs Δ per dataset × field:coefvcov_cr2dof_bmbalanced_smallunbalanced_mediumsingletons_presentThat's a dataset change, not a methodology change — confirmed by the machine-precision Python↔clubSandwich agreement above on the new dataset. The new
meta.R_versionfield locks today's RNG behavior so future regenerations can be diagnosed if R bumps RNG defaults again.Test plan
Rscript -e 'packageVersion("clubSandwich")'returns'0.7.0'Rscript benchmarks/R/generate_clubsandwich_golden.Rruns to completion (writes the JSON)jq '.meta.source' benchmarks/data/clubsandwich_cr2_golden.jsonreturns"clubSandwich"pytest tests/test_linalg_hc2_bm.py -v— 34/34 pass, includingTestCR2BMCluster::test_cr2_parity_with_goldengrep -nE "Regenerate.*clubsandwich_cr2_golden" TODO.mdreturns no matches (both rows removed)🤖 Generated with Claude Code