Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam#369
Merged
Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam#369
Conversation
…st seam Closes the last queued SDID R-parity follow-up (per ``project_sdid_pr349_followups.md``, now removed from the memory index — the work is shipped). Symmetric with the existing ``test_jackknife_se_matches_r`` anchor in TestJackknifeSERParity. Methodology fix — placebo warm-start: ``synthdid:::placebo_se`` (R/vcov.R) seeds Frank-Wolfe per draw with ``weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])`` (fit-time ω subsetted + renormalized) and the fit-time ``weights$lambda``, then re-estimates with ``update.omega=TRUE, update.lambda=TRUE``. Python's ``_placebo_variance_se`` previously used uniform cold-start, producing finite-iter convergence-pattern drift on a handful of draws relative to R's reference SE on the same panel. Fix: add ``init_omega`` and ``init_lambda`` kwargs to ``_placebo_variance_se``. The dispatcher now passes ``init_omega= unit_weights, init_lambda=time_weights`` (fit-time outputs); the loop seeds ``compute_sdid_unit_weights(init_weights= _sum_normalize(init_omega[pseudo_control_idx]))`` and ``compute_time_weights(init_weights=init_lambda)`` per draw, mirroring R's warm-start pattern. At the global FW optimum the two starts are equivalent (strictly convex objective) — this is a finite-iter parity fix, not a methodology change. R-parity fixture + test seam: * ``benchmarks/R/generate_sdid_placebo_parity_fixture.R`` — R 4.5.2 + synthdid 0.0.9. Reuses the same Y matrix as ``TestJackknifeSERParity`` (same R_ATT = 4.980848860060929) so jackknife and placebo R-parity tests share an anchor panel. Records the 200 per-rep permutations R consumed and the SE from R's manual ``placebo_se`` loop (which matches ``vcov(method="placebo")`` to machine precision when seeded); permutations are 0-indexed for direct numpy consumption. * ``tests/data/sdid_placebo_indices_r.json`` — committed fixture. * ``_placebo_variance_se`` gains a private ``_placebo_indices`` kwarg (underscore-prefixed, test-only). When supplied, each row replaces the per-draw ``rng.permutation(n_control)`` so a Python fit can consume R's exact permutation sequence and produce a bit-identical SE. * ``test_placebo_se_matches_r`` (in ``TestJackknifeSERParity``) intercepts the dispatcher's call to ``_placebo_variance_se`` to capture the normalized fit-time inputs, then re-invokes the method with R's permutations through the seam. Asserts ``|py_se - r_se| < 1e-8`` — Rust FW vs R FW differ at sub-ULP on the same warm-start; tight enough to catch real divergences without masking BLAS reduction-order tolerance. Baseline rebase: ``TestScaleEquivariance::test_baseline_parity_small_scale[placebo]`` captured pre-warm-start SE = 0.29385822261006445. New value is 0.293840360160448 (sub-percent shift). The test's bit-identity contract is preserved per backend; baseline updated with a comment documenting the warm-start change and pointer to the new R-parity test that pins the post-fix value to R's reference. p-value (placebo uses empirical formula, not analytical) is unchanged at 0.004975124378109453. Verification: ``pytest tests/test_methodology_sdid.py tests/test_survey_phase5.py -q`` → 230 passed (1 new R-parity test; existing TestScaleEquivariance baseline rebased; all other SDID + survey tests unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
- _placebo_variance_se docstring step 3 now describes the warm-start semantics (was still saying "uniform initialization, fresh start" after PR #369 landed the warm-start). Adds Parameters entries for init_omega / init_lambda / _placebo_indices. - test_placebo_se_matches_r now also asserts elementwise match between Python's placebo_effects and R_PLACEBO_TAUS from the fixture, so a permutation that diverged on a single draw but happened to leave sd() unchanged would still trip the regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Swap the captured SE from 0.293840360160448 (macOS Apple Accelerate local) to 0.2938403592163006 (Linux OpenBLAS, ubuntu-24.04-arm CI runner). The warm-start landed in the prior commit threads ``unit_weights`` (a fit-time FW output that carries sub-ULP BLAS reduction-order divergence) into per-draw FW init; across 200 draws with path-dependent sparsification the SE diverges ~1e-9 between the two platforms — exceeding the existing ``1e-12`` bit-identity gate. CI is the gating surface. macOS local fits will now drift at ~1e-9 on this fixture; the inline comment documents that the delta is finite-iter FW path-dependence, not a numerical regression. ATT and empirical p-value are unchanged: ATT comes from the deterministic FW solve (platform-stable to <1e-14 on this panel) and the placebo p-value is integer-driven (1/(B+1) = 1/201 with no placebo τ exceeding |ATT|). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The placebo SE warm-start landed in PR #369 threads ``unit_weights`` (a fit-time FW output that carries sub-ULP BLAS reduction-order divergence) into each per-draw FW init. Across 200 placebo draws with path-dependent sparsification in the 100-iter pre-sparsify pass, that ULP-level input difference accumulates to ~1e-9 SE divergence between Apple Accelerate (macOS) and OpenBLAS (Linux). No single double satisfies both at the prior ``1e-12`` gate. The placebo row's SE assertion is loosened to ``rel=1e-7`` (drift detector, not bit-identity). Bootstrap and jackknife stay at ``rel=1e-14``: bootstrap dilutes the divergence by resampling from the full unit set with replacement; jackknife uses fixed weights and no FW re-estimation. Bit-identity protection for placebo moves to ``test_placebo_se_matches_r`` (``TestJackknifeSERParity``), which uses the ``_placebo_indices`` test seam to feed R's exact permutations through the same normalized inputs the dispatcher would, bypassing the platform-divergent fit-time path. That test asserts both aggregate SE (< 1e-8 vs R) and per-draw τ (< 1e-8 elementwise vs R), which is strictly stronger than the prior ``1e-12`` capture-vs-capture gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
synthdid::vcov(method="placebo")). Symmetric with the existing jackknife R-parity test inTestJackknifeSERParity._placebo_variance_sepreviously used uniform cold-start for per-draw Frank-Wolfe re-estimation; R'ssynthdid:::placebo_se(R/vcov.R) uses a warm-startweights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])(fit-time ω subsetted + renormalized) plus the fit-timeweights$lambda. Identical at the global FW optimum (strictly convex objective), but produces finite-iter convergence-pattern drift on a handful of draws against R's reference.init_omegaandinit_lambdakwargs to_placebo_variance_se; the dispatcher passes fit-timeunit_weights/time_weights; the per-draw FW calls thread these asinit_weights. Mirrors the bootstrap warm-start landed in PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349.benchmarks/R/generate_sdid_placebo_parity_fixture.Rrecords 200 per-rep permutations + R's placebo SE on the same Y matrix asTestJackknifeSERParity(R 4.5.2, synthdid 0.0.9). Output committed attests/data/sdid_placebo_indices_r.json. R's manual loop andvcov(method="placebo")agree at machine precision when seeded (both0.226342763355644)._placebo_indiceskwarg on_placebo_variance_se(underscore-prefixed, not public API). When supplied, each row replaces the per-drawrng.permutation(n_control)so Python can consume R's exact permutation sequence.test_placebo_se_matches_rinTestJackknifeSERParityintercepts the dispatcher's call to capture the normalized fit-time inputs, then re-invokes_placebo_variance_sewith R's permutations through the seam. Asserts|py_se - r_se| < 1e-8(Rust FW vs R FW differ at sub-ULP on the same warm-start; tolerance covers BLAS reduction-order without masking real divergences).TestScaleEquivariance::test_baseline_parity_small_scale[placebo]captured pre-warm-start SE0.29385822261006445. New value0.293840360160448(sub-percent finite-iter shift). Updated with a comment documenting the warm-start change and pointer to the new R-parity test.Methodology references
SyntheticDiD._placebo_variance_sewarm-start change (matchingsynthdid:::placebo_sein R/vcov.R).synthdid0.0.9 source code (R/vcov.R::placebo_se),update.omega=TRUE, update.lambda=TRUEdefaults withweights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])warm-start.Validation
tests/test_methodology_sdid.py::TestJackknifeSERParity::test_placebo_se_matches_r— R-parity at< 1e-8.tests/test_methodology_sdid.py::TestScaleEquivariance::test_baseline_parity_small_scale[placebo]baseline rebased from0.29385822261006445to0.293840360160448with documentation.Rscript benchmarks/R/generate_sdid_placebo_parity_fixture.Rproduces R ATT = 4.980848860060929, R placebo SE = 0.226342763355644 (manual loop = via_vcov to 15 digits).Security / privacy