Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam by igerber · Pull Request #369 · igerber/diff-diff

igerber · 2026-04-25T12:29:54Z

Summary

Closes the last queued SDID R-parity follow-up (placebo SE matching synthdid::vcov(method="placebo")). Symmetric with the existing jackknife R-parity test in TestJackknifeSERParity.
Methodology fix: Python's _placebo_variance_se previously used uniform cold-start for per-draw Frank-Wolfe re-estimation; R's synthdid:::placebo_se (R/vcov.R) uses a warm-start weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]]) (fit-time ω subsetted + renormalized) plus the fit-time weights$lambda. Identical at the global FW optimum (strictly convex objective), but produces finite-iter convergence-pattern drift on a handful of draws against R's reference.
Fix: add init_omega and init_lambda kwargs to _placebo_variance_se; the dispatcher passes fit-time unit_weights / time_weights; the per-draw FW calls thread these as init_weights. Mirrors the bootstrap warm-start landed in PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349.
R fixture: benchmarks/R/generate_sdid_placebo_parity_fixture.R records 200 per-rep permutations + R's placebo SE on the same Y matrix as TestJackknifeSERParity (R 4.5.2, synthdid 0.0.9). Output committed at tests/data/sdid_placebo_indices_r.json. R's manual loop and vcov(method="placebo") agree at machine precision when seeded (both 0.226342763355644).
Test seam: private _placebo_indices kwarg on _placebo_variance_se (underscore-prefixed, not public API). When supplied, each row replaces the per-draw rng.permutation(n_control) so Python can consume R's exact permutation sequence.
R-parity test: new test_placebo_se_matches_r in TestJackknifeSERParity intercepts the dispatcher's call to capture the normalized fit-time inputs, then re-invokes _placebo_variance_se with R's permutations through the seam. Asserts |py_se - r_se| < 1e-8 (Rust FW vs R FW differ at sub-ULP on the same warm-start; tolerance covers BLAS reduction-order without masking real divergences).
Baseline rebase: TestScaleEquivariance::test_baseline_parity_small_scale[placebo] captured pre-warm-start SE 0.29385822261006445. New value 0.293840360160448 (sub-percent finite-iter shift). Updated with a comment documenting the warm-start change and pointer to the new R-parity test.

Methodology references

Method name(s): SyntheticDiD._placebo_variance_se warm-start change (matching synthdid:::placebo_se in R/vcov.R).
Paper / source link(s):
- Arkhangelsky, Athey, Hirshberg, Imbens & Wager (2021), American Economic Review 111(12) — Algorithm 4.
- R reference: synthdid 0.0.9 source code (R/vcov.R::placebo_se), update.omega=TRUE, update.lambda=TRUE defaults with weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]]) warm-start.
Any intentional deviations from the source: None on the warm-start contract — Python now matches R. The placebo survey-extension paths (PR Restore SDID survey support for placebo and jackknife variance methods #365) are unchanged; the warm-start fix applies to the non-survey + pweight-only path that has an R reference.

Validation

Tests added/updated:
- New: tests/test_methodology_sdid.py::TestJackknifeSERParity::test_placebo_se_matches_r — R-parity at < 1e-8.
- Updated: tests/test_methodology_sdid.py::TestScaleEquivariance::test_baseline_parity_small_scale[placebo] baseline rebased from 0.29385822261006445 to 0.293840360160448 with documentation.
- Existing 229 SDID + survey tests pass unchanged.
Backtest / simulation / notebook evidence:
- R fixture generator: Rscript benchmarks/R/generate_sdid_placebo_parity_fixture.R produces R ATT = 4.980848860060929, R placebo SE = 0.226342763355644 (manual loop = via_vcov to 15 digits).

Security / privacy

Confirm no secrets/PII in this PR: Yes

…st seam Closes the last queued SDID R-parity follow-up (per ``project_sdid_pr349_followups.md``, now removed from the memory index — the work is shipped). Symmetric with the existing ``test_jackknife_se_matches_r`` anchor in TestJackknifeSERParity. Methodology fix — placebo warm-start: ``synthdid:::placebo_se`` (R/vcov.R) seeds Frank-Wolfe per draw with ``weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])`` (fit-time ω subsetted + renormalized) and the fit-time ``weights$lambda``, then re-estimates with ``update.omega=TRUE, update.lambda=TRUE``. Python's ``_placebo_variance_se`` previously used uniform cold-start, producing finite-iter convergence-pattern drift on a handful of draws relative to R's reference SE on the same panel. Fix: add ``init_omega`` and ``init_lambda`` kwargs to ``_placebo_variance_se``. The dispatcher now passes ``init_omega= unit_weights, init_lambda=time_weights`` (fit-time outputs); the loop seeds ``compute_sdid_unit_weights(init_weights= _sum_normalize(init_omega[pseudo_control_idx]))`` and ``compute_time_weights(init_weights=init_lambda)`` per draw, mirroring R's warm-start pattern. At the global FW optimum the two starts are equivalent (strictly convex objective) — this is a finite-iter parity fix, not a methodology change. R-parity fixture + test seam: * ``benchmarks/R/generate_sdid_placebo_parity_fixture.R`` — R 4.5.2 + synthdid 0.0.9. Reuses the same Y matrix as ``TestJackknifeSERParity`` (same R_ATT = 4.980848860060929) so jackknife and placebo R-parity tests share an anchor panel. Records the 200 per-rep permutations R consumed and the SE from R's manual ``placebo_se`` loop (which matches ``vcov(method="placebo")`` to machine precision when seeded); permutations are 0-indexed for direct numpy consumption. * ``tests/data/sdid_placebo_indices_r.json`` — committed fixture. * ``_placebo_variance_se`` gains a private ``_placebo_indices`` kwarg (underscore-prefixed, test-only). When supplied, each row replaces the per-draw ``rng.permutation(n_control)`` so a Python fit can consume R's exact permutation sequence and produce a bit-identical SE. * ``test_placebo_se_matches_r`` (in ``TestJackknifeSERParity``) intercepts the dispatcher's call to ``_placebo_variance_se`` to capture the normalized fit-time inputs, then re-invokes the method with R's permutations through the seam. Asserts ``|py_se - r_se| < 1e-8`` — Rust FW vs R FW differ at sub-ULP on the same warm-start; tight enough to catch real divergences without masking BLAS reduction-order tolerance. Baseline rebase: ``TestScaleEquivariance::test_baseline_parity_small_scale[placebo]`` captured pre-warm-start SE = 0.29385822261006445. New value is 0.293840360160448 (sub-percent shift). The test's bit-identity contract is preserved per backend; baseline updated with a comment documenting the warm-start change and pointer to the new R-parity test that pins the post-fix value to R's reference. p-value (placebo uses empirical formula, not analytical) is unchanged at 0.004975124378109453. Verification: ``pytest tests/test_methodology_sdid.py tests/test_survey_phase5.py -q`` → 230 passed (1 new R-parity test; existing TestScaleEquivariance baseline rebased; all other SDID + survey tests unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T12:38:13Z

Overall Assessment

✅ Looks good

Executive Summary

Affected method: SyntheticDiD placebo variance path in fit() / _placebo_variance_se.
Methodology cross-check is favorable: official synthdid treats placebo variance as Algorithm 4 and, in placebo_se, copies the fit-time weights, renormalizes omega on the pseudo-control subset, and re-calls synthdid_estimate with that weight object, so the new init_omega / init_lambda plumbing matches the source implementation. (github.com)
No unmitigated P0/P1 issues found in estimator math, weighting, SE computation, or inference wiring.
Remaining items are P3 only: one stale in-code methodology docstring and one gap in the new R-parity regression.
Validation note: static review only; this environment does not have pytest or numpy, so I could not execute the targeted tests locally.

Methodology

Severity P3. Impact: _placebo_variance_se still documents step 3 as re-estimating omega/lambda “from uniform initialization, fresh start,” which now contradicts both the implementation and the updated Registry entry. Because this method is an in-code methodology reference, the mismatch can mislead future reviewers about the actual placebo SE contract. Concrete fix: update the step-3 docstring and parameter docs to describe the warm-started omega / lambda behavior and note the private _placebo_indices seam. diff_diff/synthetic_did.py:L1705-L1738, docs/methodology/REGISTRY.md:L1680-L1681

Code Quality

No findings.

Performance

No findings from static review.

Maintainability

No findings beyond the stale methodology docstring above.

Tech Debt

No findings. TODO.md does not currently track a limitation introduced by this diff.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the new fixture already ships per-draw placebo outputs (R_PLACEBO_TAUS), but test_placebo_se_matches_r only asserts equality of the aggregate SE. A draw-level regression that preserves the sample SD would escape, even though the full placebo draw vector is user-visible through placebo_effects and feeds the empirical placebo p-value path. Concrete fix: in test_placebo_se_matches_r, compare the returned placebo_effects against R_PLACEBO_TAUS elementwise (or at minimum compare the vector-derived empirical p-value) in addition to the SE check. tests/test_methodology_sdid.py:L1541-L1618, tests/data/sdid_placebo_indices_r.json:1, diff_diff/synthetic_did.py:L1164-L1170

- _placebo_variance_se docstring step 3 now describes the warm-start semantics (was still saying "uniform initialization, fresh start" after PR #369 landed the warm-start). Adds Parameters entries for init_omega / init_lambda / _placebo_indices. - test_placebo_se_matches_r now also asserts elementwise match between Python's placebo_effects and R_PLACEBO_TAUS from the fixture, so a permutation that diverged on a single draw but happened to leave sd() unchanged would still trip the regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-25T12:43:34Z

/ai-review

github-actions · 2026-04-25T12:48:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 09a3ef44472a5f069c083bfc2b674ff694830e04

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope was the SyntheticDiD placebo SE path, specifically the new warm-start plumbing in SyntheticDiD.fit() / _placebo_variance_se.
The prior methodology-docstring gap is resolved: the in-code docstring now matches the implementation and the updated Registry entry. diff_diff/synthetic_did.py:L1705-L1761, docs/methodology/REGISTRY.md:L1680-L1683
The prior test gap is resolved: the new R-parity regression checks both the aggregate placebo SE and the full per-draw placebo effect vector, not just the SE. tests/test_methodology_sdid.py:L1537-L1636, tests/data/sdid_placebo_indices_r.json:1
No unmitigated P0/P1 issues found in estimator math, weighting, variance/SE, identification assumptions, or default behavior in the changed code.
Static review only: I could not run the targeted tests locally because this environment does not have pytest or numpy installed.

Methodology

No findings. The affected method is SyntheticDiD._placebo_variance_se, and the dispatcher now threads fit-time unit_weights / time_weights as Frank-Wolfe initializations while still re-estimating ω and λ per draw, which is consistent with the updated Registry entry and the committed R fixture generator. diff_diff/synthetic_did.py:L1145-L1158, diff_diff/synthetic_did.py:L1688-L1959, docs/methodology/REGISTRY.md:L1680-L1683, benchmarks/R/generate_sdid_placebo_parity_fixture.R:L42-L108

Code Quality

No findings. The new kwargs are internal to the private placebo helper and do not create a public-surface propagation problem.

Performance

No findings. The change reuses the existing FW helpers; I do not see a new asymptotic performance regression in the modified path. diff_diff/utils.py:L1617-L1694, diff_diff/utils.py:L1769-L1837

Maintainability

No findings. The stale methodology wording noted in the prior review has been corrected in the helper docstring, and the rebased baseline is documented inline. diff_diff/synthetic_did.py:L1705-L1761, tests/test_methodology_sdid.py:L2990-L3004

Tech Debt

No findings. I did not find a new limitation here that needed mitigation via TODO.md. TODO.md:L51-L112

Security

No findings.

Documentation/Tests

No findings. The earlier aggregate-only parity gap is closed by asserting elementwise parity against R_PLACEBO_TAUS, and the baseline pin was updated with an explanatory comment. tests/test_methodology_sdid.py:L1537-L1636, tests/test_methodology_sdid.py:L2990-L3004, tests/data/sdid_placebo_indices_r.json:1
Validation note: local execution was not possible in this environment because pytest and numpy are unavailable.

Swap the captured SE from 0.293840360160448 (macOS Apple Accelerate local) to 0.2938403592163006 (Linux OpenBLAS, ubuntu-24.04-arm CI runner). The warm-start landed in the prior commit threads ``unit_weights`` (a fit-time FW output that carries sub-ULP BLAS reduction-order divergence) into per-draw FW init; across 200 draws with path-dependent sparsification the SE diverges ~1e-9 between the two platforms — exceeding the existing ``1e-12`` bit-identity gate. CI is the gating surface. macOS local fits will now drift at ~1e-9 on this fixture; the inline comment documents that the delta is finite-iter FW path-dependence, not a numerical regression. ATT and empirical p-value are unchanged: ATT comes from the deterministic FW solve (platform-stable to <1e-14 on this panel) and the placebo p-value is integer-driven (1/(B+1) = 1/201 with no placebo τ exceeding |ATT|). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The placebo SE warm-start landed in PR #369 threads ``unit_weights`` (a fit-time FW output that carries sub-ULP BLAS reduction-order divergence) into each per-draw FW init. Across 200 placebo draws with path-dependent sparsification in the 100-iter pre-sparsify pass, that ULP-level input difference accumulates to ~1e-9 SE divergence between Apple Accelerate (macOS) and OpenBLAS (Linux). No single double satisfies both at the prior ``1e-12`` gate. The placebo row's SE assertion is loosened to ``rel=1e-7`` (drift detector, not bit-identity). Bootstrap and jackknife stay at ``rel=1e-14``: bootstrap dilutes the divergence by resampling from the full unit set with replacement; jackknife uses fixed weights and no FW re-estimation. Bit-identity protection for placebo moves to ``test_placebo_se_matches_r`` (``TestJackknifeSERParity``), which uses the ``_placebo_indices`` test seam to feed R's exact permutations through the same normalized inputs the dispatcher would, bypassing the platform-divergent fit-time path. That test asserts both aggregate SE (< 1e-8 vs R) and per-draw τ (< 1e-8 elementwise vs R), which is strictly stronger than the prior ``1e-12`` capture-vs-capture gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber added the ready-for-ci Triggers CI test workflows label Apr 25, 2026

igerber and others added 2 commits April 25, 2026 09:31

igerber merged commit 8cdf75e into main Apr 25, 2026
19 checks passed

igerber deleted the sdid-placebo-r-parity branch April 25, 2026 15:14

igerber mentioned this pull request Apr 26, 2026

Release 3.3.1: HAD survey-design consolidation, dCDH by_path placebos + sup-t bands, Phase 4.5 C #387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam#369

Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam#369
igerber merged 4 commits intomainfrom
sdid-placebo-r-parity

igerber commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

igerber commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 25, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

igerber commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant