Skip to content

Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam#369

Merged
igerber merged 4 commits intomainfrom
sdid-placebo-r-parity
Apr 25, 2026
Merged

Close SDID placebo R-parity gap: warm-start + R-anchored fixture + test seam#369
igerber merged 4 commits intomainfrom
sdid-placebo-r-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 25, 2026

Summary

  • Closes the last queued SDID R-parity follow-up (placebo SE matching synthdid::vcov(method="placebo")). Symmetric with the existing jackknife R-parity test in TestJackknifeSERParity.
  • Methodology fix: Python's _placebo_variance_se previously used uniform cold-start for per-draw Frank-Wolfe re-estimation; R's synthdid:::placebo_se (R/vcov.R) uses a warm-start weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]]) (fit-time ω subsetted + renormalized) plus the fit-time weights$lambda. Identical at the global FW optimum (strictly convex objective), but produces finite-iter convergence-pattern drift on a handful of draws against R's reference.
  • Fix: add init_omega and init_lambda kwargs to _placebo_variance_se; the dispatcher passes fit-time unit_weights / time_weights; the per-draw FW calls thread these as init_weights. Mirrors the bootstrap warm-start landed in PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349.
  • R fixture: benchmarks/R/generate_sdid_placebo_parity_fixture.R records 200 per-rep permutations + R's placebo SE on the same Y matrix as TestJackknifeSERParity (R 4.5.2, synthdid 0.0.9). Output committed at tests/data/sdid_placebo_indices_r.json. R's manual loop and vcov(method="placebo") agree at machine precision when seeded (both 0.226342763355644).
  • Test seam: private _placebo_indices kwarg on _placebo_variance_se (underscore-prefixed, not public API). When supplied, each row replaces the per-draw rng.permutation(n_control) so Python can consume R's exact permutation sequence.
  • R-parity test: new test_placebo_se_matches_r in TestJackknifeSERParity intercepts the dispatcher's call to capture the normalized fit-time inputs, then re-invokes _placebo_variance_se with R's permutations through the seam. Asserts |py_se - r_se| < 1e-8 (Rust FW vs R FW differ at sub-ULP on the same warm-start; tolerance covers BLAS reduction-order without masking real divergences).
  • Baseline rebase: TestScaleEquivariance::test_baseline_parity_small_scale[placebo] captured pre-warm-start SE 0.29385822261006445. New value 0.293840360160448 (sub-percent finite-iter shift). Updated with a comment documenting the warm-start change and pointer to the new R-parity test.

Methodology references

  • Method name(s): SyntheticDiD._placebo_variance_se warm-start change (matching synthdid:::placebo_se in R/vcov.R).
  • Paper / source link(s):
    • Arkhangelsky, Athey, Hirshberg, Imbens & Wager (2021), American Economic Review 111(12) — Algorithm 4.
    • R reference: synthdid 0.0.9 source code (R/vcov.R::placebo_se), update.omega=TRUE, update.lambda=TRUE defaults with weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]]) warm-start.
  • Any intentional deviations from the source: None on the warm-start contract — Python now matches R. The placebo survey-extension paths (PR Restore SDID survey support for placebo and jackknife variance methods #365) are unchanged; the warm-start fix applies to the non-survey + pweight-only path that has an R reference.

Validation

  • Tests added/updated:
    • New: tests/test_methodology_sdid.py::TestJackknifeSERParity::test_placebo_se_matches_r — R-parity at < 1e-8.
    • Updated: tests/test_methodology_sdid.py::TestScaleEquivariance::test_baseline_parity_small_scale[placebo] baseline rebased from 0.29385822261006445 to 0.293840360160448 with documentation.
    • Existing 229 SDID + survey tests pass unchanged.
  • Backtest / simulation / notebook evidence:
    • R fixture generator: Rscript benchmarks/R/generate_sdid_placebo_parity_fixture.R produces R ATT = 4.980848860060929, R placebo SE = 0.226342763355644 (manual loop = via_vcov to 15 digits).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

…st seam

Closes the last queued SDID R-parity follow-up (per
``project_sdid_pr349_followups.md``, now removed from the memory
index — the work is shipped). Symmetric with the existing
``test_jackknife_se_matches_r`` anchor in TestJackknifeSERParity.

Methodology fix — placebo warm-start:
``synthdid:::placebo_se`` (R/vcov.R) seeds Frank-Wolfe per draw with
``weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])``
(fit-time ω subsetted + renormalized) and the fit-time
``weights$lambda``, then re-estimates with ``update.omega=TRUE,
update.lambda=TRUE``. Python's ``_placebo_variance_se`` previously
used uniform cold-start, producing finite-iter convergence-pattern
drift on a handful of draws relative to R's reference SE on the
same panel.

Fix: add ``init_omega`` and ``init_lambda`` kwargs to
``_placebo_variance_se``. The dispatcher now passes ``init_omega=
unit_weights, init_lambda=time_weights`` (fit-time outputs); the
loop seeds ``compute_sdid_unit_weights(init_weights=
_sum_normalize(init_omega[pseudo_control_idx]))`` and
``compute_time_weights(init_weights=init_lambda)`` per draw,
mirroring R's warm-start pattern. At the global FW optimum the
two starts are equivalent (strictly convex objective) — this is
a finite-iter parity fix, not a methodology change.

R-parity fixture + test seam:
* ``benchmarks/R/generate_sdid_placebo_parity_fixture.R`` — R 4.5.2
  + synthdid 0.0.9. Reuses the same Y matrix as
  ``TestJackknifeSERParity`` (same R_ATT = 4.980848860060929) so
  jackknife and placebo R-parity tests share an anchor panel.
  Records the 200 per-rep permutations R consumed and the SE from
  R's manual ``placebo_se`` loop (which matches
  ``vcov(method="placebo")`` to machine precision when seeded);
  permutations are 0-indexed for direct numpy consumption.
* ``tests/data/sdid_placebo_indices_r.json`` — committed fixture.
* ``_placebo_variance_se`` gains a private ``_placebo_indices``
  kwarg (underscore-prefixed, test-only). When supplied, each row
  replaces the per-draw ``rng.permutation(n_control)`` so a
  Python fit can consume R's exact permutation sequence and
  produce a bit-identical SE.
* ``test_placebo_se_matches_r`` (in ``TestJackknifeSERParity``)
  intercepts the dispatcher's call to ``_placebo_variance_se`` to
  capture the normalized fit-time inputs, then re-invokes the
  method with R's permutations through the seam. Asserts
  ``|py_se - r_se| < 1e-8`` — Rust FW vs R FW differ at sub-ULP
  on the same warm-start; tight enough to catch real divergences
  without masking BLAS reduction-order tolerance.

Baseline rebase:
``TestScaleEquivariance::test_baseline_parity_small_scale[placebo]``
captured pre-warm-start SE = 0.29385822261006445. New value is
0.293840360160448 (sub-percent shift). The test's bit-identity
contract is preserved per backend; baseline updated with a comment
documenting the warm-start change and pointer to the new R-parity
test that pins the post-fix value to R's reference. p-value (placebo
uses empirical formula, not analytical) is unchanged at
0.004975124378109453.

Verification: ``pytest tests/test_methodology_sdid.py
tests/test_survey_phase5.py -q`` → 230 passed (1 new R-parity
test; existing TestScaleEquivariance baseline rebased; all other
SDID + survey tests unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • Affected method: SyntheticDiD placebo variance path in fit() / _placebo_variance_se.
  • Methodology cross-check is favorable: official synthdid treats placebo variance as Algorithm 4 and, in placebo_se, copies the fit-time weights, renormalizes omega on the pseudo-control subset, and re-calls synthdid_estimate with that weight object, so the new init_omega / init_lambda plumbing matches the source implementation. (github.com)
  • No unmitigated P0/P1 issues found in estimator math, weighting, SE computation, or inference wiring.
  • Remaining items are P3 only: one stale in-code methodology docstring and one gap in the new R-parity regression.
  • Validation note: static review only; this environment does not have pytest or numpy, so I could not execute the targeted tests locally.

Methodology

  • Severity P3. Impact: _placebo_variance_se still documents step 3 as re-estimating omega/lambda “from uniform initialization, fresh start,” which now contradicts both the implementation and the updated Registry entry. Because this method is an in-code methodology reference, the mismatch can mislead future reviewers about the actual placebo SE contract. Concrete fix: update the step-3 docstring and parameter docs to describe the warm-started omega / lambda behavior and note the private _placebo_indices seam. diff_diff/synthetic_did.py:L1705-L1738, docs/methodology/REGISTRY.md:L1680-L1681

Code Quality

  • No findings.

Performance

  • No findings from static review.

Maintainability

  • No findings beyond the stale methodology docstring above.

Tech Debt

  • No findings. TODO.md does not currently track a limitation introduced by this diff.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new fixture already ships per-draw placebo outputs (R_PLACEBO_TAUS), but test_placebo_se_matches_r only asserts equality of the aggregate SE. A draw-level regression that preserves the sample SD would escape, even though the full placebo draw vector is user-visible through placebo_effects and feeds the empirical placebo p-value path. Concrete fix: in test_placebo_se_matches_r, compare the returned placebo_effects against R_PLACEBO_TAUS elementwise (or at minimum compare the vector-derived empirical p-value) in addition to the SE check. tests/test_methodology_sdid.py:L1541-L1618, tests/data/sdid_placebo_indices_r.json:1, diff_diff/synthetic_did.py:L1164-L1170

- _placebo_variance_se docstring step 3 now describes the warm-start
  semantics (was still saying "uniform initialization, fresh start"
  after PR #369 landed the warm-start). Adds Parameters entries for
  init_omega / init_lambda / _placebo_indices.
- test_placebo_se_matches_r now also asserts elementwise match between
  Python's placebo_effects and R_PLACEBO_TAUS from the fixture, so a
  permutation that diverged on a single draw but happened to leave
  sd() unchanged would still trip the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 25, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 09a3ef44472a5f069c083bfc2b674ff694830e04


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review scope was the SyntheticDiD placebo SE path, specifically the new warm-start plumbing in SyntheticDiD.fit() / _placebo_variance_se.
  • The prior methodology-docstring gap is resolved: the in-code docstring now matches the implementation and the updated Registry entry. diff_diff/synthetic_did.py:L1705-L1761, docs/methodology/REGISTRY.md:L1680-L1683
  • The prior test gap is resolved: the new R-parity regression checks both the aggregate placebo SE and the full per-draw placebo effect vector, not just the SE. tests/test_methodology_sdid.py:L1537-L1636, tests/data/sdid_placebo_indices_r.json:1
  • No unmitigated P0/P1 issues found in estimator math, weighting, variance/SE, identification assumptions, or default behavior in the changed code.
  • Static review only: I could not run the targeted tests locally because this environment does not have pytest or numpy installed.

Methodology

  • No findings. The affected method is SyntheticDiD._placebo_variance_se, and the dispatcher now threads fit-time unit_weights / time_weights as Frank-Wolfe initializations while still re-estimating ω and λ per draw, which is consistent with the updated Registry entry and the committed R fixture generator. diff_diff/synthetic_did.py:L1145-L1158, diff_diff/synthetic_did.py:L1688-L1959, docs/methodology/REGISTRY.md:L1680-L1683, benchmarks/R/generate_sdid_placebo_parity_fixture.R:L42-L108

Code Quality

  • No findings. The new kwargs are internal to the private placebo helper and do not create a public-surface propagation problem.

Performance

  • No findings. The change reuses the existing FW helpers; I do not see a new asymptotic performance regression in the modified path. diff_diff/utils.py:L1617-L1694, diff_diff/utils.py:L1769-L1837

Maintainability

  • No findings. The stale methodology wording noted in the prior review has been corrected in the helper docstring, and the rebased baseline is documented inline. diff_diff/synthetic_did.py:L1705-L1761, tests/test_methodology_sdid.py:L2990-L3004

Tech Debt

  • No findings. I did not find a new limitation here that needed mitigation via TODO.md. TODO.md:L51-L112

Security

  • No findings.

Documentation/Tests

  • No findings. The earlier aggregate-only parity gap is closed by asserting elementwise parity against R_PLACEBO_TAUS, and the baseline pin was updated with an explanatory comment. tests/test_methodology_sdid.py:L1537-L1636, tests/test_methodology_sdid.py:L2990-L3004, tests/data/sdid_placebo_indices_r.json:1
  • Validation note: local execution was not possible in this environment because pytest and numpy are unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 25, 2026
igerber and others added 2 commits April 25, 2026 09:31
Swap the captured SE from 0.293840360160448 (macOS Apple Accelerate
local) to 0.2938403592163006 (Linux OpenBLAS, ubuntu-24.04-arm CI
runner). The warm-start landed in the prior commit threads
``unit_weights`` (a fit-time FW output that carries sub-ULP BLAS
reduction-order divergence) into per-draw FW init; across 200 draws
with path-dependent sparsification the SE diverges ~1e-9 between the
two platforms — exceeding the existing ``1e-12`` bit-identity gate.

CI is the gating surface. macOS local fits will now drift at ~1e-9
on this fixture; the inline comment documents that the delta is
finite-iter FW path-dependence, not a numerical regression.

ATT and empirical p-value are unchanged: ATT comes from the
deterministic FW solve (platform-stable to <1e-14 on this panel) and
the placebo p-value is integer-driven (1/(B+1) = 1/201 with no
placebo τ exceeding |ATT|).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The placebo SE warm-start landed in PR #369 threads ``unit_weights``
(a fit-time FW output that carries sub-ULP BLAS reduction-order
divergence) into each per-draw FW init. Across 200 placebo draws
with path-dependent sparsification in the 100-iter pre-sparsify
pass, that ULP-level input difference accumulates to ~1e-9 SE
divergence between Apple Accelerate (macOS) and OpenBLAS (Linux).
No single double satisfies both at the prior ``1e-12`` gate.

The placebo row's SE assertion is loosened to ``rel=1e-7`` (drift
detector, not bit-identity). Bootstrap and jackknife stay at
``rel=1e-14``: bootstrap dilutes the divergence by resampling from
the full unit set with replacement; jackknife uses fixed weights
and no FW re-estimation.

Bit-identity protection for placebo moves to ``test_placebo_se_matches_r``
(``TestJackknifeSERParity``), which uses the ``_placebo_indices``
test seam to feed R's exact permutations through the same normalized
inputs the dispatcher would, bypassing the platform-divergent
fit-time path. That test asserts both aggregate SE (< 1e-8 vs R) and
per-draw τ (< 1e-8 elementwise vs R), which is strictly stronger
than the prior ``1e-12`` capture-vs-capture gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit 8cdf75e into main Apr 25, 2026
19 checks passed
@igerber igerber deleted the sdid-placebo-r-parity branch April 25, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant