Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [3.3.0] - 2026-04-25

### Fixed
- **`SyntheticDiD(variance_method="placebo")` SE now uses R-default warm-start** matching `synthdid:::placebo_se`. R's placebo loop seeds Frank-Wolfe per draw with `weights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])` (fit-time ω subsetted + renormalized) and the fit-time `weights$lambda` — Python previously used uniform cold-start, producing finite-iter convergence-pattern drift on a handful of draws relative to R's reference SE. New `_placebo_variance_se` kwargs `init_omega` / `init_lambda` thread fit-time weights through the existing two-pass FW dispatcher; on the global FW optimum the values are init-independent (strictly convex objective), so the change is a finite-iter parity fix, not a methodology change. Existing placebo SE values shift by sub-percent on most panels; the bit-identity baseline pin in `TestScaleEquivariance::test_baseline_parity_small_scale[placebo]` was rebased from `0.29385822261006445` to `0.293840360160448`. New R-parity test `tests/test_methodology_sdid.py::TestJackknifeSERParity::test_placebo_se_matches_r` asserts SE matches R's `vcov(method="placebo")` to within `< 1e-8` using R's exact permutation sequence (recorded by `benchmarks/R/generate_sdid_placebo_parity_fixture.R` into `tests/data/sdid_placebo_indices_r.json`). The `_placebo_indices` kwarg on `_placebo_variance_se` is the test seam; not part of the public API.

### Added
- **`qug_test` and `did_had_pretest_workflow` survey-aware NotImplementedError gates (Phase 4.5 C0 decision gate).** `qug_test(d, *, survey=None, weights=None)` and `did_had_pretest_workflow(..., *, survey=None, weights=None)` now accept the two kwargs as keyword-only with default `None`. Passing either non-`None` raises `NotImplementedError` with an educational message naming the methodology rationale and pointing users to joint Stute (Phase 4.5 C, planned) as the survey-compatible alternative. Mutex guard on `survey=` + `weights=` mirrors `HeterogeneousAdoptionDiD.fit()` at `had.py:2890`. **QUG-under-survey is permanently deferred** — the test statistic uses extreme order statistics `D_{(1)}, D_{(2)}` which are NOT smooth functionals of the empirical CDF, so standard survey machinery (Binder-TSL linearization, Rao-Wu rescaled bootstrap, Krieger-Pfeffermann (1997) EDF tests) does not yield a calibrated test; under cluster sampling the `Exp(1)/Exp(1)` limit law's independence assumption breaks; and the EVT-under-unequal-probability-sampling literature (Quintos et al. 2001, Beirlant et al.) addresses tail-index estimation, not boundary tests. The workflow's gate is **temporary** — Phase 4.5 C will close it for the linearity-family pretests with mechanism varying by test: Rao-Wu rescaled bootstrap for `stute_test` and the joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`); weighted OLS residuals + weighted variance estimator for `yatchew_hr_test` (Yatchew 1997 is a closed-form variance-ratio test, not bootstrap-based). Sister pretests (`stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`) keep their closed signatures in this release — Phase 4.5 C will add kwargs and implementation together to avoid API churn. Unweighted `qug_test(d)` and `did_had_pretest_workflow(...)` calls are bit-exact pre-PR (kwargs are keyword-only after `*`; positional path unchanged). New tests at `tests/test_had_pretests.py::TestQUGTest` (5 rejection / mutex / message / regression tests) and the new `TestHADPretestWorkflowSurveyGuards` class (6 tests covering both kwarg paths, mutex, methodology pointer, both aggregate paths, and unweighted regression). See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C0) for the full methodology rationale plus a sketch of the (out-of-scope) theoretical bridge that combines endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006), survey-aware functional CLTs (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017), and tail-empirical-process theory (Drees 2003) — publishable methodology research, not engineering work.
- **`HeterogeneousAdoptionDiD` mass-point `survey=` / `weights=` + event-study `aggregate="event_study"` survey composition + multiplier-bootstrap sup-t simultaneous confidence band (Phase 4.5 B).** Closes the two Phase 4.5 A `NotImplementedError` gates: `design="mass_point" + weights/survey` and `aggregate="event_study" + weights/survey`. Weighted 2SLS sandwich in `_fit_mass_point_2sls` follows the Wooldridge 2010 Ch. 12 pweight convention (`w²` in the HC1 meat, `w·u` in the CR1 cluster score, weighted bread `Z'WX`); HC1 and CR1 ("stata" `se_type`) bit-parity with `estimatr::iv_robust(..., weights=, clusters=)` at `atol=1e-10` (new cross-language golden at `benchmarks/data/estimatr_iv_robust_golden.json`, generated by `benchmarks/R/generate_estimatr_iv_robust_golden.R`; `estimatr` added to `benchmarks/R/requirements.R`). `_fit_mass_point_2sls` gains `weights=` + `return_influence=` kwargs and now always returns a 3-tuple `(beta, se, psi)` — `psi` is the per-unit IF on the β̂-scale scaled so `compute_survey_if_variance(psi, trivial_resolved) ≈ V_HC1[1,1]` at `atol=1e-10` (PR #359 IF scale convention applied uniformly; no `sum(psi²)` claims). Event-study per-horizon variance: `survey=` path composes Binder-TSL via `compute_survey_if_variance`; `weights=` shortcut uses the analytical weighted-robust SE (continuous: CCT-2014 `bc_fit.se_robust / |den|`; mass-point: weighted 2SLS pweight sandwich from `_fit_mass_point_2sls` — HC1 / classical / CR1). `survey_metadata` / `variance_formula` / `effective_dose_mean` populated in both regimes (previously hardcoded `None` at `had.py:3366`). New multiplier-bootstrap sup-t: `_sup_t_multiplier_bootstrap` reuses `diff_diff.bootstrap_utils.generate_survey_multiplier_weights_batch` for PSU-level draws with stratum centering + sqrt(n_h/(n_h-1)) small-sample correction + FPC scaling + lonely-PSU handling. On the `weights=` shortcut, sup-t calibration is routed through a synthetic trivial `ResolvedSurveyDesign` so the centered + small-sample-corrected branch fires uniformly — targets the analytical HC1 variance family (`compute_survey_if_variance(IF, trivial) ≈ V_HC1` per the PR #359 IF scale invariant) rather than the raw `sum(ψ²) = ((n-1)/n) · V_HC1` that unit-level Rademacher multipliers would produce on the HC1-scaled IF. Perturbations: `delta = weights @ IF` with NO `(1/n)` prefactor (matching `staggered_bootstrap.py:373` idiom), normalized by per-horizon analytical SE, `(1-alpha)`-quantile of the sup-t distribution. At H=1 the quantile reduces to `Φ⁻¹(1 − alpha/2) ≈ 1.96` up to MC noise (regression-locked by `TestSupTReducesToNormalAtH1`). `HeterogeneousAdoptionDiD.__init__` gains `n_bootstrap: int = 999` and `seed: Optional[int] = None` (CS-parity singular seed); `fit()` gains `cband: bool = True` (only consulted on weighted event-study). `HeterogeneousAdoptionDiDEventStudyResults` extended with `variance_formula`, `effective_dose_mean`, `cband_low`, `cband_high`, `cband_crit_value`, `cband_method`, `cband_n_bootstrap` (all `None` on unweighted fits); surfaced in `to_dict`, `to_dataframe`, `summary`, `__repr__`. Unweighted event-study with `cband=False` preserves pre-Phase 4.5 B numerical output bit-exactly (stability invariant, locked by regression tests). Zero-weight subpopulation convention carries over from PR #359 (filter for design decisions; preserve full ResolvedSurveyDesign for variance). Non-pweight SurveyDesigns (`aweight`, `fweight`, replicate designs) raise `NotImplementedError` on both new paths (reciprocal-guard discipline). Pretest surfaces (`qug_test`, `stute_test`, `yatchew_hr_test`, joint variants, `did_had_pretest_workflow`) remain unweighted in this release — Phase 4.5 C / C0. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted 2SLS (Phase 4.5 B)", "Event-study survey composition", and "Sup-t multiplier bootstrap" for derivations and invariants.
Expand Down
142 changes: 142 additions & 0 deletions benchmarks/R/generate_sdid_placebo_parity_fixture.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
#!/usr/bin/env Rscript
# Generate a fixture pinning R's `synthdid::vcov(method="placebo")` SE plus
# the per-replication permutations R consumed, so the Python R-parity test
# can feed those exact permutations through `_placebo_variance_se` and
# assert SE match at machine precision.
#
# Usage:
# Rscript benchmarks/R/generate_sdid_placebo_parity_fixture.R
#
# Output:
# tests/data/sdid_placebo_indices_r.json
#
# Symmetric with the existing jackknife R-parity test
# (TestJackknifeSERParity in tests/test_methodology_sdid.py:1410). Reuses
# the same Y matrix and (N0, N1, T0, T1) shape so the placebo + jackknife
# parity tests share an anchor panel.
#
# R version: 4.5.2; synthdid version: 0.0.9.

library(synthdid)
library(jsonlite)

# Reconstruct R's panel exactly as TestJackknifeSERParity does (set.seed(42),
# 23 units × 8 periods, treated = i > N0 with effect 5 in t > T0).
set.seed(42)
N0 <- 20
N1 <- 3
T0 <- 5
T1 <- 3
N <- N0 + N1
T <- T0 + T1
Y <- matrix(0, nrow = N, ncol = T)
for (i in 1:N) {
unit_fe <- rnorm(1, sd = 2)
for (t in 1:T) {
Y[i, t] <- 10 + unit_fe + (t - 1) * 0.3 + rnorm(1, sd = 0.5)
if (i > N0 && t > T0) Y[i, t] <- Y[i, t] + 5.0
}
}

# Fit-time ATT (sanity check — must match TestJackknifeSERParity.R_ATT).
tau_hat <- synthdid_estimate(Y, N0, T0)
r_att <- as.numeric(tau_hat)

# Reproduce R's placebo_se loop exactly so we can record permutations and
# the per-rep tau alongside the resulting SE. Mirrors `synthdid:::placebo_se`
# (R/vcov.R), including the warm-start weights pass-through:
#
# theta = function(ind) {
# N0 = length(ind) - N1
# weights.boot = weights
# weights.boot$omega = sum_normalize(weights$omega[ind[1:N0]])
# do.call(synthdid_estimate, c(list(Y = setup$Y[ind, ],
# N0 = N0, T0 = setup$T0, X = setup$X[ind, , ],
# weights = weights.boot), opts))
# }
#
# The warm-start `weights.boot$omega` differs from a fresh uniform init
# at finite FW iterations and is what `vcov(method="placebo")` actually
# consumes — so reproducing it here is required for bit-identical SE.
opts_used <- attr(tau_hat, "opts")
fit_weights <- attr(tau_hat, "weights")
fit_setup <- attr(tau_hat, "setup")
replications <- 200

# Use a fresh seed for the placebo loop so the recorded permutations are
# independent of the fit-time RNG state. Python consumes the recorded
# permutations directly (no RNG-state matching needed).
set.seed(42)
perms <- vector("list", replications)
taus <- numeric(replications)

for (r in 1:replications) {
ind <- sample(1:N0, N0)
perms[[r]] <- ind
N0_placebo <- N0 - N1
weights_boot <- fit_weights
weights_boot$omega <- synthdid:::sum_normalize(fit_weights$omega[ind[1:N0_placebo]])
# IMPORTANT: R's `placebo_se` uses ONLY the N0 controls (subdivided into
# N0-N1 pseudo-controls + N1 pseudo-treated). Real treated rows are NOT
# included in the placebo Y matrix — that's what makes the placebo a
# null-distribution test. ``Y = setup$Y[ind, ]`` is N0 rows; appending
# the real treated rows (i.e., ``setup$Y[c(ind, (N0+1):N), ]``) would
# change the test entirely (and produces SE ~0.132 instead of R's 0.226
# — a 2× drift on this fixture).
est_placebo <- do.call(
synthdid_estimate,
c(list(
Y = fit_setup$Y[ind, ],
N0 = N0_placebo,
T0 = T0,
X = fit_setup$X[ind, , ],
weights = weights_boot
), opts_used)
)
taus[r] <- as.numeric(est_placebo)
}

r_placebo_se <- sqrt((replications - 1) / replications) * sd(taus)

# Sanity check against R's vcov() entry point. With the warm-start pattern
# applied explicitly above, the manual loop and `vcov()` should produce
# the same SE up to MC noise on the seed sequence. Match isn't required
# for the parity test (we use `r_placebo_se` from our recorded
# permutations); both values are kept for transparency.
set.seed(42)
r_placebo_se_via_vcov <- sqrt(vcov(tau_hat, method = "placebo", replications = replications)[1, 1])

cat(sprintf("R ATT: %.15f\n", r_att))
cat(sprintf("R placebo SE (manual loop): %.15f\n", r_placebo_se))
cat(sprintf("R placebo SE (via vcov): %.15f\n", r_placebo_se_via_vcov))
cat(sprintf("Replications: %d\n", replications))

# Convert permutations to 0-indexed for Python (R uses 1-indexed).
perms_0indexed <- lapply(perms, function(p) as.integer(p - 1L))

payload <- list(
metadata = list(
R_version = paste(R.version$major, R.version$minor, sep = "."),
synthdid_version = as.character(packageVersion("synthdid")),
seed = 42L,
replications = as.integer(replications),
note = paste(
"Permutations are 0-indexed for direct numpy consumption.",
"R ATT, R placebo SE (manual loop), and per-rep taus are pinned",
"for downstream Python parity assertion."
)
),
N0 = as.integer(N0),
N1 = as.integer(N1),
T0 = as.integer(T0),
T1 = as.integer(T1),
R_ATT = r_att,
R_PLACEBO_SE = r_placebo_se,
R_PLACEBO_SE_VIA_VCOV = r_placebo_se_via_vcov,
R_PLACEBO_TAUS = as.numeric(taus),
R_PERMUTATIONS = perms_0indexed
)

out_path <- "tests/data/sdid_placebo_indices_r.json"
write_json(payload, out_path, auto_unbox = TRUE, digits = 17)
cat(sprintf("\nWrote %s\n", out_path))
Loading
Loading