igerber · igerber · Apr 25, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **`HeterogeneousAdoptionDiD` mass-point `survey=` / `weights=` + event-study `aggregate="event_study"` survey composition + multiplier-bootstrap sup-t simultaneous confidence band (Phase 4.5 B).** Closes the two Phase 4.5 A `NotImplementedError` gates: `design="mass_point" + weights/survey` and `aggregate="event_study" + weights/survey`. Weighted 2SLS sandwich in `_fit_mass_point_2sls` follows the Wooldridge 2010 Ch. 12 pweight convention (`w²` in the HC1 meat, `w·u` in the CR1 cluster score, weighted bread `Z'WX`); HC1 and CR1 ("stata" `se_type`) bit-parity with `estimatr::iv_robust(..., weights=, clusters=)` at `atol=1e-10` (new cross-language golden at `benchmarks/data/estimatr_iv_robust_golden.json`, generated by `benchmarks/R/generate_estimatr_iv_robust_golden.R`; `estimatr` added to `benchmarks/R/requirements.R`). `_fit_mass_point_2sls` gains `weights=` + `return_influence=` kwargs and now always returns a 3-tuple `(beta, se, psi)` — `psi` is the per-unit IF on the β̂-scale scaled so `compute_survey_if_variance(psi, trivial_resolved) ≈ V_HC1[1,1]` at `atol=1e-10` (PR #359 IF scale convention applied uniformly; no `sum(psi²)` claims). Event-study per-horizon variance: `survey=` path composes Binder-TSL via `compute_survey_if_variance`; `weights=` shortcut uses the analytical weighted-robust SE (continuous: CCT-2014 `bc_fit.se_robust / |den|`; mass-point: weighted 2SLS pweight sandwich from `_fit_mass_point_2sls` — HC1 / classical / CR1). `survey_metadata` / `variance_formula` / `effective_dose_mean` populated in both regimes (previously hardcoded `None` at `had.py:3366`). New multiplier-bootstrap sup-t: `_sup_t_multiplier_bootstrap` reuses `diff_diff.bootstrap_utils.generate_survey_multiplier_weights_batch` for PSU-level draws with stratum centering + sqrt(n_h/(n_h-1)) small-sample correction + FPC scaling + lonely-PSU handling. On the `weights=` shortcut, sup-t calibration is routed through a synthetic trivial `ResolvedSurveyDesign` so the centered + small-sample-corrected branch fires uniformly — targets the analytical HC1 variance family (`compute_survey_if_variance(IF, trivial) ≈ V_HC1` per the PR #359 IF scale invariant) rather than the raw `sum(ψ²) = ((n-1)/n) · V_HC1` that unit-level Rademacher multipliers would produce on the HC1-scaled IF. Perturbations: `delta = weights @ IF` with NO `(1/n)` prefactor (matching `staggered_bootstrap.py:373` idiom), normalized by per-horizon analytical SE, `(1-alpha)`-quantile of the sup-t distribution. At H=1 the quantile reduces to `Φ⁻¹(1 − alpha/2) ≈ 1.96` up to MC noise (regression-locked by `TestSupTReducesToNormalAtH1`). `HeterogeneousAdoptionDiD.__init__` gains `n_bootstrap: int = 999` and `seed: Optional[int] = None` (CS-parity singular seed); `fit()` gains `cband: bool = True` (only consulted on weighted event-study). `HeterogeneousAdoptionDiDEventStudyResults` extended with `variance_formula`, `effective_dose_mean`, `cband_low`, `cband_high`, `cband_crit_value`, `cband_method`, `cband_n_bootstrap` (all `None` on unweighted fits); surfaced in `to_dict`, `to_dataframe`, `summary`, `__repr__`. Unweighted event-study with `cband=False` preserves pre-Phase 4.5 B numerical output bit-exactly (stability invariant, locked by regression tests). Zero-weight subpopulation convention carries over from PR #359 (filter for design decisions; preserve full ResolvedSurveyDesign for variance). Non-pweight SurveyDesigns (`aweight`, `fweight`, replicate designs) raise `NotImplementedError` on both new paths (reciprocal-guard discipline). Pretest surfaces (`qug_test`, `stute_test`, `yatchew_hr_test`, joint variants, `did_had_pretest_workflow`) remain unweighted in this release — Phase 4.5 C / C0. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted 2SLS (Phase 4.5 B)", "Event-study survey composition", and "Sup-t multiplier bootstrap" for derivations and invariants.
 - **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` is a repo-standard `SurveyMetadata` dataclass (weight_type / effective_n / design_effect / sum_weights / weight_range / n_strata / n_psu / df_survey); HAD-specific extras (`variance_formula` label, `effective_dose_mean`) are separate top-level result fields. `to_dict()` surfaces the full `SurveyMetadata` object plus `variance_formula` + `effective_dose_mean`; `summary()` renders `variance_formula`, `effective_n`, `effective_dose_mean`, and (when the survey= path is used) `df_survey`; `__repr__` surfaces `variance_formula` + `effective_dose_mean` when present. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
 - **`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test` + `StuteJointResult`** (HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The core `stute_joint_pretest` is residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends, `OLS(Y_t − Y_base ~ 1)` per pre-period) and linearity (step 3 joint, `OLS(Y_t − Y_base ~ 1 + D)` per post-period). Sum-of-CvMs aggregation (`S_joint = Σ_k S_k`); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3 `did_had_pretest_workflow` previously flagged with an "Assumption 7 pre-trends test NOT run" caveat. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).
 - **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`.

diff --git a/TODO.md b/TODO.md
@@ -92,7 +92,8 @@ Deferred items from PR reviews that were not addressed before merge.
 | Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | `benchmarks/R/generate_nprobust_lprobust_golden.R` | Phase 1c | Low |
 | `HeterogeneousAdoptionDiD` joint cross-horizon covariance on event study: per-horizon SEs use INDEPENDENT sandwiches in Phase 2b (paper-faithful pointwise CIs per Pierce-Schott Figure 2). A follow-up could derive an IF-based stacking of per-horizon scores for joint cross-horizon inference (needed for joint hypothesis tests across event-time horizons). Block-bootstrap is a reasonable alternative. | `diff_diff/had.py::_fit_event_study` | Phase 2b | Low |
 | `HeterogeneousAdoptionDiD` event-study staggered-timing beyond last cohort: Phase 2b auto-filters staggered panels to the last cohort per paper Appendix B.2. Earlier-cohort treatment effects are not identified by HAD; redirecting to `ChaisemartinDHaultfoeuille` / `did_multiplegt_dyn` is the paper's prescription. A full staggered HAD would require a different identification path (out of paper scope). | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
-| `HeterogeneousAdoptionDiD` Phase 4.5 B: `survey=` / `weights=` on `design="mass_point"` (weighted 2SLS + weighted-sandwich variance; the Wooldridge 2010 Ch. 12 weighted-IV sandwich has a Stata `ivregress ... [pweight=...]` + R `AER::ivreg(weights=...)` parity anchor). Also ships `aggregate="event_study"` + survey/weights via per-horizon IPW + shared PSU multiplier bootstrap across horizons. This PR (Phase 4.5 A) raises `NotImplementedError` on both paths. | `diff_diff/had.py::_fit_mass_point_2sls`, `diff_diff/had.py::_fit_event_study` | Phase 4.5 B | Medium |
+| `HeterogeneousAdoptionDiD` joint cross-horizon analytical covariance on the weighted event-study path: Phase 4.5 B ships multiplier-bootstrap sup-t simultaneous CIs on the weighted event-study path but pointwise analytical variance is still independent across horizons. A follow-up could derive the full H × H analytical covariance from the per-horizon IF matrix (`Psi.T @ Psi` under survey weighting) for an analytical alternative to the bootstrap. Would also let the unweighted event-study path ship a sup-t band. | `diff_diff/had.py::_fit_event_study` | follow-up | Low |
+| `HeterogeneousAdoptionDiD` unweighted event-study sup-t band: Phase 4.5 B ships sup-t only on the WEIGHTED event-study path (to preserve pre-PR bit-exact output on unweighted). Extending sup-t to unweighted event-study (either via the multiplier bootstrap with unit-level iid multipliers or via analytical joint cross-horizon covariance) is a symmetric follow-up. | `diff_diff/had.py::_fit_event_study` | follow-up | Low |
 | `HeterogeneousAdoptionDiD` Phase 4.5 C0: QUG-under-survey decision gate. `qug_test` uses a ratio of extreme order statistics `D_{(1)} / (D_{(2)} - D_{(1)})` — extreme-value theory under inverse-probability weighting is a research area, not a standard toolkit. Lit-review Guillou-Hall (2001), Chen-Chen (2004); likely outcome is `NotImplementedError` on `qug_test(..., weights=...)` with a clear pointer to the Stute/Yatchew/joint pretests as the survey-supported alternatives. | `diff_diff/had_pretests.py::qug_test` | Phase 4.5 C0 | Low |
 | `HeterogeneousAdoptionDiD` Phase 4.5 C: pretests under survey (`stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, `did_had_pretest_workflow`). Rao-Wu rescaled bootstrap for the Stute-family (weighted η generation + PSU clustering in the bootstrap draw); weighted OLS residuals + weighted variance estimator for Yatchew. | `diff_diff/had_pretests.py` | Phase 4.5 C | Medium |
 | `HeterogeneousAdoptionDiD` Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted `lprobust` with an unweighted DPI selector; users who want a weight-aware bandwidth must pass `h`/`b` explicitly. Extending `lpbwselect_mse_dpi` to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope. | `diff_diff/_nprobust_port.py::lpbwselect_mse_dpi` | Phase 4.5 | Low |

diff --git a/benchmarks/R/generate_estimatr_iv_robust_golden.R b/benchmarks/R/generate_estimatr_iv_robust_golden.R
@@ -0,0 +1,182 @@
+# Generate cross-language weighted-2SLS parity fixture for HAD Phase 4.5 B
+# (mass-point + weights).
+#
+# Purpose: validate ``_fit_mass_point_2sls(..., weights=...)`` against
+# ``estimatr::iv_robust(y ~ d | Z, weights=w, se_type=...)`` bit-exactly.
+# estimatr's HC1 sandwich and Stata-style CR1 under pweights match the
+# Wooldridge 2010 Ch. 12 / Angrist-Pischke 4.1.3 pweight convention
+# that the Python port implements (w² in the HC1 meat, w·u in the CR1
+# cluster score, weighted bread Z'WX). estimatr's classical SE uses a
+# different DOF / projection convention and is skipped in parity tests
+# (documented deviation; diverges by O(1/n) at non-uniform weights).
+#
+# Usage:
+#   Rscript benchmarks/R/generate_estimatr_iv_robust_golden.R
+#
+# Output:
+#   benchmarks/data/estimatr_iv_robust_golden.json
+#
+# Phase 4.5 B of HeterogeneousAdoptionDiD (de Chaisemartin et al. 2026).
+# Python test loader: tests/test_estimatr_iv_robust_parity.py.
+
+library(jsonlite)
+library(estimatr)
+
+stopifnot(packageVersion("estimatr") >= "1.0")
+
+# -------------------------------------------------------------------------
+# DGP builders
+# -------------------------------------------------------------------------
+
+dgp_mass_point <- function(n, seed, weight_pattern = "uniform",
+                           include_cluster = FALSE, d_lower = 0.3) {
+  # Mass-point dose: a fraction at d_lower, rest uniform(d_lower, 1).
+  set.seed(seed)
+  n_mass <- round(0.15 * n)
+  d_mass <- rep(d_lower, n_mass)
+  d_cont <- runif(n - n_mass, d_lower, 1.0)
+  d <- c(d_mass, d_cont)
+  # Reshuffle to avoid ordered-by-dose artifacts.
+  perm <- sample.int(n)
+  d <- d[perm]
+
+  # True DGP: dy = 2 * d + 0.3 * d^2 + eps
+  dy <- 2.0 * d + 0.3 * d^2 + rnorm(n, sd = 0.4)
+
+  # Weights
+  w <- switch(weight_pattern,
+    "uniform" = rep(1.0, n),
+    "mild" = 1.0 + 0.3 * rnorm(n),
+    "informative" = 1.0 + 2.0 * abs(d - 0.5) + runif(n, 0, 0.2),
+    "heavy_tailed" = pmax(0.1, rlnorm(n, meanlog = 0, sdlog = 0.5))
+  )
+  # Clip to positive.
+  w <- pmax(w, 0.01)
+
+  cluster <- if (include_cluster) sample.int(max(4, n %/% 20), n, replace = TRUE) else NULL
+
+  list(d = d, dy = dy, w = w, cluster = cluster, d_lower = d_lower)
+}
+
+# -------------------------------------------------------------------------
+# Fit weighted 2SLS with estimatr at specified se_type.
+# -------------------------------------------------------------------------
+
+fit_iv_robust <- function(dgp, se_type, use_cluster = FALSE) {
+  d <- dgp$d
+  dy <- dgp$dy
+  w <- dgp$w
+  Z <- as.integer(d > dgp$d_lower)
+  df <- data.frame(d = d, dy = dy, Z = Z, w = w)
+  if (use_cluster) df$cluster <- dgp$cluster
+
+  fit <- if (use_cluster) {
+    iv_robust(dy ~ d | Z, data = df, weights = w, clusters = cluster,
+              se_type = se_type)
+  } else {
+    iv_robust(dy ~ d | Z, data = df, weights = w, se_type = se_type)
+  }
+
+  list(
+    beta = as.numeric(coef(fit)["d"]),
+    se = as.numeric(fit$std.error["d"]),
+    # Intercept for manual sandwich verification.
+    alpha = as.numeric(coef(fit)["(Intercept)"]),
+    se_intercept = as.numeric(fit$std.error["(Intercept)"]),
+    n = as.integer(nobs(fit)),
+    se_type = se_type
+  )
+}
+
+# -------------------------------------------------------------------------
+# Build the DGP × se_type fixture grid.
+# -------------------------------------------------------------------------
+
+# Each DGP × se_type combination becomes one fixture entry. DGPs vary
+# sample size, weight informativeness, and cluster structure so the
+# Python test exercises all three sandwich variants (HC1, classical, CR1).
+fixtures <- list()
+
+dgps <- list(
+  list(n = 200, seed = 42, weight = "uniform", cluster = FALSE, name = "uniform_n200"),
+  list(n = 500, seed = 123, weight = "mild", cluster = FALSE, name = "mild_n500"),
+  list(n = 500, seed = 7, weight = "informative", cluster = FALSE, name = "informative_n500"),
+  list(n = 1000, seed = 321, weight = "heavy_tailed", cluster = FALSE, name = "heavy_n1000"),
+  list(n = 600, seed = 99, weight = "informative", cluster = TRUE, name = "informative_cluster_n600")
+)
+
+# For the non-clustered DGPs, emit HC1 + classical entries (Python
+# parity tests target HC1; classical deviates by O(1/n) and is recorded
+# as a reference only). For the clustered DGP, emit the Stata-style CR1
+# entry (matches `diff_diff/had.py::_fit_mass_point_2sls` CR1 convention
+# bit-exactly; see Gate #0 verification in the Phase 4.5 B plan).
+for (dgp_spec in dgps) {
+  dgp <- dgp_mass_point(
+    n = dgp_spec$n,
+    seed = dgp_spec$seed,
+    weight_pattern = dgp_spec$weight,
+    include_cluster = dgp_spec$cluster
+  )
+
+  if (dgp_spec$cluster) {
+    entry <- list(
+      name = dgp_spec$name,
+      n = dgp_spec$n,
+      d_lower = dgp$d_lower,
+      weight_pattern = dgp_spec$weight,
+      seed = dgp_spec$seed,
+      d = dgp$d,
+      dy = dgp$dy,
+      w = dgp$w,
+      cluster = dgp$cluster,
+      cr1 = fit_iv_robust(dgp, se_type = "stata", use_cluster = TRUE)
+    )
+  } else {
+    entry <- list(
+      name = dgp_spec$name,
+      n = dgp_spec$n,
+      d_lower = dgp$d_lower,
+      weight_pattern = dgp_spec$weight,
+      seed = dgp_spec$seed,
+      d = dgp$d,
+      dy = dgp$dy,
+      w = dgp$w,
+      cluster = NULL,
+      hc1 = fit_iv_robust(dgp, se_type = "HC1", use_cluster = FALSE),
+      classical = fit_iv_robust(dgp, se_type = "classical", use_cluster = FALSE)
+    )
+  }
+  fixtures[[dgp_spec$name]] <- entry
+}
+
+# -------------------------------------------------------------------------
+# Serialize
+# -------------------------------------------------------------------------
+
+out <- list(
+  metadata = list(
+    description = "estimatr::iv_robust weighted 2SLS parity fixture for HAD Phase 4.5 B",
+    estimatr_version = as.character(packageVersion("estimatr")),
+    r_version = as.character(getRversion()),
+    n_dgps = length(fixtures),
+    hc1_atol = 1e-10,
+    cr1_atol = 1e-10,
+    notes = paste(
+      "HC1 (se_type='HC1') and CR1 (se_type='stata') under pweights match",
+      "Python's _fit_mass_point_2sls bit-exactly at atol=1e-10. Classical",
+      "(se_type='classical') uses estimatr's projection-form DOF convention",
+      "(n-k + X_hat'WX_hat bread) which differs from Python's sandwich form",
+      "(sum(w)-k + Z'W^2Z meat); included as a reference without parity",
+      "assertion.",
+      sep = " "
+    )
+  ),
+  fixtures = fixtures
+)
+
+# Ensure output directory exists.
+out_dir <- "benchmarks/data"
+if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)
+out_path <- file.path(out_dir, "estimatr_iv_robust_golden.json")
+write_json(out, path = out_path, digits = 17, auto_unbox = TRUE, null = "null")
+message(sprintf("Wrote %d DGP fixtures to %s", length(fixtures), out_path))
diff --git a/benchmarks/R/requirements.R b/benchmarks/R/requirements.R
@@ -12,6 +12,7 @@ required_packages <- c(
   "fixest",        # Fast TWFE and basic DiD
   "triplediff",    # Ortiz-Villavicencio & Sant'Anna (2025) triple difference
   "survey",        # Lumley (2004) complex survey analysis
+  "estimatr",      # Blair et al. (2019) weighted robust / IV SE (HAD mass-point parity)
 
   # Utilities
   "jsonlite",      # JSON output for Python interop

diff --git a/benchmarks/data/estimatr_iv_robust_golden.json b/benchmarks/data/estimatr_iv_robust_golden.json