Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- **`HeterogeneousAdoptionDiD` mass-point `survey=` / `weights=` + event-study `aggregate="event_study"` survey composition + multiplier-bootstrap sup-t simultaneous confidence band (Phase 4.5 B).** Closes the two Phase 4.5 A `NotImplementedError` gates: `design="mass_point" + weights/survey` and `aggregate="event_study" + weights/survey`. Weighted 2SLS sandwich in `_fit_mass_point_2sls` follows the Wooldridge 2010 Ch. 12 pweight convention (`w²` in the HC1 meat, `w·u` in the CR1 cluster score, weighted bread `Z'WX`); HC1 and CR1 ("stata" `se_type`) bit-parity with `estimatr::iv_robust(..., weights=, clusters=)` at `atol=1e-10` (new cross-language golden at `benchmarks/data/estimatr_iv_robust_golden.json`, generated by `benchmarks/R/generate_estimatr_iv_robust_golden.R`; `estimatr` added to `benchmarks/R/requirements.R`). `_fit_mass_point_2sls` gains `weights=` + `return_influence=` kwargs and now always returns a 3-tuple `(beta, se, psi)` — `psi` is the per-unit IF on the β̂-scale scaled so `compute_survey_if_variance(psi, trivial_resolved) ≈ V_HC1[1,1]` at `atol=1e-10` (PR #359 IF scale convention applied uniformly; no `sum(psi²)` claims). Event-study per-horizon variance: `survey=` path composes Binder-TSL via `compute_survey_if_variance`; `weights=` shortcut uses the analytical weighted-robust SE (continuous: CCT-2014 `bc_fit.se_robust / |den|`; mass-point: weighted 2SLS pweight sandwich from `_fit_mass_point_2sls` — HC1 / classical / CR1). `survey_metadata` / `variance_formula` / `effective_dose_mean` populated in both regimes (previously hardcoded `None` at `had.py:3366`). New multiplier-bootstrap sup-t: `_sup_t_multiplier_bootstrap` reuses `diff_diff.bootstrap_utils.generate_survey_multiplier_weights_batch` for PSU-level draws with stratum centering + sqrt(n_h/(n_h-1)) small-sample correction + FPC scaling + lonely-PSU handling. On the `weights=` shortcut, sup-t calibration is routed through a synthetic trivial `ResolvedSurveyDesign` so the centered + small-sample-corrected branch fires uniformly — targets the analytical HC1 variance family (`compute_survey_if_variance(IF, trivial) ≈ V_HC1` per the PR #359 IF scale invariant) rather than the raw `sum(ψ²) = ((n-1)/n) · V_HC1` that unit-level Rademacher multipliers would produce on the HC1-scaled IF. Perturbations: `delta = weights @ IF` with NO `(1/n)` prefactor (matching `staggered_bootstrap.py:373` idiom), normalized by per-horizon analytical SE, `(1-alpha)`-quantile of the sup-t distribution. At H=1 the quantile reduces to `Φ⁻¹(1 − alpha/2) ≈ 1.96` up to MC noise (regression-locked by `TestSupTReducesToNormalAtH1`). `HeterogeneousAdoptionDiD.__init__` gains `n_bootstrap: int = 999` and `seed: Optional[int] = None` (CS-parity singular seed); `fit()` gains `cband: bool = True` (only consulted on weighted event-study). `HeterogeneousAdoptionDiDEventStudyResults` extended with `variance_formula`, `effective_dose_mean`, `cband_low`, `cband_high`, `cband_crit_value`, `cband_method`, `cband_n_bootstrap` (all `None` on unweighted fits); surfaced in `to_dict`, `to_dataframe`, `summary`, `__repr__`. Unweighted event-study with `cband=False` preserves pre-Phase 4.5 B numerical output bit-exactly (stability invariant, locked by regression tests). Zero-weight subpopulation convention carries over from PR #359 (filter for design decisions; preserve full ResolvedSurveyDesign for variance). Non-pweight SurveyDesigns (`aweight`, `fweight`, replicate designs) raise `NotImplementedError` on both new paths (reciprocal-guard discipline). Pretest surfaces (`qug_test`, `stute_test`, `yatchew_hr_test`, joint variants, `did_had_pretest_workflow`) remain unweighted in this release — Phase 4.5 C / C0. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted 2SLS (Phase 4.5 B)", "Event-study survey composition", and "Sup-t multiplier bootstrap" for derivations and invariants.
- **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` is a repo-standard `SurveyMetadata` dataclass (weight_type / effective_n / design_effect / sum_weights / weight_range / n_strata / n_psu / df_survey); HAD-specific extras (`variance_formula` label, `effective_dose_mean`) are separate top-level result fields. `to_dict()` surfaces the full `SurveyMetadata` object plus `variance_formula` + `effective_dose_mean`; `summary()` renders `variance_formula`, `effective_n`, `effective_dose_mean`, and (when the survey= path is used) `df_survey`; `__repr__` surfaces `variance_formula` + `effective_dose_mean` when present. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
- **`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test` + `StuteJointResult`** (HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The core `stute_joint_pretest` is residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends, `OLS(Y_t − Y_base ~ 1)` per pre-period) and linearity (step 3 joint, `OLS(Y_t − Y_base ~ 1 + D)` per post-period). Sum-of-CvMs aggregation (`S_joint = Σ_k S_k`); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3 `did_had_pretest_workflow` previously flagged with an "Assumption 7 pre-trends test NOT run" caveat. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).
- **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`.
Expand Down
3 changes: 2 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,8 @@ Deferred items from PR reviews that were not addressed before merge.
| Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | `benchmarks/R/generate_nprobust_lprobust_golden.R` | Phase 1c | Low |
| `HeterogeneousAdoptionDiD` joint cross-horizon covariance on event study: per-horizon SEs use INDEPENDENT sandwiches in Phase 2b (paper-faithful pointwise CIs per Pierce-Schott Figure 2). A follow-up could derive an IF-based stacking of per-horizon scores for joint cross-horizon inference (needed for joint hypothesis tests across event-time horizons). Block-bootstrap is a reasonable alternative. | `diff_diff/had.py::_fit_event_study` | Phase 2b | Low |
| `HeterogeneousAdoptionDiD` event-study staggered-timing beyond last cohort: Phase 2b auto-filters staggered panels to the last cohort per paper Appendix B.2. Earlier-cohort treatment effects are not identified by HAD; redirecting to `ChaisemartinDHaultfoeuille` / `did_multiplegt_dyn` is the paper's prescription. A full staggered HAD would require a different identification path (out of paper scope). | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
| `HeterogeneousAdoptionDiD` Phase 4.5 B: `survey=` / `weights=` on `design="mass_point"` (weighted 2SLS + weighted-sandwich variance; the Wooldridge 2010 Ch. 12 weighted-IV sandwich has a Stata `ivregress ... [pweight=...]` + R `AER::ivreg(weights=...)` parity anchor). Also ships `aggregate="event_study"` + survey/weights via per-horizon IPW + shared PSU multiplier bootstrap across horizons. This PR (Phase 4.5 A) raises `NotImplementedError` on both paths. | `diff_diff/had.py::_fit_mass_point_2sls`, `diff_diff/had.py::_fit_event_study` | Phase 4.5 B | Medium |
| `HeterogeneousAdoptionDiD` joint cross-horizon analytical covariance on the weighted event-study path: Phase 4.5 B ships multiplier-bootstrap sup-t simultaneous CIs on the weighted event-study path but pointwise analytical variance is still independent across horizons. A follow-up could derive the full H × H analytical covariance from the per-horizon IF matrix (`Psi.T @ Psi` under survey weighting) for an analytical alternative to the bootstrap. Would also let the unweighted event-study path ship a sup-t band. | `diff_diff/had.py::_fit_event_study` | follow-up | Low |
| `HeterogeneousAdoptionDiD` unweighted event-study sup-t band: Phase 4.5 B ships sup-t only on the WEIGHTED event-study path (to preserve pre-PR bit-exact output on unweighted). Extending sup-t to unweighted event-study (either via the multiplier bootstrap with unit-level iid multipliers or via analytical joint cross-horizon covariance) is a symmetric follow-up. | `diff_diff/had.py::_fit_event_study` | follow-up | Low |
| `HeterogeneousAdoptionDiD` Phase 4.5 C0: QUG-under-survey decision gate. `qug_test` uses a ratio of extreme order statistics `D_{(1)} / (D_{(2)} - D_{(1)})` — extreme-value theory under inverse-probability weighting is a research area, not a standard toolkit. Lit-review Guillou-Hall (2001), Chen-Chen (2004); likely outcome is `NotImplementedError` on `qug_test(..., weights=...)` with a clear pointer to the Stute/Yatchew/joint pretests as the survey-supported alternatives. | `diff_diff/had_pretests.py::qug_test` | Phase 4.5 C0 | Low |
| `HeterogeneousAdoptionDiD` Phase 4.5 C: pretests under survey (`stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, `did_had_pretest_workflow`). Rao-Wu rescaled bootstrap for the Stute-family (weighted η generation + PSU clustering in the bootstrap draw); weighted OLS residuals + weighted variance estimator for Yatchew. | `diff_diff/had_pretests.py` | Phase 4.5 C | Medium |
| `HeterogeneousAdoptionDiD` Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted `lprobust` with an unweighted DPI selector; users who want a weight-aware bandwidth must pass `h`/`b` explicitly. Extending `lpbwselect_mse_dpi` to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope. | `diff_diff/_nprobust_port.py::lpbwselect_mse_dpi` | Phase 4.5 | Low |
Expand Down
182 changes: 182 additions & 0 deletions benchmarks/R/generate_estimatr_iv_robust_golden.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Generate cross-language weighted-2SLS parity fixture for HAD Phase 4.5 B
# (mass-point + weights).
#
# Purpose: validate ``_fit_mass_point_2sls(..., weights=...)`` against
# ``estimatr::iv_robust(y ~ d | Z, weights=w, se_type=...)`` bit-exactly.
# estimatr's HC1 sandwich and Stata-style CR1 under pweights match the
# Wooldridge 2010 Ch. 12 / Angrist-Pischke 4.1.3 pweight convention
# that the Python port implements (w² in the HC1 meat, w·u in the CR1
# cluster score, weighted bread Z'WX). estimatr's classical SE uses a
# different DOF / projection convention and is skipped in parity tests
# (documented deviation; diverges by O(1/n) at non-uniform weights).
#
# Usage:
# Rscript benchmarks/R/generate_estimatr_iv_robust_golden.R
#
# Output:
# benchmarks/data/estimatr_iv_robust_golden.json
#
# Phase 4.5 B of HeterogeneousAdoptionDiD (de Chaisemartin et al. 2026).
# Python test loader: tests/test_estimatr_iv_robust_parity.py.

library(jsonlite)
library(estimatr)

stopifnot(packageVersion("estimatr") >= "1.0")

# -------------------------------------------------------------------------
# DGP builders
# -------------------------------------------------------------------------

dgp_mass_point <- function(n, seed, weight_pattern = "uniform",
include_cluster = FALSE, d_lower = 0.3) {
# Mass-point dose: a fraction at d_lower, rest uniform(d_lower, 1).
set.seed(seed)
n_mass <- round(0.15 * n)
d_mass <- rep(d_lower, n_mass)
d_cont <- runif(n - n_mass, d_lower, 1.0)
d <- c(d_mass, d_cont)
# Reshuffle to avoid ordered-by-dose artifacts.
perm <- sample.int(n)
d <- d[perm]

# True DGP: dy = 2 * d + 0.3 * d^2 + eps
dy <- 2.0 * d + 0.3 * d^2 + rnorm(n, sd = 0.4)

# Weights
w <- switch(weight_pattern,
"uniform" = rep(1.0, n),
"mild" = 1.0 + 0.3 * rnorm(n),
"informative" = 1.0 + 2.0 * abs(d - 0.5) + runif(n, 0, 0.2),
"heavy_tailed" = pmax(0.1, rlnorm(n, meanlog = 0, sdlog = 0.5))
)
# Clip to positive.
w <- pmax(w, 0.01)

cluster <- if (include_cluster) sample.int(max(4, n %/% 20), n, replace = TRUE) else NULL

list(d = d, dy = dy, w = w, cluster = cluster, d_lower = d_lower)
}

# -------------------------------------------------------------------------
# Fit weighted 2SLS with estimatr at specified se_type.
# -------------------------------------------------------------------------

fit_iv_robust <- function(dgp, se_type, use_cluster = FALSE) {
d <- dgp$d
dy <- dgp$dy
w <- dgp$w
Z <- as.integer(d > dgp$d_lower)
df <- data.frame(d = d, dy = dy, Z = Z, w = w)
if (use_cluster) df$cluster <- dgp$cluster

fit <- if (use_cluster) {
iv_robust(dy ~ d | Z, data = df, weights = w, clusters = cluster,
se_type = se_type)
} else {
iv_robust(dy ~ d | Z, data = df, weights = w, se_type = se_type)
}

list(
beta = as.numeric(coef(fit)["d"]),
se = as.numeric(fit$std.error["d"]),
# Intercept for manual sandwich verification.
alpha = as.numeric(coef(fit)["(Intercept)"]),
se_intercept = as.numeric(fit$std.error["(Intercept)"]),
n = as.integer(nobs(fit)),
se_type = se_type
)
}

# -------------------------------------------------------------------------
# Build the DGP × se_type fixture grid.
# -------------------------------------------------------------------------

# Each DGP × se_type combination becomes one fixture entry. DGPs vary
# sample size, weight informativeness, and cluster structure so the
# Python test exercises all three sandwich variants (HC1, classical, CR1).
fixtures <- list()

dgps <- list(
list(n = 200, seed = 42, weight = "uniform", cluster = FALSE, name = "uniform_n200"),
list(n = 500, seed = 123, weight = "mild", cluster = FALSE, name = "mild_n500"),
list(n = 500, seed = 7, weight = "informative", cluster = FALSE, name = "informative_n500"),
list(n = 1000, seed = 321, weight = "heavy_tailed", cluster = FALSE, name = "heavy_n1000"),
list(n = 600, seed = 99, weight = "informative", cluster = TRUE, name = "informative_cluster_n600")
)

# For the non-clustered DGPs, emit HC1 + classical entries (Python
# parity tests target HC1; classical deviates by O(1/n) and is recorded
# as a reference only). For the clustered DGP, emit the Stata-style CR1
# entry (matches `diff_diff/had.py::_fit_mass_point_2sls` CR1 convention
# bit-exactly; see Gate #0 verification in the Phase 4.5 B plan).
for (dgp_spec in dgps) {
dgp <- dgp_mass_point(
n = dgp_spec$n,
seed = dgp_spec$seed,
weight_pattern = dgp_spec$weight,
include_cluster = dgp_spec$cluster
)

if (dgp_spec$cluster) {
entry <- list(
name = dgp_spec$name,
n = dgp_spec$n,
d_lower = dgp$d_lower,
weight_pattern = dgp_spec$weight,
seed = dgp_spec$seed,
d = dgp$d,
dy = dgp$dy,
w = dgp$w,
cluster = dgp$cluster,
cr1 = fit_iv_robust(dgp, se_type = "stata", use_cluster = TRUE)
)
} else {
entry <- list(
name = dgp_spec$name,
n = dgp_spec$n,
d_lower = dgp$d_lower,
weight_pattern = dgp_spec$weight,
seed = dgp_spec$seed,
d = dgp$d,
dy = dgp$dy,
w = dgp$w,
cluster = NULL,
hc1 = fit_iv_robust(dgp, se_type = "HC1", use_cluster = FALSE),
classical = fit_iv_robust(dgp, se_type = "classical", use_cluster = FALSE)
)
}
fixtures[[dgp_spec$name]] <- entry
}

# -------------------------------------------------------------------------
# Serialize
# -------------------------------------------------------------------------

out <- list(
metadata = list(
description = "estimatr::iv_robust weighted 2SLS parity fixture for HAD Phase 4.5 B",
estimatr_version = as.character(packageVersion("estimatr")),
r_version = as.character(getRversion()),
n_dgps = length(fixtures),
hc1_atol = 1e-10,
cr1_atol = 1e-10,
notes = paste(
"HC1 (se_type='HC1') and CR1 (se_type='stata') under pweights match",
"Python's _fit_mass_point_2sls bit-exactly at atol=1e-10. Classical",
"(se_type='classical') uses estimatr's projection-form DOF convention",
"(n-k + X_hat'WX_hat bread) which differs from Python's sandwich form",
"(sum(w)-k + Z'W^2Z meat); included as a reference without parity",
"assertion.",
sep = " "
)
),
fixtures = fixtures
)

# Ensure output directory exists.
out_dir <- "benchmarks/data"
if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)
out_path <- file.path(out_dir, "estimatr_iv_robust_golden.json")
write_json(out, path = out_path, digits = 17, auto_unbox = TRUE, null = "null")
message(sprintf("Wrote %d DGP fixtures to %s", length(fixtures), out_path))
1 change: 1 addition & 0 deletions benchmarks/R/requirements.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ required_packages <- c(
"fixest", # Fast TWFE and basic DiD
"triplediff", # Ortiz-Villavicencio & Sant'Anna (2025) triple difference
"survey", # Lumley (2004) complex survey analysis
"estimatr", # Blair et al. (2019) weighted robust / IV SE (HAD mass-point parity)

# Utilities
"jsonlite", # JSON output for Python interop
Expand Down
1 change: 1 addition & 0 deletions benchmarks/data/estimatr_iv_robust_golden.json

Large diffs are not rendered by default.

Loading
Loading