Skip to content

TwoStageDiD: thread vcov_type as narrow {hc1} contract (Phase 1b interstitial #5, final)#498

Merged
igerber merged 4 commits into
mainfrom
feature/two-stage-did-vcov-type-phase1b
May 30, 2026
Merged

TwoStageDiD: thread vcov_type as narrow {hc1} contract (Phase 1b interstitial #5, final)#498
igerber merged 4 commits into
mainfrom
feature/two-stage-did-vcov-type-phase1b

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 30, 2026

Summary

  • Thread vcov_type="hc1" through TwoStageDiD (Phase 1b interstitial Add comprehensive code review for diff-diff library #5 — the final standalone estimator, completing the initiative across all 8). Accept only {"hc1"}; reject {classical, hc2, hc2_bm, conley} with GMM-meat-specific messages. Adds vcov_type / cluster_name / n_clusters + to_dict() to TwoStageDiDResults; summary() renders the unit-cluster CR1 label with bootstrap + survey suppression gates. Defensive bootstrap n_clusters<2 / n_psu<2 NaN guard; cluster= + replicate-weights raises NotImplementedError.
  • Report the post-drop effective cluster count in metadata (always-treated units are excluded before estimation), counted via np.unique(df[cluster_var]) so it matches the variance exactly — including always-treated drops and NaN cluster IDs.
  • Expose vcov_type explicitly on the two_stage_did() convenience wrapper (was hidden behind **kwargs), matching the sibling wrappers.
  • Docs: REGISTRY IF-vs-sandwich taxonomy → N=5 + TwoStageDiD **Note**; llms-full.txt signature; both checked-in autosummary RSTs; CHANGELOG; TODO (initiative marked complete + conley follow-up row).

Methodology references

  • Method name(s): TwoStageDiD — Gardner (2022) two-stage DiD, GMM cluster-sandwich (did2s).
  • Paper / source link(s): Gardner (2022) arXiv:2207.05943; Butts & Gardner (2022) did2s, R Journal 14(1).
  • Intentional deviations: vcov_type permanently narrow to {"hc1"} — the GMM-corrected meat S_g = γ̂'c_g − X_2g'ε_2g folds first-stage FE uncertainty into the score, so no single hat matrix spans both stages on which HC2 leverage / Bell-McCaffrey DOF can be defined (documented in REGISTRY.md IF-vs-sandwich taxonomy + TwoStageDiD section, mirroring the SpilloverDiD classical rejection). did2s no-FSA convention: the CR1 family label carries no (n-1)/(n-p) factor (**Note (deviation from R):**). vcov_type="conley" deferred (TODO.md).

Validation

  • Tests added/updated: tests/test_two_stage.py::TestTwoStageDiDVcovType (36 tests) — invalid-vcov rejection, hc1 no-op bit-equality across analytical / cluster / TSL-survey / replicate-survey / bootstrap paths (parametrized over aggregate modes), degenerate-bootstrap full public NaN propagation, post-drop + NaN-cluster metadata parity, summary labels, set_params fit-time revalidation, cluster=+replicate rejection, convenience-fn threading.
  • Evidence: ATT/SE bit-identical vs pre-PR baseline across default / cluster / bootstrap (incl. per-horizon + per-group SE). 4 local Codex rounds (full standard mode) converged clean (no actionable findings).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

igerber and others added 4 commits May 30, 2026 07:23
…terstitial #5, final)

TwoStageDiD's variance is the Gardner/did2s two-stage GMM cluster-sandwich
(always clusters; default at unit) — a structural twin of ImputationDiD, NOT
the GMM×HC2-BM beast the tracker described (that was SpilloverDiD's helper).

Add vcov_type="hc1" accepting only {hc1}; reject {classical,hc2,hc2_bm} (the
GMM-corrected meat S_g = gamma_hat' c_g - X_2g' eps_2g folds in first-stage
uncertainty, so no single hat matrix spans both stages for HC2 leverage /
BM-DOF) and conley (deferred). Results gains vcov_type/cluster_name/n_clusters
+ to_dict(); summary() renders the unit-cluster CR1 label with bootstrap +
survey suppression gates. Bootstrap n_clusters<2 NaN guard (load-bearing,
post-drop perturbation count) + survey n_psu<2 defense. cluster= + replicate
weights raises NotImplementedError.

Docs: REGISTRY taxonomy -> N=5 + TwoStageDiD Note, llms-full signature, both
autosummary RSTs, CHANGELOG, TODO (initiative complete + conley follow-up).
34 new tests; ATT/SE bit-identical vs baseline across default/cluster/bootstrap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… P2)

n_clusters / cluster_name were derived from the full input `data`, but the GMM
sandwich computes variance over `cluster_ids=df[cluster_var]` on the POST-DROP
fit sample (always-treated units are removed before estimation). When an
always-treated unit/cluster is excluded, reporting the full-input count
overstates the effective G the SE is based on. Count clusters on `df` instead,
matching the variance. Survey suppression (cluster_name=None) is unchanged;
the Wave E.3 full-domain survey accounting is a separate, intentional path.

Adds a regression test asserting n_clusters equals the post-drop count when an
always-treated cohort is dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ct tests (codex P3)

- two_stage_did(): expose vcov_type="hc1" explicitly (was hidden behind **kwargs)
  and forward it, matching the imputation_did/efficient_did sibling wrappers — the
  convenience API surface, generated signature, and IDE help now show the param.
- Degenerate-bootstrap tests now assert the FULL public NaN-propagation contract
  (overall t_stat/p_value/conf_int + every event-study/group inference field) via
  a shared _assert_full_bootstrap_nan helper, not just overall_se, so a partial
  regression in _build_nan_bootstrap_results can't slip through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(codex P2)

n_clusters used Series.nunique() (drops NaN), but the GMM sandwich counts
np.unique(cluster_ids) (keeps a single NaN group). A non-survey cluster= column
with missing IDs would make the reported G undercount the SE's actual cluster
count. Count clusters the same way the variance does — np.unique(df[cluster_var])
— which also consolidates the two non-survey branches and still excludes
always-treated-dropped units (df, not data). Adds a NaN-cluster regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall assessment

✅ Looks good — I did not find any unmitigated P0/P1 issues in the changed TwoStageDiD estimator, bootstrap, or result-reporting paths.

Executive summary

  • Affected method: TwoStageDiD (Gardner 2022 / Butts & Gardner 2022). The source references in the estimator docstring and the Methodology Registry are aligned. diff_diff/two_stage.py L1297-L1308; docs/methodology/REGISTRY.md L1390-L1408
  • The new vcov_type={"hc1"} restriction is a documented methodology choice, not an undocumented deviation. Enforcement at __init__ and fit() matches the registry’s Gardner-GMM rationale. diff_diff/two_stage.py L3261-L3308; docs/methodology/REGISTRY.md L348-L357
  • The new bootstrap degeneracy guards correctly fail closed to all-NaN inference for n_clusters<2 / n_psu<2, and the NaNs are propagated through overall, event-study, and group outputs. diff_diff/two_stage_bootstrap.py L219-L403; diff_diff/two_stage.py L2019-L2065
  • Result metadata/reporting stays aligned with the actual inference path: default non-survey fits expose unit-cluster CR1 metadata, survey fits suppress cluster metadata, and bootstrap suppresses the analytical variance label. diff_diff/two_stage.py L2067-L2113; diff_diff/two_stage_results.py L231-L260; diff_diff/two_stage_results.py L456-L503
  • The remaining vcov_type="conley" gap is explicitly tracked as deferred work, so it is non-blocking for this PR. TODO.md L116-L116

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 informational. Impact: vcov_type="conley" remains intentionally deferred, and that follow-up is already tracked in TODO.md, so this is not blocker-level debt for the current PR. Concrete fix: none in this PR; handle the tracked follow-up separately. TODO.md L116-L116; diff_diff/two_stage.py L3296-L3303

Security

  • No findings.

Documentation/Tests

  • Severity: P3 informational. Impact: The added tests cover the new public contract, summary/reporting behavior, replicate-weight rejection, degenerate-bootstrap NaN propagation, and metadata parity checks. I could not independently execute them in this sandbox because pytest and numpy are not installed here. Concrete fix: rely on CI or rerun tests/test_two_stage.py in the normal project environment. tests/test_two_stage.py L2019-L2280

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 30, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f3368a7d96af6ff1dee5f22456996d442cb3147


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Re-review scope: the prior AI review had no open P1+ items, and I did not find newly introduced P1+ regressions in the changed TwoStageDiD vcov-threading work.
  • Affected method: TwoStageDiD. The PR constrains the public vcov_type API to the already-implemented Gardner GMM sandwich, without changing point-estimation or analytical variance arithmetic. diff_diff/two_stage.py:L1259-L1268, diff_diff/two_stage.py:L3261-L3308, docs/methodology/REGISTRY.md:L348-L357, docs/methodology/REGISTRY.md:L1403-L1408
  • The new bootstrap degeneracy guards fail closed correctly: n_clusters<2 / n_psu<2 now return all-NaN bootstrap inference, and the downstream fit path propagates that to overall, event-study, and group outputs. diff_diff/two_stage_bootstrap.py:L219-L267, diff_diff/two_stage_bootstrap.py:L354-L403, diff_diff/two_stage.py:L2019-L2065, tests/test_two_stage.py:L1982-L2221
  • Results metadata now matches the actual variance path, including post-drop effective cluster counts and NaN-cluster parity. diff_diff/two_stage.py:L2067-L2088, diff_diff/two_stage_results.py:L231-L261, diff_diff/two_stage_results.py:L456-L503, tests/test_two_stage.py:L2225-L2280
  • Minor P3 only: the shared LLM guide still describes the variance-metadata surface as if TwoStageDiDResults were excluded. diff_diff/guides/llms-full.txt:L1286-L1292

Methodology

  • Severity: P3 informational. Impact: The affected method is TwoStageDiD; Gardner frames it as a two-stage procedure that estimates group/time effects from untreated observations in the first stage, then estimates ATT after removing those effects, and the did2s R Journal article presents the corresponding implementation with custom SE handling. This PR’s permanent vcov_type={"hc1"} restriction, CR1-at-unit labeling, and cluster= + replicate-weight rejection are all explicitly documented in the Methodology Registry, so I am not treating them as defects. Concrete fix: none. diff_diff/two_stage.py:L1297-L1308, diff_diff/two_stage.py:L1432-L1447, diff_diff/two_stage.py:L3261-L3308, docs/methodology/REGISTRY.md:L1403-L1408. citeturn1view0turn0view0

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 informational. Impact: vcov_type="conley" is still deferred for TwoStageDiD, but this PR tracks that follow-up explicitly in TODO.md, so it is not blocker-level debt for the current change. Concrete fix: none in this PR; handle the tracked follow-up separately. diff_diff/two_stage.py:L3296-L3303, TODO.md:L116-L116

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: diff_diff/guides/llms-full.txt:L1286-L1292 still says only ImputationDiDResults and EfficientDiDResults carry the shared variance metadata, and its vcov_type description omits TwoStageDiD’s new permanently narrow contract. That leaves the internal guide partially stale after this PR’s public-surface update. Concrete fix: update that guide block to list TwoStageDiDResults among the results carrying vcov_type / cluster_name / n_clusters, and mention the Gardner-GMM {"hc1"} restriction.
  • No findings on changed-path test coverage. The added tests cover invalid-vcov rejection, wrapper threading, aggregate/cluster/survey/replicate/bootstrap parity, degenerate-bootstrap NaN propagation, and metadata parity in tests/test_two_stage.py:L2019-L2280. I could not execute pytest here because pytest is not installed in this environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 30, 2026
@igerber igerber merged commit b3dc8d0 into main May 30, 2026
33 of 34 checks passed
igerber added a commit that referenced this pull request May 30, 2026
…ed stale rows

Cleanup of TODO.md now that the vcov_type threading initiative is complete (all 8
standalone estimators merged, TwoStageDiD #498 last). TODO.md only — no methodology
or source changes.

Compact prune of the completed initiative's leftovers:
- the `| Done |` umbrella row + its orphan blank lines (rejoins the methodology
  table to its header), the Tier B threading bullet, and the stale duplicate
  TwoStageDiD-Conley row whose "`__init__` lacks `vcov_type`" premise is false post-#498
- the "Rows 104-105 LIFTED" comment block, the two ~~LIFTED~~ weighted-BM rows, and
  the Tier C LIFTED bullet (clubSandwich WLS-CR2 port, #475)
- two resolved-marker HTML comments (WooldridgeDiD cohort_share; PreTrendsPower)
- rewrote the Standard Error Consistency prose to "complete" and repointed its
  weighted-CR2 gate at the open multi-absorb row

Staleness audit of the ~50 remaining follow-up rows (5 subagents; every finding
re-verified against current source before acting -- the vast majority are genuine
open deferrals):
- removed the `bias_corrected_local_linear: weights=` row (shipped; residual
  unweighted-DPI gap already tracked by the sibling row) and narrowed the Tier D
  lprobust bullet's stale `weights` -> `weight-aware auto-bandwidth DPI`
- removed the `compute_survey_metadata`/`raw_w_for_meta` dedup row (done via the
  shared `survey._resolve_survey_for_fit` helper)
- tightened the HAD Phase-4.5-C survey-aware-pretests row: dropped the shipped
  pweight+PSU+FPC+strata narration, kept the two open items (replicate-weight
  designs; lonely_psu='adjust'+singleton on the Stute family)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 30, 2026
…ed stale rows

Cleanup of TODO.md now that the vcov_type threading initiative is complete (all 8
standalone estimators merged, TwoStageDiD #498 last). TODO.md only — no methodology
or source changes.

Compact prune of the completed initiative's leftovers:
- the `| Done |` umbrella row + its orphan blank lines (rejoins the methodology
  table to its header), the Tier B threading bullet, and the stale duplicate
  TwoStageDiD-Conley row whose "`__init__` lacks `vcov_type`" premise is false post-#498
- the "Rows 104-105 LIFTED" comment block, the two ~~LIFTED~~ weighted-BM rows, and
  the Tier C LIFTED bullet (clubSandwich WLS-CR2 port, #475)
- two resolved-marker HTML comments (WooldridgeDiD cohort_share; PreTrendsPower)
- rewrote the Standard Error Consistency prose to "complete" and repointed its
  weighted-CR2 gate at the open multi-absorb row

Staleness audit of the ~50 remaining follow-up rows (5 subagents; every finding
re-verified against current source before acting -- the vast majority are genuine
open deferrals):
- removed the `bias_corrected_local_linear: weights=` row (shipped; residual
  unweighted-DPI gap already tracked by the sibling row) and narrowed the Tier D
  lprobust bullet's stale `weights` -> `weight-aware auto-bandwidth DPI`
- removed the `compute_survey_metadata`/`raw_w_for_meta` dedup row (done via the
  shared `survey._resolve_survey_for_fit` helper)
- tightened the HAD Phase-4.5-C survey-aware-pretests row: dropped the shipped
  pweight+PSU+FPC+strata narration, kept the two open items (replicate-weight
  designs; lonely_psu='adjust'+singleton on the Stute family)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@igerber igerber deleted the feature/two-stage-did-vcov-type-phase1b branch May 31, 2026 18:42
@igerber igerber mentioned this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant