From a6c6aba685c2a21174c50818aeb36e44a56dceab Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sat, 30 May 2026 06:38:50 -0400
Subject: [PATCH 1/4] two-stage-did: thread vcov_type as narrow {hc1} contract
 (Phase 1b interstitial #5, final)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

TwoStageDiD's variance is the Gardner/did2s two-stage GMM cluster-sandwich
(always clusters; default at unit) — a structural twin of ImputationDiD, NOT
the GMM×HC2-BM beast the tracker described (that was SpilloverDiD's helper).

Add vcov_type="hc1" accepting only {hc1}; reject {classical,hc2,hc2_bm} (the
GMM-corrected meat S_g = gamma_hat' c_g - X_2g' eps_2g folds in first-stage
uncertainty, so no single hat matrix spans both stages for HC2 leverage /
BM-DOF) and conley (deferred). Results gains vcov_type/cluster_name/n_clusters
+ to_dict(); summary() renders the unit-cluster CR1 label with bootstrap +
survey suppression gates. Bootstrap n_clusters<2 NaN guard (load-bearing,
post-drop perturbation count) + survey n_psu<2 defense. cluster= + replicate
weights raises NotImplementedError.

Docs: REGISTRY taxonomy -> N=5 + TwoStageDiD Note, llms-full signature, both
autosummary RSTs, CHANGELOG, TODO (initiative complete + conley follow-up).
34 new tests; ATT/SE bit-identical vs baseline across default/cluster/bootstrap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  |   1 +
 TODO.md                                       |   3 +-
 diff_diff/guides/llms-full.txt                |   1 +
 diff_diff/two_stage.py                        | 100 +++++++
 diff_diff/two_stage_bootstrap.py              | 108 ++++++-
 diff_diff/two_stage_results.py                |  91 ++++++
 .../_autosummary/diff_diff.TwoStageDiD.rst    |   1 +
 .../diff_diff.TwoStageDiDResults.rst          |   4 +
 docs/api/two_stage.rst                        |   1 +
 docs/methodology/REGISTRY.md                  |  11 +-
 tests/test_two_stage.py                       | 280 ++++++++++++++++++
 11 files changed, 591 insertions(+), 10 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index a4f5a2b2..3783708a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added / Changed
 - **EfficientDiD `vcov_type` threading + Results metadata harmonization (Phase 1b interstitial #4, permanently narrow).** `EfficientDiD(vcov_type=...)` now accepts `{"hc1"}` only (default). Analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` are REJECTED at `__init__` / `set_params` with methodology-rooted messages — EfficientDiD uses influence-function-based variance per Chen-Sant'Anna-Xie (2025) achieving the semiparametric efficiency bound; the per-unit EIF aggregation has no single design matrix on which hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF can be defined. `cluster=` (Liang-Zeger CR1 on cluster-aggregated EIF) and `survey_design=` (TSL on combined IF) paths are unchanged. **BC break on `EfficientDiDResults`:** the `cluster` field renamed to `cluster_name`; new `n_clusters` + `vcov_type` fields added; `to_dict()` method added (mirrors TripleDifferenceResults). `DiagnosticReport._pt_hausman` updated to read the renamed `cluster_name` field for the Hausman pretest replay (`diff_diff/diagnostic_report.py:2444`). `EfficientDiD.set_params(vcov_type=bad)` raises immediately rather than deferring to `fit()` — intentional eager-validation pattern matching EfficientDiD's existing handling of `pt_assumption`/`control_group` etc, diverging from `ImputationDiD`/`TripleDifference`/`CallawaySantAnna` (which use sklearn mutate-then-validate-at-use). Survey-PSU bootstrap path returns NaN SE when fewer than 2 independent PSUs are available (was ≈0 SE from BLAS roundoff). New summary block: `Variance estimator: <label>` line rendered after the survey block when not under bootstrap; suppressed under bootstrap (replaced with `Inference method: bootstrap` + `Bootstrap replications: <n>`). Default `cluster=None` (no survey) renders "HC1 heteroskedasticity-robust" — methodologically correct because the per-unit EIF SE `sqrt(mean(EIF²)/n)` is HC1-style (no Liang-Zeger G/(G-1) finite-sample correction); diverges from `ImputationDiD` which auto-clusters at unit per BJS Theorem 3.
+- **TwoStageDiD `vcov_type` threading + Results metadata (Phase 1b interstitial #5, final, permanently narrow).** `TwoStageDiD(vcov_type=...)` now accepts `{"hc1"}` only (default), completing the Phase 1b initiative across all 8 standalone estimators. Analytical-sandwich families `{classical, hc2, hc2_bm}` and `conley` are REJECTED at `__init__` / `fit()` with methodology-rooted messages: TwoStageDiD's variance is the Gardner (2022) two-stage GMM cluster-sandwich whose meat is the per-cluster GMM-corrected score `S_g = gamma_hat' c_g - X'_{2g} eps_{2g}`, which folds first-stage FE estimation uncertainty into the score — there is no single hat matrix spanning both stages on which HC2 leverage or Bell-McCaffrey Satterthwaite DOF can be defined, and the Gardner correction has not been derived for the leverage-corrected/homoskedastic meat (no reference implementation; mirrors the SpilloverDiD `classical` rejection). `cluster=` and `survey_design=` paths are numerically unchanged (bit-identical for healthy fits). **`TwoStageDiDResults` additions (no rename, no BC break):** new `vcov_type` / `cluster_name` / `n_clusters` fields + `to_dict()` method. `summary()` renders a `Variance estimator: <label>` line after the survey block (suppressed under bootstrap — `Inference method: bootstrap` + `Bootstrap replications: <n>` shown instead — and under any survey design). Default `cluster=None` renders `"CR1 cluster-robust at <unit>, G=<n_units>"` because the Gardner sandwich auto-clusters at the unit column (did2s no-FSA convention — the `CR1` label carries no `(n-1)/(n-p)` factor, matching R `did2s`; same convention as ImputationDiD's Theorem 3 variance). Defensive `n_clusters<2` NaN guard added to the multiplier-bootstrap path (was ≈0 SE from BLAS roundoff) plus a survey-PSU `n_psu<2` parity guard. `cluster=` with a replicate-weight survey design now raises `NotImplementedError` (replicate-refit variance ignores `cluster=`). `vcov_type='conley'` deferred to a TODO follow-up row.
 
 ### Fixed
 - **Bertanha-Imbens 2014 citation correction (16 sites across 5 files).** A verification spike confirmed the citation across `diff_diff/linalg.py` (×8), `diff_diff/conley.py` (×1), `diff_diff/guides/llms-full.txt` (×2), `docs/methodology/REGISTRY.md` (×4), and `docs/api/spillover.rst` (×1) was incorrect — NBER w20773 *External Validity in Fuzzy Regression Discontinuity Designs* (JBES 2020, 38(3):593-612) by Bertanha & Imbens covers fuzzy RD external validity, NOT weighted spatial-HAC under sampling weights. Replaced across all 16 sites with the open-problem framing: "weighted spatial-HAC under probability sampling is an open methodological question; no canonical extension of Conley (1999) exists for the combination." At the four `REGISTRY.md` sites the replacement is wrapped in the canonical `**Note (open methodological question):**` label per CLAUDE.md "Documenting Deviations (AI Review Compatibility)". REGISTRY ConleySpatialHAC section gains a new `**Note (deferral status, 2026-05-26):**` splitting the boundary into three parts: **Shipped** — SpilloverDiD + Conley + survey via Wave E.1/E.2/E.3 (PR #468/#474/#482), TwoStageDiD + Conley + survey via Wave E.3 parity (PR #485). **Deferred (generic linalg surface, any `weight_type`)** — DiD/MPD/TWFE/LinearRegression generic path + Conley + `survey_design=`; `LinearRegression` / `compute_robust_vcov` Conley + `weights=` rejected for `pweight`, `aweight`, AND `fweight` (weighted Conley is not implemented on the generic linalg surface). **Open methodological question (subset)** — the `pweight` / `survey_design` portion of the deferral additionally lacks a canonical methodological extension of Conley (1999) for weighted spatial-HAC under probability sampling. **No source-code logic changes:** verified via diff-in-diff pytest output before and after the citation strip (175 passed + 14 warnings, bit-identical pass set on `tests/test_conley_vcov.py`). **Historical CHANGELOG entries (pre-[Unreleased]) intentionally retain the Bertanha-Imbens 2014 attribution** as accurate records of what was claimed at the time of each release; the [Unreleased] entry above supersedes those rationales going forward.
diff --git a/TODO.md b/TODO.md
index 95a5a24e..6383036e 100644
--- a/TODO.md
+++ b/TODO.md
@@ -104,7 +104,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R. | `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R` | PR-C follow-up | Low |
 
 
-| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the standalone estimators that expose `cluster=` but not yet `vcov_type=`: `TwoStageDiD`. Phase 1a added the chain to DiD/MPD/TWFE; Phase 1b PR 1/8 added `SunAbraham`; PR 2/8 added `StackedDiD`; PR 3/8 added `WooldridgeDiD` OLS path. **Four interstitial PRs (post-PR-3/8) addressed the IF-based estimators separately, each permanently narrow to `{"hc1"}`**: (a) `CallawaySantAnna` per Callaway & Sant'Anna (2021) Theorem 2 (also fixed CS's bare-`cluster=` silent no-op); (b) `TripleDifference` per Ortiz-Villavicencio & Sant'Anna (2025) on the 3-pairwise-DiD decomposition; (c) `ImputationDiD` per Borusyak-Jaravel-Spiess (2024) Theorem 3 on per-unit IF aggregation (also added defensive `n_clusters<2`/`n_psu<2` NaN guard on the bootstrap path + `cluster=` + replicate-weights `NotImplementedError`); (d) `EfficientDiD` per Chen-Sant'Anna-Xie (2025) EIF aggregation achieving the semiparametric efficiency bound (also renamed `EfficientDiDResults.cluster` → `cluster_name`, added `n_clusters`/`vcov_type` fields + `to_dict()`, added defensive survey-PSU n<2 NaN guard, eager set_params validation diverging from sibling IF-based estimators). Analytical-sandwich families don't compose with IF-based variance for any of the four. This row tracks the remaining 1 (`TwoStageDiD` is sandwich-class with GMM-corrected meat). | `diff_diff/two_stage.py` | Phase 1b | Medium |
+| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the standalone estimators that expose `cluster=` but not yet `vcov_type=`: `TwoStageDiD`. Phase 1a added the chain to DiD/MPD/TWFE; Phase 1b PR 1/8 added `SunAbraham`; PR 2/8 added `StackedDiD`; PR 3/8 added `WooldridgeDiD` OLS path. **Four interstitial PRs (post-PR-3/8) addressed the IF-based estimators separately, each permanently narrow to `{"hc1"}`**: (a) `CallawaySantAnna` per Callaway & Sant'Anna (2021) Theorem 2 (also fixed CS's bare-`cluster=` silent no-op); (b) `TripleDifference` per Ortiz-Villavicencio & Sant'Anna (2025) on the 3-pairwise-DiD decomposition; (c) `ImputationDiD` per Borusyak-Jaravel-Spiess (2024) Theorem 3 on per-unit IF aggregation (also added defensive `n_clusters<2`/`n_psu<2` NaN guard on the bootstrap path + `cluster=` + replicate-weights `NotImplementedError`); (d) `EfficientDiD` per Chen-Sant'Anna-Xie (2025) EIF aggregation achieving the semiparametric efficiency bound (also renamed `EfficientDiDResults.cluster` → `cluster_name`, added `n_clusters`/`vcov_type` fields + `to_dict()`, added defensive survey-PSU n<2 NaN guard, eager set_params validation diverging from sibling IF-based estimators). Analytical-sandwich families don't compose with IF-based variance for any of the four. PR 8/8 (Phase 1b interstitial #5, final) added `TwoStageDiD` per the Gardner (2022) two-stage GMM cluster-sandwich — sandwich-class, but the GMM-corrected meat `S_g = gamma_hat' c_g - X'_{2g} eps_{2g}` admits the same permanently-narrow `{"hc1"}` contract (added `vcov_type`/`cluster_name`/`n_clusters` + `to_dict()`; bootstrap `n_clusters<2`/`n_psu<2` NaN guard; `cluster=`+replicate `NotImplementedError`). **Initiative COMPLETE — all 8 standalone estimators threaded;** remaining work is the per-estimator `conley` follow-ups below. | `diff_diff/two_stage.py` | Phase 1b | Done |
 | Extend `SunAbraham` with `vcov_type="conley"` (Conley spatial-HAC) as a first-class feature: thread `conley_coords` / `conley_cutoff_km` / `conley_metric` / `conley_kernel` / `conley_time` / `conley_unit` / `conley_lag_cutoff` through `_fit_saturated_regression`. Phase 1b PR 1/8 deferred this; SA currently rejects `vcov_type="conley"` at `__init__` with a deferral message. | `diff_diff/sun_abraham.py` | follow-up | Medium |
 | Extend `StackedDiD` with `vcov_type="conley"` (Conley spatial-HAC) — thread the six `conley_*` params through `solve_ols` at `stacked_did.py:419` (and the `_refit_stacked` closure at `:444`). Phase 1b PR 2/8 deferred this; StackedDiD currently rejects `vcov_type="conley"` at `__init__` with a deferral message. Same shape as the SunAbraham conley follow-up. | `diff_diff/stacked_did.py` | follow-up | Medium |
 | Extend `WooldridgeDiD` with `vcov_type="conley"` — thread the six `conley_*` params through `solve_ols` in `_fit_ols`. Phase 1b PR 3/8 deferred this; WooldridgeDiD currently rejects `vcov_type="conley"` at `__init__` with a deferral message. Same shape as the SunAbraham / StackedDiD conley follow-ups. | `diff_diff/wooldridge.py` | follow-up | Medium |
@@ -113,6 +113,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | Extend `TripleDifference` with `vcov_type="conley"` — would require deriving a spatial-HAC composition for the 3-pairwise-DiD influence-function decomposition (Conley 1999 spatial kernel × `inf = w3·IF_3 + w2·IF_2 - w1·IF_1` aggregation); no reference implementation exists today. Phase 1b interstitial #2 PR rejected this at `__init__` with a deferral pointer here. | `diff_diff/triple_diff.py` | follow-up | Low |
 | Extend `ImputationDiD` with `vcov_type="conley"` — would require deriving a spatial-HAC composition with the Theorem 3 per-unit IF aggregation (Conley 1999 spatial kernel × `sigma_sq = (cluster_psi_sums**2).sum()` reduction); no reference implementation exists today. Phase 1b interstitial #3 PR rejected this at `__init__` with a deferral pointer here. | `diff_diff/imputation.py` | follow-up | Low |
 | Extend `EfficientDiD` with `vcov_type="conley"` — would require deriving a spatial-HAC composition with the per-unit EIF aggregation (Conley 1999 spatial kernel × `_compute_se_from_eif` reduction); no reference implementation exists today. Phase 1b interstitial #4 PR rejected this at `__init__` with a deferral pointer here. | `diff_diff/efficient_did.py` | follow-up | Low |
+| Extend `TwoStageDiD` with `vcov_type="conley"` — thread a spatial-HAC composition into the GMM sandwich meat (`_compute_gmm_variance`); the Conley machinery already exists in the sibling SpilloverDiD `_compute_gmm_corrected_meat` (same module) and could be adapted to TwoStageDiD's per-cluster GMM score `S_g = gamma_hat' c_g - X'_{2g} eps_{2g}`, but two-stage GMM × Conley has no reference implementation. Phase 1b interstitial #5 PR rejected this at `__init__`/`fit()` with a deferral pointer here. | `diff_diff/two_stage.py` | follow-up | Low |
 | Decide whether to formally deprecate `CallawaySantAnna.cluster=X` in favor of `survey_design=SurveyDesign(psu=X)`. Both APIs are first-class today (the bare-cluster path synthesizes a minimal SurveyDesign internally), but having two equivalent paths to express the same intent creates redundant surface. Mirrors a similar question for ImputationDiD / EfficientDiD / TwoStageDiD if those estimators ever face the same review. | `diff_diff/staggered.py` | follow-up | Low |
 | Harmonize SunAbraham's HC1 within-transform finite-sample correction with `fixest::sunab()`. SA's `solve_ols` applies `n / (n - k_dm)` (within-transform columns only); fixest applies `n / (n - k_total)` (counts absorbed FE). SE values differ by ~1-2% on typical panel sizes (documented in REGISTRY.md "Deviation from R"; pinned at `atol=5e-3` in `tests/test_methodology_sun_abraham.py`). Either thread `df_adjustment` into the vcov scaling or document as an intentional difference. | `diff_diff/sun_abraham.py`, `diff_diff/linalg.py::compute_robust_vcov` | follow-up | Low |
 <!-- Rows 104-105 LIFTED 2026-05-20 via the clubSandwich WLS-CR2 port. The diff-diff
diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
index f444a5eb..e89ee907 100644
--- a/diff_diff/guides/llms-full.txt
+++ b/diff_diff/guides/llms-full.txt
@@ -435,6 +435,7 @@ TwoStageDiD(
     seed: int | None = None,
     rank_deficient_action: str = "warn",
     horizon_max: int | None = None,
+    vcov_type: str = "hc1",          # {"hc1"} only — Gardner (2022) two-stage GMM cluster-sandwich; analytical-sandwich {classical, hc2, hc2_bm} and conley REJECTED at __init__/fit (see REGISTRY.md IF-vs-sandwich subsection)
 )
 ```
 
diff --git a/diff_diff/two_stage.py b/diff_diff/two_stage.py
index c5ee0170..4d4eadf2 100644
--- a/diff_diff/two_stage.py
+++ b/diff_diff/two_stage.py
@@ -1256,6 +1256,16 @@ class TwoStageDiD(TwoStageDiDBootstrapMixin):
         pre-trends assessment. Pre-period effects should be ~0 under
         parallel trends. Only affects event_study aggregation; overall
         ATT and group aggregation are unchanged.
+    vcov_type : str, default="hc1"
+        Variance estimator family. Permanently narrow to ``{"hc1"}`` — the
+        Gardner (2022) two-stage GMM cluster-sandwich. Analytical-sandwich
+        families ``{"classical", "hc2", "hc2_bm"}`` and ``"conley"`` are
+        rejected at ``__init__`` / ``fit()`` because the GMM-corrected meat
+        folds first-stage estimation uncertainty into the score, leaving no
+        single hat matrix on which hat-matrix leverage or Bell-McCaffrey
+        Satterthwaite DOF can be defined. Use ``cluster=<col>`` to select the
+        cluster level; ``cluster=None`` (the default) clusters at the unit
+        level, so the summary renders the unit-cluster CR1 label.
 
     Attributes
     ----------
@@ -1309,6 +1319,7 @@ def __init__(
         rank_deficient_action: str = "warn",
         horizon_max: Optional[int] = None,
         pretrends: bool = False,
+        vcov_type: str = "hc1",
     ):
         if rank_deficient_action not in ("warn", "error", "silent"):
             raise ValueError(
@@ -1320,10 +1331,12 @@ def __init__(
                 f"bootstrap_weights must be 'rademacher', 'mammen', or 'webb', "
                 f"got '{bootstrap_weights}'"
             )
+        self._validate_vcov_type(vcov_type)
 
         self.anticipation = anticipation
         self.alpha = alpha
         self.cluster = cluster
+        self.vcov_type = vcov_type
         self.n_bootstrap = n_bootstrap
         self.bootstrap_weights = bootstrap_weights
         self.seed = seed
@@ -1387,6 +1400,11 @@ def fit(
         ValueError
             If required columns are missing or data validation fails.
         """
+        # Re-validate vcov_type at fit-time so sklearn-style set_params
+        # mutations (e.g. set_params(vcov_type="classical")) are re-checked
+        # rather than silently accepted by the attribute setter.
+        self._validate_vcov_type(self.vcov_type)
+
         # ---- Data validation ----
         required_cols = [outcome, unit, time, first_treat]
         if covariates:
@@ -1417,6 +1435,16 @@ def fit(
                 "Cannot use n_bootstrap > 0 with replicate-weight survey designs. "
                 "Replicate weights provide their own variance estimation."
             )
+        if _uses_replicate_ts and self.cluster is not None:
+            raise NotImplementedError(
+                "TwoStageDiD(cluster=...) with a replicate-weight survey design "
+                "is not supported: replicate-weight variance "
+                "(compute_replicate_refit_variance) estimates the SE by "
+                "per-replicate re-fit and ignores cluster= entirely, so the "
+                "cluster specification would be silently dropped. Use cluster= "
+                "with analytical/TSL inference (no replicate weights), or a "
+                "replicate-weight design without cluster=."
+            )
         # Validate within-unit constancy for panel survey designs
         if resolved_survey is not None:
             _validate_unit_constant_survey(data, unit, survey_design)
@@ -2036,6 +2064,24 @@ def _refit_ts(w_r):
                                 eff_val, se_val, alpha=self.alpha
                             )[0]
 
+        # Resolve cluster_name / n_clusters for Results metadata. Suppress under
+        # ANY survey design (the summary survey block already reports the
+        # design's PSU/strata/replicate metadata; replicate-weight variance
+        # ignores cluster entirely). Otherwise:
+        #   bare cluster= -> the user-named cluster column
+        #   cluster=None  -> the Gardner GMM sandwich still clusters at the unit
+        #                    column by default (cluster_var = unit above), so the
+        #                    summary label reports unit-cluster CR1, not HC1.
+        if resolved_survey is not None:
+            _cluster_name_for_results: Optional[str] = None
+            _n_clusters_for_results: Optional[int] = None
+        elif self.cluster is not None:
+            _cluster_name_for_results = self.cluster
+            _n_clusters_for_results = int(data[self.cluster].nunique())
+        else:
+            _cluster_name_for_results = unit
+            _n_clusters_for_results = int(data[unit].nunique())
+
         # Construct results
         self.results_ = TwoStageDiDResults(
             treatment_effects=treated_df,
@@ -2057,6 +2103,9 @@ def _refit_ts(w_r):
             anticipation=self.anticipation,
             bootstrap_results=bootstrap_results,
             survey_metadata=survey_metadata,
+            vcov_type=self.vcov_type,
+            cluster_name=_cluster_name_for_results,
+            n_clusters=_n_clusters_for_results,
         )
 
         self.is_fitted_ = True
@@ -3204,12 +3253,63 @@ def _build_rows(mask=None):
     # sklearn-compatible interface
     # =========================================================================
 
+    @staticmethod
+    def _validate_vcov_type(vcov_type: str) -> None:
+        """Validate ``vcov_type`` against TwoStageDiD's narrow GMM-sandwich
+        variance contract.
+
+        Called from ``__init__`` AND ``fit()`` so sklearn-style
+        ``set_params(vcov_type=...)`` mutations are re-checked at use time rather
+        than silently accepted by the setter (mirrors the ImputationDiD /
+        TripleDifference / CallawaySantAnna pattern). TwoStageDiD's variance is
+        the Gardner (2022) two-stage GMM cluster-sandwich
+        (``V = bread @ (S' S) @ bread`` with the per-cluster GMM-corrected score
+        ``S_g = gamma_hat' c_g - X_2g' eps_2g``); the contract is permanently
+        narrow to ``{"hc1"}``.
+        """
+        _accepted_vcov = {"hc1"}
+        _sandwich_incompatible = {"classical", "hc2", "hc2_bm"}
+        _deferred_vcov = {"conley"}
+
+        if vcov_type in _sandwich_incompatible:
+            raise ValueError(
+                f"TwoStageDiD(vcov_type={vcov_type!r}) is rejected: TwoStageDiD "
+                "uses the Gardner (2022) two-stage GMM sandwich, whose meat is "
+                "the per-cluster GMM-corrected score "
+                "S_g = gamma_hat' c_g - X_2g' eps_2g, which folds first-stage FE "
+                "estimation uncertainty into the score via the gamma_hat' c_g "
+                "term. Hat-matrix leverage (hc2) and Bell-McCaffrey "
+                "Satterthwaite DOF (hc2_bm) are defined for textbook "
+                "single-equation OLS residuals; there is no single hat matrix "
+                "spanning both stages, and the Gardner first-stage correction "
+                "has not been derived for the leverage-corrected or "
+                "homoskedastic (classical) meat structures (no reference "
+                "implementation — clubSandwich covers single-equation WLS/OLS "
+                "CR2, not two-stage GMM). Use vcov_type='hc1' (the default) with "
+                "cluster=<col> for cluster-robust inference."
+            )
+        if vcov_type in _deferred_vcov:
+            raise ValueError(
+                f"TwoStageDiD(vcov_type={vcov_type!r}) is not yet supported: "
+                "TwoStageDiD's GMM sandwich (_compute_gmm_variance) has no "
+                "spatial-HAC path today (the Conley machinery lives in the "
+                "separate SpilloverDiD helper). See TODO.md for the deferred "
+                "follow-up row. Use vcov_type='hc1' (the default) with "
+                "cluster=<col> for cluster-robust inference."
+            )
+        if vcov_type not in _accepted_vcov:
+            raise ValueError(
+                f"TwoStageDiD(vcov_type={vcov_type!r}) is invalid. "
+                f"Accepted: {sorted(_accepted_vcov)}."
+            )
+
     def get_params(self) -> Dict[str, Any]:
         """Get estimator parameters (sklearn-compatible)."""
         return {
             "anticipation": self.anticipation,
             "alpha": self.alpha,
             "cluster": self.cluster,
+            "vcov_type": self.vcov_type,
             "n_bootstrap": self.n_bootstrap,
             "bootstrap_weights": self.bootstrap_weights,
             "seed": self.seed,
diff --git a/diff_diff/two_stage_bootstrap.py b/diff_diff/two_stage_bootstrap.py
index 391cdbd4..2a2a693d 100644
--- a/diff_diff/two_stage_bootstrap.py
+++ b/diff_diff/two_stage_bootstrap.py
@@ -216,6 +216,56 @@ def _compute_cluster_S_scores(
 
         return S, bread, unique_clusters
 
+    def _build_nan_bootstrap_results(
+        self,
+        original_event_study: Optional[Dict[int, Dict[str, Any]]],
+        original_group: Optional[Dict[Any, Dict[str, Any]]],
+    ) -> TwoStageBootstrapResults:
+        """Build an all-NaN TwoStageBootstrapResults for degenerate-design
+        bootstrap paths (n_clusters<2 / n_psu<2).
+
+        Per-horizon and per-group dicts are populated with NaN entries keyed by
+        the SAME horizons/groups as the analytical originals so the downstream
+        post-bootstrap override loop in :meth:`TwoStageDiD.fit` iterates over
+        them and propagates NaN to ``event_study_effects[h]["se"]`` /
+        ``group_effects[g]["se"]`` (rather than silently no-oping by finding
+        ``None``).
+        """
+        n_nan = float("nan")
+        ci_nan: Tuple[float, float] = (n_nan, n_nan)
+
+        es_ses: Optional[Dict[int, float]] = None
+        es_cis: Optional[Dict[int, Tuple[float, float]]] = None
+        es_ps: Optional[Dict[int, float]] = None
+        if original_event_study:
+            es_ses = {h: n_nan for h in original_event_study}
+            es_cis = {h: ci_nan for h in original_event_study}
+            es_ps = {h: n_nan for h in original_event_study}
+
+        g_ses: Optional[Dict[Any, float]] = None
+        g_cis: Optional[Dict[Any, Tuple[float, float]]] = None
+        g_ps: Optional[Dict[Any, float]] = None
+        if original_group:
+            g_ses = {g: n_nan for g in original_group}
+            g_cis = {g: ci_nan for g in original_group}
+            g_ps = {g: n_nan for g in original_group}
+
+        return TwoStageBootstrapResults(
+            n_bootstrap=self.n_bootstrap,
+            weight_type=self.bootstrap_weights,
+            alpha=self.alpha,
+            overall_att_se=n_nan,
+            overall_att_ci=ci_nan,
+            overall_att_p_value=n_nan,
+            event_study_ses=es_ses,
+            event_study_cis=es_cis,
+            event_study_p_values=es_ps,
+            group_ses=g_ses,
+            group_cis=g_cis,
+            group_p_values=g_ps,
+            bootstrap_distribution=None,
+        )
+
     def _run_bootstrap(
         self,
         df: pd.DataFrame,
@@ -277,8 +327,11 @@ def _run_bootstrap(
 
         X_2_static = D.reshape(-1, 1)
         coef_static = solve_ols(
-            X_2_static, y_tilde, return_vcov=False,
-            weights=survey_weights, weight_type=survey_weight_type,
+            X_2_static,
+            y_tilde,
+            return_vcov=False,
+            weights=survey_weights,
+            weight_type=survey_weight_type,
         )[0]
         eps_2_static = y_tilde - np.dot(X_2_static, coef_static)
 
@@ -300,6 +353,28 @@ def _run_bootstrap(
 
         n_clusters = len(unique_clusters)
 
+        # Degenerate-design guard (load-bearing). The bootstrap perturbs exactly
+        # `n_clusters` cluster scores: `boot_att_vec = all_weights @ S_static`
+        # with `all_weights` shape (B, n_clusters) and `S_static` shape
+        # (n_clusters, k). With <2 clusters the multiplier draws collapse to
+        # constants and BLAS roundoff yields a ~0 SE (NOT NaN), producing
+        # near-infinite t-stats for inference that is actually undefined. Fail
+        # closed with all-NaN bootstrap results. `n_clusters` is the POST-DROP
+        # effective cluster count and dominates the survey generator's PSU count
+        # (post-drop clusters are a subset of the full-domain PSUs), so this also
+        # catches the Wave E.3 always-treated-drop-collapse case where the
+        # full-domain resolved_survey still retains >=2 PSUs. See
+        # feedback_bootstrap_g_less_than_2_blas_roundoff.
+        if n_clusters < 2:
+            warnings.warn(
+                f"TwoStageDiD bootstrap: n_clusters={n_clusters} (<2). Cluster "
+                "variance is unidentified with fewer than 2 clusters; returning "
+                "NaN bootstrap inference so downstream statistics NaN-propagate.",
+                UserWarning,
+                stacklevel=3,
+            )
+            return self._build_nan_bootstrap_results(original_event_study, original_group)
+
         # Generate bootstrap weights — PSU-level when survey design is present
         _use_survey_bootstrap = resolved_survey is not None and (
             resolved_survey.strata is not None
@@ -311,6 +386,21 @@ def _run_bootstrap(
             psu_weights, psu_ids = _generate_survey_multiplier_weights_batch(
                 self.n_bootstrap, resolved_survey, self.bootstrap_weights, rng
             )
+            # Defense-in-depth + ImputationDiD-precedent parity
+            # (imputation_bootstrap.py:387): NaN-out when the survey generator
+            # itself yields <2 PSUs. Dominated by the ungated n_clusters guard
+            # above (post-drop clusters are a subset of the full-domain PSUs, so
+            # n_clusters<2 already fired whenever len(psu_ids)<2), but kept
+            # explicit so the survey path's degeneracy is self-evident.
+            if len(psu_ids) < 2:
+                warnings.warn(
+                    f"TwoStageDiD survey-PSU bootstrap: n_psu={len(psu_ids)} "
+                    "(<2). Cluster variance is unidentified; returning NaN "
+                    "bootstrap inference.",
+                    UserWarning,
+                    stacklevel=3,
+                )
+                return self._build_nan_bootstrap_results(original_event_study, original_group)
             # Map unique_clusters (PSU values) to PSU weight columns.
             # When survey+PSU is active, cluster_var == "_survey_cluster" so
             # unique_clusters are the PSU ids used in S-score aggregation.
@@ -396,8 +486,11 @@ def _run_bootstrap(
                             X_2_es[i, horizon_to_col[h_int]] = 1.0
 
                 coef_es = solve_ols(
-                    X_2_es, y_tilde, return_vcov=False,
-                    weights=survey_weights, weight_type=survey_weight_type,
+                    X_2_es,
+                    y_tilde,
+                    return_vcov=False,
+                    weights=survey_weights,
+                    weight_type=survey_weight_type,
                 )[0]
                 eps_2_es = y_tilde - np.dot(X_2_es, coef_es)
 
@@ -464,8 +557,11 @@ def _run_bootstrap(
                         X_2_grp[i, group_to_col[g]] = 1.0
 
             coef_grp = solve_ols(
-                X_2_grp, y_tilde, return_vcov=False,
-                weights=survey_weights, weight_type=survey_weight_type,
+                X_2_grp,
+                y_tilde,
+                return_vcov=False,
+                weights=survey_weights,
+                weight_type=survey_weight_type,
             )[0]
             eps_2_grp = y_tilde - np.dot(X_2_grp, coef_grp)
 
diff --git a/diff_diff/two_stage_results.py b/diff_diff/two_stage_results.py
index 591f0cd1..3bc550e8 100644
--- a/diff_diff/two_stage_results.py
+++ b/diff_diff/two_stage_results.py
@@ -140,6 +140,16 @@ class TwoStageDiDResults:
     bootstrap_results: Optional[TwoStageBootstrapResults] = field(default=None, repr=False)
     # Survey design metadata (SurveyMetadata instance from diff_diff.survey)
     survey_metadata: Optional[Any] = field(default=None, repr=False)
+    # --- Variance-estimator metadata (Phase 1b vcov_type threading) ---
+    # vcov_type is permanently narrow to {"hc1"} per the Gardner (2022) GMM
+    # cluster-sandwich. cluster_name/n_clusters carry the cluster label (the
+    # Gardner sandwich always clusters — default at the unit column, see
+    # two_stage.py:1547) so summary() renders the unit-cluster CR1 label rather
+    # than generic HC1. Both are None under any survey design (the survey block
+    # already reports the design's PSU/strata metadata).
+    vcov_type: str = "hc1"
+    cluster_name: Optional[str] = None
+    n_clusters: Optional[int] = None
 
     # --- Inference-field aliases (balance/external-adapter compatibility) ---
     @property
@@ -218,6 +228,38 @@ def summary(self, alpha: Optional[float] = None) -> str:
             sm = self.survey_metadata
             lines.extend(_format_survey_block(sm, 85))
 
+        # Variance-estimator label (Phase 1b vcov_type), with two suppression
+        # gates mirroring DiDResults.summary() (results.py:213-226) and
+        # ImputationDiDResults.summary():
+        #   1. Bootstrap fits: fit() overwrites the reported SE/CI/p-value with
+        #      bootstrap_results, so the analytical variance-family label would
+        #      mislabel the inference source — surface "Inference method:
+        #      bootstrap" + the replication count instead.
+        #   2. Survey fits: _format_survey_block above already reports the design
+        #      (weight type, strata/PSU counts, replicate method), so a parallel
+        #      variance line would be redundant/misleading.
+        # Default cluster=None still clusters at the unit column (Gardner GMM
+        # sandwich, two_stage.py:1547), so cluster_name carries the unit column
+        # and _format_vcov_label renders the unit-cluster CR1 label, not HC1.
+        if self.bootstrap_results is not None:
+            lines.append(f"{'Inference method:':<30} {'bootstrap':>15}")
+            lines.append(
+                f"{'Bootstrap replications:':<30} {self.bootstrap_results.n_bootstrap:>15}"
+            )
+            lines.append("")
+        elif self.survey_metadata is None:
+            from diff_diff.results import _format_vcov_label
+
+            vcov_label = _format_vcov_label(
+                self.vcov_type,
+                cluster_name=self.cluster_name,
+                n_clusters=self.n_clusters,
+                n_obs=self.n_obs,
+            )
+            if vcov_label:
+                lines.append(f"{'Variance estimator:':<30} {vcov_label:>15}")
+                lines.append("")
+
         # Overall ATT
         lines.extend(
             [
@@ -411,6 +453,55 @@ def to_dataframe(self, level: str = "event_study") -> pd.DataFrame:
                 f"Unknown level: {level}. Use 'event_study', 'group', or 'observation'."
             )
 
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Convert headline results to a dictionary.
+
+        Provides flat aliases (``att``/``se``/``t_stat``/``p_value``/
+        ``conf_int_lower``/``conf_int_upper``) plus variance-estimator metadata
+        (``vcov_type``, optional ``cluster_name``/``n_clusters``, optional
+        ``n_bootstrap``, ``inference_method``). Per-cohort / per-horizon detail is
+        exposed via :meth:`to_dataframe`. ``inference_method`` reports
+        ``"cluster"`` for the default fit because the Gardner GMM sandwich
+        clusters at the unit column (``n_clusters`` populated) — consistent with
+        the CR1-at-unit summary label.
+
+        Returns
+        -------
+        Dict[str, Any]
+            Headline overall ATT plus inference metadata.
+        """
+        result: Dict[str, Any] = {
+            "att": self.overall_att,
+            "se": self.overall_se,
+            "t_stat": self.overall_t_stat,
+            "p_value": self.overall_p_value,
+            "conf_int_lower": self.overall_conf_int[0],
+            "conf_int_upper": self.overall_conf_int[1],
+            "n_obs": self.n_obs,
+            "n_treated_obs": self.n_treated_obs,
+            "n_untreated_obs": self.n_untreated_obs,
+            "n_treated_units": self.n_treated_units,
+            "n_control_units": self.n_control_units,
+            "alpha": self.alpha,
+            "anticipation": self.anticipation,
+            "vcov_type": self.vcov_type,
+        }
+        if self.cluster_name is not None:
+            result["cluster_name"] = self.cluster_name
+        if self.n_clusters is not None:
+            result["n_clusters"] = self.n_clusters
+        if self.bootstrap_results is not None:
+            result["n_bootstrap"] = self.bootstrap_results.n_bootstrap
+            result["inference_method"] = "bootstrap"
+        elif self.survey_metadata is not None:
+            result["inference_method"] = "survey"
+        elif self.n_clusters is not None:
+            result["inference_method"] = "cluster"
+        else:
+            result["inference_method"] = "analytical"
+        return result
+
     @property
     def is_significant(self) -> bool:
         """Check if overall ATT is significant."""
diff --git a/docs/api/_autosummary/diff_diff.TwoStageDiD.rst b/docs/api/_autosummary/diff_diff.TwoStageDiD.rst
index 0eaedf9b..73311ad6 100644
--- a/docs/api/_autosummary/diff_diff.TwoStageDiD.rst
+++ b/docs/api/_autosummary/diff_diff.TwoStageDiD.rst
@@ -30,4 +30,5 @@
       ~TwoStageDiD.alpha
       ~TwoStageDiD.seed
       ~TwoStageDiD.horizon_max
+      ~TwoStageDiD.vcov_type
 
diff --git a/docs/api/_autosummary/diff_diff.TwoStageDiDResults.rst b/docs/api/_autosummary/diff_diff.TwoStageDiDResults.rst
index 00c5d5c0..c57d1fc3 100644
--- a/docs/api/_autosummary/diff_diff.TwoStageDiDResults.rst
+++ b/docs/api/_autosummary/diff_diff.TwoStageDiDResults.rst
@@ -15,6 +15,7 @@
       ~TwoStageDiDResults.print_summary
       ~TwoStageDiDResults.summary
       ~TwoStageDiDResults.to_dataframe
+      ~TwoStageDiDResults.to_dict
 
 
 
@@ -50,4 +51,7 @@
       ~TwoStageDiDResults.n_untreated_obs
       ~TwoStageDiDResults.n_treated_units
       ~TwoStageDiDResults.n_control_units
+      ~TwoStageDiDResults.vcov_type
+      ~TwoStageDiDResults.cluster_name
+      ~TwoStageDiDResults.n_clusters
 
diff --git a/docs/api/two_stage.rst b/docs/api/two_stage.rst
index ea432b8d..4f7c01aa 100644
--- a/docs/api/two_stage.rst
+++ b/docs/api/two_stage.rst
@@ -71,6 +71,7 @@ Results container for two-stage DiD estimation.
       ~TwoStageDiDResults.summary
       ~TwoStageDiDResults.print_summary
       ~TwoStageDiDResults.to_dataframe
+      ~TwoStageDiDResults.to_dict
 
 TwoStageBootstrapResults
 ------------------------
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 101cc3c0..e00a1e76 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -348,9 +348,13 @@ Sant'Anna 2021 for `CallawaySantAnna`; Borusyak-Jaravel-Spiess 2024 for
 This split is a structural property of the estimator's variance derivation,
 not a missing feature. The `vcov_type` input contract for IF-based estimators
 is **permanently narrow** at `{"hc1"}`. Enforced today on
-`CallawaySantAnna`, `TripleDifference`, `ImputationDiD`, and `EfficientDiD`;
-the same narrow contract is expected when `TwoStageDiD` reaches `vcov_type`
-threading (sandwich + GMM-corrected meat; non-IF, so threading model differs).
+`CallawaySantAnna`, `TripleDifference`, `ImputationDiD`, `EfficientDiD`, and
+`TwoStageDiD`. `TwoStageDiD` is sandwich-class rather than pure-IF — its
+variance is the Gardner (2022) two-stage GMM sandwich — but reaches the same
+narrow `{"hc1"}` contract for a meat-specific reason: the GMM-corrected score
+`S_g = gamma_hat' c_g - X'_{2g} eps_{2g}` folds first-stage FE uncertainty into
+the meat, so no single hat matrix spans both stages and `{classical, hc2,
+hc2_bm}` have no derivation (see the TwoStageDiD section).
 
 **Note:** This routing is a documented synthesis. The clustered-`hc1`
 activation path is estimator-specific: `CallawaySantAnna` synthesizes
@@ -1401,6 +1405,7 @@ Our implementation uses multiplier bootstrap on the GMM influence function: clus
 - **Note:** Both the iterative FE solver (`_iterative_fe`, Stage 1) and the iterative alternating-projection demeaning helper (`_iterative_demean`, used in covariate residualization) emit `UserWarning` when `max_iter` exhausts without reaching `tol`, via `diff_diff.utils.warn_if_not_converged`. Silent return of the current iterate was classified as a silent failure under the Phase 2 audit and replaced with an explicit signal to match the logistic/Poisson IRLS pattern in `linalg.py`.
 - **Note:** When the Stage-2 bread `X'_2 W X_2` is singular, both the analytical TSL variance (`two_stage.py`) and the multiplier-bootstrap bread (`two_stage_bootstrap.py`) now emit a `UserWarning` before falling back to `np.linalg.lstsq`. Previously this fallback was silent. Sibling of axis-A finding #17 in the Phase 2 silent-failures audit; surfaced by the repo-wide lstsq-fallback pattern grep that accompanied the StaggeredTripleDifference fix.
 - **Note:** The GMM sandwich and bootstrap paths both use `scipy.sparse.linalg.factorized` for the Stage 1 normal-equations solve `(X'_{10} W X_{10}) gamma = X'_1 W X_2` and fall back to dense `lstsq` when the sparse factorization raises `RuntimeError` on a near-singular matrix. Both fallback sites emit a `UserWarning` (silent-failure audit axis C) so callers know SE estimates came from the degraded path rather than the fast sparse path.
+- **Note:** `vcov_type` is permanently narrow to `{"hc1"}` (Phase 1b threading). TwoStageDiD's variance is the Gardner (2022) two-stage GMM cluster-sandwich `V = (X'_2 W X_2)^{-1} (S' S) (X'_2 W X_2)^{-1}` with the per-cluster GMM-corrected score `S_g = gamma_hat' c_g - X'_{2g} eps_{2g}`. Analytical-sandwich families `{classical, hc2, hc2_bm}` are rejected at `__init__`/`fit()`: the GMM-corrected meat folds first-stage FE estimation uncertainty into the score via the `gamma_hat' c_g` term, so there is no single hat matrix spanning both stages on which HC2 leverage or Bell-McCaffrey Satterthwaite DOF can be defined, and the Gardner first-stage correction has not been derived for the leverage-corrected or homoskedastic meat structures (no reference implementation — `clubSandwich` covers single-equation WLS/OLS CR2, not two-stage GMM; mirrors the SpilloverDiD `vcov_type="classical"` rejection). `cluster=<col>` selects the cluster level; `cluster=None` (the default) still clusters at the `unit` column (`cluster_var = unit`), so the summary renders `"CR1 cluster-robust at <unit>, G=<n_units>"` rather than the generic `"HC1"` label. **Note (deviation from R):** the did2s GMM sandwich uses NO finite-sample multiplier (meat `= S' S`), so the rendered `CR1` family label carries no Stata-style `(n-1)/(n-p)` or `G/(G-1)` factor (matches R `did2s`; same FSA-free convention as ImputationDiD's Theorem 3 variance). Under bootstrap (`n_bootstrap > 0`) the analytical variance-family label is suppressed in `summary()` because `fit()` overwrites the reported SE/CI/p-value with `bootstrap_results` (mirrors `DiDResults` at `results.py:213-226`). `cluster=<col>` combined with a replicate-weight survey design raises `NotImplementedError` (replicate-refit variance ignores `cluster=`). `vcov_type='conley'` is deferred to the TwoStageDiD Conley follow-up row in TODO.md.
 
 **Reference implementation(s):**
 - R: `did2s::did2s()` (Kyle Butts & John Gardner)
diff --git a/tests/test_two_stage.py b/tests/test_two_stage.py
index ae714a3b..59be18a0 100644
--- a/tests/test_two_stage.py
+++ b/tests/test_two_stage.py
@@ -1910,3 +1910,283 @@ def test_h_psu_entirely_always_treated_unidentified_gate(self):
         assert np.isfinite(result.overall_se)
         assert result.survey_metadata is not None
         assert result.survey_metadata.n_psu == 6
+
+
+# =============================================================================
+# Phase 1b interstitial #5 (final): vcov_type threading on TwoStageDiD
+# =============================================================================
+
+from diff_diff import SurveyDesign  # noqa: E402
+
+
+def _add_survey_cols(data, n_rep=8):
+    """Add a constant pweight 'w' + unit-constant JK1 replicate-weight columns.
+
+    Each replicate zeroes one block of units and rescales survivors by
+    n_rep/(n_rep-1) (JK1 convention), broadcast panel-constant per unit.
+    """
+    d = data.copy()
+    d["w"] = 1.0
+    units = np.sort(d["unit"].unique())
+    n_units = len(units)
+    unit_pos = {u: i for i, u in enumerate(units)}
+    rows = d["unit"].map(unit_pos).values
+    units_per_rep = max(n_units // n_rep, 1)
+    rep_cols = []
+    for r in range(n_rep):
+        w_r = np.ones(n_units)
+        start = r * units_per_rep
+        end = min((r + 1) * units_per_rep, n_units)
+        w_r[start:end] = 0.0
+        nz = w_r > 0
+        w_r[nz] = w_r[nz] * n_rep / (n_rep - 1)
+        d[f"rep_{r}"] = w_r[rows]
+        rep_cols.append(f"rep_{r}")
+    return d, rep_cols
+
+
+def _fit_ts(est, data, **kw):
+    """Fit helper suppressing convergence/bootstrap-size warnings."""
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        return est.fit(
+            data,
+            outcome="outcome",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            **kw,
+        )
+
+
+def _assert_results_bit_equal(r0, r1):
+    """Assert two TwoStageDiDResults are numerically identical (NaN-aware)."""
+
+    def _eq(a, b):
+        return a == b or (np.isnan(a) and np.isnan(b))
+
+    assert _eq(r0.overall_att, r1.overall_att)
+    assert _eq(r0.overall_se, r1.overall_se)
+    for attr in ("event_study_effects", "group_effects"):
+        e0 = getattr(r0, attr)
+        e1 = getattr(r1, attr)
+        if e0 is None:
+            assert e1 is None
+            continue
+        assert set(e0) == set(e1)
+        for k in e0:
+            for f in ("effect", "se"):
+                assert _eq(e0[k][f], e1[k][f]), f"{attr}[{k}][{f}] differs"
+
+
+class TestTwoStageDiDVcovType:
+    """Phase 1b interstitial #5 (final): vcov_type input contract on TwoStageDiD.
+
+    TwoStageDiD's variance is the Gardner (2022) two-stage GMM cluster-sandwich
+    (``V = bread @ (S' S) @ bread``; always clusters, default at the unit
+    column). ``vcov_type`` is permanently narrow to ``{"hc1"}``;
+    analytical-sandwich families ``{classical, hc2, hc2_bm}`` and ``conley`` are
+    rejected with GMM-meat-specific messages. Mirrors the ImputationDiD
+    interstitial #3 template
+    (``tests/test_imputation.py::TestImputationDiDVcovType``).
+    """
+
+    # ---- introspection / defaults ----
+
+    def test_default_vcov_type(self):
+        est = TwoStageDiD()
+        assert est.vcov_type == "hc1"
+        assert est.get_params()["vcov_type"] == "hc1"
+
+    def test_results_carry_vcov_metadata(self):
+        data = generate_test_data(n_units=80, seed=1)
+        r = _fit_ts(TwoStageDiD(), data, aggregate="all")
+        assert r.vcov_type == "hc1"
+        assert r.cluster_name == "unit"
+        assert r.n_clusters == 80
+
+    def test_to_dict_carries_vcov(self):
+        data = generate_test_data(n_units=80, seed=1)
+        d = _fit_ts(TwoStageDiD(), data).to_dict()
+        assert d["vcov_type"] == "hc1"
+        assert d["cluster_name"] == "unit"
+        assert d["n_clusters"] == 80
+        assert d["inference_method"] == "cluster"
+
+    def test_convenience_function_threads_vcov_type(self):
+        data = generate_test_data(n_units=60, seed=2)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            r = two_stage_did(data, "outcome", "unit", "time", "first_treat", vcov_type="hc1")
+        assert r.vcov_type == "hc1"
+        with pytest.raises(ValueError):
+            two_stage_did(data, "outcome", "unit", "time", "first_treat", vcov_type="classical")
+
+    def test_fit_clone_idempotence(self):
+        data = generate_test_data(n_units=60, seed=3)
+        est = TwoStageDiD()
+        r1 = _fit_ts(est, data)
+        clone = TwoStageDiD(**est.get_params())
+        r2 = _fit_ts(clone, data)
+        assert clone.vcov_type == "hc1"
+        assert r1.overall_att == pytest.approx(r2.overall_att)
+        assert r1.overall_se == pytest.approx(r2.overall_se)
+
+    # ---- rejections ----
+
+    @pytest.mark.parametrize("bad", ["classical", "hc2", "hc2_bm", "conley", "garbage"])
+    def test_invalid_vcov_type_rejected_at_init(self, bad):
+        with pytest.raises(ValueError, match=bad):
+            TwoStageDiD(vcov_type=bad)
+
+    @pytest.mark.parametrize("bad", ["classical", "hc2", "hc2_bm", "conley"])
+    def test_fit_revalidates_after_set_params(self, bad):
+        data = generate_test_data(n_units=60, seed=4)
+        est = TwoStageDiD()
+        est.set_params(vcov_type=bad)  # sklearn mutate-then-validate-at-use
+        assert est.vcov_type == bad
+        with pytest.raises(ValueError, match=bad):
+            est.fit(data, outcome="outcome", unit="unit", time="time", first_treat="first_treat")
+
+    def test_rejection_messages_are_methodology_specific(self):
+        with pytest.raises(ValueError, match="GMM"):
+            TwoStageDiD(vcov_type="hc2")
+        with pytest.raises(ValueError, match="Conley|spatial"):
+            TwoStageDiD(vcov_type="conley")
+
+    # ---- summary labels ----
+
+    def test_summary_default_renders_unit_cluster_cr1(self):
+        data = generate_test_data(n_units=80, seed=5)
+        s = _fit_ts(TwoStageDiD(), data).summary()
+        assert "CR1 cluster-robust at unit" in s
+        assert "HC1 heteroskedasticity" not in s
+
+    def test_summary_explicit_cluster_renders_named_cr1(self):
+        data = generate_test_data(n_units=80, seed=6)
+        data["st"] = data["unit"] % 6
+        r = _fit_ts(TwoStageDiD(cluster="st"), data)
+        assert r.cluster_name == "st" and r.n_clusters == 6
+        assert "CR1 cluster-robust at st, G=6" in r.summary()
+
+    def test_summary_suppresses_variance_label_under_bootstrap(self):
+        data = generate_test_data(n_units=80, seed=7)
+        s = _fit_ts(TwoStageDiD(n_bootstrap=199, seed=7), data).summary()
+        assert "Inference method:" in s and "bootstrap" in s
+        assert "Variance estimator:" not in s
+
+    def test_cluster_name_suppressed_under_tsl_survey(self):
+        data, _ = _add_survey_cols(generate_test_data(n_units=80, seed=8))
+        r = _fit_ts(TwoStageDiD(), data, survey_design=SurveyDesign(weights="w"))
+        assert r.cluster_name is None
+        assert r.n_clusters is None
+
+    def test_cluster_name_suppressed_under_replicate_survey(self):
+        data, rep_cols = _add_survey_cols(generate_test_data(n_units=80, seed=9))
+        design = SurveyDesign(weights="w", replicate_weights=rep_cols, replicate_method="JK1")
+        r = _fit_ts(TwoStageDiD(), data, survey_design=design)
+        assert r.cluster_name is None
+        assert r.n_clusters is None
+
+    # ---- bit-equality regression guards (vcov_type='hc1' is a pure no-op) ----
+
+    @pytest.mark.parametrize("aggregate", [None, "event_study", "group", "all"])
+    def test_default_path_bit_equal(self, aggregate):
+        data = generate_test_data(n_units=80, seed=10)
+        r0 = _fit_ts(TwoStageDiD(), data, aggregate=aggregate)
+        r1 = _fit_ts(TwoStageDiD(vcov_type="hc1"), data, aggregate=aggregate)
+        _assert_results_bit_equal(r0, r1)
+
+    @pytest.mark.parametrize("aggregate", [None, "event_study", "group", "all"])
+    def test_cluster_path_bit_equal(self, aggregate):
+        data = generate_test_data(n_units=80, seed=11)
+        data["st"] = data["unit"] % 7
+        r0 = _fit_ts(TwoStageDiD(cluster="st"), data, aggregate=aggregate)
+        r1 = _fit_ts(TwoStageDiD(cluster="st", vcov_type="hc1"), data, aggregate=aggregate)
+        _assert_results_bit_equal(r0, r1)
+
+    def test_tsl_survey_path_bit_equal(self):
+        data, _ = _add_survey_cols(generate_test_data(n_units=80, seed=12))
+        design = SurveyDesign(weights="w")
+        r0 = _fit_ts(TwoStageDiD(), data, survey_design=design, aggregate="all")
+        r1 = _fit_ts(TwoStageDiD(vcov_type="hc1"), data, survey_design=design, aggregate="all")
+        _assert_results_bit_equal(r0, r1)
+
+    def test_replicate_survey_path_bit_equal(self):
+        data, rep_cols = _add_survey_cols(generate_test_data(n_units=80, seed=13))
+        design = SurveyDesign(weights="w", replicate_weights=rep_cols, replicate_method="JK1")
+        r0 = _fit_ts(TwoStageDiD(), data, survey_design=design, aggregate="event_study")
+        r1 = _fit_ts(
+            TwoStageDiD(vcov_type="hc1"), data, survey_design=design, aggregate="event_study"
+        )
+        _assert_results_bit_equal(r0, r1)
+
+    def test_bootstrap_path_bit_equal(self):
+        data = generate_test_data(n_units=80, seed=14)
+        r0 = _fit_ts(TwoStageDiD(n_bootstrap=199, seed=99), data, aggregate="all")
+        r1 = _fit_ts(TwoStageDiD(n_bootstrap=199, seed=99, vcov_type="hc1"), data, aggregate="all")
+        _assert_results_bit_equal(r0, r1)
+        b0, b1 = r0.bootstrap_results, r1.bootstrap_results
+        assert b0.overall_att_se == b1.overall_att_se
+        if b0.event_study_ses:
+            assert b0.event_study_ses == b1.event_study_ses
+        if b0.group_ses:
+            assert b0.group_ses == b1.group_ses
+
+    # ---- cluster + replicate rejection ----
+
+    def test_cluster_plus_replicate_weights_rejected(self):
+        data, rep_cols = _add_survey_cols(generate_test_data(n_units=80, seed=15))
+        data["st"] = data["unit"] % 5
+        design = SurveyDesign(weights="w", replicate_weights=rep_cols, replicate_method="JK1")
+        with pytest.raises(NotImplementedError, match="replicate"):
+            TwoStageDiD(cluster="st").fit(
+                data,
+                outcome="outcome",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                survey_design=design,
+            )
+
+    # ---- bootstrap G<2 NaN guard (both entry paths) ----
+
+    def test_bootstrap_single_cluster_returns_nan(self):
+        data = generate_test_data(n_units=80, seed=16)
+        data["solo"] = 1  # single cluster
+        est = TwoStageDiD(cluster="solo", n_bootstrap=199, seed=3)
+        with pytest.warns(UserWarning, match="n_clusters=1"):
+            r = est.fit(
+                data,
+                outcome="outcome",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                aggregate="all",
+            )
+        assert np.isnan(r.overall_se)
+        assert np.isnan(r.coef_var)
+        assert np.isnan(r.bootstrap_results.overall_att_se)
+        assert all(np.isnan(x) for x in r.bootstrap_results.overall_att_ci)
+        if r.bootstrap_results.event_study_ses:
+            assert all(np.isnan(v) for v in r.bootstrap_results.event_study_ses.values())
+        if r.bootstrap_results.group_ses:
+            assert all(np.isnan(v) for v in r.bootstrap_results.group_ses.values())
+
+    def test_bootstrap_single_psu_survey_returns_nan(self):
+        data, _ = _add_survey_cols(generate_test_data(n_units=80, seed=17))
+        data["onepsu"] = 0  # single PSU
+        design = SurveyDesign(weights="w", psu="onepsu")
+        est = TwoStageDiD(n_bootstrap=199, seed=3)
+        with pytest.warns(UserWarning):
+            r = est.fit(
+                data,
+                outcome="outcome",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                survey_design=design,
+                aggregate="event_study",
+            )
+        assert np.isnan(r.overall_se)
+        assert np.isnan(r.bootstrap_results.overall_att_se)

From 489dae4b6e4425f64853417a7bf45b9406bfa2dd Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sat, 30 May 2026 06:48:12 -0400
Subject: [PATCH 2/4] two-stage-did: report post-drop cluster count in vcov
 metadata (codex P2)

n_clusters / cluster_name were derived from the full input `data`, but the GMM
sandwich computes variance over `cluster_ids=df[cluster_var]` on the POST-DROP
fit sample (always-treated units are removed before estimation). When an
always-treated unit/cluster is excluded, reporting the full-input count
overstates the effective G the SE is based on. Count clusters on `df` instead,
matching the variance. Survey suppression (cluster_name=None) is unchanged;
the Wave E.3 full-domain survey accounting is a separate, intentional path.

Adds a regression test asserting n_clusters equals the post-drop count when an
always-treated cohort is dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 diff_diff/two_stage.py  | 21 ++++++++++++++-------
 tests/test_two_stage.py | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/diff_diff/two_stage.py b/diff_diff/two_stage.py
index 4d4eadf2..3b903fe4 100644
--- a/diff_diff/two_stage.py
+++ b/diff_diff/two_stage.py
@@ -2067,20 +2067,27 @@ def _refit_ts(w_r):
         # Resolve cluster_name / n_clusters for Results metadata. Suppress under
         # ANY survey design (the summary survey block already reports the
         # design's PSU/strata/replicate metadata; replicate-weight variance
-        # ignores cluster entirely). Otherwise:
-        #   bare cluster= -> the user-named cluster column
-        #   cluster=None  -> the Gardner GMM sandwich still clusters at the unit
-        #                    column by default (cluster_var = unit above), so the
-        #                    summary label reports unit-cluster CR1, not HC1.
+        # ignores cluster entirely). Otherwise count clusters on the POST-DROP
+        # fit sample `df` (always-treated units were removed above at the
+        # `df = df[~df[unit].isin(always_treated_units)]` step), NOT the full
+        # input `data`: the GMM sandwich computes variance over
+        # `cluster_ids=df[cluster_var].values` (see the `_compute_gmm_variance`
+        # call sites), so the reported G must match that effective count — using
+        # `data` would overstate the clusters the SE is actually based on when an
+        # always-treated unit/cluster is excluded. Branches:
+        #   bare cluster= -> the user-named cluster column (df[self.cluster])
+        #   cluster=None  -> the Gardner GMM sandwich clusters at `unit` by
+        #                    default (cluster_var = unit above), so the summary
+        #                    label reports unit-cluster CR1, not HC1.
         if resolved_survey is not None:
             _cluster_name_for_results: Optional[str] = None
             _n_clusters_for_results: Optional[int] = None
         elif self.cluster is not None:
             _cluster_name_for_results = self.cluster
-            _n_clusters_for_results = int(data[self.cluster].nunique())
+            _n_clusters_for_results = int(df[self.cluster].nunique())
         else:
             _cluster_name_for_results = unit
-            _n_clusters_for_results = int(data[unit].nunique())
+            _n_clusters_for_results = int(df[unit].nunique())
 
         # Construct results
         self.results_ = TwoStageDiDResults(
diff --git a/tests/test_two_stage.py b/tests/test_two_stage.py
index 59be18a0..9a59fed6 100644
--- a/tests/test_two_stage.py
+++ b/tests/test_two_stage.py
@@ -2190,3 +2190,36 @@ def test_bootstrap_single_psu_survey_returns_nan(self):
             )
         assert np.isnan(r.overall_se)
         assert np.isnan(r.bootstrap_results.overall_att_se)
+
+    # ---- metadata reflects the post-drop fit sample (codex P2) ----
+
+    def test_n_clusters_reflects_post_drop_fit_sample(self):
+        """Always-treated units are dropped before estimation, so reported
+        n_clusters must equal the POST-DROP effective cluster count the GMM
+        sandwich uses (cluster_ids = df[cluster_var] on the post-drop df), not
+        the full-input cluster count. Regression for the codex P2 metadata bug.
+        """
+        # Restrict to time >= 3 so the first_treat=3 cohort becomes
+        # always-treated (treated in every observed period) and is dropped.
+        data = generate_test_data(n_units=80, seed=20)
+        data = data[data["time"] >= 3].copy()
+        first_by_unit = data.groupby("unit")["first_treat"].first()
+        min_t = data["time"].min()
+        always_treated = first_by_unit[(first_by_unit > 0) & (first_by_unit <= min_t)].index
+        assert len(always_treated) > 0, "fixture should contain always-treated units"
+        full_units = data["unit"].nunique()
+        expected_g = full_units - len(always_treated)
+
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            r = TwoStageDiD().fit(
+                data, outcome="outcome", unit="unit", time="time", first_treat="first_treat"
+            )
+        assert r.cluster_name == "unit"
+        assert r.n_clusters == expected_g, (
+            f"n_clusters should be the post-drop count {expected_g}, "
+            f"got {r.n_clusters} (full input has {full_units})"
+        )
+        assert r.n_clusters < full_units  # would equal full_units under the bug
+        # to_dict mirrors the corrected count
+        assert r.to_dict()["n_clusters"] == expected_g

From 64119a0ee27feb7fa53c037cb3f30159ec6fa5c5 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sat, 30 May 2026 07:01:29 -0400
Subject: [PATCH 3/4] two-stage-did: explicit vcov_type on convenience fn +
 full NaN-contract tests (codex P3)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- two_stage_did(): expose vcov_type="hc1" explicitly (was hidden behind **kwargs)
  and forward it, matching the imputation_did/efficient_did sibling wrappers — the
  convenience API surface, generated signature, and IDE help now show the param.
- Degenerate-bootstrap tests now assert the FULL public NaN-propagation contract
  (overall t_stat/p_value/conf_int + every event-study/group inference field) via
  a shared _assert_full_bootstrap_nan helper, not just overall_se, so a partial
  regression in _build_nan_bootstrap_results can't slip through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 diff_diff/two_stage.py  |  8 ++++++-
 tests/test_two_stage.py | 49 ++++++++++++++++++++++++++++++++---------
 2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/diff_diff/two_stage.py b/diff_diff/two_stage.py
index 3b903fe4..625c20d3 100644
--- a/diff_diff/two_stage.py
+++ b/diff_diff/two_stage.py
@@ -3361,6 +3361,7 @@ def two_stage_did(
     aggregate: Optional[str] = None,
     balance_e: Optional[int] = None,
     survey_design: object = None,
+    vcov_type: str = "hc1",
     **kwargs,
 ) -> TwoStageDiDResults:
     """
@@ -3392,6 +3393,11 @@ def two_stage_did(
         PSU, and FPC for design-based GMM sandwich variance. Strata enters
         survey df for t-distribution inference.
         Both analytical (n_bootstrap=0) and bootstrap inference are supported.
+    vcov_type : str, default="hc1"
+        Variance estimator family; permanently narrow to ``{"hc1"}`` (the Gardner
+        2022 two-stage GMM cluster-sandwich). Analytical-sandwich families
+        ``{"classical", "hc2", "hc2_bm"}`` and ``"conley"`` are rejected. See
+        :class:`TwoStageDiD`.
     **kwargs
         Additional keyword arguments passed to TwoStageDiD constructor.
 
@@ -3408,7 +3414,7 @@ def two_stage_did(
     ...                         'first_treat', aggregate='event_study')
     >>> results.print_summary()
     """
-    est = TwoStageDiD(**kwargs)
+    est = TwoStageDiD(vcov_type=vcov_type, **kwargs)
     return est.fit(
         data,
         outcome=outcome,
diff --git a/tests/test_two_stage.py b/tests/test_two_stage.py
index 9a59fed6..8b0799f8 100644
--- a/tests/test_two_stage.py
+++ b/tests/test_two_stage.py
@@ -1979,6 +1979,43 @@ def _eq(a, b):
                 assert _eq(e0[k][f], e1[k][f]), f"{attr}[{k}][{f}] differs"
 
 
+def _assert_full_bootstrap_nan(r):
+    """Assert the FULL public NaN-propagation contract under a degenerate
+    bootstrap (n_clusters<2 / n_psu<2): every overall + per-horizon + per-group
+    inference field is NaN, not just the SE (REGISTRY NaN-inference contract).
+    """
+    # Overall inference fields
+    assert np.isnan(r.overall_se)
+    assert np.isnan(r.overall_t_stat)
+    assert np.isnan(r.overall_p_value)
+    assert all(np.isnan(x) for x in r.overall_conf_int)
+    assert np.isnan(r.coef_var)
+    # Bootstrap payload
+    b = r.bootstrap_results
+    assert np.isnan(b.overall_att_se)
+    assert np.isnan(b.overall_att_p_value)
+    assert all(np.isnan(x) for x in b.overall_att_ci)
+    if b.event_study_ses:
+        assert all(np.isnan(v) for v in b.event_study_ses.values())
+    if b.group_ses:
+        assert all(np.isnan(v) for v in b.group_ses.values())
+    # Per-horizon event-study inference fields (skip reference-period markers,
+    # which carry n_obs == 0 and are not real effects).
+    for eff in (r.event_study_effects or {}).values():
+        if eff.get("n_obs", 1) == 0:
+            continue
+        assert np.isnan(eff["se"])
+        assert np.isnan(eff["t_stat"])
+        assert np.isnan(eff["p_value"])
+        assert all(np.isnan(x) for x in eff["conf_int"])
+    # Per-group inference fields
+    for eff in (r.group_effects or {}).values():
+        assert np.isnan(eff["se"])
+        assert np.isnan(eff["t_stat"])
+        assert np.isnan(eff["p_value"])
+        assert all(np.isnan(x) for x in eff["conf_int"])
+
+
 class TestTwoStageDiDVcovType:
     """Phase 1b interstitial #5 (final): vcov_type input contract on TwoStageDiD.
 
@@ -2164,14 +2201,7 @@ def test_bootstrap_single_cluster_returns_nan(self):
                 first_treat="first_treat",
                 aggregate="all",
             )
-        assert np.isnan(r.overall_se)
-        assert np.isnan(r.coef_var)
-        assert np.isnan(r.bootstrap_results.overall_att_se)
-        assert all(np.isnan(x) for x in r.bootstrap_results.overall_att_ci)
-        if r.bootstrap_results.event_study_ses:
-            assert all(np.isnan(v) for v in r.bootstrap_results.event_study_ses.values())
-        if r.bootstrap_results.group_ses:
-            assert all(np.isnan(v) for v in r.bootstrap_results.group_ses.values())
+        _assert_full_bootstrap_nan(r)
 
     def test_bootstrap_single_psu_survey_returns_nan(self):
         data, _ = _add_survey_cols(generate_test_data(n_units=80, seed=17))
@@ -2188,8 +2218,7 @@ def test_bootstrap_single_psu_survey_returns_nan(self):
                 survey_design=design,
                 aggregate="event_study",
             )
-        assert np.isnan(r.overall_se)
-        assert np.isnan(r.bootstrap_results.overall_att_se)
+        _assert_full_bootstrap_nan(r)
 
     # ---- metadata reflects the post-drop fit sample (codex P2) ----
 

From 9f3368a7d96af6ff1dee5f22456996d442cb3147 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sat, 30 May 2026 07:13:25 -0400
Subject: [PATCH 4/4] two-stage-did: count vcov n_clusters via np.unique like
 the variance (codex P2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

n_clusters used Series.nunique() (drops NaN), but the GMM sandwich counts
np.unique(cluster_ids) (keeps a single NaN group). A non-survey cluster= column
with missing IDs would make the reported G undercount the SE's actual cluster
count. Count clusters the same way the variance does — np.unique(df[cluster_var])
— which also consolidates the two non-survey branches and still excludes
always-treated-dropped units (df, not data). Adds a NaN-cluster regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 diff_diff/two_stage.py  | 32 +++++++++++++++-----------------
 tests/test_two_stage.py | 26 ++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/diff_diff/two_stage.py b/diff_diff/two_stage.py
index 625c20d3..96eb398f 100644
--- a/diff_diff/two_stage.py
+++ b/diff_diff/two_stage.py
@@ -2067,27 +2067,25 @@ def _refit_ts(w_r):
         # Resolve cluster_name / n_clusters for Results metadata. Suppress under
         # ANY survey design (the summary survey block already reports the
         # design's PSU/strata/replicate metadata; replicate-weight variance
-        # ignores cluster entirely). Otherwise count clusters on the POST-DROP
-        # fit sample `df` (always-treated units were removed above at the
-        # `df = df[~df[unit].isin(always_treated_units)]` step), NOT the full
-        # input `data`: the GMM sandwich computes variance over
-        # `cluster_ids=df[cluster_var].values` (see the `_compute_gmm_variance`
-        # call sites), so the reported G must match that effective count — using
-        # `data` would overstate the clusters the SE is actually based on when an
-        # always-treated unit/cluster is excluded. Branches:
-        #   bare cluster= -> the user-named cluster column (df[self.cluster])
-        #   cluster=None  -> the Gardner GMM sandwich clusters at `unit` by
-        #                    default (cluster_var = unit above), so the summary
-        #                    label reports unit-cluster CR1, not HC1.
+        # ignores cluster entirely). Otherwise count clusters EXACTLY the way the
+        # GMM sandwich does — `np.unique(df[cluster_var].values)` — so the
+        # reported G can never disagree with the SE:
+        #   - `df` (not the full input `data`) excludes always-treated units
+        #     dropped above at `df = df[~df[unit].isin(always_treated_units)]`,
+        #     matching the post-drop `cluster_ids=df[cluster_var].values` fed to
+        #     `_compute_gmm_variance`;
+        #   - `np.unique` (not `Series.nunique()`, which drops NaN) keeps the
+        #     same single NaN group the variance forms, so missing cluster IDs
+        #     cannot make the metadata undercount.
+        # `cluster_var` is `self.cluster`, or the `unit` column when
+        # `cluster=None` (the Gardner sandwich always clusters at unit by
+        # default), so the summary renders the unit-cluster CR1 label, not HC1.
         if resolved_survey is not None:
             _cluster_name_for_results: Optional[str] = None
             _n_clusters_for_results: Optional[int] = None
-        elif self.cluster is not None:
-            _cluster_name_for_results = self.cluster
-            _n_clusters_for_results = int(df[self.cluster].nunique())
         else:
-            _cluster_name_for_results = unit
-            _n_clusters_for_results = int(df[unit].nunique())
+            _cluster_name_for_results = self.cluster if self.cluster is not None else unit
+            _n_clusters_for_results = int(np.unique(df[cluster_var].values).size)
 
         # Construct results
         self.results_ = TwoStageDiDResults(
diff --git a/tests/test_two_stage.py b/tests/test_two_stage.py
index 8b0799f8..1e5b29fb 100644
--- a/tests/test_two_stage.py
+++ b/tests/test_two_stage.py
@@ -2252,3 +2252,29 @@ def test_n_clusters_reflects_post_drop_fit_sample(self):
         assert r.n_clusters < full_units  # would equal full_units under the bug
         # to_dict mirrors the corrected count
         assert r.to_dict()["n_clusters"] == expected_g
+
+    def test_n_clusters_counts_nan_cluster_like_the_variance(self):
+        """The GMM sandwich counts clusters via np.unique(cluster_ids), which
+        keeps a single NaN group; Series.nunique() would drop NaN. n_clusters
+        metadata must match the variance so a `cluster=` column with missing IDs
+        cannot make the reported G undercount the SE's actual cluster count.
+        Regression for the codex round-3 P2.
+        """
+        data = generate_test_data(n_units=80, seed=21)
+        data["cl"] = (data["unit"] % 6).astype(float)
+        data.loc[data["unit"].isin([0, 1, 2]), "cl"] = np.nan
+        # No always-treated drop here (cohorts start at t=3, min_time=0), so
+        # df == data; count clusters the way the variance does.
+        expected_g = int(np.unique(data["cl"].values).size)
+        n_valid = int(data.loc[data["cl"].notna(), "cl"].nunique())
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            r = TwoStageDiD(cluster="cl").fit(
+                data, outcome="outcome", unit="unit", time="time", first_treat="first_treat"
+            )
+        assert r.cluster_name == "cl"
+        assert r.n_clusters == expected_g
+        # NaN is counted as a cluster (Series.nunique() would have dropped it),
+        # so G strictly exceeds the distinct non-NaN cluster count.
+        assert r.n_clusters > n_valid
+        assert r.to_dict()["n_clusters"] == expected_g