callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract#487
Conversation
Phase 1b interstitial PR addressing four tightly-coupled items in
CallawaySantAnna and adjacent infrastructure:
1. Bug fix: bare CallawaySantAnna(cluster="state").fit(...) was a silent
no-op — the parameter was accepted at __init__, stored as self.cluster,
returned by get_params(), but never consumed in the fit / aggregator /
bootstrap pipeline. The docstring at staggered.py:154-156 claimed
"Defaults to unit-level clustering" but the aggregator at
staggered_aggregation.py:193-213 computed per-unit IF variance regardless
of self.cluster, and the bootstrap at staggered_bootstrap.py:323-347
drew per-unit multiplier weights regardless. Users who explicitly set
cluster="state" got per-unit inference believing they had cluster-robust
SE — typically SE too small under intra-cluster correlation, no warning.
Survey-PSU clustering via survey_design=SurveyDesign(psu="state") was
NOT affected and continued to cluster correctly via the existing
_compute_stratified_psu_meat machinery.
The fix synthesizes a minimal SurveyDesign(psu=self.cluster,
weight_type="pweight") when bare cluster= is set without an explicit
survey design, threading the synthesized PSU through the existing
survey-PSU machinery (aggregator + bootstrap). Three-branch wiring
mirrors estimators.py:497-516:
- bare cluster + no survey → synthesize SurveyDesign(psu=cluster)
- survey w/o PSU + cluster → inject cluster as PSU via
_inject_cluster_as_psu, AND construct effective_survey_design =
replace(survey_design, psu=cluster) so _validate_unit_constant_survey
catches movers on panel data
- survey w/ PSU + cluster → PSU wins; _resolve_effective_cluster
emits UserWarning if partitions differ
Wiring is inserted BEFORE the existing validators so synthesized /
injected designs pass through unit-constancy + pweight checks.
cluster_name on CallawaySantAnnaResults reflects the canonical PSU:
survey_design.psu when explicit PSU is provided, self.cluster when bare
cluster synthesizes / injects.
2. Narrow vcov_type input contract: CallawaySantAnna(vcov_type=...) now
accepts {"hc1"} only at __init__. Analytical-sandwich families
{classical, hc2, hc2_bm} and conley rejected with methodology-rooted
messages. The rejection is library-architectural, not paper-prescribed:
CS uses influence-function-based variance per Callaway & Sant'Anna
(2021); per-(g,t) doubly-robust / IPW / outcome-regression structure
has no single design matrix to compute hat-matrix leverage or
Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent;
the same surface applies to other IF-based estimators (ImputationDiD,
EfficientDiD) when their PRs land. _validate_vcov_type extracted as
@staticmethod; called from both __init__ and fit() so sklearn-style
set_params mutations are re-checked at use time (preventing silent
propagation of bad values to Results metadata).
3. Survey/non-survey contract separation via dedicated df_inference field:
bare cluster= populates the new Results.df_inference field (carrying
cluster-level df for downstream HonestDiD t-critical-value selection)
but leaves Results.survey_metadata as None — preserving the canonical
"user provided a survey design" signal that DiagnosticReport
(diagnostic_report.py:848-856 Bacon skip + 1150-1158 2x2 PT skip) and
CallawaySantAnnaResults.summary() (staggered_results.py:235-238 survey
block render) read. Legitimate survey_design= fits populate both
df_inference AND survey_metadata; the two surfaces agree
(df_inference == survey_metadata.df_survey). HonestDiD at
honest_did.py prefers df_inference over survey_metadata.df_survey at
BOTH extraction sites (MPD branch + CS branch).
Replicate-weight survey designs combined with bare cluster= raise
NotImplementedError: replicate-weight variance is computed by replicate
reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster
entirely (survey.py:104-109 enforces replicate_weights are mutually
exclusive with strata/psu/fpc). Honoring bare cluster= would silently
have no effect on variance while populating cluster_name/n_clusters
dishonestly. Fail-closed per feedback_no_silent_failures.
vcov_type, cluster_name, n_clusters, and df_inference added to
CallawaySantAnnaResults with full class-docstring attribute entries.
4. TripleDifference defensive test_cluster_changes_ses regression test:
TripleDifference's bare-cluster code path (triple_diff.py:1245-1259)
was already correct but lacked a positive regression test (only an
error-handling test for missing cluster columns existed). Added
defensive coverage on a fixed-seed panel with state-level random
effects, mirroring tests/test_two_stage.py::test_cluster_changes_ses.
REGISTRY.md adds the "IF-based variance estimators vs analytical-sandwich
estimators" taxonomy subsection explaining the structural split (CS,
ImputationDiD, EfficientDiD vs DiD, MPD, TWFE, SA, StackedDiD,
WooldridgeDiD), plus CS variance-families + cluster-wiring documentation.
CHANGELOG Fixed + Added bullets. TODO.md Phase 1b row updated (4
remaining estimators, CS path narrowed permanently); new CS conley and
cluster=-deprecation follow-up rows. llms-full.txt CS entry updated with
the new vcov_type + cluster behavior.
Audit verified the cluster= silent no-op was CS-specific — the other 7
Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD,
TripleDifference, TwoStageDiD, EfficientDiD) handle bare cluster=
correctly.
Tests: 25 new tests across TestCallawaySantAnnaClusterWiring (8 panel +
RCS + survey precedence coverage), TestCallawaySantAnnaVcovTypeNarrowContract
(10 input contract + set_params re-validation), TestCallawaySantAnnaClusterSafetyGates
(6 mover validation + replicate reject + survey/non-survey contract +
end-to-end HonestDiD integration), and TestTripleDifferenceClusterDefensive
(1 defensive SE-changes regression). All 326 tests
(test_staggered + test_staggered_rc + test_triple_diff + test_honest_did)
pass. cluster=None path is bit-equal to pre-PR (wiring guarded by
if self.cluster is not None).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance Maintainability Tech Debt
Security Documentation/Tests
Path to Approval
|
…compat + bootstrap test coverage
Three issues:
1. df_inference contract narrowing: prior plumbing populated df_inference
from resolved_survey.df_survey for ALL cluster branches (synthesize,
inject, conflict), and HonestDiD preferred df_inference unconditionally.
But CS internally uses resolved_survey_unit.df_survey (unit-collapsed)
for safe_inference and can tighten that further via overall_effective_df
recompute at staggered.py:1995-1999. Preferring resolved_survey.df_survey
in HonestDiD on panel survey fits without explicit PSU (weights/strata/
fpc only) could switch from the unit-level df actually used by CS to
the larger observation-level df, making HonestDiD CIs/p-values anti-
conservative without warning. Fix narrows df_inference to bare-cluster-
synthesize path only (the case where survey_metadata is intentionally
None to preserve the survey/non-survey contract for DiagnosticReport /
summary). For inject/conflict branches, df_inference stays None and
HonestDiD reads df_survey directly from survey_metadata (which carries
the actual CS-internal df, including any post-recompute tightening).
HonestDiD preference inverted: survey_metadata.df_survey first,
df_inference fallback for bare-cluster fits.
2. Positional-arg compatibility: vcov_type was inserted between cluster
and n_bootstrap in __init__, breaking positional callers who passed
n_bootstrap or later args positionally (e.g.,
CallawaySantAnna("never_treated", 0, "dr", 0.05, None, 999) would
silently bind 999 to vcov_type → ValueError at validation). Moved
vcov_type to the end of __init__ (matches SA / WooldridgeDiD
convention of appending new params).
3. Bootstrap path coverage: added two new tests in
TestCallawaySantAnnaClusterSafetyGates:
- test_bare_cluster_bootstrap_se_differs_from_unit_level pins that
bare cluster= routes bootstrap through the PSU-multiplier-weights
branch at staggered_bootstrap.py:323-347 (closes the bootstrap-
side of the original silent-no-op bug class)
- test_survey_design_psu_wins_under_bootstrap pins that explicit
PSU wins under bootstrap when cluster= + survey_design.psu have
different partitions (mirrors the analytical-sandwich precedence)
Existing test_explicit_survey_design_does_populate_survey_metadata
updated to assert df_inference IS None in the inject branch (matches
new narrowed contract); the survey_metadata.df_survey assertion is
retained.
CHANGELOG Fixed bullet updated to describe the narrowed df_inference
contract + the inverted HonestDiD preference order.
All 328 tests (test_staggered + test_staggered_rc + test_triple_diff
+ test_honest_did) pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability Tech Debt
Security Documentation/Tests
|
Two P3 doc-only follow-ups: 1. llms-full.txt CS entry kept vcov_type before n_bootstrap in the pseudo-signature even after the constructor was reordered to append vcov_type at the end. Moved vcov_type to the end of the pseudo- signature so the generated guide matches the actual __init__ order (matches SA / WooldridgeDiD convention). 2. staggered_results.py class attribute docstring + inline field comment for df_inference still described it as populated on "Legitimate survey_design= fits" alongside survey_metadata. Updated to reflect the narrowed contract: df_inference is populated on the bare-cluster- synthesize path ONLY (where survey_metadata is None). When user provides survey_design=, df_inference stays None and the canonical df carrier is survey_metadata.df_survey. HonestDiD reads survey_metadata first, df_inference fallback. 3. test_explicit_survey_design_does_populate_survey_metadata docstring still said "+ df_inference also populated"; updated to match the body assertion (which already correctly asserts df_inference is None). No source/test logic changes. All 80 targeted tests + test_honest_did pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker — one unmitigated P0 finding. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
CI codex R3 P0: the cluster wiring contract documented in REGISTRY.md
("cluster=X means CR1 Liang-Zeger on the IF") was honored at the
aggregate inference surface (overall_se, event-study, group, bootstrap)
but the per-cell public surface results.group_time_effects[(g,t)]["se"]
remained unit-level. Users inspecting per-cell ATT(g,t) inference under
cluster= got silently misleading SE/t/p/CI even though overall inference
was correctly cluster-robust.
Fix: new module-level _cluster_robust_se_from_per_gt_if helper that
aggregates the per-(g,t) IF by PSU and returns CR1 Liang-Zeger SE.
Applied at all 4 ATT(g,t) computation sites identified by the codex:
1. _compute_all_att_gt_vectorized (no-covariate vectorized batch) —
recompute se after building inf_info, overwrite group_time_effects
[(g,t)]["se"] which was set with the unit-level value
2. _compute_all_att_gt_covariate_reg (covariate-reg batch) — same pattern
3. Main panel single-cell loop (after _compute_att_gt_fast) — local
se_gt update flows into gte_entry["se"]
4. RC fit loop (after _compute_att_gt_rc) — uses resolved_survey.psu
(per-obs) instead of resolved_survey_unit.psu (per-unit)
The recompute is gated by `if psu is not None`, so cluster=None remains
bit-equal to pre-PR. For cluster=unit (each unit its own cluster), the
CR1 formula coincides with the unit-level IF formula (modulo ddof
conventions in the underlying OR path) — methodologically consistent
with Williams (2000) CR1-on-IF for IF-based estimators.
Tests:
- test_per_gt_analytical_se_changes_with_cluster: asserts at least one
(g,t) cell shows measurable SE divergence between cluster=None and
cluster="state" on a panel with intra-cluster correlation
- test_per_gt_se_matches_explicit_survey_design: asserts per-(g,t) SE
agrees (rel=1e-10) between bare cluster="state" and explicit
SurveyDesign(psu="state") — both activate the same CR1 aggregation
All 414 tests (test_staggered + test_staggered_rc + test_triple_diff +
test_honest_did + test_two_stage) pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…elper CI codex R4 P1: the per-(g,t) cluster helper from R3 computed sqrt(sum(psi_per_cluster^2)) — the basic IF aggregation — but the documented CR1 contract in REGISTRY.md routes through _compute_stratified_psu_meat, which adds: - G/(G-1) finite-sample correction - PSU centering (subtracting per-stratum mean) - FPC handling - lonely-PSU / G<2 → NaN behavior Without these, per-(g,t) clustered SEs were methodologically mismatched to the aggregate path and still understated SE on few-cluster designs. Fix: rewrite _cluster_robust_se_from_per_gt_if to build the per-(g,t) psi vector from inf_info and route through compute_survey_if_variance (survey.py:2139) which delegates to _compute_stratified_psu_meat. The helper signature changes from taking a bare psu_array to taking the full ResolvedSurveyDesign object — all 4 call sites updated. Per-cell SEs now inherit the SAME design-based machinery as overall + event-study + bootstrap inference. For panel path: ``resolved_survey_unit`` (length n_units). For RCS path: ``resolved_survey`` (length n_obs). The shared helper handles both index spaces uniformly. New regression test test_per_gt_se_matches_compute_survey_if_variance _helper: hand-constructs a small-G (10 cluster) design + synthetic inf_info, calls the helper, and asserts it equals direct compute_survey_if_variance reconstruction at rel=0/abs=0. Pins that the helper bypasses NO part of the shared machinery — any future refactor that diverges from compute_survey_if_variance fails this test. Closes CI codex R4 P1 + P2 findings. All 331 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…gnment
Three fixes:
1. P1: when compute_survey_if_variance returns NaN (clustered variance
undefined — e.g., G<2, lonely-PSU removed all strata), the per-(g,t)
helper returned None and all 4 call sites silently kept the
unit-level SE. group_time_effects[(g,t)]["se"] could remain finite
under requested clustered inference even though the shared survey
helper said the variance was unidentified — violates the
safe_inference NaN-consistent contract + feedback_no_silent_failures.
Fix: change helper return contract:
- float SE → use it
- NaN → propagate NaN through to safe_inference (NaNs out the
full per-cell inference surface)
- None → malformed (negative variance), caller keeps unit-level
All 4 call sites updated to drop the np.isfinite(se_cluster) guard —
NaN now flows through and triggers the safe_inference NaN-consistent
behavior. The full per-cell surface (se, t_stat, p_value, conf_int)
becomes NaN when clustered variance is undefined.
2. P3: HonestDiD CS branch (honest_did.py:835) and dCDH branch
(honest_did.py:1003) read df_inference first, falling back to
survey_metadata.df_survey. The MPD branch (honest_did.py:654) was
updated in R2 to prefer survey_metadata first. The CS + dCDH
branches still had the old preference order, creating a refactor
trap (mutual-exclusion invariant kept this harmless today, but the
mismatch between docs/comments and code is a bug magnet). Updated
both CS + dCDH branches via replace_all to mirror the MPD ordering:
survey_metadata first, df_inference fallback.
3. P2: added test_per_gt_se_propagates_nan_when_cluster_variance_
undefined regression. Forces G=1 (single-cluster design) so
compute_survey_if_variance returns NaN, then asserts at least one
(g,t) cell has NaN se AND that the full inference surface (se,
t_stat, p_value, both CI bounds) is NaN-consistent. Pins the
NaN-propagation contract end-to-end.
All 416 tests (test_staggered + test_staggered_rc + test_triple_diff +
test_honest_did + test_two_stage) pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — re-review of the prior clustered-SE finding is favorable. I do not see any unmitigated P0/P1 issues in the changed code. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI codex R6 P3: _cluster_robust_se_from_per_gt_if docstring still said "compute_survey_if_variance returning NaN leads to None and fallback to unit-level SE" — but R5 changed the contract: NaN now propagates so the per-cell inference surface NaN-outs uniformly. Docstring documented the opposite of the code, creating a refactor trap. Rewrote the return-contract paragraph to enumerate the three explicit return types (float SE / NaN / None) with the caller's contract for each. Cross-references feedback_no_silent_failures for the rationale. No code/test changes — docstring-only fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — re-review of the prior clustered-inference findings is favorable. I do not see any unmitigated P0/P1 issues in the changed code. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI codex R7 P3 wanted a regression pinning that bare-cluster + bootstrap NaN-propagates when clustered bootstrap variance is undefined (G=1 single-cluster, no within-PSU variability). Writing the test exposed a real silent-failure gap: when n_psu < 2 in the PSU-multiplier-weights bootstrap path at staggered_bootstrap.py:323-347, all multiplier draws collapse to constants → BLAS produces ≈0 variance (e.g., 6.7e-16), not NaN. Downstream safe_inference treats this as finite-but-tiny SE, yielding near-infinite t-stats and ~0 p-values for inference that is actually undefined. Fix: capture `_bootstrap_cluster_variance_unidentified = len(psu_ids) < 2` after PSU multiplier weight generation; emit a UserWarning explaining the diagnosis; before returning CSBootstrapResults, NaN-out ALL bootstrap inference fields (overall_se, gt_ses, event_study_ses, group_effect_ses, plus their CIs / p-values; cband_crit_value=None). The conditional dict construction guards against optional fields being None (e.g., when no event-study aggregation was requested). New test_bare_cluster_bootstrap_propagates_nan_when_g_less_than_2 pins the NaN contract end-to-end: fits CS with cluster="single_cluster" (all units in 1 cluster) + n_bootstrap=99 + aggregate="event_study", asserts overall_* + per-(g,t) + event-study are NaN-consistent (se, t_stat, p_value, both CI bounds). Closes CI codex R7 P3. All 417 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did + test_two_stage) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — this re-review is favorable. I do not see any unmitigated P0/P1 issues in the changed code. The prior clustered-inference concerns appear resolved. Static review only: I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI codex R8 P3: aggregate="group" has its own SE computation at staggered_aggregation.py:782-860 (_aggregate_by_group) independent of overall + event-study, so a regression isolated to grouped effects would escape the existing tests. Added test that asserts: 1. Grouped SEs differ between cluster=None and cluster="state" on a panel with intra-cluster correlation (max diff > 1e-6) 2. Bare cluster="state" matches explicit SurveyDesign(psu="state") on the grouped surface (rel=1e-10, abs=1e-12) — confirms both routes activate the same CR1 aggregation through _aggregate_by_group All 418 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good. I do not see any unmitigated P0/P1 issues in the changed code. On re-review, the prior clustered-inference concerns appear resolved, including the earlier Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
Phase 1b interstitial. Four tightly-coupled items in
CallawaySantAnnaand adjacent infrastructure:Bug fix: bare
CallawaySantAnna(cluster="state").fit(...)was a silent no-op — accepted at__init__, stored asself.cluster, returned byget_params(), but never consumed in the fit / aggregator / bootstrap pipeline. Users got per-unit IF inference believing they had cluster-robust SE; typically SE too small under intra-cluster correlation, no warning. Survey-PSU clustering viasurvey_design=SurveyDesign(psu="state")was NOT affected. Fix synthesizes a minimalSurveyDesign(psu=self.cluster, weight_type="pweight")when barecluster=is set without an explicit survey design, threading through the existing PSU-meat aggregator + PSU-multiplier bootstrap. Three-branch wiring mirrorsestimators.py:497-516; wiring inserted BEFORE_validate_unit_constant_surveyso synthesized / injected designs catch movers on panel data.Narrow
vcov_typeinput contract:CallawaySantAnna(vcov_type=...)now accepts{"hc1"}only.{classical, hc2, hc2_bm, conley}REJECTED at__init__with methodology-rooted messages. Rejection is library-architectural, not paper-prescribed: CS uses influence-function-based variance per Callaway & Sant'Anna (2021); per-(g,t) doubly-robust / IPW / outcome-regression structure has no single design matrix to compute hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF on. Narrow contract is permanent._validate_vcov_typeextracted as@staticmethod; called from both__init__andfit()so sklearn-styleset_paramsmutations are re-checked at use time.Survey / non-survey contract via dedicated
df_inferencefield: bare cluster= populates newResults.df_inference(cluster-level df for HonestDiD's t-critical-value selection) but leavesResults.survey_metadataasNone— preserving the canonical "user provided a survey design" signal thatDiagnosticReport(Bacon skip + 2x2 PT skip) andCallawaySantAnnaResults.summary()(survey block render) read. HonestDiD at both extraction sites prefersdf_inferenceoversurvey_metadata.df_survey. Replicate-weight survey designs combined with barecluster=raiseNotImplementedError(replicate IF variance ignores PSU entirely).TripleDifference defensive test: added
test_cluster_changes_sesregression test assertingTripleDifference(cluster="state")produces SE differing fromcluster=None. TripleDifference's bare-cluster code path attriple_diff.py:1245-1259was already correct but lacked a positive regression test.Audit verified the bare-
cluster=no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle barecluster=correctly.Methodology references
vcov_typeto{"hc1"}: library-architectural, not a deviation from CS 2021 — analytical-sandwich variance families and Conley spatial-HAC don't compose with IF-based variance machinery. Documented indocs/methodology/REGISTRY.md"IF-based variance estimators vs analytical-sandwich estimators" subsection.SurveyDesign(psu=cluster)synthesis for barecluster=is a documented synthesis (per CLAUDE.md "Documenting Deviations" —**Note:**label in REGISTRY.md): the synthesis itself is new wiring; the downstream PSU-meat machinery (_compute_stratified_psu_meat) is the established survey-side path; the CR1 Liang-Zeger algebra on IF is Williams (2000) / Hansen (2007).Validation
tests/test_staggered.py—TestCallawaySantAnnaClusterWiring(8 tests: panel + RCS + survey precedence coverage),TestCallawaySantAnnaVcovTypeNarrowContract(10 tests: input contract + set_params re-validation),TestCallawaySantAnnaClusterSafetyGates(6 tests: mover validation + replicate reject + survey/non-survey contract + end-to-end HonestDiD integration)tests/test_triple_diff.py—TestTripleDifferenceClusterDefensive::test_cluster_changes_sestests/test_staggered.py::TestCallawaySantAnnaClusterWiring::test_cluster_none_path_unchangedverifies cluster=None path is bit-equal to pre-PR (wiring guarded byif self.cluster is not None:)Security / privacy
🤖 Generated with Claude Code