Reject covariate names colliding with reserved structural terms (DifferenceInDifferences, MultiPeriodDiD, TwoWayFixedEffects)#518
Conversation
…iD family)
DifferenceInDifferences, MultiPeriodDiD, and TwoWayFixedEffects build their
coefficients dict by zipping a var_names list (structural term names + user
covariate column names appended verbatim) with the fitted coefficient vector.
A covariate whose name equaled a reserved structural name (const; the
treatment/time column names; the {treatment}:{time} interaction; MPD period_{p}
dummies and {treatment}:period_{p} interactions; TWFE ATT; fixed-effect / unit /
time dummy names; or an internal _-prefixed working column such as _treat_time /
_did_treatment / _treatment_post) silently overwrote that structural coefficient
via dict last-write-wins -- e.g. a covariate named const dropped the intercept --
with no error.
Add a shared validate_covariate_names helper (utils.py) called in each fit()
before the design matrix is built: it raises ValueError on a collision
(case-sensitive, so Const is allowed) and on duplicate covariate names.
Fixed-effect/unit/time dummy reserved names are taken from the same
pd.get_dummies(..., drop_first=True) call used to build them, so they match
exactly (including for Categorical columns with a non-default category order).
The TWFE guard fires on both variance paths: the within-transform path returns
only {"ATT": att}, but a covariate named _treatment_post would still clobber the
internal interaction column.
Potentially breaking: a fit that previously succeeded with a colliding (or
duplicated) covariate name -- silently returning a corrupted coefficient dict --
now raises. The staggered / influence-function estimators key results by (g, t)
tuples / relative-time indices and are unaffected; SyntheticDiD overrides fit()
and never reaches the base-class guard.
Tests: tests/test_utils.py::TestValidateCovariateNames,
tests/test_estimators.py::TestCovariateNameCollision,
tests/test_estimators_vcov_type.py::TestTWFECovariateNameCollision.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions
The upfront validate_covariate_names guard only validates the user `covariates`
list. Fixed-effect dummy names (`{fe}_{value}`) are appended to var_names AFTER
that guard and were not themselves checked against structural names — so
MultiPeriodDiD(fixed_effects=["period"]) (a non-time FE whose dummies are named
period_1, period_2, ...) could still collide with the structural period_{p}
event-study keys and silently overwrite them in the coefficients dict, the exact
bug class this PR addresses (caught in local review).
Add a shared validate_design_term_names(var_names) backstop in utils.py that
enforces uniqueness on the FINAL assembled term list (structural + covariates +
fixed-effect dummies) right before the coefficients dict is built, in all three
estimators. This catches data-dependent collisions (FE-dummy vs structural,
FE-dummy vs FE-dummy) that cannot be known up front. The upfront covariate guard
is kept for its precise, fail-fast messaging on the common case.
Tests: tests/test_utils.py::TestValidateDesignTermNames + an MPD fixed-effect-
dummy-vs-period_{p} regression in TestCovariateNameCollision. Full estimator +
TWFE-methodology suites pass with the backstop (no false positives on existing
fixed-effect fits).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (perf) The collision guard reserved fixed-effect/unit/time dummy names via pd.get_dummies(col, ...).columns, which materializes the full (n x G) dummy matrix. In TwoWayFixedEffects this ran UNCONDITIONALLY — including the default within-transform (hc1/classical/conley) path whose entire purpose is to avoid expanding high-cardinality FE dummies — so a large-G panel paid an avoidable O(n x G) allocation (latency / potential OOM) before estimation (caught in local review). Add utils.fe_dummy_names(col, prefix): derives the same names get_dummies would produce, from the category levels only (O(G)), via pd.Categorical(col).categories (sorted unique for a plain column, declared order for a Categorical) + drop_first. Verified to match pd.get_dummies(drop_first=True).columns exactly across int / str / float / NaN / non-default-order Categorical dtypes. Wire it into the DiD, MPD, and TWFE reserved-name construction in place of get_dummies. The within- transform TWFE path now performs no FE-dummy materialization at all. Tests: tests/test_utils.py::TestFeDummyNames (equivalence to get_dummies, incl. Categorical order) + a TwoWayFixedEffects within-path regression that monkeypatches pd.get_dummies to raise and asserts the hc1 fit still succeeds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
validate_design_term_names ran just before coef_dict construction — i.e. AFTER
the OLS fit — in DifferenceInDifferences and MultiPeriodDiD, so a duplicate
assembled term name (e.g. an MPD fixed-effect dummy colliding with a structural
period_{p} key) drove a wasted rank-deficient fit and emitted a misleading
multicollinearity warning before the intended ValueError (local review P3).
Move the check to immediately after var_names is fully assembled and before the
regression call in both estimators (TwoWayFixedEffects already checked pre-fit).
var_names is not mutated between assembly and coef_dict, so this is behavior-
preserving except that the ValueError now fires fast and warning-free.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good. No unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
DifferenceInDifferences,MultiPeriodDiD, andTwoWayFixedEffects. These estimators build theircoefficientsdict by zipping avar_nameslist (structural terms + user covariate names appended verbatim) with the coefficient vector, so a covariate named like a structural term (const, the treatment/time columns, the{treatment}:{time}interaction, MPDperiod_{p}dummies, TWFEATT, FE/unit/time dummy names, or an internal_-prefixed working column) silently overwrote that coefficient (dict last-write-wins) with no error. A sharedvalidate_covariate_namesguard now raisesValueErroron such a collision (case-sensitive) and on duplicate covariate names.validate_design_term_namesbackstop on the FINAL assembledvar_names(before the regression), catching fixed-effect-dummy collisions the covariate guard can't know up front — e.g.MultiPeriodDiD(fixed_effects=["period"])whose dummies collide with the structuralperiod_{p}event-study keys.utils.fe_dummy_names(O(G) category-level, matchingpd.get_dummies(drop_first=True)exactly incl.Categoricalorder) instead of materializing the dummy matrix — so the TWFE within-transform path keeps its high-cardinality scaling contract.(g, t)and are unaffected;TripleDifference/SyntheticControl/SyntheticDiDdo not expose covariates by name.Methodology references (required if estimator / math changes)
DifferenceInDifferences,MultiPeriodDiD,TwoWayFixedEffects— input-validation guard on coefficient-dict labeling.docs/methodology/REGISTRY.mdunder the DiD/MPD/TWFE "covariate-name collision guard" Notes.Validation
tests/test_utils.py(TestValidateCovariateNames,TestValidateDesignTermNames,TestFeDummyNames),tests/test_estimators.py(TestCovariateNameCollision),tests/test_estimators_vcov_type.py(TestTWFECovariateNameCollision+ a within-path no-FE-dummy-materialization regression).fe_dummy_namesverified equal topd.get_dummies(drop_first=True).columnsacross int/str/float/NaN/non-default-orderCategorical.Security / privacy
Generated with Claude Code