Skip to content

Add survey data support (Phases 1-2)#218

Merged
igerber merged 24 commits intomainfrom
survey-data
Mar 21, 2026
Merged

Add survey data support (Phases 1-2)#218
igerber merged 24 commits intomainfrom
survey-data

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 20, 2026

Summary

  • Add SurveyDesign class for specifying complex survey structures (stratification, clustering, weights, FPC)
  • Implement weighted OLS via sqrt(w) transformation in solve_ols() with three weight types (pweight/fweight/aweight)
  • Implement Taylor Series Linearization (TSL) variance estimation with strata, PSU, and FPC support
  • Integrate survey_design parameter into DifferenceInDifferences.fit(), TwoWayFixedEffects.fit(), and MultiPeriodDiD.fit()
  • Add weighted demeaning to demean_by_group() and within_transform() for correct absorb/within-transformation with survey weights
  • Add SurveyMetadata to DiDResults with effective sample size, design effect, and survey d.f.
  • Create Phase 3-5 roadmap (docs/survey-roadmap.md) for future standalone estimator integration

Methodology references (required if estimator / math changes)

  • Method name(s): Weighted Least Squares (WLS), Taylor Series Linearization (TSL), Survey degrees of freedom
  • Paper / source link(s):
    • Lumley (2004) "Analysis of Complex Survey Samples", JSS 9(8)
    • Binder (1983) "On the Variances of Asymptotically Normal Estimators from Complex Surveys", ISR 51(3)
    • Solon, Haider, & Wooldridge (2015) "What Are We Weighting For?", JHR 50(2)
    • Korn & Graubard (1990) "Simultaneous Testing of Regression Coefficients with Complex Survey Data", JASA 85(409)
  • Any intentional deviations from the source (and why):
    • Weight normalization uses sum(w)=n convention (DRDID/Stata), not raw weights (R survey). Coefficients identical; documented in REGISTRY.md.
    • lonely_psu defaults to "remove" (not R's "fail"), matching common applied practice. Documented in REGISTRY.md.
    • Survey df uses n_PSU - n_strata (not Satterthwaite approximation). Documented in REGISTRY.md.

Validation

  • Tests added/updated:
    • tests/test_survey.py — 42 tests across 5 tiers:
      • Tier 1: Analytical verification (exact match, ≤1e-10 tolerance)
      • Tier 2: R cross-validation (reference values)
      • Tier 3: Consistency and invariance (FPC monotonicity, stratification, degeneracy)
      • Tier 4: Integration (full pipeline DiD/TWFE/MultiPeriod, validation, PSU override, summary output)
      • Tier 5: Monte Carlo coverage (500 sims, 96.6% coverage at nominal 95%)
    • tests/reference/survey_reference.R — R script for generating reference values
    • tests/reference/survey_reference.csv — Pre-computed R reference outputs
  • Full test suite: 1783 passed, 0 failures, 0 regressions

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link

Overall Assessment

Blocker

Static review only: I could not execute the test suite in this sandbox because the available python does not have the project dependencies (numpy/pandas), so this review is based on the diff plus surrounding code.

Executive Summary

  • I checked the new survey registry entry in docs/methodology/REGISTRY.md; it documents weight normalization, lonely_psu default, aweight/fweight conventions, and survey d.f. None of the blocking issues below are covered by a Note/Deviation entry.
  • solve_ols() now applies survey weights in the sqrt(w) transform and then applies them again inside the weighted vcov code, so weighted analytic SEs are wrong on the non-TSL paths.
  • TwoWayFixedEffects’ new weighted within-transformation is not the weighted FWL projection, so weighted TWFE can return the wrong ATT, not just the wrong SE.
  • The new singleton-strata path can produce se=0 and then finite-looking inference (t=±inf, p=0, point CI) instead of NaNs.
  • FPC handling is inconsistent with the stated N_h per stratum semantics: validation uses observation counts instead of sampled PSUs, does not enforce constancy within stratum, and ignores fpc-only designs.
  • Non-blocking gaps remain in MultiPeriodDiD result plumbing and in the claimed R cross-validation tests.

Methodology

Affected methods: weighted least squares (WLS), Taylor Series Linearization (TSL), survey degrees of freedom, and weighted TWFE residualization.

  • Severity P0. Impact: weighted analytic inference is mathematically wrong anywhere the new code relies on solve_ols() vcov rather than compute_survey_vcov() (for example, weights-only survey designs, PSU-only survey designs, and direct weighted solve_ols() / LinearRegression calls). solve_ols() first builds X_w = sqrt(w) X and y_w = sqrt(w) y, but _compute_robust_vcov_numpy() then multiplies by weights again, which changes the bread from X'WX to X'W^2X and the score terms from wXu to w^2Xu. The non-robust weighted fallback is also wrong because it uses X'X instead of X'WX. This directly violates the registry’s WLS and weighted sandwich formulas and is not a documented deviation. Concrete fix: after the sqrt(w) transform, compute vcov on transformed data with weights=None, or compute vcov on original-scale X and u with exactly one W; also update the classical weighted vcov to use X'WX and weighted RSS. Add exact oracle tests against manual WLS HC1/cluster formulas and the external R/Stata references. Location: diff_diff/linalg.py lines 484-498, diff_diff/linalg.py lines 715-741, diff_diff/linalg.py lines 883-929, diff_diff/linalg.py lines 1430-1468, diff_diff/estimators.py lines 1059-1077.

  • Severity P0. Impact: the new weighted within_transform() used by TwoWayFixedEffects is not equivalent to weighted least squares with unit and time dummies. The implementation simply substitutes weighted means into the unweighted identity x_it - x_i. - x_.t + x_.., but the correct weighted FWL residualization is M_A^W x = x - A(A'WA)^{-1}A'Wx. Those are only equivalent in special cases; with general survey weights they produce different regressors and therefore different ATT estimates. This is an undocumented methodology mismatch in an estimator path. Concrete fix: when weights are present, do not use the one-shot two-way demeaning formula; either fit explicit unit/time dummies under WLS or implement a weighted alternating-projection / absorb algorithm that solves the weighted normal equations. Add a regression test where weights vary within both unit and time and compare against explicit weighted dummy-variable regression. Location: diff_diff/twfe.py lines 136-155, diff_diff/utils.py lines 1777-1862.

  • Severity P0. Impact: the new survey variance path can silently emit misleading inference when no stratum contributes variance. In compute_survey_vcov(), singleton strata under lonely_psu="remove" or "certainty" are skipped, so a design with only singleton strata yields an all-zero vcov. That then flows into LinearRegression.get_inference(), which converts se<=0 into t=±inf/0, p=0/1, and a point CI instead of NaNs, violating the project’s explicit safe_inference() rule. This is a silent statistical-output bug on a newly introduced code path. Concrete fix: detect “no contributing strata” / non-positive survey d.f. and return NaN vcov (or raise) before inference, and route survey inference through the same all-or-nothing NaN gate as safe_inference(). Add tests for all-singleton-strata survey designs under both "remove" and "certainty". Location: diff_diff/survey.py lines 467-493, diff_diff/linalg.py lines 1582-1628, diff_diff/utils.py lines 152-183, tests/test_survey.py lines 934-959.

  • Severity P1. Impact: FPC handling is internally inconsistent. The docstring says fpc is N_h per stratum, and the TSL code uses n_psu_h / N_h, but validation compares fpc_h to the number of observations in the stratum, not the number of sampled PSUs. That rejects valid clustered survey designs whenever a PSU contributes multiple rows. The code also takes fpc_arr[mask_h][0] / fpc_arr[0] without checking that FPC is constant within stratum, and the fpc-only path silently ignores FPC altogether. None of those behaviors are documented in the registry. Concrete fix: when psu is present, validate N_h >= n_psu_h; when psu is absent, validate against observation count; require FPC to be constant within stratum (or overall if unstratified); and either implement the SRS FPC case or reject fpc-only input with a clear error. Add tests for repeated observations per PSU, non-constant FPC within stratum, and unstratified fpc-only designs. Location: diff_diff/survey.py lines 153-180, diff_diff/survey.py lines 419-435, diff_diff/survey.py lines 480-484.

Code Quality

  • Severity P3. Impact: _resolve_effective_cluster() compares factorized PSU codes to raw cluster labels, so cluster='<same PSU column>' can still trigger a “different groupings” warning even when the grouping is identical. That makes the warning noisy and undermines trust in the real mismatch case. Concrete fix: compare the raw PSU column to the cluster column before factorization, or compare partition equivalence rather than unique-label sets. Location: diff_diff/survey.py lines 351-372.

Performance

No material new performance findings beyond the correctness issues above.

Maintainability

  • Severity P2. Impact: MultiPeriodDiD.fit() resolves survey metadata and then throws it away, and MultiPeriodDiDResults has no survey_metadata field or summary/export support. One of the three advertised estimator integrations therefore hides weight type, design effect, and survey d.f. from the public results surface. Concrete fix: add survey_metadata to MultiPeriodDiDResults, pass it from fit(), and update summary() / to_dict() plus integration tests. Location: diff_diff/estimators.py lines 951-956, diff_diff/estimators.py lines 1143-1164, diff_diff/results.py lines 289-551.

Tech Debt

  • Severity P3. Impact: if the non-blocking API/doc/test gaps are intentionally deferred, they are not currently tracked under TODO.md, so they would not count as mitigated under this review rubric. Concrete fix: if those P2/P3 items are not fixed here, add explicit entries under TODO.mdTech Debt from Code Reviews with this PR reference. Location: TODO.md lines 16-65.

Security

No material security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the claimed “Tier 2: R cross-validation” is not actually wired into the tests. survey_reference.R says the CSV is read by tests/test_survey.py, but the tests never load it; instead they compare coefficients to loose DGP constants and compare TSL SEs to _vcov from the same implementation, which cannot catch a shared algebra bug. The committed CSV also contains placeholder-looking values, so it is not a trustworthy oracle. Concrete fix: either load tests/reference/survey_reference.csv in Tier 2 and assert exact coefficient/SE/df matches, or remove the unused artifacts and reword the claim. Add targeted tests for exact manual weighted HC1/cluster formulas, weighted TWFE vs explicit dummy WLS, all-singleton lonely_psu, and FPC validation. Location: tests/reference/survey_reference.R lines 1-10 and 39-71, tests/reference/survey_reference.csv lines 1-5, tests/test_survey.py lines 261-356.

  • Severity P3. Impact: the new public parameters are not documented in the inline API docs. survey_design appears in estimator signatures but not in the parameter sections, and solve_ols() now accepts weights / weight_type without documenting them. That makes the new survey API much less discoverable. Concrete fix: update the docstrings for DifferenceInDifferences.fit(), MultiPeriodDiD.fit(), TwoWayFixedEffects.fit(), and solve_ols() with parameter semantics, supported combinations, and weight-type behavior. Location: diff_diff/estimators.py lines 146-206, diff_diff/estimators.py lines 748-800, diff_diff/twfe.py lines 57-87, diff_diff/linalg.py lines 339-425.

Path to Approval

  1. Fix the weighted vcov algebra so weighted analytic inference is computed with exactly one W, and correct the non-robust weighted vcov path to use X'WX plus weighted RSS.
  2. Replace the weighted TWFE within-transformation with a method that is actually equivalent to weighted FE regression, or fall back to explicit weighted unit/time dummies when weights are present.
  3. Add an all-or-nothing NaN guard for survey designs with zero / undefined variance contribution so t_stat, p_value, and CI all become NaN together.
  4. Correct FPC validation/application: validate against sampled PSUs when psu is present, enforce constancy within stratum, and either implement or reject fpc-only designs clearly.

igerber added a commit that referenced this pull request Mar 20, 2026
P0-1: Fix double-weighting in vcov — solve_ols() now computes vcov on
original-scale data with weights applied once (not on sqrt(w)-transformed
data with weights applied again). Fix HC1 meat to use X'diag(w*u²)X
instead of (X*w*u)'(X*w*u) which gave w² in the meat.

P0-2: Replace one-shot weighted within-transformation with iterative
alternating projections (max_iter=100, tol=1e-8) for correct weighted
FWL residualization in TWFE.

P0-3: Add NaN-vcov guard when no stratum contributes variance (all
singletons skipped), preventing se=0 → t=±inf instead of NaN.

P1: FPC validation now checks against PSU count (not obs count),
enforces constancy within stratum, and rejects fpc-only designs.

P2: Add survey_metadata to MultiPeriodDiDResults. Replace placeholder
R reference tests with exact manual oracle tests.

P3: Fix cluster comparison to use partition equivalence. Add docstrings
for survey_design/weights/weight_type parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 0b42b407c63b76e91bd31b5a4220ba0eebccbb75


Overall Assessment

⚠️ Needs changes

Static review only: I could not execute pytest in this sandbox because pytest is unavailable and the available Python environment is missing numpy/pandas.

Executive Summary

  • Most of the previous blocker findings appear addressed in the diff: weighted TWFE residualization now uses iterative weighted projections, the all-singleton survey variance path now returns NaN vcov, MultiPeriodDiD now carries survey metadata, and the registry/docstrings/tests are substantially improved.
  • I do not see a remaining P0.
  • One methodology gap remains in the core survey routing: weights-only and PSU-only SurveyDesign inputs still bypass the survey vcov path and fall back to generic OLS HC1/cluster logic.
  • MultiPeriodDiD(..., survey_design=..., robust=False) still overwrites weighted/survey vcov with an unweighted homoskedastic OLS vcov.
  • The FPC fix is still partial: unstratified PSU-only designs can pass row-varying fpc, and only the first value is used.
  • The documented deviations in REGISTRY.md are correctly labeled and should be treated as informational, not defects.

Methodology

Affected methods: survey-weighted WLS, Taylor Series Linearization variance, and survey degrees of freedom. The survey references and official implementations treat weighted estimating equations as survey totals and compute design-based/model-robust linearized SEs from those totals, with customary degrees of freedom based on PSUs minus strata. (stata.com)

  • Severity P1. Impact: weights-only and PSU-only survey designs still bypass the survey vcov path. ResolvedSurveyDesign.needs_tsl_vcov is True only when strata or fpc is present, so DiD/TWFE fall back to LinearRegression.fit()’s generic HC1/cluster vcov and MultiPeriodDiD.fit() does the same on its direct solve_ols() path, even though compute_survey_vcov() has dedicated branches for both no-strata/no-PSU and PSU-only survey designs. That is an undocumented methodology mismatch: the survey references and official implementations use design-based/model-robust linearized SEs for survey-weighted regression, not the generic OLS fallback. References: diff_diff/survey.py:L263-L265, diff_diff/linalg.py:L1417-L1503, diff_diff/estimators.py:L1037-L1058, diff_diff/survey.py:L443-L460. Concrete fix: make any non-None SurveyDesign use the survey vcov path (or reject unsupported combinations explicitly), and add weights-only/PSU-only oracle tests against survey::svyglm or Stata svy. (stata.com)
  • Severity P1. Impact: MultiPeriodDiD.fit() still has a wrong weighted/survey inference branch when robust=False and cluster=None. It computes survey vcov at diff_diff/estimators.py:L1054-L1058, but then immediately overwrites it with an unweighted homoskedastic OLS vcov based on X'X and unweighted RSS at diff_diff/estimators.py:L1067-L1085. That silently produces wrong SEs/CIs for any weighted or survey-weighted multi-period fit and regresses the weighted classical logic already added in diff_diff/linalg.py:L1450-L1498. Concrete fix: skip this override whenever weights or survey_design are present, or recompute the fallback with weighted RSS and X'WX; preferably reject robust=False under survey_design unless a model-based variance mode is intentionally supported and documented.
  • Severity P1. Impact: the FPC validation fix is still partial. SurveyDesign.resolve() now enforces constant FPC within strata, but in the unstratified PSU-only branch it only checks fpc_arr[0] >= n_psu, and compute_survey_vcov() later uses only resolved.fpc[0]. A row-varying fpc therefore passes validation and the reported SEs depend on row order. Both the PR’s own docstring and the survey references treat FPC as a stratum/stage-level population size, not an observation-level free-form vector. References: diff_diff/survey.py:L44-L45, diff_diff/survey.py:L197-L203, diff_diff/survey.py:L455-L458. Concrete fix: when strata is None and psu is not None, require a single unique FPC value (or implement/document richer semantics explicitly) and add a negative test for varying unstratified FPC. (stata.com)
  • Severity P3 informational. Impact: the intentional deviations on weight normalization, lonely_psu="remove", aweight/fweight handling, and survey d.f. are explicitly documented, so I am not counting those as defects. Reference: docs/methodology/REGISTRY.md:L1838-L1875. Concrete fix: none.

Code Quality

  • No material new findings beyond the methodology issues above.

Performance

  • No material new findings.

Maintainability

  • No material new findings.

Tech Debt

  • No material new findings.

Security

  • No security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the new tests are much better than the prior revision, but they still do not pin the two remaining survey edge cases above. test_no_strata_degeneracy() accepts any PSU-only discrepancy inside a 0.2x-to-5x band instead of asserting a reference value, and there is no test that row-varying unstratified FPC is rejected. References: tests/test_survey.py:L449-L483, tests/test_survey.py:L1131-L1187. Concrete fix: replace that loose ratio check with an external-oracle assertion for PSU-only survey designs, add a weights-only survey_design oracle, and add a failing-validation test for varying unstratified FPC.

Path to Approval

  1. Route any non-None survey_design through the survey vcov path, including weights-only and PSU-only designs, instead of falling back to generic HC1/cluster vcov.
  2. In MultiPeriodDiD.fit(), stop overwriting weighted/survey vcov in the not self.robust and self.cluster is None branch; either remove that override under weights/survey_design or recompute it with weighted RSS and X'WX.
  3. Enforce FPC constancy for unstratified PSU designs before using fpc_arr[0], and add a validation test that row-varying unstratified FPC raises.
  4. Add exact survey::svyglm or Stata svy oracle tests for weights-only and PSU-only survey designs.

igerber added a commit that referenced this pull request Mar 20, 2026
- Rename needs_tsl_vcov to needs_survey_vcov and always route through
  survey vcov path when a SurveyDesign is resolved (P1-1)
- Guard MultiPeriodDiD robust=False override so it does not overwrite
  survey vcov when survey weights are present (P1-2)
- Validate FPC constancy for unstratified PSU-only designs (P1-3)
- Replace loose ratio check with exact oracle assertions for PSU-only
  and weights-only survey vcov tests, add negative FPC test (P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 72c48e14580b814dd2bb9c04774770eb7ccfe63b


PR Review

Overall Assessment

⚠️ Needs changes

Static review only: I could not execute the new survey test suite in this sandbox because the available Python environment is missing pytest and numpy.

Executive Summary

  • The previous survey-routing issues from the last review appear fixed: all SurveyDesign inputs now go through the survey vcov path, the MultiPeriodDiD(..., robust=False) overwrite is gated off for survey-weighted fits, and unstratified varying FPC is now rejected.
  • One methodology issue remains in the new fweight support: the registry documents df = sum(w) - k, but estimator-level inference still falls back to row-count n - k in the no-PSU path, and weights-only survey vcov hard-codes the row-count HC1 adjustment.
  • One new edge-case regression exists in the weighted solver path: rank-deficient weighted/survey fits no longer preserve the existing R-style “drop column / return NaN for unidentified coefficients” behavior and can instead produce all-NaN fitted values/residuals or raise a generic singular X'WX error.
  • The documented methodology deviations in the registry are correctly labeled and should be treated as informational only.
  • Test coverage is much better, but it still misses the two broken branches above: fweight inference and weighted rank-deficiency handling.

Methodology

  • Severity P1. Impact: fweight inference is only partially implemented. The Methodology Registry explicitly says fweight uses df = sum(w) - k, not n - k, but estimator-level inference still uses row count in the non-PSU path. _compute_robust_vcov_numpy() applies n_eff = sum(weights) for fweight small-sample adjustments in the generic weighted sandwich path, yet LinearRegression.fit() stores self.df_ as n_obs - k and get_inference() uses that value, while MultiPeriodDiD.fit() also computes df from len(y) when no survey PSU is present. On top of that, any SurveyDesign is forced onto the survey-vcov path, and the weights-only survey branch hard-codes adjustment = n / (n - k) instead of the documented fweight convention. This makes p-values/CIs incorrect for supported fweight fits without PSU structure, including weights-only and strata-only survey designs. References: docs/methodology/REGISTRY.md:L1859-L1866, diff_diff/linalg.py:L895-L926, diff_diff/linalg.py:L1515-L1519, diff_diff/linalg.py:L1633-L1640, diff_diff/estimators.py:L1061-L1065, diff_diff/survey.py:L267-L270, diff_diff/survey.py:L448-L452. Concrete fix: propagate an effective_n/df_weighted concept through LinearRegression.fit() and MultiPeriodDiD.fit() for fweight, and make the weights-only survey vcov branch reuse the fweight-aware HC1 logic from _compute_robust_vcov_numpy() or reject SurveyDesign(..., weight_type="fweight") without PSU until that path is implemented correctly.

  • Severity P3 informational. Impact: the intentional deviations for weight normalization, lonely_psu="remove", and survey d.f. are now documented in the registry, so they should not block approval. References: docs/methodology/REGISTRY.md:L1838-L1842, docs/methodology/REGISTRY.md:L1849-L1857, docs/methodology/REGISTRY.md:L1868-L1872. Concrete fix: none.

Code Quality

  • Severity P1. Impact: weighted and survey-weighted rank-deficient fits lose the solver’s existing R-style dropped-column handling. In the unweighted rank-deficient path, _solve_ols_numpy() computes fitted values on the reduced design and expands vcov with NaN only for dropped columns. In the new weighted path, solve_ols() disables backend vcov, then back-transforms with fitted_orig = _original_X @ coefficients even when coefficients already contains NaN for dropped columns; that turns fitted values and residuals into all-NaN. It then tries to compute vcov on the full singular X'WX, which raises a generic singular-matrix error instead of honoring rank_deficient_action. That breaks exactly the behavior TWFE is trying to preserve with its post-fit dropped-column handling. References: diff_diff/linalg.py:L725-L733, diff_diff/linalg.py:L508-L511, diff_diff/linalg.py:L586-L605, diff_diff/linalg.py:L934-L945, diff_diff/linalg.py:L1499-L1503, diff_diff/survey.py:L528-L537, diff_diff/twfe.py:L216-L246. Concrete fix: preserve the reduced-system fitted values/residuals and kept-column mask when weights are present, compute weighted/survey vcov on the reduced design, then expand the vcov back to full size with NaN rows/columns for dropped coefficients.

Performance

  • No material findings in the changed code.

Maintainability

  • No additional material findings beyond the weighted rank-deficiency regression above.

Tech Debt

  • No mitigating TODO.md entry covers the remaining issues, so they remain unmitigated for assessment purposes.

Security

  • No security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the new test suite exercises fweight coefficient equivalence, but not fweight inference semantics or weighted/survey rank-deficiency behavior, which is why the remaining bugs survive. The only fweight oracle in the new file stops at coefficient equivalence against row expansion. References: tests/test_survey.py:L526-L540. Concrete fix: add one oracle test that compares fweight SE/p-value/CI behavior against an explicitly expanded dataset, and add one weighted rank-deficiency test at the solver level (plus one estimator-level regression test for a survey-weighted TWFE/DiD case) asserting dropped columns yield NaN coefficients with valid residuals/vcov on identified parameters rather than singular X'WX.

Path to Approval

  1. Make fweight inference consistent with the documented sum(w) - k convention across all new paths: LinearRegression.fit(), LinearRegression.get_inference(), MultiPeriodDiD.fit(), and the weights-only compute_survey_vcov() branch.
  2. Restore rank-deficient behavior for weighted and survey-weighted fits by carrying reduced-design fitted values/residuals and vcov expansion through the weighted back-transform path instead of recomputing on the full NaN coefficient vector / singular X'WX.
  3. Add targeted regression tests for fweight inference and weighted/survey rank-deficiency so these paths are pinned going forward.

igerber added a commit that referenced this pull request Mar 20, 2026
…w (round 3)

P1-1: Use sum(w) - k instead of n - k for fweight degrees of freedom in
compute_survey_vcov(), LinearRegression.fit() (classical MSE and df_ storage),
and MultiPeriodDiD.fit().

P1-2: Fix weighted rank-deficient fits producing all-NaN residuals by using only
identified columns for fitted-value computation in solve_ols() back-transform,
and computing vcov on the reduced system then expanding.

P2: Add regression tests for fweight SE oracle match, LinearRegression fweight df,
and weighted rank-deficiency at both solver and estimator levels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

igerber and others added 4 commits March 20, 2026 14:59
…timator integration

Implements SurveyDesign for complex survey structures (stratification,
clustering, weights, FPC) with Taylor Series Linearization variance
estimation. Adds weighted OLS via sqrt(w) transformation, survey-aware
sandwich estimator with three weight types (pweight/fweight/aweight),
and survey degrees of freedom (n_PSU - n_strata).

Integrates with DifferenceInDifferences, TwoWayFixedEffects, and
MultiPeriodDiD via survey_design parameter on fit(). Includes weighted
demeaning for absorb/within-transformation, SurveyMetadata in results,
and 42 tests across 5 tiers (analytical, R cross-validation,
consistency, integration, Monte Carlo coverage).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P0-1: Fix double-weighting in vcov — solve_ols() now computes vcov on
original-scale data with weights applied once (not on sqrt(w)-transformed
data with weights applied again). Fix HC1 meat to use X'diag(w*u²)X
instead of (X*w*u)'(X*w*u) which gave w² in the meat.

P0-2: Replace one-shot weighted within-transformation with iterative
alternating projections (max_iter=100, tol=1e-8) for correct weighted
FWL residualization in TWFE.

P0-3: Add NaN-vcov guard when no stratum contributes variance (all
singletons skipped), preventing se=0 → t=±inf instead of NaN.

P1: FPC validation now checks against PSU count (not obs count),
enforces constancy within stratum, and rejects fpc-only designs.

P2: Add survey_metadata to MultiPeriodDiDResults. Replace placeholder
R reference tests with exact manual oracle tests.

P3: Fix cluster comparison to use partition equivalence. Add docstrings
for survey_design/weights/weight_type parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename needs_tsl_vcov to needs_survey_vcov and always route through
  survey vcov path when a SurveyDesign is resolved (P1-1)
- Guard MultiPeriodDiD robust=False override so it does not overwrite
  survey vcov when survey weights are present (P1-2)
- Validate FPC constancy for unstratified PSU-only designs (P1-3)
- Replace loose ratio check with exact oracle assertions for PSU-only
  and weights-only survey vcov tests, add negative FPC test (P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…w (round 3)

P1-1: Use sum(w) - k instead of n - k for fweight degrees of freedom in
compute_survey_vcov(), LinearRegression.fit() (classical MSE and df_ storage),
and MultiPeriodDiD.fit().

P1-2: Fix weighted rank-deficient fits producing all-NaN residuals by using only
identified columns for fitted-value computation in solve_ols() back-transform,
and computing vcov on the reduced system then expanding.

P2: Add regression tests for fweight SE oracle match, LinearRegression fweight df,
and weighted rank-deficiency at both solver and estimator levels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 80d43c8500f3206fd9a2dd7a7f34e6c0419e7905


Overall Assessment

⚠️ Needs changes

Static review only: I could not run the new survey tests in this sandbox because the available Python environment is missing numpy and pytest.

Executive Summary

  • The re-review items around survey routing and FPC validation look improved: survey fits do route through the survey-vcov path, MultiPeriodDiD(..., robust=False) no longer overwrites survey vcov, and unstratified row-varying FPC is rejected.
  • One prior methodology P1 remains: fweight inference is still only partially implemented. The Registry says df = sum(w) - k, but row-count n - k is still used in several estimator-facing paths.
  • One prior edge-case P1 remains: weighted and survey-weighted rank-deficient fits still lose the reduced-design handling that the unweighted solver has, so they can produce all-NaN residuals or raise on the singular full weighted design.
  • The documented deviations for weight normalization, lonely_psu="remove", and survey d.f. are properly labeled in the Registry and are non-blocking.
  • The new tests still do not pin the two remaining failure modes above.

Methodology

Affected methods: Weighted Least Squares (WLS), Taylor Series Linearization (TSL), and survey/frequency-weight degrees of freedom.

  • Severity P1. Impact: fweight inference still diverges from the Methodology Registry’s stated df = sum(w) - k convention. LinearRegression.fit() stores self.df_ from row count, so get_inference() falls back to n - k when no PSU-based survey d.f. exists; MultiPeriodDiD.fit() does the same; and the weights-only survey-HC1 branch still hard-codes n / (n - k). That silently gives wrong p-values/CIs for DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD, and public LinearRegression fweight fits without PSU-based survey d.f., and it also mis-scales SEs in the weights-only survey path. References: docs/methodology/REGISTRY.md:L1861-L1866, diff_diff/linalg.py:L1458-L1519, diff_diff/linalg.py:L1633-L1640, diff_diff/survey.py:L448-L452, diff_diff/estimators.py:L1061-L1065. Concrete fix: propagate an n_eff_df = sum(weights) concept through LinearRegression.fit() and MultiPeriodDiD.fit(), use it in weighted classical denominators and estimator d.f., and make the weights-only compute_survey_vcov() branch reuse the same fweight adjustment or reject SurveyDesign(..., weight_type="fweight") without PSU until that path is correct.
  • Severity P3. Impact: the deviations for weight normalization, lonely_psu="remove", and df = n_PSU - n_strata are documented in the Registry, so they are informational only. References: docs/methodology/REGISTRY.md:L1840-L1842, docs/methodology/REGISTRY.md:L1854-L1857, docs/methodology/REGISTRY.md:L1870-L1875. Concrete fix: none.

Code Quality

  • Severity P1. Impact: weighted rank-deficient fits still lose the solver’s existing R-style “drop column, keep valid fitted values/residuals/vcov on identified coefficients” behavior. _solve_ols_numpy() correctly computes fitted values on the reduced design, but solve_ols() immediately overwrites that with np.dot(_original_X, coefficients) even when coefficients contains dropped-column NaNs, which poisons fitted values/residuals. If vcov is requested, the code then re-enters the full singular weighted design; in survey fits, compute_survey_vcov() receives that full singular design as well. References: diff_diff/linalg.py:L586-L605, diff_diff/linalg.py:L720-L733, diff_diff/linalg.py:L1499-L1503, diff_diff/survey.py:L528-L537. Concrete fix: preserve the kept-column mask and reduced-system fitted/residuals through the weighted back-transform, compute weighted/survey vcov on the reduced design only, then expand the vcov back with NaN rows/columns for dropped coefficients.

Performance

No material findings in the changed code.

Maintainability

No additional material findings beyond the weighted rank-deficiency regression above.

Tech Debt

  • Severity P3. Impact: there is no existing TODO.md entry that tracks the remaining survey correctness issues, so they remain unmitigated for assessment purposes. Reference: TODO.md:L38-L66. Concrete fix: none; these are correctness issues and need code changes, not TODO-only tracking.

Security

No security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the survey test file still pins fweight coefficient equivalence only; it does not cover estimator-level df = sum(w) - k behavior or weighted/survey rank-deficiency, which is why the two remaining P1s survive. Reference: tests/test_survey.py:L526-L542. Concrete fix: add one oracle test that compares fweight SE/p-value/CI (or LinearRegression.df_) against explicit row expansion, and add weighted plus survey-weighted rank-deficiency regression tests asserting finite residuals and vcov NaNs only on dropped coefficients.

Path to Approval

  1. Make fweight inference fully consistent with the Registry across all user-facing paths: LinearRegression.fit()/get_inference(), MultiPeriodDiD.fit(), and the weights-only compute_survey_vcov() branch.
  2. Restore R-style rank-deficient behavior for weighted and survey-weighted fits by carrying reduced-design fitted values/residuals and reduced-design vcov expansion through the weighted back-transform path.
  3. Add targeted regression tests for fweight inference semantics and weighted/survey rank-deficiency so these branches are pinned going forward.

@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

…s from PR #218 review (round 4)

P0: Replace scores'scores (w²*e²) with correct X'diag(w*e²)X in no-structure
survey vcov branch. P1: Handle NaN coefficients in survey vcov callers
(LinearRegression.fit, MultiPeriodDiD.fit) by computing on kept columns and
expanding with _expand_vcov_with_nan. P2: Fix oracle test and add fweight
oracle + rank-deficiency tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: a67d940606cba434c51c1d64826f8ac3b086ef34


Overall Assessment
⚠️ Needs changes

Static review only: the sandbox Python environment here is missing numpy, so I could not execute tests/test_survey.py.

Executive Summary

  • The prior re-review blockers on fweight d.f. and weighted rank-deficient residual/vcov handling look addressed in the current diff.
  • Severity P1: supported survey designs without an explicit psu still fall back to residual n-k d.f., even though the new TSL code treats each observation as a PSU. That conflicts with the Registry’s df = n_PSU - n_strata rule and changes p-values/CIs.
  • Severity P1: the unstratified PSU path never applies lonely_psu; with one PSU it divides by n_psu - 1, so this new edge case has no valid variance handling.
  • Severity P1: the new LinearRegression(..., survey_design=...) API only uses survey_design for vcov/d.f.; unless callers also duplicate weights=..., it can mix unweighted coefficients with weighted survey SEs.
  • The documented deviations for weight normalization, lonely_psu="remove", and the simple n_PSU - n_strata survey d.f. convention are properly labeled in the Registry and are non-blocking.
  • The new tests do not pin the no-PSU d.f. rule, unstratified single-PSU handling, or direct LinearRegression(survey_design=...) weight propagation.

Methodology

  • Severity P1. Impact: survey degrees of freedom are only populated when psu is explicitly provided, but the new TSL implementation already treats psu is None as “each observation is its own PSU” inside each stratum. By the code’s own implicit-PSU construction plus the Registry’s df = n_PSU - n_strata rule, supported no-PSU survey fits should use survey d.f. (n_obs - n_strata, or n_obs - 1 unstratified), not residual n-k. As written, DifferenceInDifferences/TwoWayFixedEffects fall back to LinearRegression.df_, and MultiPeriodDiD does the same, so p-values and CIs are inconsistent with the documented method for weights-only and stratified-no-PSU designs. References: docs/methodology/REGISTRY.md:L1868-L1875, diff_diff/survey.py:L258-L265, diff_diff/survey.py:L474-L530, diff_diff/linalg.py:L1661-L1667, diff_diff/linalg.py:L1777-L1784, diff_diff/estimators.py:L1074-L1082. Concrete fix: define implicit PSU counts when psu is omitted and thread that through ResolvedSurveyDesign.df_survey, SurveyMetadata, LinearRegression, and MultiPeriodDiD, or explicitly document a deviation in the Registry.
  • Severity P1. Impact: the documented lonely_psu handling is only enforced for stratified designs. In the unstratified psu branch, the variance code always applies n_psu / (n_psu - 1) and never consults lonely_psu, while resolve() never warns about n_psu < 2 unless strata is also present. A one-PSU unstratified survey therefore has no defined method behavior and can hit divide-by-zero/NaN output instead of the documented remove/certainty/adjust policy. References: docs/methodology/REGISTRY.md:L1844-L1857, diff_diff/survey.py:L210-L227, diff_diff/survey.py:L461-L473. Concrete fix: treat an unstratified PSU design as a single stratum for lonely-PSU handling, or reject n_psu < 2 with a clear error/warning before variance computation.
  • Severity P3. Impact: the weight-normalization convention, default lonely_psu="remove", and df = n_PSU - n_strata survey d.f. convention are all explicitly documented in the Registry, so they are informational rather than defects. References: docs/methodology/REGISTRY.md:L1840-L1842, docs/methodology/REGISTRY.md:L1854-L1857, docs/methodology/REGISTRY.md:L1872-L1875. Concrete fix: none.

Code Quality

  • Severity P1. Impact: the new LinearRegression.survey_design constructor parameter is only partially threaded. fit() passes only self.weights into solve_ols() for the coefficient solve, but later compute_survey_vcov() always reads self.survey_design.weights. So LinearRegression(survey_design=resolved).fit(X, y) can silently estimate an unweighted beta and then attach weighted survey SEs unless the caller separately duplicates the same weights in weights=; the code also never validates that explicit weights= matches survey_design.weights. This is the exact “new __init__ param missing downstream” anti-pattern from the review checklist. References: diff_diff/linalg.py:L1465-L1483, diff_diff/linalg.py:L1545-L1557, diff_diff/linalg.py:L1624-L1639. Concrete fix: when survey_design is present, derive weights/weight_type from it and assert consistency if explicit weights are also supplied, or reject survey_design without matching weights.

Performance

  • No material findings in the changed code.

Maintainability

  • No material findings beyond the parameter-threading issue above.

Tech Debt

  • Severity P3. Impact: TODO.md has no survey-related correctness deferral, so the P1 issues above are untracked and are not eligible for TODO-only mitigation. References: TODO.md:L44-L71. Concrete fix: fix these in code rather than deferring them.

Security

  • No security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the new test file pins explicit-PSU survey d.f. and weighted rank-deficiency, but it does not cover the no-PSU survey d.f. rule, unstratified single-PSU handling, or weighted coefficient propagation for direct LinearRegression(survey_design=...) calls; the current direct-LinearRegression survey tests only assert finiteness, so they would not catch the weighting mismatch above. The public LinearRegression docstring also still omits the new weights, weight_type, and survey_design parameters. References: tests/test_survey.py:L952-L973, tests/test_survey.py:L995-L1035, tests/test_survey.py:L1502-L1603, diff_diff/linalg.py:L1400-L1417. Concrete fix: add regression tests for implicit-PSU d.f., unstratified one-PSU lonely_psu behavior, and coefficient/SE equivalence for direct LinearRegression(survey_design=...), and update the LinearRegression docstring to document the new parameters.

Path to Approval

  1. Make no-PSU survey fits follow the Registry’s survey d.f. rule consistently: if psu is omitted, treat each observation as an implicit PSU for d.f. purposes as well as for the TSL meat, or document a deliberate deviation in REGISTRY.md.
  2. Route unstratified n_psu < 2 designs through lonely_psu handling instead of n_psu / (n_psu - 1), and add explicit tests for remove, certainty, and adjust.
  3. Make LinearRegression(survey_design=...) own the survey weights, or reject missing/mismatched weights= so beta, residuals, and vcov come from the same weighted problem.
  4. Add targeted tests for weights-only and stratified-no-PSU survey inference, single-PSU unstratified designs, and direct LinearRegression survey equivalence to the explicit weighted call.

…Regression weight propagation from PR #218 review (round 5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 84434806284a5ce174a022621a6d998571d8526c


Overall Assessment

⚠️ Needs changes

Static review only: the local sandbox here is missing numpy, so I could not execute tests/test_survey.py or run the new survey paths end-to-end.

Executive Summary

  • The prior blockers on implicit-PSU survey d.f. and unstratified single-PSU handling look addressed in this diff.
  • Severity P1: the weights-only survey variance branch does not follow the same PSU-score TSL construction used elsewhere in the PR; it special-cases to X' diag(w u^2) X, which is an undocumented methodology deviation from the Registry’s Binder/Lumley TSL description.
  • Severity P1 [Previously flagged, partially resolved]: LinearRegression(..., survey_design=...) now auto-derives missing weights, but it still accepts conflicting explicit weights= / weight_type= and can combine coefficients from one weighted problem with survey SEs from another.
  • The documented deviations for weight normalization, default lonely_psu="remove", simplified survey d.f., and single-PSU-unstratified NaN variance are properly labeled in the Registry and are non-blocking.
  • The new tests still miss the conflicting-input direct-LinearRegression case, and the public LinearRegression docstring does not document the new survey/weight parameters.

Methodology

Affected methods: weighted least squares and Binder/Lumley-style survey linearization as wired into diff_diff/estimators.py:L237-L320, diff_diff/twfe.py:L124-L208, diff_diff/estimators.py:L961-L1082, and diff_diff/linalg.py:L1465-L1671, against the Registry entry at docs/methodology/REGISTRY.md:L1835-L1886.

  • Severity P1. Impact: the no-strata/no-PSU branch in compute_survey_vcov() falls back to a weighted HC1 meat X' diag(w u^2) X, while the rest of the survey code builds PSU score totals from w_i X_i u_i and accumulates outer products of those totals. That makes a weights-only SurveyDesign(weights=...) numerically inconsistent with the same design made explicit via observation-level PSUs, and the Registry does not document that deviation. References: diff_diff/survey.py:L463-L489, diff_diff/survey.py:L526-L566, docs/methodology/REGISTRY.md:L1849-L1886. Concrete fix: route the weights-only case through the same implicit-PSU score construction as the other TSL branches, or explicitly document the deviation in REGISTRY.md and validate it against an external survey reference implementation.

  • Severity P3. Impact: the PR now documents its deliberate deviations on weight normalization, default lonely_psu, simplified survey d.f., and unstratified single-PSU NaN behavior, so I do not count those as defects. References: docs/methodology/REGISTRY.md:L1845-L1847, docs/methodology/REGISTRY.md:L1859-L1865, docs/methodology/REGISTRY.md:L1880-L1886. Concrete fix: none.

Code Quality

  • Severity P1 [Previously flagged, partially resolved]. Impact: LinearRegression.fit() only auto-derives weights / weight_type from survey_design when self.weights is None; if a caller supplies explicit weights= or weight_type= alongside survey_design, the coefficient solve uses the explicit inputs, but survey vcov still comes from self.survey_design. That can silently mix coefficients/residuals from one weighted problem with SEs from another. References: diff_diff/linalg.py:L1537-L1560, diff_diff/linalg.py:L1628-L1643. Concrete fix: when survey_design is present, canonicalize weights and weight_type from it and raise on any conflicting explicit inputs.

Performance

  • No material findings.

Maintainability

  • No material findings beyond the blocking parameter-consistency issue above.

Tech Debt

  • No separate findings. The blocking items above are not tracked in TODO.md:L44-L72, and by project policy they are not eligible for TODO-only deferral.

Security

  • No security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the tests only pin the happy-path direct LinearRegression(..., survey_design=...) case, and the weights-only oracle test hard-codes the current HC1 fallback instead of checking equivalence to an explicit-PSU/reference implementation. The public LinearRegression docstring also still omits weights, weight_type, and survey_design. References: tests/test_survey.py:L492-L523, tests/test_survey.py:L1764-L1814, diff_diff/linalg.py:L1391-L1417. Concrete fix: add a regression test for conflicting explicit weights= with survey_design, add a weights-only equivalence/reference test, and document the new public parameters on LinearRegression.

Path to Approval

  1. Make the weights-only survey vcov branch use the same implicit-PSU TSL score construction as the rest of the survey code, or add an explicit Registry deviation note plus external reference validation if the HC1 fallback is intentional.
  2. Reject or canonicalize conflicting explicit weights= / weight_type= when LinearRegression(..., survey_design=...) is used, and add a regression test that currently fails with mismatched inputs.
  3. Update the public LinearRegression docstring and direct-API survey tests to cover the new survey/weights contract.

…218 review (round 6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 21690d2ad4b3480b009b14ae64759f05a52b326d


Overall Assessment

⚠️ Needs changes

Static review only: the local environment here is missing numpy, pandas, pytest, and scipy, so I could not execute tests/test_survey.py or run the new survey paths end to end.

Executive Summary

  • The prior re-review blockers on the weights-only TSL branch and conflicting explicit weights= with survey_design look largely addressed; the corresponding deviations are now documented in the Registry.
  • Severity P1: when survey_design is present but has no explicit psu, the new survey path silently drops user-supplied cluster IDs and falls back to implicit per-observation PSUs, which changes both SEs and survey d.f. without warning.
  • Severity P1: the new public weighted APIs validate only weight length, not weight validity. Invalid weight_type strings and non-finite/negative direct weights can flow into the new weighting formulas instead of failing fast.
  • Severity P2: the new tests cover psu overriding cluster and SurveyDesign-level validation, but they do not cover the no-psu + cluster interaction or invalid direct weighted-API inputs, so the two issues above are currently unguarded.

Methodology

Affected methods cross-checked against the new Registry section: weighted least squares, Binder/Lumley Taylor Series Linearization, and Korn/Graubard-style survey d.f.

  • Severity P1. Impact: cluster is silently ignored whenever survey_design is present without an explicit psu. The new helper says only survey PSU should override the user cluster, and the estimator entry points pass effective_cluster_ids, but the survey-vcov path then recomputes variance from survey_design alone. In practice, weights-only or stratified-no-PSU fits fall back to implicit per-observation PSUs and df_survey = n_obs - n_strata even when cluster= was supplied. That is an undocumented variance/d.f. change in the new weighting path. References: diff_diff/survey.py:408, diff_diff/linalg.py:1546, diff_diff/linalg.py:1650, diff_diff/estimators.py:1034, diff_diff/twfe.py:173. Concrete fix: when survey_design.psu is absent and cluster/cluster_ids is present, either inject those IDs into the survey-vcov / survey-d.f. / metadata path as the effective PSU structure, or reject the combination explicitly with a clear error and documentation.
  • Severity P3. Impact: the intentional deviations called out in the PR body are now properly documented in the Registry, so they are non-blocking under the project policy. That includes normalized p/aweights, default lonely_psu="remove", implicit-PSU weights-only TSL, and df = n_PSU - n_strata. References: docs/methodology/REGISTRY.md:1859, docs/methodology/REGISTRY.md:1866, docs/methodology/REGISTRY.md:1873, docs/methodology/REGISTRY.md:1880. Concrete fix: none.

Code Quality

  • Severity P1. Impact: the new weighted public APIs do not validate their own weight contract. SurveyDesign rejects invalid weight_type, NaN/Inf weights, and non-positive survey weights, but direct solve_ols, compute_robust_vcov, and LinearRegression calls only check weights.shape[0] before applying sqrt(w) or branching on weight_type. A typo like weight_type="pwieght" silently takes a pweight-like path, and negative/NaN direct weights can enter the transformed regression instead of raising. That is a missing assumption check in a newly added weighting feature. References: diff_diff/survey.py:66, diff_diff/survey.py:97, diff_diff/linalg.py:526, diff_diff/linalg.py:545, diff_diff/linalg.py:1474. Concrete fix: add a shared validator for weight_type and direct weights in solve_ols, compute_robust_vcov, and LinearRegression before any transformation; reject invalid weight_type, non-finite weights, and negative weights, and document whether zero weights are allowed.

Performance

No material findings in the changed code.

Maintainability

No material findings beyond the two blocking API/variance-path inconsistencies above.

Tech Debt

No mitigating TODO.md entry covers the blocking findings above; the current deferred review items are unrelated. Reference: TODO.md:44.

Security

No security findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the new test suite is broad, but it misses the two risky interaction points introduced here. It covers psu overriding cluster and SurveyDesign validation, and it covers conflicting explicit weights with survey_design, but it does not cover survey_design without psu plus cluster=..., nor invalid direct solve_ols / LinearRegression weighted inputs. References: tests/test_survey.py:615, tests/test_survey.py:740, tests/test_survey.py:1884. Concrete fix: add regression tests for weights-only and stratified-no-PSU survey fits with cluster=... across DifferenceInDifferences, TwoWayFixedEffects, and MultiPeriodDiD, plus direct solve_ols / LinearRegression tests for invalid weight_type, NaN weights, and negative weights.

Path to Approval

  1. Preserve cluster semantics when survey_design.psu is absent, or reject that combination explicitly. If supported, use those cluster IDs in the survey-vcov, survey d.f., and survey-metadata paths.
  2. Add centralized validation for the new direct weighted APIs so invalid weight_type and non-finite/negative weights fail fast instead of flowing into the WLS/SE formulas.
  3. Add regression tests for both behaviors and document the intended cluster + survey_design contract.

… review (round 7)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: b4cd77065e4f521fe97fab566c92f65394154db3


Overall Assessment

⚠️ Needs changes

Static review only: the local environment here is missing numpy, pandas, pytest, and scipy, so I could not execute the new survey tests end to end.

Executive Summary

Methodology

  • Severity P1. Impact: the non-clustered weighted HC1 implementation is internally inconsistent for pweight. solve_ols() now estimates coefficients by running ordinary OLS on sqrt(w)X and sqrt(w)y, which implies score contributions w_i x_i u_i, and the cluster branch uses exactly those scores. But the no-cluster branch explicitly switches to X' diag(w_i u_i^2) X instead of the transformed-regression sandwich implied by the stated implementation. That is an undocumented methodology mismatch in a newly added inference path, and it yields different SEs for the same estimator depending on whether you call the weighted API with or without clustering. References: diff_diff/linalg.py:L567-L579, diff_diff/linalg.py:L1046-L1078, docs/methodology/REGISTRY.md:L1840-L1847. Concrete fix: make the non-clustered pweight HC1 meat consistent with the documented sqrt(w) transformed-OLS path, or explicitly document and justify a different pweight covariance formula in REGISTRY.md and then align all branches/tests to that choice.
  • Severity P3. Impact: the intentional deviations that matter for the survey estimator path are now properly documented, including normalized p/aweights, lonely_psu="remove", implicit-PSU weights-only TSL, survey df, and cluster injection as PSU. Those are non-blocking under the review policy. References: docs/methodology/REGISTRY.md:L1845-L1847, docs/methodology/REGISTRY.md:L1859-L1869, docs/methodology/REGISTRY.md:L1873-L1894. Concrete fix: none.

Code Quality

Performance

  • Severity P3. Impact: no material performance findings in the changed code. Concrete fix: none.

Maintainability

  • Severity P3. Impact: no material maintainability findings beyond the two blocking inference issues above. Concrete fix: none.

Tech Debt

  • Severity P3. Impact: TODO.md does not currently track the two P1 issues, so they remain unmitigated for assessment. In any case, TODO tracking would not downgrade incorrect SE / missing-assumption P1 findings under the project rules. References: TODO.md:L37-L58. Concrete fix: fix these in the PR rather than deferring them.

Security

  • Severity P3. Impact: no security findings in the changed files. Concrete fix: none.

Documentation/Tests

  • Severity P2. Impact: the new suite will not catch the current blockers. The new manual oracle test hardcodes the current X' diag(w u^2) X formula, and the added validation tests stop at solve_ols() / LinearRegression() instead of covering the new weighted compute_robust_vcov() API. That means CI can pass while both the pweight HC1 mismatch and the missing validation path remain. References: tests/test_survey.py:L292-L316, tests/test_survey.py:L942-L955, tests/test_survey.py:L2062-L2106, tests/test_linalg.py:L478-L555. Concrete fix: replace the weighted HC1 oracle with one that matches the chosen pweight covariance formula, add direct compute_robust_vcov() tests for invalid weight_type / NaN / Inf / negative / mismatched weights, and add a regression test for the chosen fractional-fweight policy.

Path to Approval

  1. Make the non-clustered weighted pweight HC1 calculation consistent with the documented sqrt(w) WLS implementation, or document a different formula in docs/methodology/REGISTRY.md and align the code/tests to it.
  2. Apply _validate_weights() to compute_robust_vcov() and make the fweight contract consistent across SurveyDesign, solve_ols(), LinearRegression, and the df adjustments.
  3. Update the tests so they would fail on both classes of bug: weighted pweight HC1 oracle coverage, direct compute_robust_vcov() validation coverage, and a fractional-fweight regression test matching the intended policy.

… fix fweight df rounding from PR #218 review (round 8)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 20, 2026

/ai-review

…abel consistency from PR #218 review (round 15)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: d12bab26d943ecdab21552aa49fddf6885f51dc2


Overall Assessment

⚠️ Needs changes

Static review only: this environment does not have numpy/pytest, so I could not execute tests/test_survey.py. The assessment below is based on code inspection and the added regression coverage.

Executive Summary

Methodology

Affected methods reviewed: Weighted Least Squares, Taylor Series Linearization, survey degrees of freedom, and weighted absorbed/within transformations.

  • Severity: P1. Impact: _inject_cluster_as_psu() globally factorizes injected cluster_ids and sets n_psu from the global unique count, but the only nesting support lives in the explicit psu= resolution path. In stratified designs where cluster labels repeat across strata, the variance meat is still formed stratum-by-stratum, but n_psu and therefore df_survey = n_PSU - n_strata are too small, so p-values/CIs use the wrong reference distribution with no warning. References: diff_diff/survey.py:153, diff_diff/survey.py:279, diff_diff/survey.py:442, diff_diff/survey.py:289, docs/methodology/REGISTRY.md:1908. Concrete fix: preserve nest on ResolvedSurveyDesign and apply it during cluster injection, or explicitly reject injected cluster IDs that are not globally unique across strata when nest=False.
  • Severity: P1. Impact: SurveyDesign.resolve() validates fpc before cluster IDs can be promoted to PSUs. That means survey_design=SurveyDesign(..., fpc=...) plus cluster= either fails outright when no psu/strata was declared, or is checked against the number of observations in each stratum instead of the number of sampled PSUs. The variance code then switches to PSU-level FPC once clusters are injected, so the validation rule and the estimation rule disagree. References: diff_diff/survey.py:173, diff_diff/survey.py:200, diff_diff/survey.py:442, diff_diff/survey.py:596, docs/methodology/REGISTRY.md:1908. Concrete fix: validate fpc against the final effective PSU structure after cluster injection, not against the pre-injection design.

Code Quality

  • No material findings beyond the shared survey-resolution issues above.

Performance

  • No material performance regressions stood out in the changed paths.

Maintainability

  • No separate maintainability findings beyond the need to unify survey resolution, cluster injection, and FPC validation in one place.

Tech Debt

Security

  • No material security or secret-handling issues found in the changed files.

Documentation/Tests

  • Severity: P2. Impact: The added suite covers explicit nest=True on psu= and unstratified cluster injection, but it does not cover the two injected-PSU interactions that fail here: cluster + strata with repeated cluster labels, and cluster + fpc. That leaves the documented cluster-as-PSU path unguarded. References: tests/test_survey.py:858, tests/test_survey.py:1978. Concrete fix: add regression tests for both cases, including assertions on n_psu, df_survey, and acceptance/rejection behavior.

Path to Approval

  1. Make the injected-PSU path honor stratified nesting, or explicitly reject non-unique injected cluster labels across strata when nesting is not supported.
  2. Rework fpc validation so it runs on the effective PSU structure after cluster= has been resolved into PSUs.
  3. Add targeted regressions for cluster + strata + repeated labels and for cluster + fpc in the survey test suite.

… late PSU injection from PR #218 review (round 16)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 8a0ba95d2d3160d49b673d82cbf9d5e8861a5222


Overall Assessment

⚠️ Needs changes

Static review only: the local python environment here is missing numpy, so I could not execute tests/test_survey.py. The assessment below is based on diff/code inspection.

Executive Summary

  • Previous re-review work on injected-PSU nesting looks resolved: _inject_cluster_as_psu() now nests cluster labels within strata, and the new tests cover that path.
  • The prior FPC finding is not fully resolved. FPC handling is still keyed to the pre-injection design, not the final effective PSU structure actually used by Taylor Series Linearization.
  • That leaves two bad branches: valid unstratified cluster + fpc survey designs are still rejected up front, while stratified strata + fpc / strata + fpc + cluster designs can now bypass the required FPC >= n_PSU check and flow into variance estimation.
  • Outside that gap, the core survey methodology lines up with the new registry: sqrt(w) WLS, documented weight normalization, documented lonely_psu="remove", and documented survey d.f. convention.
  • The weighted multi-absorb limitation is now documented in REGISTRY.md and tracked in TODO.md, so that part is informational, not blocking.

Methodology

Affected methods reviewed: WLS, Taylor Series Linearization variance/FPC handling, survey degrees of freedom, and weighted absorbed/within transformations.

  • Severity P1. Impact: FPC handling is still based on the unresolved design rather than the effective PSU structure used by TSL. SurveyDesign.resolve() rejects any fpc when both psu and strata are absent, so a valid unstratified survey_design=SurveyDesign(..., fpc=...) plus estimator cluster= still cannot be represented. But when strata is present and psu is absent, resolve() now skips the FPC >= n_PSU_h check entirely, and later _inject_cluster_as_psu() / implicit per-observation PSUs feed f_h = n_PSU_h / N_h directly into compute_survey_vcov() with no post-injection validation. That is an undocumented mismatch with the registry’s cluster-as-PSU / implicit-PSU rules and can produce wrong or undefined survey SEs and p-values instead of a clear validation error. Concrete fix: keep the “constant within stratum” checks in resolve(), but move the FPC >= effective n_PSU_h validation onto the final effective-design path, after explicit PSU resolution, after cluster injection, and after the implicit per-observation PSU fallback. Locations: docs/methodology/REGISTRY.md:1854, docs/methodology/REGISTRY.md:1905, docs/methodology/REGISTRY.md:1908, diff_diff/survey.py:173, diff_diff/survey.py:200, diff_diff/survey.py:440, diff_diff/survey.py:545, diff_diff/survey.py:598

Code Quality

  • No separate code-quality findings beyond the methodology issue above.

Performance

  • No material performance regressions stood out in the changed paths.

Maintainability

  • No separate maintainability findings beyond centralizing final FPC validation on the same effective-design path used for survey variance computation.

Tech Debt

Security

  • No material security or secret-handling issues found in the changed files.

Documentation/Tests

  • Severity P2. Impact: The new suite covers explicit-PSU FPC validation and the fixed nested-cluster injection path, but it does not add a regression for invalid FPC < effective n_PSU after cluster injection or implicit PSU fallback. Worse, test_fpc_with_strata_no_psu_accepted currently enshrines unconditional acceptance of the no-PSU path, which points the suite in the wrong direction for the remaining methodology bug. Concrete fix: replace or extend that coverage with one valid unstratified cluster + fpc case, one invalid stratified implicit-PSU case (FPC < n_obs_h), and one invalid stratified cluster-injected case (FPC < n_cluster_h). Locations: tests/test_survey.py:1191, tests/test_survey.py:2880, tests/test_survey.py:2903

Path to Approval

  1. Rework FPC validation so it runs on the final effective PSU structure, not on the pre-injection SurveyDesign.resolve() state.
  2. Add a passing regression for unstratified survey_design(..., fpc=...) plus cluster=..., and failing regressions for invalid FPC < n_obs_h and invalid FPC < n_cluster_h cases; replace the unconditional acceptance assertion at tests/test_survey.py:2903.

… structure from PR #218 review (round 17)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: e57b748f2be06f8f3871020e082ab601e08fda1a


Overall Assessment

⚠️ Needs changes

Static review only; local execution was not possible here because the current environment lacks numpy.

Executive Summary

  • The prior FPC / effective-PSU issue is mostly fixed: cluster-as-PSU injection now nests within strata, and the stratified / injected-PSU TSL branches validate FPC >= effective n_PSU.
  • One blocking gap remains: unstratified weights-only designs that supply fpc are accepted, but the implicit-PSU TSL branch never reads resolved.fpc, so FPC is silently ignored.
  • That means valid finite-population corrections do not reduce SEs, full-census FPC == n_obs does not collapse variance to zero, and invalid FPC < n_obs values are not rejected.
  • The prior zero-SE inference issue looks resolved: LinearRegression.get_inference() now routes through safe_inference(), so inference fields go NaN together when SE is non-positive.
  • The documented deviations from R are properly recorded in REGISTRY.md, so they are informational only.
  • The weighted multi-absorb limitation is now rejected at runtime and tracked in TODO.md / REGISTRY.md, so it is non-blocking tech debt.

Methodology

Affected methods reviewed: WLS via sqrt(w) transformation, Taylor Series Linearization variance/FPC handling, survey degrees of freedom, and weighted absorbed/within transformations.

  • Severity P1. Impact: SurveyDesign.resolve() now explicitly allows fpc with no explicit PSU/strata, but the unstratified implicit-PSU TSL branch (strata is None and psu is None) never applies or validates resolved.fpc. Any SurveyDesign(weights=..., fpc=...) path with no PSU and no cluster injection therefore returns the same vcov as no FPC, does not zero out at full census (FPC == n_obs), and does not reject invalid FPC < n_obs. This is an undocumented mismatch with the registry’s implicit-per-observation-PSU rule and yields incorrect survey SEs/p-values with no warning. Concrete fix: in the implicit-PSU branch, treat n_obs as the effective PSU count for FPC validation, apply the same (1 - f) correction used in the other TSL branches, and raise when FPC < n_obs. References: diff_diff/survey.py:173, diff_diff/survey.py:513, docs/methodology/REGISTRY.md:1866, tests/test_survey.py:1235, tests/test_survey.py:2920
  • Severity P3. Impact: the sum(w)=n normalization, lonely_psu="remove" default, and survey d.f. convention are explicit deviations from R and are documented in the registry, so they are informational only. Concrete fix: none. References: docs/methodology/REGISTRY.md:1845, docs/methodology/REGISTRY.md:1859, docs/methodology/REGISTRY.md:1901

Code Quality

No separate findings.

Performance

No separate findings.

Maintainability

No separate findings.

Tech Debt

Security

No material security or secret-handling findings in the changed files.

Documentation/Tests

  • Severity P2. Impact: the new suite asserts that fpc-only designs are accepted, but it never verifies that the accepted unstratified implicit-PSU path actually uses FPC or rejects FPC < n_obs. As written, the suite would miss the P1 above. Concrete fix: add direct and estimator-level regressions for SurveyDesign(weights='w', fpc='pop') covering FPC > n_obs, FPC == n_obs, and FPC < n_obs. References: tests/test_survey.py:1235, tests/test_survey.py:2920, tests/test_survey.py:2933, diff_diff/survey.py:513

Path to Approval

  1. Apply FPC in the unstratified implicit-PSU TSL branch (strata is None and psu is None) using n_obs as the effective PSU count, with a ValueError for FPC < n_obs.
  2. Add regressions for valid, full-census, and invalid weights-only fpc cases; at least one test should go through an estimator path, not only compute_survey_vcov().

…obs from PR #218 review (round 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: ed28c01e295957b48499dbaf904aab7725e9d917


Overall Assessment

⚠️ Needs changes

Static review only; I couldn’t run the suite in this environment because the runner is missing numpy.

Executive Summary

  • Re-review status: the prior weights-only FPC blocker appears fixed. The implicit-PSU TSL branch now applies and validates FPC, and the suite adds direct regressions for FPC > n_obs, FPC == n_obs, and FPC < n_obs. References: diff_diff/survey.py:L513-L531, tests/test_survey.py:L2961-L3027
  • The prior zero-SE inference issue also appears fixed: coefficient inference now routes through safe_inference(), so non-positive SEs propagate NaN consistently. References: diff_diff/linalg.py:L1841-L1868
  • The weighted multi-absorb limitation is now explicitly rejected at runtime and tracked in both the Methodology Registry and TODO.md, so that part is informational rather than blocking. References: diff_diff/estimators.py:L253-L261, diff_diff/estimators.py:L1009-L1017, docs/methodology/REGISTRY.md:L1888-L1895, TODO.md:L48-L56
  • One unmitigated methodology blocker remains: explicit PSU labels reused across strata are accepted with nest=False, which undercounts n_psu, feeds the wrong survey d.f. into inference, and can also distort lonely_psu="adjust" variance calculations.
  • I did not find separate performance or security blockers in the changed files.

Methodology

Affected methods reviewed: WLS via sqrt(w), Taylor Series Linearization variance/FPC handling, survey degrees of freedom, and weighted absorbed/within transformations.

Code Quality

  • No separate findings.

Performance

  • No separate findings.

Maintainability

  • No separate findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new survey suite currently enshrines the incorrect nest=False acceptance path for repeated PSU IDs across strata, so it would not catch the P1 above. Concrete fix: replace that expectation with a negative regression for reused PSU labels under nest=False, and add a positive nest=True regression that asserts the correct n_psu/df_survey and inference behavior on the same toy design. References: tests/test_survey.py:L858-L886

Path to Approval

  1. In SurveyDesign.resolve(), add a nesting/uniqueness check for the strata + psu + nest=False case so reused raw PSU IDs across strata are rejected before n_psu and df_survey are computed.
  2. Update the survey tests so repeated PSU labels across strata fail (or raise the chosen blocking warning/error) when nest=False.
  3. Add a positive nest=True regression that asserts the expected n_psu, df_survey, and inference output, and covers the lonely_psu="adjust" path or an equivalent inference-level assertion.

…rom PR #218 review (round 19)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: eebc13e1380a5cee672758a5db56336c897aa38f


Overall Assessment
⚠️ Needs changes

Static review only; runtime verification wasn’t possible in this environment because Python is missing numpy.

Executive Summary

Methodology

  • Severity P1. Impact: SurveyDesign.resolve() still global-factorizes raw PSU ids when nest=False and never validates that PSU labels are already unique across strata; the added tests explicitly assert that repeated labels should succeed under nest=False. That silently turns the default path into a de facto nested-PSU interpretation for n_psu/df_survey, and compute_survey_vcov() still builds the lonely_psu="adjust" grand mean with groupby(psu) on those globally reused codes, so repeated labels can still be merged across strata in the adjusted-variance branch. Concrete fix: add a resolve-time check.strata-style validation for strata + psu + nest=False, require nest=True when raw PSU ids repeat across strata, and keep the adjust path operating only on uniquely nested PSU ids. diff_diff/survey.py:L153-L169, diff_diff/survey.py:L289-L300, diff_diff/survey.py:L577-L584, tests/test_survey.py:L874-L888, tests/test_survey.py:L3079-L3089. R survey::svydesign() defaults check.strata = !nest, assumes PSU ids are unique across strata unless nest=TRUE, and documents lonely.psu="adjust" as centering singleton strata at the grand mean rather than the stratum mean. (r-survey.r-forge.r-project.org)
  • Severity P3. Impact: the sum-weights normalization, lonely_psu="remove" default, df = n_PSU - n_strata convention, and the weighted multi-absorb rejection are all documented in the Methodology Registry, so those deviations are informational only. Concrete fix: none. docs/methodology/REGISTRY.md:L1845-L1847, docs/methodology/REGISTRY.md:L1859-L1869, docs/methodology/REGISTRY.md:L1888-L1911

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the remaining exact weighted multi-absorb demeaning work is explicitly tracked in TODO.md, so it should not block merge. Concrete fix: none for this PR. TODO.md:L52-L56

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new repeated-PSU tests now enshrine the incorrect nest=False acceptance path, and there is still no regression that exercises lonely_psu="adjust" on repeated raw PSU labels across strata. As written, the suite will not catch the remaining P1 above. Concrete fix: replace the nest=False acceptance expectations with a negative regression, and add a positive nest=True + lonely_psu="adjust" regression that asserts n_psu, df_survey, and finite adjusted variance on the same toy design. tests/test_survey.py:L874-L888, tests/test_survey.py:L3079-L3089

Path to Approval

  1. In SurveyDesign.resolve(), add validation for strata + psu + nest=False so reused raw PSU labels across strata raise a clear error telling the caller to set nest=True if ids are only unique within strata.
  2. Ensure the lonely_psu="adjust" path only computes its grand mean from uniquely nested PSU ids. The clean fix is step 1; if the code keeps accepting reused labels, the adjustment mean still needs to be built from strata × psu, not bare psu.
  3. Update the new survey tests so repeated PSU labels fail under nest=False, and add a nest=True regression that covers lonely_psu="adjust" on the same shape of input.

…st fixtures from PR #218 review (round 20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: b6db0342c5f58fea3be88ca273e9b02738265fa4


Overall Assessment

⚠️ Needs changes

Static review only; I could not run the new survey suite here because pytest is unavailable and the Python environment is missing numpy.

Executive Summary

Methodology

Cross-check performed against the new registry plus the cited Lumley/Binder survey references and the survey::svydesign() documentation. (jstatsoft.org)

  • Severity P1 [Newly identified]. Impact: in both DifferenceInDifferences.fit() and MultiPeriodDiD.fit(), the absorb block now residualizes outcome/treatment/time/interactions/covariates (diff_diff/estimators.py:L263-L283, diff_diff/estimators.py:L1022-L1050), but any fixed_effects= dummies are still appended afterward from raw data (diff_diff/estimators.py:L306-L314, diff_diff/estimators.py:L1101-L1107). That violates the PR’s own registry note that the absorbed-FE path should within-transform all regressors under the FWL step (docs/methodology/REGISTRY.md:L1888-L1895), so survey_design + absorb + fixed_effects silently fits a different model than the advertised absorbed-FE WLS/TSL specification. Concrete fix: either build the fixed_effects dummy columns before demeaning and include them in vars_to_demean, or reject absorb + fixed_effects whenever survey_design is present until those dummies are residualized correctly.
  • Severity P3. Impact: the previous nest=False / reused-PSU mismatch no longer appears live; the code now enforces globally unique PSU labels unless nest=True, and the new tests cover both reject and accept behavior (diff_diff/survey.py:L153-L175, tests/test_survey.py:L3059-L3094). That is consistent with svydesign()’s documented check.strata=!nest default. Concrete fix: none. (r-survey.r-forge.r-project.org)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the weighted multi-absorb limitation is now explicitly tracked, so I did not treat it as a blocker. Concrete fix: none for this PR. docs/methodology/REGISTRY.md:L1888-L1895, TODO.md:L48-L56
  • Severity P3. Impact: the roadmap correctly calls out subpopulation/domain analysis as deferred; survey-domain inference needs to preserve original cluster/stratum information rather than naïve row subsetting. Concrete fix: none for this PR beyond keeping that limitation explicit until a dedicated API lands. docs/survey-roadmap.md:L59-L62. (r-survey.r-forge.r-project.org)

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new round-9 weighted absorb equivalence tests only compare absorb=["region"] and fixed_effects=["region"] separately; they never exercise the combined absorb + fixed_effects + survey_design path, so the P1 above is currently untested (tests/test_survey.py:L2216-L2335). Concrete fix: add DiD and MultiPeriodDiD regressions that compare (absorb=["a"], fixed_effects=["b"], survey_design=...) against a fully explicit weighted-dummy regression on a toy panel where b varies within absorbed groups.

Path to Approval

  1. In DifferenceInDifferences.fit() and MultiPeriodDiD.fit(), either residualize the generated fixed_effects dummy columns through the same absorb step or reject absorb + fixed_effects whenever survey_design is used.
  2. Add regression tests for that combined path in both estimators; the test data should make the explicit FE vary within the absorbed groups so the current bug cannot cancel out accidentally.

…n from PR #218 review (round 21)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 68ee72b89a99102efe79eb19ea35ba16fbc82105


Overall Assessment

⚠️ Needs changes

Static review only; I couldn’t run pytest here because the local Python environment is missing numpy.

Executive Summary

  • The prior survey-specific blocker is fixed: both DiD paths now reject survey_design + absorb + fixed_effects, and the new rejection tests cover that path in diff_diff/estimators.py:L263-L269, diff_diff/estimators.py:L1027-L1033, and tests/test_survey.py:L3139-L3165.
  • The new survey methodology is otherwise aligned with the added registry notes, and the weighted multi-absorb limitation is now documented and tracked, so I treated that item as P3 informational rather than blocking.
  • Severity P1 [Newly identified]: the same FWL violation still remains in the non-survey absorb + fixed_effects path, because fixed-effect dummies are still appended after the absorb step without being residualized.
  • The new regression tests/test_survey.py:L3167-L3178 now locks in that incorrect unweighted path instead of comparing it to the explicit-dummy oracle or rejecting it.

Methodology

High-level cross-check: the new survey section is directionally consistent with the cited survey sources on linearisation-based variance, PSU nesting checks, and design-based degrees of freedom tied to degf(design) / PSUs minus strata. (jstatsoft.org)

  • Severity P1 [Newly identified]. Impact: the PR fixes absorb + fixed_effects only when survey_design is present, but both DifferenceInDifferences.fit() and MultiPeriodDiD.fit() still add fixed_effects dummies from raw working_data[fe] after the absorb step in the unweighted path (diff_diff/estimators.py:L271-L322, diff_diff/estimators.py:L1038-L1123). That contradicts the new registry note that, under absorb, all regressors are within-transformed per FWL (docs/methodology/REGISTRY.md:L1888-L1895). When a fixed effect varies within absorbed groups, fit(..., absorb=['a'], fixed_effects=['b']) still estimates a different model than the explicit-dummy regression, with no warning. Concrete fix: either extend the rejection to the non-survey path as well, or generate the dummy columns before the absorb step and residualize them through the same within-transformation before adding them to X.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: tests/test_survey.py:L3167-L3178 asserts that the unweighted absorb + fixed_effects path should succeed, which will hide the methodology bug above from future regressions. Concrete fix: replace that test with an explicit-dummy oracle comparison on data where the non-absorbed fixed effect varies within the absorbed groups, or change it to expect the same rejection as the survey path.

Path to Approval

  1. In diff_diff/estimators.py:L263-L269 and diff_diff/estimators.py:L1027-L1033, either remove the and survey_weights is not None condition so all absorb + fixed_effects combinations are rejected, or fully residualize the generated fixed-effect dummies through the absorb step in both estimators.
  2. Replace tests/test_survey.py:L3167-L3178 with an oracle test that compares absorb=['a'], fixed_effects=['b'] against the explicit-dummy model on data where b varies within a, and add the analogous oracle or rejection coverage for MultiPeriodDiD.

…tion) from PR #218 review (round 22)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 349b0f591cd5429d1b33f42dc99a344eff65639e


Overall Assessment

⚠️ Needs changes

Static review only; the local environment here is missing numpy and pytest, so I could not execute the added tests.

Executive Summary

  • The prior blocker is fixed: both DiD estimators now reject absorb + fixed_effects unconditionally, and the new regression tests cover both survey and non-survey paths in diff_diff/estimators.py:L263, diff_diff/estimators.py:L1028, and tests/test_survey.py:L3139.
  • No additional unmitigated P0/P1 issue stood out in the DiD or MultiPeriod survey paths; the remaining blocker is isolated to the new TWFE survey path.
  • Severity P1 [Newly identified]: TwoWayFixedEffects.fit() silently injects the estimator’s default unit clustering as survey PSUs when survey_design has no explicit psu, changing SEs and survey d.f. away from the no-PSU method documented in the Methodology Registry.
  • The new TWFE survey tests cover explicit PSU designs and explicit cluster= injection, but they do not cover the default cluster=None no-PSU survey path tied to the P1 above.
  • The weighted multi-absorb limitation is properly documented and tracked, so I treated that item as informational rather than blocking.

Methodology

High-level cross-check: the new survey registry section is directionally consistent with the survey package documentation on survey.lonely.psu="adjust" and on representing no-PSU designs with id=~1; the explicit deviations called out in REGISTRY.md are therefore appropriately treated as documented choices rather than defects. (r-survey.r-forge.r-project.org)

  • Severity P1 [Newly identified]. Impact: TwoWayFixedEffects documents that cluster=None defaults to unit-level clustering in the non-survey path diff_diff/twfe.py:L31, and the new survey code reuses that default by setting cluster_var = unit diff_diff/twfe.py:L131 and then injecting those IDs as PSUs whenever survey_design.psu is absent diff_diff/twfe.py:L171, diff_diff/twfe.py:L179. That contradicts the registry’s documented no-PSU rule for weights-only and stratified-no-PSU survey designs, which says they should use implicit per-observation PSUs and n_obs - 1 / n_obs - n_strata survey d.f. docs/methodology/REGISTRY.md:L1905. The current TWFE behavior silently changes both the variance estimator and reported survey d.f. without any user opt-in. Concrete fix: only inject cluster IDs as effective PSUs in TWFE when the user explicitly supplied cluster=...; if self.cluster is None and survey_design.psu is None, leave the survey design on the documented no-PSU path.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3. Impact: the multi-absorb weighted demeaning limitation is explicitly documented in docs/methodology/REGISTRY.md:L1888 and tracked in TODO.md:L55, so I did not count it against the assessment. Concrete fix: none for this PR.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new TWFE survey coverage exercises explicit PSU designs tests/test_survey.py:L886 and explicit cluster= injection tests/test_survey.py:L1991, but it does not test the default cluster=None + no-PSU survey path that currently triggers the P1 above. Concrete fix: add regression tests for TwoWayFixedEffects().fit(..., survey_design=SurveyDesign(weights="w")) and the analogous stratified-no-PSU case with cluster=None, asserting survey_metadata.n_psu, survey_metadata.df_survey, and SEs match the documented implicit-PSU path.

Path to Approval

  1. Change TwoWayFixedEffects.fit() so cluster-as-PSU injection only happens when the user explicitly passes cluster=...; keep the existing unit-cluster default for the non-survey analytical path, but do not silently convert that default into survey PSUs.
  2. Add TWFE regression tests for the weights-only and stratified-no-PSU survey defaults with cluster=None, checking both the reported survey metadata and the resulting SEs against the documented no-PSU survey oracle.

 review (round 23)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 74f0438c9e46e61a23d4f880200c1b4bdd3410a4


Overall Assessment

✅ Looks good

Static review only; I could not execute the added tests locally because pytest is not installed in this environment.

Executive Summary

  • Severity P2. The new weighted TWFE residualization path in diff_diff/utils.py:1835 is only regression-tested against explicit weighted dummies on a small balanced panel in tests/test_survey.py:1247, so the unbalanced/high-weight-dispersion case is still under-covered.
  • The prior P1 is fixed: TWFE no longer turns its default unit clustering into survey PSUs when cluster=None; that split now happens explicitly in diff_diff/twfe.py:181 and is covered by tests/test_survey.py:3197.
  • No new unmitigated P0/P1 methodology issue stood out in the changed DiD, TWFE, MultiPeriod, WLS, or survey-variance paths.
  • The multi-absorb weighted-demeaning limitation is documented in docs/methodology/REGISTRY.md:1888 and tracked in TODO.md:55, so it is informational rather than blocking.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new weighted alternating-projection path in diff_diff/utils.py:1835 is only validated by a balanced 10x3 equivalence test in tests/test_survey.py:1247. That is enough to cover the fixed regression case, but not the unbalanced panels and heavier weight dispersion where convergence behavior is least obvious. Concrete fix: add at least one explicit-dummies equivalence regression test for within_transform(..., weights=...) and TwoWayFixedEffects.fit(..., survey_design=...) on an unbalanced panel with heterogeneous weights.

@igerber igerber merged commit 5d5bfa0 into main Mar 21, 2026
13 checks passed
@igerber igerber deleted the survey-data branch March 21, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant