Conversation
|
Overall Assessment Static review only. I could not execute the test suite in this shell because the available Executive Summary
Methodology The documented extension itself is acceptable; the blockers below are implementation errors inside that extension.
Code Quality No material findings beyond the methodology issues above. Performance No material findings in the changed code. Maintainability
Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only; I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only; this shell does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only; this shell does not have Executive Summary
Methodology Affected methods: the core dCDH survey extension, the Phase 3 heterogeneity test, and the TWFE diagnostic surface.
Code Quality No additional findings beyond the methodology propagation issues above. Performance No material performance findings in the changed code. Maintainability
Tech Debt No relevant Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the test suite in this shell because Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only. I could not execute the test suite in this shell because the available Python environment is missing core deps like Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only. I could not run the test suite in this shell because Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessmentStatic review only. This shell does not have Executive summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the test suite in this environment because Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings. Tech Debt No findings. I did not find a mitigating entry for the P1 above under Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only. I could not run the test suite here because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
- Weighted cell aggregation in _validate_and_aggregate_to_cells() - Survey resolution via _resolve_survey_for_fit() with pweight-only and group-constant validation - IF expansion from group to observation level for TSL variance - Survey-aware SE at all call sites (overall, joiners, leavers, multi-horizon, placebos) via _compute_se() dispatcher - Bootstrap + survey warning (PSU-level deferred) - 12 new tests in test_survey_dcdh.py - Documentation updates: REGISTRY.md, ROADMAP.md, llms-full.txt, choosing_estimator.rst Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…torize IF expansion - Remove _validate_group_constant_survey() call - the IF expansion psi_i = U[g] * (w_i / W_g) handles observation-level variation in weights, strata, and PSU within groups correctly - Vectorize _survey_se_from_group_if using np.bincount + np.unique (was Python loops over all observations) - Replace test_rejects_varying_weights_within_group with two positive tests: varying weights accepted, and varying weights change ATT (time-varying noise to survive first-differencing) - Remove unused survey_weight_type variable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- P1-A: Scale U_centered by 1/divisor before survey IF expansion. dCDH IFs are numerator-scale (U.sum() == N_S * DID_M), but compute_survey_if_variance() expects estimator-scale psi. - P1-B: Zero-weight cells (w_gt <= 0) now treated as absent by setting n_gt=0, preventing NaN propagation into estimates. - P2: Add SE-pinning test (uniform weights + PSU=group matches plug-in SE) and zero-weight cell exclusion test. - P3: Delete unused _validate_group_constant_survey() from survey.py that contradicted the supported within-group variation contract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ey propagation - P1-A: Drop zero-weight rows entirely from cell DataFrame (instead of just zeroing n_gt) so ragged-panel validator doesn't see them - P1-B: Survey-weighted covariate aggregation in DID^X path (sum(w*x)/sum(w) instead of unweighted mean) - P1-C: Thread _df_survey to all remaining safe_inference() calls: bootstrap t-stats, normalized effects, cost-benefit delta, placebo bootstrap t-stats - P3: Fix REGISTRY overview paragraph (was still saying survey deferred) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- P0: delta overall surface now uses _df_survey instead of df=None at both safe_inference sites (primary delta path + placebo NaN-SE fallback). This makes overall_* under L_max>=2 use survey-t inference and respects safe_inference's df<=0 NaN guard. - P1: HonestDiD dCDH extraction now propagates df_survey from survey_metadata (mirrors CS pattern). Survey-backed dCDH HonestDiD bounds now use survey-aware critical values. - P2: Add 4 regressions (survey delta t-matches-reported, t-vs-z differs, survey+controls, survey+honest_did df propagation). Update stale comment in test_dcdh_extraction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 #1: _compute_heterogeneity_test now accepts obs_survey_info and runs survey-aware WLS + Binder TSL IF when survey_design is active. Point estimate via solve_ols(weights=W_elig, weight_type='pweight'); group-level IF ψ_g[X] = inv(X'WX)[1,:] @ x_g * W_g * r_g, expanded to obs-level via w_i/W_g ratio, then compute_survey_if_variance for stratified/PSU variance. safe_inference uses df_survey. Rank-deficiency short-circuits to NaN to avoid point-estimate/IF mismatch between solve_ols's R-style drop and pinv's minimum-norm. - P1 #2: twowayfeweights() now accepts Optional[SurveyDesign]. When provided, resolves weights via _resolve_survey_for_fit and passes them to _validate_and_aggregate_to_cells, restoring fit-vs-helper parity under survey-backed inputs. fweight/aweight rejected. - P3: REGISTRY updates — TWFE parity sentence now includes survey; heterogeneity Note documents the TSL IF mechanics and library extension disclaimer; checklist line-651 lists survey-aware surfaces; new survey+bootstrap-fallback Note after line 652. - P2: 5 new regression tests in test_survey_dcdh.py: TestSurveyHeterogeneity (uniform-weights match, non-uniform beta change, t-dist df_survey) and TestSurveyTWFEParity (fit-vs-helper match, non-pweight rejection). All 254 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 #1: _compute_twfe_diagnostic now uses cell_weight (w_gt when available, else n_gt) for FE regressions, the normalization denominator, contribution weights, and the Corollary 1 observation shares. On survey-backed inputs the outputs now match the observation-level pweighted TWFE estimand; non-survey path is byte-identical. - P1 #2: Zero-weight rows are dropped before the groupby in _validate_and_aggregate_to_cells when weights are provided, so that d_min/d_max/n_gt reflect the effective sample. Prevents zero-weight subpopulation rows from tripping the fuzzy-DiD guard or inflating downstream n_gt counts. - P2: 2 new regression tests in test_survey_dcdh.py — TestSurveyTWFEOracle.test_survey_twfe_matches_obs_level_pweighted_ols verifies beta_fe matches an observation-level pweighted OLS under survey (would fail if n_gt was still used), and TestZeroWeightSubpopulation.test_mixed_zero_weight_row_excluded_from_validation verifies an injected zero-weight row with opposite treatment value doesn't trip the within-cell constancy check. All 256 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- twowayfeweights() now raises NotImplementedError when the resolved survey design carries replicate weights, matching fit()'s contract. The fit/helper parity promise now holds for unsupported inputs too. - docs/llms-full.txt and ROADMAP.md no longer claim that Phase 3 parameters raise NotImplementedError; both now correctly note that only `aggregate` remains gated. - Added a helper-level regression that pins the replicate-weight rejection, complementing the existing fit()-level test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The results dataclass docstring still described survey_metadata as 'always None' and survey integration as deferred. survey_design is now implemented and results.survey_metadata is populated whenever a SurveyDesign is passed to fit(). Docstring now describes the populated field and its role in survey-aware inference and HonestDiD propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vey branch tests - P1 #1: The R5 zero-weight filter only ran inside the cell aggregation step, after the NaN/coercion checks for group/time/treatment/outcome. Moved the filter to the very top of _validate_and_aggregate_to_cells so validation only sees the effective sample. fit()'s controls, trends_nonparam, and heterogeneity blocks now also scope their NaN/time-invariance checks to positive-weight rows when survey_weights is active. Legitimate SurveyDesign.subpopulation() inputs with NaN in excluded rows now fit cleanly. TSL variance path is unchanged (zero-weight obs still contribute zero psi). - P2: 5 new regression tests in test_survey_dcdh.py — TestZeroWeightSubpopulation now covers NaN outcome and NaN het columns in excluded rows; new TestSurveyTrendsLinear / TestSurveyTrendsNonparam / TestSurveyDesign2 classes exercise survey_design combined with those previously-untested branches. All 262 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8's controls-block fix scoped NaN/Inf validation to the positive-weight subset (shorter data_controls) but then assigned those shorter arrays into an x_agg_input built from the full-length frame, causing a length-mismatch on any SurveyDesign.subpopulation() / zero-weight excluded row before covariate aggregation could run. Root-caused fix: derive both the validation frame AND the aggregation frame from the same positive-weight effective sample (data_eff, survey_weights_eff). Zero-weight rows are genuinely out-of-sample throughout the DID^X path now. Non-survey fits unchanged. Added TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_control pinning the subpopulation contract for the DID^X path — injects a zero-weight row with NaN control value and asserts fit() succeeds. All 263 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- diff_diff/guides/llms-practitioner.txt: add reversible-treatment branch to the Step 4 decision tree pointing at ChaisemartinDHaultfoeuille (alias DCDH), and a short Step 5 example showing the survey_design + L_max + TSL fallback for dCDH. The practitioner guide previously had zero dCDH coverage despite the full API being live. - diff_diff/guides/llms-full.txt: fix stale SyntheticDiDResults table (line 1024) — variance_method now accepts "bootstrap", "jackknife", or "placebo" (jackknife was added for synthdid::vcov parity but the table only listed bootstrap/placebo). Bundled-guide paths only. The old docs/llms-*.txt were moved to diff_diff/guides/ in main (PR #310); these edits target the new canonical location consumed by get_llm_guide(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3624305 to
d045987
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run Executive Summary
Methodology Affected methods: survey-backed
Code Quality
Performance No findings. Maintainability No findings. Tech Debt No findings. I did not see a new Security No findings. Documentation/Tests
Path to Approval
|
- P1 #1/#2: Add _validate_group_constant_strata_psu() helper and call it from fit() after the weight_type/replicate-weights checks. The dCDH IF expansion psi_i = U[g] * (w_i / W_g) treats each group as the effective sampling unit; when strata or PSU vary within group it silently spreads horizon-specific IF mass across observations in different PSUs, contaminating the stratified-PSU variance. Walk back the overstated claim at the old line 669 comment to match. Within- group-varying weights remain supported. - P1 #3: _survey_se_from_group_if now filters zero-weight rows before np.unique/np.bincount so NaN / non-comparable group IDs on excluded subpopulation rows cannot crash SE factorization. psi stays full- length with zeros in excluded positions to preserve alignment with resolved.strata / resolved.psu inside compute_survey_if_variance. - REGISTRY.md line 652 Note updated: explicitly states the within-group-constant strata/PSU requirement and the within-group-varying weights support. - Tests: new TestSurveyWithinGroupValidation class (4 tests — rejects varying PSU, rejects varying strata, accepts varying weights, and ignores zero-weight rows during the constancy check) plus TestZeroWeightSubpopulation.test_zero_weight_row_with_nan_group_id. All 268 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run Executive Summary
Methodology
Code Quality No additional findings beyond the methodology issue above. Performance No findings. Maintainability No findings. Tech Debt No mitigating Security No findings. Documentation/Tests
Path to Approval
|
The dCDH methodology note says the group is the effective sampling unit for the TSL IF expansion psi_i = U[g] * (w_i / W_g). But when the user passed SurveyDesign(weights=...) with no explicit PSU, compute_survey_if_variance() defaulted to per-observation PSUs (survey.py:1201-1218), inflating the effective PSU count and df_survey and silently contradicting the documented contract. - fit() now auto-injects psu=<group column> whenever the resolved SurveyDesign has no PSU, rebuilds a replacement SurveyDesign preserving strata/FPC/weight_type/nest/lonely_psu, and re-resolves so every downstream consumer (_survey_se_from_group_if, the heterogeneity survey path, survey_metadata.df_survey → HonestDiD critical values) uses the per-group structure. Replicate-weight designs are skipped (they're rejected by the subsequent check) and zero-weight rows are filtered before re-resolution so NaN group IDs on excluded subpopulation rows don't block PSU resolution. - REGISTRY.md line-652 Note documents the auto-inject behavior and the user escape hatch (pass an explicit psu constant within group). - Updated test_uniform_survey_se_matches_plugin docstring to reflect the new default (no-PSU ≡ psu=group) instead of the prior "independent-obs" note. - Added 2 regressions: test_auto_inject_psu_matches_explicit_group_psu (invariance of SE/df_survey between no-psu and explicit psu=group) and test_off_horizon_row_duplication_does_not_change_se (pins the per-group sampling-unit invariant under row duplication). All 270 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No unmitigated findings. Security No findings. Documentation/Tests
Path to Approval
|
R11's auto-inject of psu=group filtered `data` to positive-weight rows before re-resolving the effective SurveyDesign. That silently shrank `df_survey` on SurveyDesign.subpopulation() inputs without an explicit PSU — violating the documented subpopulation contract that keeps the full design intact so t critical values, p-values, CIs, and HonestDiD bounds match full-design expectations. Replace the pre-filter with a synthesized PSU column built on a private copy of `data`: - Valid group values flow through unchanged as the per-row PSU label. - NaN / invalid group values on zero-weight rows (the edge case that motivated the R11 filter) are replaced with a single shared dummy label so the PSU resolver accepts them. - Zero-weight rows contribute psi_i = 0 to the variance, but remain in the resolved design so n_psu / n_strata / df_survey reflect the full sample — matching the library's subpopulation contract. Added TestSurveyWithinGroupValidation.test_subpopulation_preserves_full_design_df_survey: zero-weights an entire group (mimicking SurveyDesign.subpopulation) and asserts that auto-inject df_survey equals the explicit psu='group' df_survey — the full-design reference. All 271 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only. I could not run Executive Summary
Methodology The source pairing is coherent: the AER 2020 paper covers the TWFE weighting problem, the dynamic NBER companion covers the reversible-treatment/event-study dCDH estimators, and Binder is the survey-variance reference. That means the new survey linearization is an extension rather than paper-parity code, so
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…ey fits
_plugin_se returns NaN when the cohort-recentered IF is empty or
identically zero (documented degenerate-cohort contract — every
variance-eligible group is its own (D_{g,1}, F_g, S_g) singleton).
_survey_se_from_group_if only rejected negative variance, so on the
same panel it computed sqrt(0) = 0.0, suppressed the degenerate-cohort
warning (gated on np.isnan(overall_se)), and exposed a false zero SE.
The bug affected every surface routed through _compute_se — top-level
ATT, joiners/leavers, multi-horizon ATT, placebos, and derived
normalized/cumulated SEs.
Mirror the _plugin_se contract: short-circuit to NaN when U_centered
is empty or sum(U_centered**2) <= 0, before delegating to
compute_survey_if_variance.
Added TestSurveyWithinGroupValidation.test_degenerate_cohort_survey_se_is_nan:
4 groups × 5 periods, each switching at a unique F_g so every cohort
is a singleton; asserts overall_se is NaN (not 0.0) and that the
degenerate-cohort warning fires under the survey path.
All 272 targeted tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment This re-review passes. The earlier survey-SE blocker is addressed, and I did not identify a new unmitigated P0 or P1 issue in the changed code. Static review only. I could not execute the test suite here because this environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Extends the group-level Taylor Series Linearization survey support landed in PR #307 with two opt-in variance mechanisms: 1. Replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) routed through the unified compute_replicate_if_variance helper at every IF site (overall DID_M, joiners DID_+, leavers DID_-, multi-horizon DID_l, placebo DID^pl_l, heterogeneity beta^het_l, twowayfeweights helper). Uses inline-branch pattern mirroring EfficientDiD (efficient_did.py:1119-1142) and TripleDifference (triple_diff.py:1206-1238). Effective df_survey reduces to min(n_valid) - 1 across IF sites when replicates fail, matching the precedent in efficient_did.py:1133-1135 and triple_diff.py:676-686. 2. PSU-level Hall-Mammen wild bootstrap via group_to_psu_map threaded through _compute_dcdh_bootstrap. Under auto-inject psu=group the identity-map fast path preserves pre-change behavior bit-for-bit. Under an explicitly coarser PSU (e.g., psu=state with county-level groups), all groups in the same PSU receive the same bootstrap multiplier so within-PSU correlation is preserved. Locked contracts: - Replicate weights and n_bootstrap > 0 are mutually exclusive (replicate variance is closed-form; bootstrap would double-count). Raises NotImplementedError, matching efficient_did.py:989 etc. - Strata participate only through analytical TSL variance, not the bootstrap (multiplier weights stay unit-by-unit). - Warning fires only when n_psu_eff < n_groups_eff (strictly coarser PSU). Under auto-inject psu=group the warning is correctly suppressed. Adds 36 regression tests in tests/test_survey_dcdh_replicate_psu.py covering Class A sites (via _survey_se_from_group_if), Class B sites (heterogeneity + twowayfeweights), PSU bootstrap semantics, and cross-cutting invariants (replicate+bootstrap rejection, HonestDiD under replicate). REGISTRY.md lines 651-653 updated to document both mechanisms. Full suite: 525 passing (was 489 before this change).
Summary
Methodology references (required if estimator / math changes)
Validation
Security / privacy
Generated with Claude Code