Add comprehensive code review for diff-diff library by igerber · Pull Request #5 · igerber/diff-diff

igerber · 2026-01-02T14:03:34Z

Conduct thorough review from two perspectives:

Subject matter expert (econometrics/methodology) - verifying DiD
methodology, robust SE, parallel trends testing, and TOST equivalence
Expert engineer - identifying bugs, edge cases, and code quality issues

Key findings:

Critical bug in absorbed fixed effects (not demeaning treatment vars)
Division by zero risks in SE and df calculations
FE dummies using wrong data source when combined with absorb
Overall solid foundation but needs fixes before production use

Conduct thorough review from two perspectives: 1. Subject matter expert (econometrics/methodology) - verifying DiD methodology, robust SE, parallel trends testing, and TOST equivalence 2. Expert engineer - identifying bugs, edge cases, and code quality issues Key findings: - Critical bug in absorbed fixed effects (not demeaning treatment vars) - Division by zero risks in SE and df calculations - FE dummies using wrong data source when combined with absorb - Overall solid foundation but needs fixes before production use

Fixes applied: - Add matrix rank check to detect multicollinearity in design matrix - Fix fixed effects dummies to use working_data consistently - Add binary variable validation before any transformations - Fix division by zero in parallel trends SE calculation - Fix division by zero in equivalence test df calculation - Use local RNG (np.random.default_rng) instead of global seed - Make Wasserstein threshold configurable (wasserstein_threshold param) - Add proper error handling for zero variance and insufficient data Tests added: - Multicollinearity detection test - Custom Wasserstein threshold test - Insufficient data edge case tests - Single pre-period edge case test - TwoWayFixedEffects comprehensive tests - Cluster-robust standard errors test All 42 tests passing.

Revised review reflects: - #1, #4 verified as non-issues (correct by design) - #3, #5, #6, #8, #13 addressed in commit e40d6b4 - Updated recommendation to approve and merge - Remaining items are low-priority style suggestions for future PRs

TwoStageDiD and ImputationDiD each run two iterative alternating-projection solvers (_iterative_fe, _iterative_demean) whose convergence loop exited silently on max_iter exhaustion, returning the current iterate as if converged. This matches the silent-failure pattern audited under axis B of the silent-failures initiative (findings #2-#5). Adds a shared warn_if_not_converged helper in diff_diff.utils and calls it from all four alternating-projection loops on non-convergence. Pattern mirrors the existing logistic + Poisson IRLS convergence warnings in linalg.py (lines 1329-1376). Warning-only: no new public parameter, no behavior change on inputs that already converge. Updates REGISTRY.md entries for ImputationDiD and TwoStageDiD with Note labels describing the new signal. Axis-B regression-lint baseline: 10 silent range(max_iter) loops -> 6 remaining (Frank-Wolfe and TROP addressed in follow-up PRs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between  /  markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI re-review P1: `bench_brand_awareness_survey.py` declared the analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)` only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event study + HonestDiD) built their own SurveyDesigns without `fpc`. That means a material share of the committed brand-awareness baselines timed a different variance path than the scenario doc declares. Fix: - One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True) is now constructed once at the top of `make_phases` and reused across phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a different variance surface and correctly keeps its own design. - Regenerated baselines for both backends. - Regenerated findings tables via gen_findings_tables.py. Narrative refreshed against the new tables: - Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in the previous rerun); on Rust the multi-outcome loop and JK1 come in essentially tied. Medium is also where Python is slowest relative to Rust (~1.6x) - the full analytical TSL path with FPC exposes vectorization differences at that shape. Totals re-converge at large scale. - Reversible dCDH: ~48-52% split under both backends (previously the Python heterogeneity refit edged out the main fit slightly). - Scaling finding #5 retuned: Rust-only uplift is still the SDiD story; brand-aware medium now surfaces as a secondary, modest ~1.6x case rather than "within noise." Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ht variance" CI re-review P2 + P3, both docs/label only: - docs/performance-plan.md had two remaining specific-magnitude claims about brand-awareness medium ("~1.6x under Python", "Python and Rust separate the most at medium", "~1.6x at worst on brand-awareness medium"). Those were true on one earlier rerun but the current committed baselines show medium at 0.56 / 0.55 (essentially tied) and the widest non-SDiD gap is now ~1.1x at brand-large. Reworded per-scenario paragraph and scaling finding #5 to describe the stable aggregate pattern and defer exact ratios to the scale-sweep table. Same treatment as the earlier staggered/dCDH pass: narrative stops claiming magnitudes that can shift on rerun; the generator-owned table carries the specifics. - bench_brand_awareness_survey.py module docstring labeled JK1 as "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate- weight variance (jackknife-style), not bootstrap inference - they are distinct methodology surfaces. Renamed to "replicate-weight variance (JK1 delete-one-PSU)" with an inline note pointing to the registry. Docs + docstring only. No script behaviour change; no baseline regeneration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…light checks as standard-workflow predictions, not estimator gates Reviewer correctly noted that calling {has_never_treated, treatment_varies_within_unit==False, is_balanced, no duplicate_unit_time_rows alert, dose_min > 0} the "screening checks" / "necessary" gates of `ContinuousDiD` overstates the contract. `ContinuousDiD.fit()` keys off the separate `first_treat` column (which `profile_panel` does not see), defines never-treated controls as `first_treat == 0` rows, force-zeroes nonzero `dose` on those rows with a `UserWarning`, and rejects negative dose only among treated units `first_treat > 0` (see `continuous_did.py:276-327` and `:348-360`). Two of the five checks (`has_never_treated`, `dose_min > 0`) are first_treat-dependent: agents who relabel positive- or negative-dose units as `first_treat == 0` trigger the force-zero coercion path with a `UserWarning` and may still fit panels that fail those preflights, with the methodology shifting. The other three (`treatment_varies_within_unit`, `is_balanced`, duplicate-row absence) are real fit-time gates that hold regardless of how `first_treat` is constructed. Reframed every wording site to call these "standard-workflow preflight checks" — predictive when the agent derives `first_treat` from the same dose column passed to `profile_panel`, but not the estimator's literal contract: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote the multi-paragraph block; explicit standard-workflow definition + per-check first_treat dependency map + force-zero coercion caveat). - `diff_diff/profile.py` `_compute_treatment_dose` helper docstring (already brief; stays consistent). - `diff_diff/guides/llms-autonomous.txt` §2 field reference (long rewrite covering the standard-workflow framing + override paths). - `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet + trailing paragraph (both updated; opening bullet now spells out which of the five checks are first_treat-dependent vs. hard fit-time stops; trailing paragraph promotes the standard- workflow framing). - `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step 2 (rewrote the gate-checking paragraph; counter-example #4 expanded to enumerate (a) supply matching first_treat and accept rejection, (b) deliberate relabel + coercion, (c) different estimator; counter-example #5 distinguishes negative-dose treated-unit rejection from never-treated coercion). - `CHANGELOG.md` Wave 2 entry (matches the new framing). - `ROADMAP.md` AI-Agent Track building block (matches). Test coverage: - Renamed assertion messages in `test_treatment_dose_descriptive_fields_supplement_existing_gates` and `test_treatment_dose_min_flags_negative_dose_continuous_panels` to remove "authoritative gate" phrasing; reframed as "standard- workflow preflight" assertions consistent with the corrected docs. - Added `test_negative_dose_on_never_treated_coerces_not_rejects` in `tests/test_continuous_did.py::TestEdgeCases` covering the reviewer's specific request: never-treated rows with NEGATIVE nonzero dose must coerce (with `UserWarning`) rather than raise the treated-unit negative-dose error. Sister to the existing `test_nonzero_dose_on_never_treated_warns` which covers the positive-dose case. Rebased onto origin/main during this round (no conflicts beyond prior CHANGELOG resolutions; main advanced 19 commits). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s-fallback wording; correct duplicate-row "fit-time stop" claim P1 (relabel-to-manufacture-controls misframing): Round 11 introduced wording across the guide, profile docstring, CHANGELOG, ROADMAP, and test docstrings that presented intentional `first_treat == 0` relabeling of nonzero-dose units as an "option" / "fallback" for fitting `ContinuousDiD` when the profile-side preflights (`has_never_treated`, `dose_min > 0`) fail. REGISTRY does not document this as a routing option, and the estimator still requires actual `P(D=0) > 0` because Remark 3.1 lowest-dose-as-control is not yet implemented. The force-zero coercion at `continuous_did.py:311-327` is implementation behavior for INCONSISTENT inputs (e.g., user accidentally passes nonzero dose on a never-treated row), not a methodology fallback. Reworded every site to remove the relabeling-as-option framing and replace it with the registry-documented fixes when (1) or (5) fails: re-encode the treatment column to a non-negative scale that contains a true never-treated group, or route to a different estimator (`HeterogeneousAdoptionDiD` for graded-adoption panels; linear DiD with the treatment as a continuous covariate). Every remaining "manufacture controls" mention in the guide, profile, and tests is now an explicit anti-recommendation ("do not relabel ... to manufacture controls"). Updated: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (item (1): "not an opportunity to relabel ..."; item (5): coercion is "implementation behavior for inconsistent inputs, not a methodological fallback"). - `diff_diff/guides/llms-autonomous.txt` §2 field reference (the When-(1)-or-(5)-fails paragraph names re-encode + alternative estimator only; explicit anti-relabel warning). - `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet + trailing paragraph (consolidated; opening bullet drops the relabel-as-fallback framing; trailing paragraph trimmed to a pointer to §2). - `diff_diff/guides/llms-autonomous.txt` §5.2 step 2 + counter- example #4 + counter-example #5 (relabel-as-option language removed; explicit "do not relabel" callouts; counter-example #4 options trimmed to (a) re-encode and (b) different estimator). - `CHANGELOG.md` (relabel-as-option clause removed; replaced with re-encode / different-estimator framing). - `ROADMAP.md` (same). - `tests/test_profile_panel.py` two test docstrings (relabel-as- workflow language removed). P2 (duplicate-row "hard fit-time stop" misclaim): Round 11 wording said "duplicate-row failures are hard fit-time stops" — incorrect. `_precompute_structures` at `continuous_did.py:818-823` silently overwrites with last-row-wins, no exception raised. Reworded as "hard preflight veto: the agent must deduplicate before fit because `ContinuousDiD` otherwise uses last-row-wins, no fit-time exception" in profile.py docstring, guide §4.7 opening bullet, and §5.2 step 2 (now defers to §2 for the breakdown). The previously-correct §2 description of the silent-coerce path is preserved. Length housekeeping: The round-11 round-12 expansion pushed `llms-autonomous.txt` above `llms-full.txt`, breaking `test_full_is_largest`. Trimmed ~2.7KB by consolidating the §4.7 trailing paragraph + §5.2 step 2 trailing block to point at §2's full breakdown rather than duplicating the per-check semantics. autonomous: 65364 chars, full: 66058 chars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… first_treat from dose" framing; add PanelProfile backward-compat defaults; fix test_continuous_did docstring P1 (canonical ContinuousDiD setup vs. derive-from-dose framing): Round 12 introduced a "standard workflow" description across the guide, profile docstring, CHANGELOG, ROADMAP, and test docstrings that said agents derive `first_treat` from the same dose column passed to `profile_panel`. Reviewer correctly noted this conflicts with the actual ContinuousDiD contract (`continuous_did.py:222-228`, `prep_dgp.py:970-993`, `docs/methodology/continuous-did.md:65-73`): the canonical setup uses a **time-invariant per-unit dose** `D_i` and a **separate `first_treat` column** the caller supplies — the dose column has no within-unit time variation in this setup, so it cannot encode timing. An agent following the rejected framing would either build a `0,0,d,d` path (which `fit()` rejects) or keep a valid constant-dose panel (in which case the dose column carries no timing information). Reworded every site to drop the derive-from-dose framing and replace with the canonical setup. The five facts on the dose column remain predictive of `fit()` outcomes BECAUSE the canonical convention ties `first_treat == 0` to `D_i == 0` and treated units carry their constant dose across all periods — so `has_never_treated` proxies `P(D=0) > 0` and `dose_min > 0` predicts the strictly- positive-treated-dose requirement, without any "derivation" of `first_treat` from the dose column. Updated: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote the multi-paragraph block to use the canonical-setup framing and added an explicit "agent must validate `first_treat` independently" note). - `diff_diff/guides/llms-autonomous.txt` §2 field reference. - `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet. - `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step 2 + counter-examples #4 and #5 (now describe the canonical setup rather than a derive-from-dose workflow). - `CHANGELOG.md` Wave 2 entry. - `ROADMAP.md` AI-Agent Track building block. - `tests/test_profile_panel.py` `test_treatment_dose_min_flags _negative_dose_continuous_panels` docstring/comments. P2 (PanelProfile direct-construction backward compat): Wave 2 added `outcome_shape` and `treatment_dose` to PanelProfile without defaults, breaking direct `PanelProfile(...)` calls that predate Wave 2. Made both fields default to `None` (moved them to the end of the field list; both are `Optional[...]`). Added `test_panel_profile_direct_construction_without_wave2_fields` asserting that direct construction without the new fields succeeds and yields `None` defaults that serialize correctly through `to_dict()`. P3 (test_continuous_did.py docstring overstating sanction): The new `test_negative_dose_on_never_treated_coerces_not_rejects` docstring said the contract "lets agents legally relabel negative-dose units as `first_treat == 0` to coerce them away." Reworded as observed implementation behavior for inconsistent inputs, NOT a sanctioned routing option — the test locks in the coercion contract while the autonomous guide §5.2 explicitly tells agents not to use this path methodologically. Length invariant maintained: autonomous (65748 chars) < full (66031 chars); `test_full_is_largest` still passes (compares character count, not byte count, so on-disk size with UTF-8 multi-byte characters differs from the assertion target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s "negative dose" branches; HAD only valid on the former Reviewer correctly noted that the round-15/16 wording listed `HeterogeneousAdoptionDiD` as a routing alternative whenever `ContinuousDiD` fails on the dose-related preflights, but HAD itself requires non-negative dose support and raises on negative post-period dose at `had.py:1450-1459` (paper Section 2). On a panel with `dose_min < 0`, routing to HAD silently steers an agent into the same fit-time error. Verified the rejection at `had.py:1450-1459`. Reworded every site to split the two failure modes: - Branch (a): `has_never_treated == False` (no zero-dose controls but all observed doses non-negative). `ContinuousDiD` does not apply (Remark 3.1 not implemented). HAD IS a routing alternative on this branch (HAD's contract requires non-negative dose, satisfied here); linear DiD with a continuous covariate is another. - Branch (e): `dose_min < 0` (negative treated doses). `ContinuousDiD` does not apply AND HAD is **not** a fallback either — HAD raises on negative post-period dose (`had.py:1450-1459`). Linear DiD with a signed continuous covariate is the applicable alternative on this branch. Updated wording across: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (refactored from item-by-item duplication into a numbered list with a single "Routing alternatives when (1) or (5) fails" section that splits the two branches; trimmed redundancy). - `diff_diff/guides/llms-autonomous.txt` §2 field reference (split the When-(1)-or-(5)-fails paragraph into the two branches). - `diff_diff/guides/llms-autonomous.txt` §4.7 trailing paragraph (consolidated to a pointer at §2's split discussion). - `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain counter-example #4 (no never-treated branch: HAD applies) and counter-example #5 (negative-dose branch: HAD does NOT apply, cite `had.py:1450-1459`). - `CHANGELOG.md` Wave 2 entry. - `ROADMAP.md` AI-Agent Track building block. - `tests/test_profile_panel.py` two test docstrings/comments. Added `test_autonomous_negative_dose_path_does_not_route_to_had` in `tests/test_guides.py` asserting that §5.2 explicitly cites `had.py:1450-1459` on the negative-dose branch (used a single- line fingerprint since the prose phrase "non-negative dose support" is split across newlines in the rendered guide). Length housekeeping: trimmed counter-example #4 and #5 prose + §4.7 trailing paragraph to point at §2's split discussion; autonomous (65374 chars) < full (66031), `test_full_is_largest` green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude added 2 commits January 2, 2026 13:33

igerber merged commit c2aa92a into main Jan 2, 2026

igerber deleted the claude/code-review-session-fnLIe branch January 2, 2026 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive code review for diff-diff library#5

Add comprehensive code review for diff-diff library#5
igerber merged 2 commits intomainfrom
claude/code-review-session-fnLIe

igerber commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igerber commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants