Add comprehensive code review for diff-diff library#5
Merged
Conversation
Conduct thorough review from two perspectives: 1. Subject matter expert (econometrics/methodology) - verifying DiD methodology, robust SE, parallel trends testing, and TOST equivalence 2. Expert engineer - identifying bugs, edge cases, and code quality issues Key findings: - Critical bug in absorbed fixed effects (not demeaning treatment vars) - Division by zero risks in SE and df calculations - FE dummies using wrong data source when combined with absorb - Overall solid foundation but needs fixes before production use
Fixes applied: - Add matrix rank check to detect multicollinearity in design matrix - Fix fixed effects dummies to use working_data consistently - Add binary variable validation before any transformations - Fix division by zero in parallel trends SE calculation - Fix division by zero in equivalence test df calculation - Use local RNG (np.random.default_rng) instead of global seed - Make Wasserstein threshold configurable (wasserstein_threshold param) - Add proper error handling for zero variance and insufficient data Tests added: - Multicollinearity detection test - Custom Wasserstein threshold test - Insufficient data edge case tests - Single pre-period edge case test - TwoWayFixedEffects comprehensive tests - Cluster-robust standard errors test All 42 tests passing.
igerber
pushed a commit
that referenced
this pull request
Jan 4, 2026
igerber
added a commit
that referenced
this pull request
Apr 18, 2026
TwoStageDiD and ImputationDiD each run two iterative alternating-projection solvers (_iterative_fe, _iterative_demean) whose convergence loop exited silently on max_iter exhaustion, returning the current iterate as if converged. This matches the silent-failure pattern audited under axis B of the silent-failures initiative (findings #2-#5). Adds a shared warn_if_not_converged helper in diff_diff.utils and calls it from all four alternating-projection loops on non-convergence. Pattern mirrors the existing logistic + Poisson IRLS convergence warnings in linalg.py (lines 1329-1376). Warning-only: no new public parameter, no behavior change on inputs that already converge. Updates REGISTRY.md entries for ImputationDiD and TwoStageDiD with Note labels describing the new signal. Axis-B regression-lint baseline: 10 silent range(max_iter) loops -> 6 remaining (Frank-Wolfe and TROP addressed in follow-up PRs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
Addresses the second-round CI review findings: - P1 false-pass (remaining): removed five phase-local try/except blocks that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response dataframe extraction). Exceptions now escape, the phase is marked ok=false, and run_scenario's atexit handler exits nonzero. The fix caught a real API-usage bug on its first rerun: dose_response extract phase tried to pull event_study level on a result fit with aggregate="dose"; the event_study fit lives in a dedicated phase, so that level is removed from the extraction loop. - P2 scenario-spec drift: BRFSS scenario text now says pweight TSL stage-2 (matching the aggregate_survey-returned design), not "Full replicate-weight path"; dCDH reversible scenario text now says heterogeneity="group" (matching the script), not "cohort". - P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and site-packages before writing the committed txt. Drift-prevention layer: - gen_findings_tables.py reads every JSON baseline and rewrites the numerical tables in performance-plan.md between <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables now re-derive from data on every rerun, eliminating the hand-edit drift the prior review flagged. Narrative prose stays hand-written by design, forcing a human re-read of findings when numbers shift. Findings refresh (the numbers moved slightly; three narrative claims needed updating): - "Rust marginally slower than Python on JK1 at large scale" -> removed; fresh data has Rust and Python within noise on brand awareness at large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04). - "ImputationDiD consistently dominant phase at all scales" -> narrowed to "dominant under Python; tied with SunAbraham under Rust at large". - "Nine-figures of MB" in memory finding #3 was a phrasing error (literally 100+ TB); corrected to "mid-100s of MB". Priority of optimization opportunities refreshed against new data: - #1 aggregate_survey precompute stratum scaffolding: High (unchanged, now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100% of chain runtime, growth only +31 MB). - #2 Staggered CS working-memory audit: Low with explicit bump-trigger (Rust large crosses 512 MB Lambda line). - #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low - the "Rust regression to fix" leg of the rationale is gone because Rust is no longer slower. Net: one clear priority (aggregate_survey fix), four optional follow-ups. Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
CI re-review P1: `bench_brand_awareness_survey.py` declared the analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)` only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event study + HonestDiD) built their own SurveyDesigns without `fpc`. That means a material share of the committed brand-awareness baselines timed a different variance path than the scenario doc declares. Fix: - One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True) is now constructed once at the top of `make_phases` and reused across phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a different variance surface and correctly keeps its own design. - Regenerated baselines for both backends. - Regenerated findings tables via gen_findings_tables.py. Narrative refreshed against the new tables: - Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in the previous rerun); on Rust the multi-outcome loop and JK1 come in essentially tied. Medium is also where Python is slowest relative to Rust (~1.6x) - the full analytical TSL path with FPC exposes vectorization differences at that shape. Totals re-converge at large scale. - Reversible dCDH: ~48-52% split under both backends (previously the Python heterogeneity refit edged out the main fit slightly). - Scaling finding #5 retuned: Rust-only uplift is still the SDiD story; brand-aware medium now surfaces as a secondary, modest ~1.6x case rather than "within noise." Still measurement only. No changes under diff_diff/ or rust/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
…ht variance"
CI re-review P2 + P3, both docs/label only:
- docs/performance-plan.md had two remaining specific-magnitude
claims about brand-awareness medium ("~1.6x under Python",
"Python and Rust separate the most at medium", "~1.6x at worst on
brand-awareness medium"). Those were true on one earlier rerun but
the current committed baselines show medium at 0.56 / 0.55
(essentially tied) and the widest non-SDiD gap is now ~1.1x at
brand-large. Reworded per-scenario paragraph and scaling finding #5
to describe the stable aggregate pattern and defer exact ratios to
the scale-sweep table. Same treatment as the earlier staggered/dCDH
pass: narrative stops claiming magnitudes that can shift on rerun;
the generator-owned table carries the specifics.
- bench_brand_awareness_survey.py module docstring labeled JK1 as
"replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate-
weight variance (jackknife-style), not bootstrap inference - they
are distinct methodology surfaces. Renamed to "replicate-weight
variance (JK1 delete-one-PSU)" with an inline note pointing to the
registry.
Docs + docstring only. No script behaviour change; no baseline
regeneration needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…light checks as standard-workflow predictions, not estimator gates
Reviewer correctly noted that calling
{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}
the "screening checks" / "necessary" gates of `ContinuousDiD`
overstates the contract. `ContinuousDiD.fit()` keys off the
separate `first_treat` column (which `profile_panel` does not see),
defines never-treated controls as `first_treat == 0` rows,
force-zeroes nonzero `dose` on those rows with a `UserWarning`,
and rejects negative dose only among treated units `first_treat > 0`
(see `continuous_did.py:276-327` and `:348-360`).
Two of the five checks (`has_never_treated`, `dose_min > 0`) are
first_treat-dependent: agents who relabel positive- or negative-dose
units as `first_treat == 0` trigger the force-zero coercion path
with a `UserWarning` and may still fit panels that fail those
preflights, with the methodology shifting. The other three
(`treatment_varies_within_unit`, `is_balanced`, duplicate-row
absence) are real fit-time gates that hold regardless of how
`first_treat` is constructed.
Reframed every wording site to call these "standard-workflow
preflight checks" — predictive when the agent derives `first_treat`
from the same dose column passed to `profile_panel`, but not the
estimator's literal contract:
- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
the multi-paragraph block; explicit standard-workflow definition
+ per-check first_treat dependency map + force-zero coercion
caveat).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
(already brief; stays consistent).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (long
rewrite covering the standard-workflow framing + override paths).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
trailing paragraph (both updated; opening bullet now spells out
which of the five checks are first_treat-dependent vs. hard
fit-time stops; trailing paragraph promotes the standard-
workflow framing).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step
2 (rewrote the gate-checking paragraph; counter-example #4
expanded to enumerate (a) supply matching first_treat and accept
rejection, (b) deliberate relabel + coercion, (c) different
estimator; counter-example #5 distinguishes negative-dose
treated-unit rejection from never-treated coercion).
- `CHANGELOG.md` Wave 2 entry (matches the new framing).
- `ROADMAP.md` AI-Agent Track building block (matches).
Test coverage:
- Renamed assertion messages in
`test_treatment_dose_descriptive_fields_supplement_existing_gates`
and `test_treatment_dose_min_flags_negative_dose_continuous_panels`
to remove "authoritative gate" phrasing; reframed as "standard-
workflow preflight" assertions consistent with the corrected docs.
- Added `test_negative_dose_on_never_treated_coerces_not_rejects`
in `tests/test_continuous_did.py::TestEdgeCases` covering the
reviewer's specific request: never-treated rows with NEGATIVE
nonzero dose must coerce (with `UserWarning`) rather than raise
the treated-unit negative-dose error. Sister to the existing
`test_nonzero_dose_on_never_treated_warns` which covers the
positive-dose case.
Rebased onto origin/main during this round (no conflicts beyond
prior CHANGELOG resolutions; main advanced 19 commits).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…s-fallback wording; correct duplicate-row "fit-time stop" claim
P1 (relabel-to-manufacture-controls misframing):
Round 11 introduced wording across the guide, profile docstring,
CHANGELOG, ROADMAP, and test docstrings that presented intentional
`first_treat == 0` relabeling of nonzero-dose units as an
"option" / "fallback" for fitting `ContinuousDiD` when the
profile-side preflights (`has_never_treated`, `dose_min > 0`)
fail. REGISTRY does not document this as a routing option, and the
estimator still requires actual `P(D=0) > 0` because Remark 3.1
lowest-dose-as-control is not yet implemented. The force-zero
coercion at `continuous_did.py:311-327` is implementation behavior
for INCONSISTENT inputs (e.g., user accidentally passes nonzero
dose on a never-treated row), not a methodology fallback.
Reworded every site to remove the relabeling-as-option framing and
replace it with the registry-documented fixes when (1) or (5)
fails: re-encode the treatment column to a non-negative scale that
contains a true never-treated group, or route to a different
estimator (`HeterogeneousAdoptionDiD` for graded-adoption panels;
linear DiD with the treatment as a continuous covariate). Every
remaining "manufacture controls" mention in the guide, profile,
and tests is now an explicit anti-recommendation ("do not relabel
... to manufacture controls"). Updated:
- `diff_diff/profile.py` `TreatmentDoseShape` docstring (item (1):
"not an opportunity to relabel ..."; item (5): coercion is
"implementation behavior for inconsistent inputs, not a
methodological fallback").
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (the
When-(1)-or-(5)-fails paragraph names re-encode + alternative
estimator only; explicit anti-relabel warning).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
trailing paragraph (consolidated; opening bullet drops the
relabel-as-fallback framing; trailing paragraph trimmed to a
pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 step 2 + counter-
example #4 + counter-example #5 (relabel-as-option language
removed; explicit "do not relabel" callouts; counter-example #4
options trimmed to (a) re-encode and (b) different estimator).
- `CHANGELOG.md` (relabel-as-option clause removed; replaced with
re-encode / different-estimator framing).
- `ROADMAP.md` (same).
- `tests/test_profile_panel.py` two test docstrings (relabel-as-
workflow language removed).
P2 (duplicate-row "hard fit-time stop" misclaim):
Round 11 wording said "duplicate-row failures are hard fit-time
stops" — incorrect. `_precompute_structures` at
`continuous_did.py:818-823` silently overwrites with last-row-wins,
no exception raised. Reworded as "hard preflight veto: the agent
must deduplicate before fit because `ContinuousDiD` otherwise uses
last-row-wins, no fit-time exception" in profile.py docstring,
guide §4.7 opening bullet, and §5.2 step 2 (now defers to §2 for
the breakdown). The previously-correct §2 description of the
silent-coerce path is preserved.
Length housekeeping:
The round-11 round-12 expansion pushed `llms-autonomous.txt`
above `llms-full.txt`, breaking `test_full_is_largest`. Trimmed
~2.7KB by consolidating the §4.7 trailing paragraph + §5.2 step 2
trailing block to point at §2's full breakdown rather than
duplicating the per-check semantics. autonomous: 65364 chars,
full: 66058 chars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
… first_treat from dose" framing; add PanelProfile backward-compat defaults; fix test_continuous_did docstring P1 (canonical ContinuousDiD setup vs. derive-from-dose framing): Round 12 introduced a "standard workflow" description across the guide, profile docstring, CHANGELOG, ROADMAP, and test docstrings that said agents derive `first_treat` from the same dose column passed to `profile_panel`. Reviewer correctly noted this conflicts with the actual ContinuousDiD contract (`continuous_did.py:222-228`, `prep_dgp.py:970-993`, `docs/methodology/continuous-did.md:65-73`): the canonical setup uses a **time-invariant per-unit dose** `D_i` and a **separate `first_treat` column** the caller supplies — the dose column has no within-unit time variation in this setup, so it cannot encode timing. An agent following the rejected framing would either build a `0,0,d,d` path (which `fit()` rejects) or keep a valid constant-dose panel (in which case the dose column carries no timing information). Reworded every site to drop the derive-from-dose framing and replace with the canonical setup. The five facts on the dose column remain predictive of `fit()` outcomes BECAUSE the canonical convention ties `first_treat == 0` to `D_i == 0` and treated units carry their constant dose across all periods — so `has_never_treated` proxies `P(D=0) > 0` and `dose_min > 0` predicts the strictly- positive-treated-dose requirement, without any "derivation" of `first_treat` from the dose column. Updated: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote the multi-paragraph block to use the canonical-setup framing and added an explicit "agent must validate `first_treat` independently" note). - `diff_diff/guides/llms-autonomous.txt` §2 field reference. - `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet. - `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step 2 + counter-examples #4 and #5 (now describe the canonical setup rather than a derive-from-dose workflow). - `CHANGELOG.md` Wave 2 entry. - `ROADMAP.md` AI-Agent Track building block. - `tests/test_profile_panel.py` `test_treatment_dose_min_flags _negative_dose_continuous_panels` docstring/comments. P2 (PanelProfile direct-construction backward compat): Wave 2 added `outcome_shape` and `treatment_dose` to PanelProfile without defaults, breaking direct `PanelProfile(...)` calls that predate Wave 2. Made both fields default to `None` (moved them to the end of the field list; both are `Optional[...]`). Added `test_panel_profile_direct_construction_without_wave2_fields` asserting that direct construction without the new fields succeeds and yields `None` defaults that serialize correctly through `to_dict()`. P3 (test_continuous_did.py docstring overstating sanction): The new `test_negative_dose_on_never_treated_coerces_not_rejects` docstring said the contract "lets agents legally relabel negative-dose units as `first_treat == 0` to coerce them away." Reworded as observed implementation behavior for inconsistent inputs, NOT a sanctioned routing option — the test locks in the coercion contract while the autonomous guide §5.2 explicitly tells agents not to use this path methodologically. Length invariant maintained: autonomous (65748 chars) < full (66031 chars); `test_full_is_largest` still passes (compares character count, not byte count, so on-disk size with UTF-8 multi-byte characters differs from the assertion target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 25, 2026
…s "negative dose" branches; HAD only valid on the former Reviewer correctly noted that the round-15/16 wording listed `HeterogeneousAdoptionDiD` as a routing alternative whenever `ContinuousDiD` fails on the dose-related preflights, but HAD itself requires non-negative dose support and raises on negative post-period dose at `had.py:1450-1459` (paper Section 2). On a panel with `dose_min < 0`, routing to HAD silently steers an agent into the same fit-time error. Verified the rejection at `had.py:1450-1459`. Reworded every site to split the two failure modes: - Branch (a): `has_never_treated == False` (no zero-dose controls but all observed doses non-negative). `ContinuousDiD` does not apply (Remark 3.1 not implemented). HAD IS a routing alternative on this branch (HAD's contract requires non-negative dose, satisfied here); linear DiD with a continuous covariate is another. - Branch (e): `dose_min < 0` (negative treated doses). `ContinuousDiD` does not apply AND HAD is **not** a fallback either — HAD raises on negative post-period dose (`had.py:1450-1459`). Linear DiD with a signed continuous covariate is the applicable alternative on this branch. Updated wording across: - `diff_diff/profile.py` `TreatmentDoseShape` docstring (refactored from item-by-item duplication into a numbered list with a single "Routing alternatives when (1) or (5) fails" section that splits the two branches; trimmed redundancy). - `diff_diff/guides/llms-autonomous.txt` §2 field reference (split the When-(1)-or-(5)-fails paragraph into the two branches). - `diff_diff/guides/llms-autonomous.txt` §4.7 trailing paragraph (consolidated to a pointer at §2's split discussion). - `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain counter-example #4 (no never-treated branch: HAD applies) and counter-example #5 (negative-dose branch: HAD does NOT apply, cite `had.py:1450-1459`). - `CHANGELOG.md` Wave 2 entry. - `ROADMAP.md` AI-Agent Track building block. - `tests/test_profile_panel.py` two test docstrings/comments. Added `test_autonomous_negative_dose_path_does_not_route_to_had` in `tests/test_guides.py` asserting that §5.2 explicitly cites `had.py:1450-1459` on the negative-dose branch (used a single- line fingerprint since the prose phrase "non-negative dose support" is split across newlines in the rendered guide). Length housekeeping: trimmed counter-example #4 and #5 prose + §4.7 trailing paragraph to point at §2's split discussion; autonomous (65374 chars) < full (66031), `test_full_is_largest` green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Conduct thorough review from two perspectives:
methodology, robust SE, parallel trends testing, and TOST equivalence
Key findings: