Skip to content

Add comprehensive code review for diff-diff library#5

Merged
igerber merged 2 commits intomainfrom
claude/code-review-session-fnLIe
Jan 2, 2026
Merged

Add comprehensive code review for diff-diff library#5
igerber merged 2 commits intomainfrom
claude/code-review-session-fnLIe

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jan 2, 2026

Conduct thorough review from two perspectives:

  1. Subject matter expert (econometrics/methodology) - verifying DiD
    methodology, robust SE, parallel trends testing, and TOST equivalence
  2. Expert engineer - identifying bugs, edge cases, and code quality issues

Key findings:

  • Critical bug in absorbed fixed effects (not demeaning treatment vars)
  • Division by zero risks in SE and df calculations
  • FE dummies using wrong data source when combined with absorb
  • Overall solid foundation but needs fixes before production use

claude added 2 commits January 2, 2026 13:33
Conduct thorough review from two perspectives:
1. Subject matter expert (econometrics/methodology) - verifying DiD
   methodology, robust SE, parallel trends testing, and TOST equivalence
2. Expert engineer - identifying bugs, edge cases, and code quality issues

Key findings:
- Critical bug in absorbed fixed effects (not demeaning treatment vars)
- Division by zero risks in SE and df calculations
- FE dummies using wrong data source when combined with absorb
- Overall solid foundation but needs fixes before production use
Fixes applied:
- Add matrix rank check to detect multicollinearity in design matrix
- Fix fixed effects dummies to use working_data consistently
- Add binary variable validation before any transformations
- Fix division by zero in parallel trends SE calculation
- Fix division by zero in equivalence test df calculation
- Use local RNG (np.random.default_rng) instead of global seed
- Make Wasserstein threshold configurable (wasserstein_threshold param)
- Add proper error handling for zero variance and insufficient data

Tests added:
- Multicollinearity detection test
- Custom Wasserstein threshold test
- Insufficient data edge case tests
- Single pre-period edge case test
- TwoWayFixedEffects comprehensive tests
- Cluster-robust standard errors test

All 42 tests passing.
@igerber igerber merged commit c2aa92a into main Jan 2, 2026
@igerber igerber deleted the claude/code-review-session-fnLIe branch January 2, 2026 14:51
igerber pushed a commit that referenced this pull request Jan 4, 2026
Revised review reflects:
- #1, #4 verified as non-issues (correct by design)
- #3, #5, #6, #8, #13 addressed in commit e40d6b4
- Updated recommendation to approve and merge
- Remaining items are low-priority style suggestions for future PRs
igerber added a commit that referenced this pull request Apr 18, 2026
TwoStageDiD and ImputationDiD each run two iterative alternating-projection
solvers (_iterative_fe, _iterative_demean) whose convergence loop exited
silently on max_iter exhaustion, returning the current iterate as if
converged. This matches the silent-failure pattern audited under axis B of
the silent-failures initiative (findings #2-#5).

Adds a shared warn_if_not_converged helper in diff_diff.utils and calls it
from all four alternating-projection loops on non-convergence. Pattern
mirrors the existing logistic + Poisson IRLS convergence warnings in
linalg.py (lines 1329-1376). Warning-only: no new public parameter, no
behavior change on inputs that already converge.

Updates REGISTRY.md entries for ImputationDiD and TwoStageDiD with Note
labels describing the new signal.

Axis-B regression-lint baseline: 10 silent range(max_iter) loops -> 6
remaining (Frank-Wolfe and TROP addressed in follow-up PRs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
Addresses the second-round CI review findings:

- P1 false-pass (remaining): removed five phase-local try/except blocks
  that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness
  and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response
  dataframe extraction). Exceptions now escape, the phase is marked
  ok=false, and run_scenario's atexit handler exits nonzero. The fix
  caught a real API-usage bug on its first rerun: dose_response extract
  phase tried to pull event_study level on a result fit with
  aggregate="dose"; the event_study fit lives in a dedicated phase, so
  that level is removed from the extraction loop.
- P2 scenario-spec drift: BRFSS scenario text now says pweight TSL
  stage-2 (matching the aggregate_survey-returned design), not "Full
  replicate-weight path"; dCDH reversible scenario text now says
  heterogeneity="group" (matching the script), not "cohort".
- P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and
  site-packages before writing the committed txt.

Drift-prevention layer:

- gen_findings_tables.py reads every JSON baseline and rewrites the
  numerical tables in performance-plan.md between
  <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables
  now re-derive from data on every rerun, eliminating the hand-edit
  drift the prior review flagged. Narrative prose stays hand-written
  by design, forcing a human re-read of findings when numbers shift.

Findings refresh (the numbers moved slightly; three narrative claims
needed updating):

- "Rust marginally slower than Python on JK1 at large scale" -> removed;
  fresh data has Rust and Python within noise on brand awareness at
  large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04).
- "ImputationDiD consistently dominant phase at all scales" -> narrowed
  to "dominant under Python; tied with SunAbraham under Rust at large".
- "Nine-figures of MB" in memory finding #3 was a phrasing error
  (literally 100+ TB); corrected to "mid-100s of MB".

Priority of optimization opportunities refreshed against new data:

- #1 aggregate_survey precompute stratum scaffolding: High (unchanged,
  now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100%
  of chain runtime, growth only +31 MB).
- #2 Staggered CS working-memory audit: Low with explicit bump-trigger
  (Rust large crosses 512 MB Lambda line).
- #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low -
  the "Rust regression to fix" leg of the rationale is gone because
  Rust is no longer slower.

Net: one clear priority (aggregate_survey fix), four optional follow-ups.
Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
CI re-review P1: `bench_brand_awareness_survey.py` declared the
analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)`
only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event
study + HonestDiD) built their own SurveyDesigns without `fpc`. That
means a material share of the committed brand-awareness baselines
timed a different variance path than the scenario doc declares.

Fix:
- One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True)
  is now constructed once at the top of `make_phases` and reused across
  phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a
  different variance surface and correctly keeps its own design.
- Regenerated baselines for both backends.
- Regenerated findings tables via gen_findings_tables.py.

Narrative refreshed against the new tables:

- Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in
  the previous rerun); on Rust the multi-outcome loop and JK1 come in
  essentially tied. Medium is also where Python is slowest relative
  to Rust (~1.6x) - the full analytical TSL path with FPC exposes
  vectorization differences at that shape. Totals re-converge at
  large scale.
- Reversible dCDH: ~48-52% split under both backends (previously the
  Python heterogeneity refit edged out the main fit slightly).
- Scaling finding #5 retuned: Rust-only uplift is still the SDiD
  story; brand-aware medium now surfaces as a secondary, modest
  ~1.6x case rather than "within noise."

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 19, 2026
…ht variance"

CI re-review P2 + P3, both docs/label only:

- docs/performance-plan.md had two remaining specific-magnitude
  claims about brand-awareness medium ("~1.6x under Python",
  "Python and Rust separate the most at medium", "~1.6x at worst on
  brand-awareness medium"). Those were true on one earlier rerun but
  the current committed baselines show medium at 0.56 / 0.55
  (essentially tied) and the widest non-SDiD gap is now ~1.1x at
  brand-large. Reworded per-scenario paragraph and scaling finding #5
  to describe the stable aggregate pattern and defer exact ratios to
  the scale-sweep table. Same treatment as the earlier staggered/dCDH
  pass: narrative stops claiming magnitudes that can shift on rerun;
  the generator-owned table carries the specifics.
- bench_brand_awareness_survey.py module docstring labeled JK1 as
  "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate-
  weight variance (jackknife-style), not bootstrap inference - they
  are distinct methodology surfaces. Renamed to "replicate-weight
  variance (JK1 delete-one-PSU)" with an inline note pointing to the
  registry.

Docs + docstring only. No script behaviour change; no baseline
regeneration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…light checks as standard-workflow predictions, not estimator gates

Reviewer correctly noted that calling
{has_never_treated, treatment_varies_within_unit==False,
is_balanced, no duplicate_unit_time_rows alert, dose_min > 0}
the "screening checks" / "necessary" gates of `ContinuousDiD`
overstates the contract. `ContinuousDiD.fit()` keys off the
separate `first_treat` column (which `profile_panel` does not see),
defines never-treated controls as `first_treat == 0` rows,
force-zeroes nonzero `dose` on those rows with a `UserWarning`,
and rejects negative dose only among treated units `first_treat > 0`
(see `continuous_did.py:276-327` and `:348-360`).

Two of the five checks (`has_never_treated`, `dose_min > 0`) are
first_treat-dependent: agents who relabel positive- or negative-dose
units as `first_treat == 0` trigger the force-zero coercion path
with a `UserWarning` and may still fit panels that fail those
preflights, with the methodology shifting. The other three
(`treatment_varies_within_unit`, `is_balanced`, duplicate-row
absence) are real fit-time gates that hold regardless of how
`first_treat` is constructed.

Reframed every wording site to call these "standard-workflow
preflight checks" — predictive when the agent derives `first_treat`
from the same dose column passed to `profile_panel`, but not the
estimator's literal contract:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
  the multi-paragraph block; explicit standard-workflow definition
  + per-check first_treat dependency map + force-zero coercion
  caveat).
- `diff_diff/profile.py` `_compute_treatment_dose` helper docstring
  (already brief; stays consistent).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (long
  rewrite covering the standard-workflow framing + override paths).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
  trailing paragraph (both updated; opening bullet now spells out
  which of the five checks are first_treat-dependent vs. hard
  fit-time stops; trailing paragraph promotes the standard-
  workflow framing).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain step
  2 (rewrote the gate-checking paragraph; counter-example #4
  expanded to enumerate (a) supply matching first_treat and accept
  rejection, (b) deliberate relabel + coercion, (c) different
  estimator; counter-example #5 distinguishes negative-dose
  treated-unit rejection from never-treated coercion).
- `CHANGELOG.md` Wave 2 entry (matches the new framing).
- `ROADMAP.md` AI-Agent Track building block (matches).

Test coverage:
- Renamed assertion messages in
  `test_treatment_dose_descriptive_fields_supplement_existing_gates`
  and `test_treatment_dose_min_flags_negative_dose_continuous_panels`
  to remove "authoritative gate" phrasing; reframed as "standard-
  workflow preflight" assertions consistent with the corrected docs.
- Added `test_negative_dose_on_never_treated_coerces_not_rejects`
  in `tests/test_continuous_did.py::TestEdgeCases` covering the
  reviewer's specific request: never-treated rows with NEGATIVE
  nonzero dose must coerce (with `UserWarning`) rather than raise
  the treated-unit negative-dose error. Sister to the existing
  `test_nonzero_dose_on_never_treated_warns` which covers the
  positive-dose case.

Rebased onto origin/main during this round (no conflicts beyond
prior CHANGELOG resolutions; main advanced 19 commits).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…s-fallback wording; correct duplicate-row "fit-time stop" claim

P1 (relabel-to-manufacture-controls misframing):
Round 11 introduced wording across the guide, profile docstring,
CHANGELOG, ROADMAP, and test docstrings that presented intentional
`first_treat == 0` relabeling of nonzero-dose units as an
"option" / "fallback" for fitting `ContinuousDiD` when the
profile-side preflights (`has_never_treated`, `dose_min > 0`)
fail. REGISTRY does not document this as a routing option, and the
estimator still requires actual `P(D=0) > 0` because Remark 3.1
lowest-dose-as-control is not yet implemented. The force-zero
coercion at `continuous_did.py:311-327` is implementation behavior
for INCONSISTENT inputs (e.g., user accidentally passes nonzero
dose on a never-treated row), not a methodology fallback.

Reworded every site to remove the relabeling-as-option framing and
replace it with the registry-documented fixes when (1) or (5)
fails: re-encode the treatment column to a non-negative scale that
contains a true never-treated group, or route to a different
estimator (`HeterogeneousAdoptionDiD` for graded-adoption panels;
linear DiD with the treatment as a continuous covariate). Every
remaining "manufacture controls" mention in the guide, profile,
and tests is now an explicit anti-recommendation ("do not relabel
... to manufacture controls"). Updated:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (item (1):
  "not an opportunity to relabel ..."; item (5): coercion is
  "implementation behavior for inconsistent inputs, not a
  methodological fallback").
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (the
  When-(1)-or-(5)-fails paragraph names re-encode + alternative
  estimator only; explicit anti-relabel warning).
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet +
  trailing paragraph (consolidated; opening bullet drops the
  relabel-as-fallback framing; trailing paragraph trimmed to a
  pointer to §2).
- `diff_diff/guides/llms-autonomous.txt` §5.2 step 2 + counter-
  example #4 + counter-example #5 (relabel-as-option language
  removed; explicit "do not relabel" callouts; counter-example #4
  options trimmed to (a) re-encode and (b) different estimator).
- `CHANGELOG.md` (relabel-as-option clause removed; replaced with
  re-encode / different-estimator framing).
- `ROADMAP.md` (same).
- `tests/test_profile_panel.py` two test docstrings (relabel-as-
  workflow language removed).

P2 (duplicate-row "hard fit-time stop" misclaim):
Round 11 wording said "duplicate-row failures are hard fit-time
stops" — incorrect. `_precompute_structures` at
`continuous_did.py:818-823` silently overwrites with last-row-wins,
no exception raised. Reworded as "hard preflight veto: the agent
must deduplicate before fit because `ContinuousDiD` otherwise uses
last-row-wins, no fit-time exception" in profile.py docstring,
guide §4.7 opening bullet, and §5.2 step 2 (now defers to §2 for
the breakdown). The previously-correct §2 description of the
silent-coerce path is preserved.

Length housekeeping:
The round-11 round-12 expansion pushed `llms-autonomous.txt`
above `llms-full.txt`, breaking `test_full_is_largest`. Trimmed
~2.7KB by consolidating the §4.7 trailing paragraph + §5.2 step 2
trailing block to point at §2's full breakdown rather than
duplicating the per-check semantics. autonomous: 65364 chars,
full: 66058 chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
… first_treat from dose" framing; add PanelProfile backward-compat defaults; fix test_continuous_did docstring

P1 (canonical ContinuousDiD setup vs. derive-from-dose framing):
Round 12 introduced a "standard workflow" description across the
guide, profile docstring, CHANGELOG, ROADMAP, and test docstrings
that said agents derive `first_treat` from the same dose column
passed to `profile_panel`. Reviewer correctly noted this conflicts
with the actual ContinuousDiD contract (`continuous_did.py:222-228`,
`prep_dgp.py:970-993`, `docs/methodology/continuous-did.md:65-73`):
the canonical setup uses a **time-invariant per-unit dose** `D_i`
and a **separate `first_treat` column** the caller supplies — the
dose column has no within-unit time variation in this setup, so it
cannot encode timing. An agent following the rejected framing would
either build a `0,0,d,d` path (which `fit()` rejects) or keep a
valid constant-dose panel (in which case the dose column carries no
timing information).

Reworded every site to drop the derive-from-dose framing and
replace with the canonical setup. The five facts on the dose column
remain predictive of `fit()` outcomes BECAUSE the canonical
convention ties `first_treat == 0` to `D_i == 0` and treated units
carry their constant dose across all periods — so `has_never_treated`
proxies `P(D=0) > 0` and `dose_min > 0` predicts the strictly-
positive-treated-dose requirement, without any "derivation" of
`first_treat` from the dose column. Updated:

- `diff_diff/profile.py` `TreatmentDoseShape` docstring (rewrote
  the multi-paragraph block to use the canonical-setup framing
  and added an explicit "agent must validate `first_treat`
  independently" note).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference.
- `diff_diff/guides/llms-autonomous.txt` §4.7 opening bullet.
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain
  step 2 + counter-examples #4 and #5 (now describe the
  canonical setup rather than a derive-from-dose workflow).
- `CHANGELOG.md` Wave 2 entry.
- `ROADMAP.md` AI-Agent Track building block.
- `tests/test_profile_panel.py` `test_treatment_dose_min_flags
  _negative_dose_continuous_panels` docstring/comments.

P2 (PanelProfile direct-construction backward compat):
Wave 2 added `outcome_shape` and `treatment_dose` to PanelProfile
without defaults, breaking direct `PanelProfile(...)` calls that
predate Wave 2. Made both fields default to `None` (moved them to
the end of the field list; both are `Optional[...]`). Added
`test_panel_profile_direct_construction_without_wave2_fields`
asserting that direct construction without the new fields succeeds
and yields `None` defaults that serialize correctly through
`to_dict()`.

P3 (test_continuous_did.py docstring overstating sanction):
The new `test_negative_dose_on_never_treated_coerces_not_rejects`
docstring said the contract "lets agents legally relabel
negative-dose units as `first_treat == 0` to coerce them away."
Reworded as observed implementation behavior for inconsistent
inputs, NOT a sanctioned routing option — the test locks in the
coercion contract while the autonomous guide §5.2 explicitly tells
agents not to use this path methodologically.

Length invariant maintained: autonomous (65748 chars) < full
(66031 chars); `test_full_is_largest` still passes (compares
character count, not byte count, so on-disk size with UTF-8
multi-byte characters differs from the assertion target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 25, 2026
…s "negative dose" branches; HAD only valid on the former

Reviewer correctly noted that the round-15/16 wording listed
`HeterogeneousAdoptionDiD` as a routing alternative whenever
`ContinuousDiD` fails on the dose-related preflights, but HAD
itself requires non-negative dose support and raises on negative
post-period dose at `had.py:1450-1459` (paper Section 2). On a
panel with `dose_min < 0`, routing to HAD silently steers an agent
into the same fit-time error. Verified the rejection at
`had.py:1450-1459`.

Reworded every site to split the two failure modes:

- Branch (a): `has_never_treated == False` (no zero-dose controls
  but all observed doses non-negative). `ContinuousDiD` does not
  apply (Remark 3.1 not implemented). HAD IS a routing alternative
  on this branch (HAD's contract requires non-negative dose,
  satisfied here); linear DiD with a continuous covariate is
  another.
- Branch (e): `dose_min < 0` (negative treated doses).
  `ContinuousDiD` does not apply AND HAD is **not** a fallback
  either — HAD raises on negative post-period dose
  (`had.py:1450-1459`). Linear DiD with a signed continuous
  covariate is the applicable alternative on this branch.

Updated wording across:
- `diff_diff/profile.py` `TreatmentDoseShape` docstring (refactored
  from item-by-item duplication into a numbered list with a single
  "Routing alternatives when (1) or (5) fails" section that splits
  the two branches; trimmed redundancy).
- `diff_diff/guides/llms-autonomous.txt` §2 field reference (split
  the When-(1)-or-(5)-fails paragraph into the two branches).
- `diff_diff/guides/llms-autonomous.txt` §4.7 trailing paragraph
  (consolidated to a pointer at §2's split discussion).
- `diff_diff/guides/llms-autonomous.txt` §5.2 reasoning chain
  counter-example #4 (no never-treated branch: HAD applies) and
  counter-example #5 (negative-dose branch: HAD does NOT apply,
  cite `had.py:1450-1459`).
- `CHANGELOG.md` Wave 2 entry.
- `ROADMAP.md` AI-Agent Track building block.
- `tests/test_profile_panel.py` two test docstrings/comments.

Added `test_autonomous_negative_dose_path_does_not_route_to_had`
in `tests/test_guides.py` asserting that §5.2 explicitly cites
`had.py:1450-1459` on the negative-dose branch (used a single-
line fingerprint since the prose phrase "non-negative dose
support" is split across newlines in the rendered guide).

Length housekeeping: trimmed counter-example #4 and #5 prose +
§4.7 trailing paragraph to point at §2's split discussion;
autonomous (65374 chars) < full (66031), `test_full_is_largest`
green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants