Skip to content

Add EPV diagnostics for propensity score logit#251

Merged
igerber merged 12 commits intomainfrom
logit-assessment
Apr 2, 2026
Merged

Add EPV diagnostics for propensity score logit#251
igerber merged 12 commits intomainfrom
logit-assessment

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 1, 2026

Summary

  • Add Events Per Variable (EPV) check in solve_logit() that warns when minority-class observations per parameter falls below threshold (default 10, per Peduzzi et al. 1996). Affects all estimators using logit for propensity scores: CallawaySantAnna, TripleDifference, StaggeredTripleDiff.
  • New pscore_fallback parameter defaults to "error" instead of silently dropping all covariates when logit fails. Set pscore_fallback="unconditional" for legacy behavior.
  • diagnose_propensity() method on CallawaySantAnna enables pre-estimation EPV assessment across all cohorts without running the full estimation.
  • Per-cohort EPV diagnostics stored in results.epv_diagnostics with epv_summary() method and diagnostic block in summary() output.
  • Fix NaN cache poisoning: zero-fill dropped rank-deficient logit coefficients before caching to prevent NaN propagation on cache reuse.
  • Strict-mode semantics preserved: rank_deficient_action="error" always re-raises regardless of pscore_fallback setting.

Methodology references (required if estimator / math changes)

  • Method name(s): Events Per Variable (EPV) diagnostics for logistic regression
  • Paper / source link(s): Peduzzi, P., Concato, J., Kemper, E., Holford, T.R., & Feinstein, A.R. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), 1373-1379.
  • Any intentional deviations from the source (and why): EPV threshold is configurable (default 10) rather than hard-coded, since the optimal threshold varies by context

Validation

  • Tests added/updated: tests/test_linalg.py (7 EPV unit tests), tests/test_staggered.py (12 integration tests including cache NaN regression, strict-mode interaction, diagnose_propensity), tests/test_methodology_triple_diff.py (fallback test updated), tests/test_survey_staggered_ddd.py (fallback test updated)
  • 305 tests pass across affected test files, 0 failures

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Overall Assessment

⚠️ Needs changes

The highest-severity unmitigated findings are P1s in the new EPV/propensity-score behavior: one strict-mode contract break on the repeated-cross-section Callaway-Sant’Anna path, one weighted-EPV calculation bug, one diagnose_propensity() path mismatch, and one missing Methodology Registry update for StaggeredTripleDifference.

Executive Summary

  • CallawaySantAnna(panel=False) does not preserve the new strict-mode semantics: the repeated-cross-section IPW/DR paths still fall back under pscore_fallback="unconditional" even when rank_deficient_action="error" should force a re-raise.
  • The new EPV check in solve_logit() counts all rows, not the effective positive-weight sample, so survey/subpopulation fits can overstate EPV and suppress the intended Peduzzi-based warning/error on the actual fitted sample. Peduzzi’s rule is about events per predictor in the fitted logistic sample. citeturn0view0
  • diagnose_propensity() always uses the panel preprocessing path and cohort-level counts, even though fit() switches to a different repeated-cross-section sample construction for panel=False; the new diagnostic is therefore not aligned with one supported estimator mode.
  • StaggeredTripleDifference now exposes EPV diagnostics and a new pscore_fallback default in code, but its section in the Methodology Registry was not updated, so this estimator has an undocumented methodology/default-behavior change.
  • I did not flag the configurable EPV threshold itself or the NaN zero-fill cache fix as defects; those are reasonable implementation choices, and I found no new partial-NaN inference anti-patterns or security issues in the diff.

Methodology

  • Severity: P1. Impact: the repeated-cross-section Callaway-Sant’Anna paths _ipw_estimation_rc() and _doubly_robust_rc() only re-raise when pscore_fallback == "error", not when rank_deficient_action == "error", so panel=False can still proceed with unconditional propensity after a logit failure even though the new API/docs say strict mode must always raise. See diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, docs/methodology/REGISTRY.md. Concrete fix: in both RCS catch blocks, mirror the panel logic and re-raise when self.rank_deficient_action == "error"; add panel=False regression tests for both IPW and DR with pscore_fallback="unconditional".

  • Severity: P1. Impact: solve_logit() correctly defines the effective weighted sample via weights > 0 for class/rank identification, but the new EPV calculation still uses the full y vector. In survey/domain fits with many zero-weight rows, EPV can be materially overstated and the low-EPV warning/error can be skipped on the actual fitted sample. This affects every weighted propensity-score caller touched here. See diff_diff/linalg.py, diff_diff/linalg.py, diff_diff/staggered.py, diff_diff/triple_diff.py, diff_diff/staggered_triple_diff.py. Concrete fix: when weights are present, compute n_events from y[weights > 0] after any effective-sample rank reduction; add a regression test with many zero-weight padded rows showing EPV tracks the positive-weight sample.

  • Severity: P1. Impact: the new diagnose_propensity() API is not aligned with the estimator’s repeated-cross-section path. It always calls the panel precompute routine, while fit() uses _precompute_structures_rc() for panel=False; it also returns one row per cohort even though fit-time EPV is stored per (g, t) cell. That makes the new pre-estimation diagnostic unreliable on a supported mode and weaker than the actual fit-time assumption check. See diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py. Concrete fix: branch on self.panel; for panel=False, compute diagnostics from the RCS sample construction and actual (g,t) treated/control pools, or explicitly reject panel=False until that logic exists.

  • Severity: P1. Impact: StaggeredTripleDifference now changes propensity-score methodology/default behavior in code by adding EPV diagnostics and changing pscore_fallback to default to "error", but the StaggeredTripleDifference section of the Methodology Registry still documents only the older propensity-score assumptions. That is an undocumented estimator change under the project’s review rules. See diff_diff/staggered_triple_diff.py, diff_diff/staggered_triple_diff.py, diff_diff/staggered_triple_diff.py, docs/methodology/REGISTRY.md. Concrete fix: add registry bullets for EPV diagnostics, fallback behavior, and the default-change note in the StaggeredTripleDifference section, matching the detail already added for Callaway-Sant’Anna and TripleDifference.

Code Quality

Performance

  • No material performance regressions stood out in the changed code.

Maintainability

  • No additional maintainability findings beyond the methodology/diagnostic propagation issues above.

Tech Debt

  • No separate finding, but none of the P1 items above are already tracked in TODO.md, so they remain unmitigated for assessment purposes.

Security

  • No security findings in the diff.

Documentation/Tests

  • Severity: P2. Impact: the added tests cover solve_logit(), panel Callaway-Sant’Anna, and TripleDifference, but they do not cover the broken panel=False Callaway-Sant’Anna paths or the new EPV/strict-mode behavior for StaggeredTripleDifference. The only changed staggered DDD test exercises weighted unconditional fallback mean, not EPV warnings/results or strict-mode re-raise. See tests/test_staggered.py, tests/test_staggered.py, tests/test_survey_staggered_ddd.py. Concrete fix: add targeted tests for panel=False fallback strict mode, weighted zero-weight EPV behavior, StaggeredTripleDifference EPV warning/result plumbing, and its rank_deficient_action="error" interaction with pscore_fallback="unconditional".

Path to Approval

  1. Fix the repeated-cross-section Callaway-Sant’Anna fallback handlers so rank_deficient_action="error" always re-raises in both IPW and DR paths, and add panel=False regression tests for that contract.
  2. Recompute EPV on the effective positive-weight sample inside solve_logit(), and add a zero-weight-subpopulation regression test that would currently miss the warning/error.
  3. Make diagnose_propensity() respect panel=False and the actual RCS sample construction, or explicitly disable it for that mode until implemented.
  4. Update the StaggeredTripleDifference Methodology Registry section to document EPV diagnostics and the pscore_fallback default change.
  5. Add coverage for cached-diagnostic propagation so results.epv_diagnostics and epv_summary(show_all=True) stay complete on cache hits.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 8d462f101a6fc2935b7c4c2c8f90d7012b4b5c63


Overall Assessment
⚠️ Needs changes

Executive Summary

  • The prior P1s from the last review look addressed: repeated-cross-section strict mode now re-raises, solve_logit() EPV uses the positive-weight sample, diagnose_propensity() no longer claims to support panel=False, and the StaggeredTripleDifference Registry section now documents the default/fallback change.
  • The remaining blocker is in StaggeredTripleDifference: EPV diagnostics are documented per DiD/logit, but the implementation stores only one (g, t) entry and overwrites it as each comparison cohort is processed.
  • That can silently hide a low-EPV comparison if a later g_c for the same ATT(g,t) has a higher EPV.
  • Cache-hit paths in SDDD still do not carry diagnostic metadata, so results.epv_summary(show_all=True) can underreport reused propensity fits.
  • I did not find new safe_inference() / NaN-propagation regressions or security issues in the modified paths.

Methodology
Affected methods checked: propensity-score logistic EPV diagnostics in solve_logit(), CallawaySantAnna, TripleDifference, and StaggeredTripleDifference. The default epv_threshold=10 is consistent with the cited Peduzzi simulation, which reported no major problems at EPV ≥ 10 and materially worse bias/coverage behavior below 10. (pubmed.ncbi.nlm.nih.gov)

Code Quality

Performance

  • No material performance findings in the changed code.

Maintainability

  • No additional maintainability findings beyond the SDDD EPV data-model mismatch above.

Tech Debt

  • No separate finding, but the remaining EPV issues are not tracked under TODO.md:L47-L70, so they remain unmitigated for approval.

Security

  • No security findings in the diff.

Documentation/Tests

  • Severity: P2. Impact: the new EPV tests cover solve_logit() and Callaway-Sant’Anna, but the only changed SDDD test still checks weighted fallback only in tests/test_survey_staggered_ddd.py:L833-L864; there is no regression coverage for the multi-g_c overwrite path or the cache-hit diagnostic loss in the modified SDDD implementation. Concrete fix: add one test where a (g, t) cell has at least two valid comparison cohorts with different EPV values and assert the lower EPV survives, plus one universal-base/cache-hit test asserting results.epv_summary(show_all=True) retains diagnostics on reused propensity fits.

Path to Approval

  1. Make SDDD EPV reporting faithfully represent every fitted propensity model contributing to ATT(g,t), or compute a true cell-level minimum across all contributing fits and document that deviation in the Registry with a **Note:**.
  2. Cache and replay EPV diagnostic metadata in StaggeredTripleDifference._compute_pscore().
  3. Add regression tests for multi-comparison overwrite and cache-hit diagnostic persistence.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 65afd79ee65dffc8c98c585cfee21eecc5be6afe


Overall Assessment
⚠️ Needs changes

Executive Summary

  • The epv_threshold=10 default itself is aligned with the Peduzzi rule-of-thumb the PR cites; the remaining blockers are in how the new fallback and diagnostic paths are wired into estimation. citeturn1open0
  • The prior SDDD EPV overwrite/cache concern looks addressed in diff_diff/staggered_triple_diff.py:L1016-L1083 and diff_diff/staggered_triple_diff.py:L1355-L1361, and the new Registry note at docs/methodology/REGISTRY.md:L1298-L1305 now documents the cell-level worst-case summary.
  • CallawaySantAnna’s new pscore_fallback="unconditional" path still feeds full-covariate propensity-score nuisance corrections into the IF/SE code in several branches, so fallback inference is not actually unconditional.
  • CallawaySantAnna.diagnose_propensity() is not using the same not_yet_treated control definition as estimation, so it can overstate EPV for later (g,t) cells and miss low-EPV fits.
  • The new public params are not fully propagated to results objects: pscore_fallback is missing from the CS and SDDD result containers, and TripleDifferenceResults also omits epv_threshold.
  • Coverage is still missing for the repaired SDDD EPV merge/cache behavior and for diagnose_propensity() under control_group="not_yet_treated".

Methodology

  • Severity: P1. Impact: [Newly identified] CallawaySantAnna’s unconditional-fallback path is only unconditional for the point estimator. After the new fallback branch sets a constant pscore, the panel/RC IPW/DR code still builds propensity-score nuisance corrections from the full covariate design, so SEs, p-values, and CIs are computed as if the dropped propensity covariates had still been estimated. That conflicts with the new Registry/warning contract that fallback drops covariates for the cell. Location: diff_diff/staggered.py:L2127-L2145, diff_diff/staggered.py:L2176-L2207, diff_diff/staggered.py:L2406-L2425, diff_diff/staggered.py:L2453-L2546, diff_diff/staggered.py:L3206-L3331, diff_diff/staggered.py:L3467-L3669, docs/methodology/REGISTRY.md:L412-L415. Concrete fix: thread a ps_fallback_used flag through _ipw_estimation, _doubly_robust, _ipw_estimation_rc, and _doubly_robust_rc; when fallback is used, either skip PS correction entirely or recompute it from an intercept-only model, and document whichever choice you keep.
  • Severity: P1. Impact: diagnose_propensity() does not use the same control-group definition as fit() when control_group="not_yet_treated". The new helper hardcodes nyt_threshold = g - 1 at the cohort level, but actual ATT estimation uses max(t, base_period) + anticipation, so later/post-treatment cells can have far fewer valid controls than the diagnostic reports. That lets the new pre-estimation EPV check label a cohort ok even when some fitted cells are low-EPV or skipped. Location: diff_diff/staggered.py:L443-L457, diff_diff/staggered.py:L649-L656, diff_diff/staggered.py:L2742-L2745. Concrete fix: have diagnose_propensity() evaluate the same valid (g,t) cells as fit() and return per-cell EPV, or report the per-cohort minimum across those cells and add a **Note:** in REGISTRY.md that it is a conservative aggregation.

Code Quality

  • No separate code-quality finding beyond the methodology issues above. I did not find a new safe_inference()/NaN-propagation regression in the touched code.

Performance

  • No material performance findings in the changed paths.

Maintainability

Tech Debt

  • No separate mitigating TODO.md entry covers the P1s above, so they remain unmitigated for approval.

Security

  • No security findings in the diff.

Documentation/Tests

  • Severity: P2. Impact: The re-review target in StaggeredTripleDifference still lacks direct regression coverage for the repaired EPV behavior: the only touched SDDD test now checks opt-in fallback, not the new worst-case merge/cache replay, and the new diagnose_propensity() tests never exercise control_group="not_yet_treated". Location: tests/test_survey_staggered_ddd.py:L836-L864, tests/test_staggered.py:L3716-L3764. Concrete fix: add one SDDD regression with at least two contributing g_c values and different EPVs asserting results.epv_summary(show_all=True) retains the minimum EPV across cache hits, plus one CallawaySantAnna(control_group="not_yet_treated") diagnose_propensity() test where late periods have fewer valid controls than the cohort-level baseline.

Path to Approval

  1. Make CallawaySantAnna’s unconditional-fallback inference actually unconditional: skip PS nuisance correction on fallback or replace it with an intercept-only correction in all affected panel/RC IPW/DR paths.
  2. Rework diagnose_propensity() so its control definition matches estimation for control_group="not_yet_treated"; if you keep a per-cohort summary, make it a documented conservative minimum over actual (g,t) cells.
  3. Finish parameter propagation by storing pscore_fallback on the CS/SDDD results objects and both epv_threshold and pscore_fallback on TripleDifferenceResults.
  4. Add targeted regression tests for the fallback-inference path, the not_yet_treated diagnostic path, and the SDDD worst-case/cache EPV summary path.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: b2e02faa43c77c72f89d46d1402c8f727a3967d7


Overall Assessment
⚠️ Needs changes

Executive Summary

  • The prior P1s from the last review look addressed in the changed files: CS now skips the PS IF correction on unconditional fallback, diagnose_propensity() now refuses control_group="not_yet_treated", the new params are propagated into results objects, and the SDDD worst-case EPV aggregation is documented.
  • The EPV threshold itself is not the blocker here; the remaining issue is in how one fallback branch handles survey weights.
  • Severity P1. Impact: [Newly identified] CallawaySantAnna’s survey-weighted panel DR fallback still uses an unweighted treated share when it switches to unconditional propensity, so that fallback no longer matches the survey-weighted intercept-only logit the estimator otherwise uses.
  • Severity P2. Impact: the new diagnose_propensity() helper does not mirror fit()’s first_treat=np.inf normalization, so it can emit a bogus group=inf cohort in supported never-treated inputs.
  • Test coverage added here exercises weighted fallback only for SDDD; there is still no targeted regression for CS survey fallback or for diagnose_propensity() with first_treat=np.inf.

Methodology

Code Quality

  • Severity: P2. Impact: diagnose_propensity() does not apply the same np.inf -> 0 never-treated normalization that fit() applies before deriving treatment_groups, so a supported never-treated encoding can show up as a fake treated cohort in the new diagnostic output. That makes the new helper disagree with the estimator on inputs already covered by fit(). Location: diff_diff/staggered.py:L434, diff_diff/staggered.py:L1468, and the existing fit() regression at tests/test_staggered.py:L93. Concrete fix: normalize np.inf to 0 on a local copy inside diagnose_propensity() before computing treatment_groups / precomputed structures, then add a matching regression test.

Performance

  • No findings in the changed paths.

Maintainability

  • No separate maintainability finding beyond the fallback-logic inconsistency already noted above.

Tech Debt

  • No mitigating TODO.md entry covers the remaining P1, so it remains unmitigated for approval. Location: TODO.md:L47.

Security

  • No security findings in the diff.

Documentation/Tests

  • Severity: P2. Impact: The new tests do not exercise the blocker path above. This PR adds a weighted-fallback regression only for SDDD, while the CS fallback tests are still unweighted and diagnose_propensity() has no np.inf coverage, which is why both changed branches remained unguarded. Location: tests/test_survey_staggered_ddd.py:L833, tests/test_staggered.py:L3670, tests/test_staggered.py:L3716. Concrete fix: add 1) a panel+survey CallawaySantAnna(estimation_method="dr", pscore_fallback="unconditional") regression with unequal weights and forced logit failure, asserting the fallback matches the weighted intercept-only propensity path; and 2) a diagnose_propensity() regression with never-treated encoded as np.inf, asserting there is no group=inf row.

Path to Approval

  1. Fix the survey-weighted panel DR fallback in CallawaySantAnna so unconditional propensity uses the weighted treated share, not raw counts.
  2. Add a regression test that forces fallback in the survey-weighted panel DR path and asserts the expected weighted-fallback behavior.
  3. Optionally, clean up diagnose_propensity(np.inf) at the same time; it is not blocker severity, but it is a real new-helper correctness gap.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: b1d78e90194ea52bcd6e065246f5e3cd63a49622


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review blockers in CallawaySantAnna look addressed in code: the survey-weighted unconditional fallback now uses weighted treated share in both panel and repeated-cross-section paths, and diagnose_propensity() now normalizes first_treat=np.inf before building cohorts.
  • The remaining blocker is in TripleDifference: its new pscore_fallback="unconditional" branch still uses the raw subgroup share even when the propensity logit was being fit with survey weights, so survey-weighted IPW/DR fallback no longer matches the weighted intercept-only logit it is substituting for.
  • The EPV threshold/default-change behavior itself looks fine: the Registry now documents the pscore_fallback="error" default, the CS fallback SE note, and the SDDD worst-case EPV roll-up, so those are not defects.
  • The changed tests still do not exercise the survey-weighted TripleDifference fallback path; the new fallback test there is unweighted, which is why the blocker above is not caught.

Methodology

Affected methods: propensity-score logit fallback and EPV diagnostics in CallawaySantAnna, TripleDifference, and StaggeredTripleDifference.

Code Quality

  • No findings in the changed paths beyond the methodology issue above.

Performance

  • No findings.

Maintainability

  • No separate findings. The new (beta, diag) propensity-cache payload is consumed consistently in the changed callers.

Tech Debt

  • No separate tech-debt finding. TODO.md does not contain a mitigating entry for the P1 above, so it remains unmitigated for approval.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: The new fallback coverage still misses the survey-weighted TripleDifference path that is now wrong. The changed TripleDifference fallback test forces logit failure without survey_design, while the only new weighted fallback regression in this PR was added for StaggeredTripleDifference. That gap is why the blocker above is not caught. Location: tests/test_methodology_triple_diff.py:L1435-L1464, tests/test_survey_staggered_ddd.py:L833-L860, tests/test_staggered.py:L3688-L3714. Concrete fix: add a survey-weighted TripleDifference(..., pscore_fallback="unconditional") regression with unequal weights and forced logit failure, parameterized over estimation_method in {"ipw", "dr"}, and assert the fallback probability equals the weighted subgroup-4 share.

Path to Approval

  1. In TripleDifference, change the unconditional survey fallback to use the positive-weight weighted mean of PA4, matching the survey-weighted logit semantics already used elsewhere in the repo.
  2. Add a survey-weighted regression test for TripleDifference fallback in both ipw and dr modes, with unequal weights and forced logit failure, asserting the constant propensity equals the weighted subgroup-4 share and the fit remains finite.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: f7d34f28d810b44bc427d152ab3f92fa321bed4f


Overall Assessment

✅ Looks good

Executive Summary

  • The prior P1 from the last review is resolved: TripleDifference now uses the positive-weight weighted mean in its unconditional fallback when survey weights are present, matching the weighted-logit contract in the other estimators. diff_diff/triple_diff.py:1007
  • I did not find a new P0/P1 in the changed estimator, weighting, or inference paths. The core EPV choice itself is methodologically defensible: Peduzzi is the cited source for the 10-EPV heuristic, and later work explicitly treats that cutoff as a rule of thumb rather than a hard law. (pubmed.ncbi.nlm.nih.gov)
  • P2: CallawaySantAnna.diagnose_propensity() does not measure EPV on the same effective sample that fit() actually uses when outcomes are missing or survey weights are zero, so the new pre-check can overstate EPV on those panels. diff_diff/staggered.py:381 diff_diff/staggered.py:465 diff_diff/staggered.py:677 diff_diff/linalg.py:1291
  • P2: the weighted-fallback methodology fix is still not directly locked down by the changed tests. TripleDifference coverage remains unweighted, and the new StaggeredTripleDifference survey regression only checks finiteness instead of the weighted-share fallback it claims to verify. tests/test_methodology_triple_diff.py:1435 tests/test_survey_staggered_ddd.py:836
  • P3: TripleDifferenceResults now stores EPV diagnostics and exposes epv_summary(), but summary() still does not surface them, unlike the companion results classes added in this PR. diff_diff/triple_diff.py:108 diff_diff/triple_diff.py:122 diff_diff/triple_diff.py:285
  • I could not run the changed tests in this environment because pytest and the runtime deps are not installed, so this is a source-based re-review.

Methodology

Affected methods: propensity-score logit diagnostics/fallback in CallawaySantAnna, TripleDifference, and StaggeredTripleDifference.

Code Quality

  • No findings in the changed code paths beyond the items called out elsewhere.

Performance

  • No findings.

Maintainability

Tech Debt

  • No finding requiring new TODO.md tracking. I did not see an existing TODO.md entry that mitigates the P2 items above.

Security

  • No findings.

Documentation/Tests

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 1, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 4412bf0734e1642f50b7944b71ee71031e3d4a8c


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: the new TripleDifference fallback warning tells Python users to switch to est_method='reg', but the public API keyword is estimation_method. That is minor, but it points users to a nonexistent parameter in the changed path. Concrete fix: change the warning text to estimation_method='reg'. diff_diff/triple_diff.py:L1040-L1045

Tech Debt

  • No finding requiring new TODO.md tracking.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Change the EPV denominator from “predictors plus intercept” to “predictor variables” everywhere this PR computes or documents EPV: solve_logit(), CallawaySantAnna.diagnose_propensity(), and the Registry sections for Callaway-Sant’Anna, TripleDifference, and StaggeredTripleDifference.
  2. Add boundary tests that lock down the Peduzzi denominator after the fix. Example: with 10 minority-class observations and 1 retained covariate, EPV should be 10, not 5; with 2 retained covariates, EPV should be 5.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 2, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f64cc1af2eafa9f17fe6405cc5e6dc92315aaed


Overall assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive summary

  • The prior P1 methodology blocker appears resolved: EPV is now computed with the intercept excluded, and the registry wording matches.
  • The fallback math itself still looks sound on source review: weighted unconditional fallback and strict-mode re-raise paths are present in the changed estimators.
  • Severity P2: the new DR fallback warnings, plus one Callaway-Sant'Anna registry bullet, say fallback drops all covariates, but the DR code still uses covariates in the outcome-regression component.
  • Severity P2: the changed fallback tests still do not pin down the actual weighted fallback constant they claim to exercise.
  • This was a source-based re-review only; python -m pytest --version fails here because pytest is not installed.

Methodology

  • Severity: P2. Impact: the new fallback messaging is semantically wrong for DR. In CallawaySantAnna, the DR code fits outcome-regression nuisance models before the propensity fallback, and in the repeated-cross-section DR path it does the same; TripleDifference and StaggeredTripleDifference also continue to run their OR components when estimation_method="dr". But the new warnings say fallback means “all covariates dropped,” and the Callaway-Sant'Anna registry now says the same. That can mislead users about which estimator actually ran and how to interpret the result. Concrete fix: make the warning/registry text method-specific. Keep “all covariates dropped” for IPW, but for DR say that the propensity model fell back to an unconditional score while the OR component still uses covariates, and that only the PS influence-function correction is skipped. staggered.py:L2346, staggered.py:L2443, staggered.py:L3425, staggered.py:L3519, triple_diff.py:L1040, triple_diff.py:L1103, staggered_triple_diff.py:L1199, staggered_triple_diff.py:L1389, REGISTRY.md:L412
  • Re-review note: I do not see a remaining source-material defect in the EPV denominator. solve_logit() and diagnose_propensity() now exclude the intercept, which matches Peduzzi’s framing of EPV by predictive variables, and the configurable threshold reads as a reasonable implementation choice rather than a source violation. linalg.py:L1304, staggered.py:L459, REGISTRY.md:L407. citeturn0search0turn0search2turn0search3

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new issue needs TODO.md tracking.

Security

No findings.

Documentation/Tests

Source-based only in this environment: python -m pytest --version fails with No module named pytest.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 2, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: b4044ee7e7852a5138fe68022f017236962a14c6


Overall assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive summary

  • Re-review status: the earlier EPV-denominator source issue looks resolved.
  • The pscore_fallback="error" default change is documented in the Methodology Registry for the affected estimators, so I did not treat that default change as a defect.
  • Remaining P2: DR fallback text is still misleading in one CallawaySantAnna docstring and in the DDD warning surfaces; it says covariates are dropped even though DR still uses the OR covariates.
  • Remaining P2: the updated fallback tests still do not pin the weighted fallback probability or the DR-under-fallback semantics that actually changed.
  • Source-based only here: python -m pytest --version fails because pytest is not installed.

Methodology

Affected methods: propensity-score logit EPV diagnostics and fallback behavior in solve_logit(), CallawaySantAnna, TripleDifference, and StaggeredTripleDifference. No P0/P1 methodology defects found in the changed math or variance paths. solve_logit() now excludes the intercept from the EPV denominator, and Peduzzi et al. define EPV using predictive variables and report no major problems at EPV values of 10 or greater, so the current denominator and default threshold are source-consistent. diff_diff/linalg.py:1305, docs/methodology/REGISTRY.md:407. (pubmed.ncbi.nlm.nih.gov)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining P2s are not currently tracked in TODO.md:47, so I treated them as live review items rather than accepted debt.

Security

No findings.

Documentation/Tests

Verification note: source-based only here; python -m pytest --version fails with No module named pytest.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 2, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 6a2e254b0c5936a50ccb182224f3146c3e25a6b8


Overall Assessment

Looks good

Highest unmitigated severity: P2.

Executive Summary

  • No unmitigated P0/P1 issues found in the changed estimator math, weighting, variance/SE handling, identification checks, or documented default-behavior changes.
  • Methodology alignment looks sound: the new EPV check uses minority-class observations per predictor, excluding the intercept, and the pscore_fallback="error" default change is documented in docs/methodology/REGISTRY.md. citeturn0search0
  • Re-review status: the earlier misleading DR fallback wording issue appears resolved in the touched public docstrings and warnings.
  • New P2: all three new epv_summary(show_all=False) helpers can return a column-less empty DataFrame when diagnostics exist but none are low.
  • Remaining P2: the updated fallback tests still do not assert the weighted fallback probability or the DR-under-fallback semantics that the code now claims to support.
  • Verification note: I could not run pytest here because python -m pytest --version fails with No module named pytest.

Methodology

  • Severity: None. Impact: Affected methods are the propensity-score logit EPV diagnostics and fallback behavior in CallawaySantAnna, TripleDifference, and StaggeredTripleDifference. I did not find an undocumented mismatch with Peduzzi or the Methodology Registry. solve_logit() now computes EPV as minority-class observations divided by predictor count excluding the intercept, which matches Peduzzi’s “events per predictive variable” framing, and the fallback/default behavior changes are explicitly documented in the registry. Concrete fix: none. diff_diff/linalg.py:L1292-L1328, docs/methodology/REGISTRY.md:L408-L422, docs/methodology/REGISTRY.md:L1246-L1256, docs/methodology/REGISTRY.md:L1300-L1312. citeturn0search0

Code Quality

  • Severity: P2. Impact: The new epv_summary(show_all=False) helpers violate their documented return contract when diagnostics exist but none are low. In that case rows stays empty and each method returns pd.DataFrame(rows), which produces a column-less empty frame instead of the promised schema. Any caller that expects group/time/subgroup columns will break on valid high-EPV fits. Concrete fix: return pd.DataFrame(rows, columns=[...]) in all three helpers, and add one regression test for the no-low-EPV case. diff_diff/staggered_results.py:L291-L313, diff_diff/staggered_triple_diff_results.py:L267-L289, diff_diff/triple_diff.py:L317-L338

Performance

  • Severity: None. Impact: No material performance regression stood out in the changed paths; the EPV bookkeeping is linear in already-materialized arrays and the cache now stores only a small diagnostics dict alongside the coefficient vector. Concrete fix: none.

Maintainability

  • Severity: None. Impact: Parameter propagation looks complete in the changed files. The new epv_threshold and pscore_fallback parameters are validated, exposed via get_params(), and stored on results objects for the touched estimators. Concrete fix: none.

Tech Debt

  • Severity: None. Impact: I did not find matching TODO.md entries that would mitigate the live P2/P3 items, so I treated them as active review findings rather than accepted debt. Concrete fix: none, unless you want to defer them explicitly in TODO.md. TODO.md:L47-L85

Security

  • Severity: None. Impact: No new secret-handling, subprocess, filesystem, or untrusted-input issues are introduced by this diff. Concrete fix: none.

Documentation/Tests

  • Severity: P2. Impact: The fallback tests still do not pin the behavioral contracts this PR changed. The implementations now use weighted positive-mass treated shares for unconditional fallback and preserve the OR covariates in DR fallback, but the changed tests still only assert warnings or finite outputs. A regression back to an unweighted fallback constant, or a DR fallback that effectively drops all covariates, would still pass. Concrete fix: add unequal-weight fixtures that force fallback and assert the exact fallback constant; for DR, add a paired test where only the OR covariates change and the fallback estimate changes accordingly. diff_diff/staggered.py:L2156-L2174, diff_diff/staggered.py:L2443-L2462, diff_diff/staggered.py:L3521-L3538, diff_diff/staggered_triple_diff.py:L1399-L1410, diff_diff/triple_diff.py:L1033-L1048, tests/test_staggered.py:L3490-L3517, tests/test_methodology_triple_diff.py:L1435-L1464, tests/test_survey_staggered_ddd.py:L833-L864
  • Severity: P3. Impact: Two new EPV test comments still use the wrong arithmetic in the configurable-threshold case. The fixture there is 15 events / 2 predictors = 7.5, but the comments say EPV=5, which reintroduces confusion around the intercept-excluded denominator. Concrete fix: update the comments to EPV = 7.5 or rewrite them as 15 events / 2 predictor variables. tests/test_linalg.py:L1728-L1735

igerber and others added 11 commits April 2, 2026 06:48
…ault

Events Per Variable (EPV) check in solve_logit warns when minority-class
observations per parameter falls below threshold (default 10, per Peduzzi
et al. 1996). Affects CallawaySantAnna, TripleDifference, and
StaggeredTripleDiff — all estimators using logit for propensity scores.

New pscore_fallback parameter defaults to "error" instead of silently
dropping covariates when logit fails. Set pscore_fallback="unconditional"
for legacy behavior. diagnose_propensity() method on CallawaySantAnna
enables pre-estimation EPV assessment across cohorts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cstrings

P0: Zero-fill NaN coefficients from dropped rank-deficient columns before
caching in CS panel IPW/DR paths, preventing NaN propagation on cache
reuse. Matches existing pattern in StaggeredTripleDiff._compute_pscore().

P1: Restore strict-mode semantics so rank_deficient_action="error" always
re-raises regardless of pscore_fallback setting. Update TripleDifference
REGISTRY.md section with pscore_fallback and EPV documentation. Propagate
epv_threshold and pscore_fallback through triple_difference() wrapper.

P2: Store epv_threshold on results objects for correct summary rendering.
Add class docstrings for new parameters. Add regression tests for cache
NaN poisoning and strict-mode interaction with pscore_fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e diagnostics

Fix RCS fallback handlers to re-raise when rank_deficient_action="error",
matching the panel path fix. Compute EPV on positive-weight sample only
when weights have zeros (Peduzzi's rule applies to the fitted sample).
Guard diagnose_propensity() against panel=False with NotImplementedError.
Update StaggeredTripleDifference REGISTRY.md section with EPV diagnostics
and pscore_fallback documentation. Cache EPV diagnostic metadata alongside
logit coefficients so cache-hit cells appear in epv_summary(show_all=True).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pagation

Retain worst-case (minimum) EPV across all g_c comparison cohorts for the
same (g,t) cell instead of overwriting. Cache EPV diagnostic metadata
alongside logit coefficients in _compute_pscore() and propagate on cache
hits, matching the CallawaySantAnna pattern. Add REGISTRY.md note
documenting the cell-level worst-case reporting convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y guard, result params

Skip propensity-score influence function correction when unconditional
fallback is used (constant pscore has zero estimation uncertainty).
Adds ps_fallback_used flag across all 4 IPW/DR methods (panel+RCS).

Guard diagnose_propensity() against control_group='not_yet_treated'
with NotImplementedError since the control set varies per (g,t) cell.

Propagate pscore_fallback to all three results dataclasses and
epv_threshold to TripleDifferenceResults for full audit trail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…f handling

Use weighted treated share (np.average with survey weights) for
unconditional fallback propensity instead of raw count ratio. Applies
to all 4 panel/RCS IPW/DR fallback sites.

Normalize np.inf → 0 for never-treated encoding in diagnose_propensity()
to match fit()'s treatment_groups derivation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use weighted subgroup share (np.average with survey weights) for
unconditional fallback instead of raw np.mean(PA4), matching the
survey-weighted logit semantics used in CS and SDDD fallback paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aveat

Add low-EPV diagnostic block to TripleDifferenceResults.summary(),
matching the pattern in CS and SDDD results classes.

Document diagnose_propensity() as a raw-count heuristic that may
overstate EPV vs. fit-time effective sample (missing outcomes, zero
survey weights). Direct users to results.epv_diagnostics for
authoritative per-cell EPV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… convention)

Peduzzi et al. (1996) define EPV using independent predictor variables,
not including the intercept. Change denominator from k_solve (which
includes the intercept column) to n_predictors = k_solve - 1.

Also fix TripleDifference fallback warning to use correct API keyword
estimation_method (not est_method).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DR fallback warnings now say propensity model is unconditional while
outcome regression still uses covariates, instead of misleading "all
covariates dropped" text. IPW warnings unchanged.

Update REGISTRY.md fallback description to distinguish IPW vs DR
behavior. Fix docstrings to say "predictor variables (excluding
intercept)" consistently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ention

Make all fallback warnings and docstrings method-specific: IPW says
covariates dropped, DR says propensity model unconditional while outcome
regression still uses covariates. Update test comments from old
intercept-inclusive arithmetic to predictor-variable counts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure epv_summary(show_all=False) returns DataFrame with correct
column schema even when no entries have low EPV, across all three
results classes. Fix remaining test comments to use intercept-excluded
EPV arithmetic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the logit-assessment branch from 6a2e254 to 67bc6db Compare April 2, 2026 10:51
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 2, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 67bc6dbbfe34b2b3e9e543f02594c60ebe1003cf


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

  • Re-review status: the prior epv_summary(show_all=False) contract bug is fixed in all three result containers, and the EPV arithmetic comment in the new solve_logit() tests is corrected. diff_diff/staggered_results.py:L282-L314 diff_diff/staggered_triple_diff_results.py:L258-L290 diff_diff/triple_diff.py:L308-L339 tests/test_linalg.py:L1721-L1735
  • No unmitigated P0/P1 issues found in estimator math, weighting, variance/SE handling, identification checks, or the changed default behaviors.
  • Methodology alignment looks sound for the affected methods: solve_logit() EPV diagnostics, CallawaySantAnna IPW/DR propensity fits, TripleDifference subgroup logits, and StaggeredTripleDifference pairwise logits all match the Methodology Registry as updated.
  • Remaining P2: the fallback tests still mostly assert warnings or finite outputs, not the exact weighted unconditional propensity constant, DR-under-fallback semantics, or the new default-"error" behavior across all affected estimators.
  • Verification note: I could not run pytest here because /usr/bin/python does not have pytest installed.

Methodology

Code Quality

Performance

  • Severity: None. Impact: No material performance regression stood out in the changed paths; the EPV bookkeeping is linear in already-materialized arrays, and the cache growth is limited to a small diagnostics dict next to cached coefficients. Concrete fix: none.

Maintainability

Tech Debt

  • Severity: None. Impact: TODO.md does not already track the remaining live test-gap item, so I treated it as an active P2 finding rather than accepted debt. Concrete fix: none beyond the Documentation/Tests item. TODO.md:L47-L87

Security

  • Severity: None. Impact: No new secret-handling, subprocess, filesystem, or untrusted-input issue is introduced by this diff. Concrete fix: none.

Documentation/Tests

@igerber igerber merged commit e1cd6f5 into main Apr 2, 2026
14 checks passed
@igerber igerber deleted the logit-assessment branch April 2, 2026 12:13
@igerber igerber mentioned this pull request Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant