Add EPV diagnostics for propensity score logit by igerber · Pull Request #251 · igerber/diff-diff

igerber · 2026-04-01T19:00:28Z

Summary

Add Events Per Variable (EPV) check in solve_logit() that warns when minority-class observations per parameter falls below threshold (default 10, per Peduzzi et al. 1996). Affects all estimators using logit for propensity scores: CallawaySantAnna, TripleDifference, StaggeredTripleDiff.
New pscore_fallback parameter defaults to "error" instead of silently dropping all covariates when logit fails. Set pscore_fallback="unconditional" for legacy behavior.
diagnose_propensity() method on CallawaySantAnna enables pre-estimation EPV assessment across all cohorts without running the full estimation.
Per-cohort EPV diagnostics stored in results.epv_diagnostics with epv_summary() method and diagnostic block in summary() output.
Fix NaN cache poisoning: zero-fill dropped rank-deficient logit coefficients before caching to prevent NaN propagation on cache reuse.
Strict-mode semantics preserved: rank_deficient_action="error" always re-raises regardless of pscore_fallback setting.

Methodology references (required if estimator / math changes)

Method name(s): Events Per Variable (EPV) diagnostics for logistic regression
Paper / source link(s): Peduzzi, P., Concato, J., Kemper, E., Holford, T.R., & Feinstein, A.R. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), 1373-1379.
Any intentional deviations from the source (and why): EPV threshold is configurable (default 10) rather than hard-coded, since the optimal threshold varies by context

Validation

Tests added/updated: tests/test_linalg.py (7 EPV unit tests), tests/test_staggered.py (12 integration tests including cache NaN regression, strict-mode interaction, diagnose_propensity), tests/test_methodology_triple_diff.py (fallback test updated), tests/test_survey_staggered_ddd.py (fallback test updated)
305 tests pass across affected test files, 0 failures

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-04-01T19:13:58Z

Overall Assessment

⚠️ Needs changes

The highest-severity unmitigated findings are P1s in the new EPV/propensity-score behavior: one strict-mode contract break on the repeated-cross-section Callaway-Sant’Anna path, one weighted-EPV calculation bug, one diagnose_propensity() path mismatch, and one missing Methodology Registry update for StaggeredTripleDifference.

Executive Summary

CallawaySantAnna(panel=False) does not preserve the new strict-mode semantics: the repeated-cross-section IPW/DR paths still fall back under pscore_fallback="unconditional" even when rank_deficient_action="error" should force a re-raise.
The new EPV check in solve_logit() counts all rows, not the effective positive-weight sample, so survey/subpopulation fits can overstate EPV and suppress the intended Peduzzi-based warning/error on the actual fitted sample. Peduzzi’s rule is about events per predictor in the fitted logistic sample. citeturn0view0
diagnose_propensity() always uses the panel preprocessing path and cohort-level counts, even though fit() switches to a different repeated-cross-section sample construction for panel=False; the new diagnostic is therefore not aligned with one supported estimator mode.
StaggeredTripleDifference now exposes EPV diagnostics and a new pscore_fallback default in code, but its section in the Methodology Registry was not updated, so this estimator has an undocumented methodology/default-behavior change.
I did not flag the configurable EPV threshold itself or the NaN zero-fill cache fix as defects; those are reasonable implementation choices, and I found no new partial-NaN inference anti-patterns or security issues in the diff.

Methodology

Severity: P1. Impact: the repeated-cross-section Callaway-Sant’Anna paths _ipw_estimation_rc() and _doubly_robust_rc() only re-raise when pscore_fallback == "error", not when rank_deficient_action == "error", so panel=False can still proceed with unconditional propensity after a logit failure even though the new API/docs say strict mode must always raise. See diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, docs/methodology/REGISTRY.md. Concrete fix: in both RCS catch blocks, mirror the panel logic and re-raise when self.rank_deficient_action == "error"; add panel=False regression tests for both IPW and DR with pscore_fallback="unconditional".
Severity: P1. Impact: solve_logit() correctly defines the effective weighted sample via weights > 0 for class/rank identification, but the new EPV calculation still uses the full y vector. In survey/domain fits with many zero-weight rows, EPV can be materially overstated and the low-EPV warning/error can be skipped on the actual fitted sample. This affects every weighted propensity-score caller touched here. See diff_diff/linalg.py, diff_diff/linalg.py, diff_diff/staggered.py, diff_diff/triple_diff.py, diff_diff/staggered_triple_diff.py. Concrete fix: when weights are present, compute n_events from y[weights > 0] after any effective-sample rank reduction; add a regression test with many zero-weight padded rows showing EPV tracks the positive-weight sample.
Severity: P1. Impact: the new diagnose_propensity() API is not aligned with the estimator’s repeated-cross-section path. It always calls the panel precompute routine, while fit() uses _precompute_structures_rc() for panel=False; it also returns one row per cohort even though fit-time EPV is stored per (g, t) cell. That makes the new pre-estimation diagnostic unreliable on a supported mode and weaker than the actual fit-time assumption check. See diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py. Concrete fix: branch on self.panel; for panel=False, compute diagnostics from the RCS sample construction and actual (g,t) treated/control pools, or explicitly reject panel=False until that logic exists.
Severity: P1. Impact: StaggeredTripleDifference now changes propensity-score methodology/default behavior in code by adding EPV diagnostics and changing pscore_fallback to default to "error", but the StaggeredTripleDifference section of the Methodology Registry still documents only the older propensity-score assumptions. That is an undocumented estimator change under the project’s review rules. See diff_diff/staggered_triple_diff.py, diff_diff/staggered_triple_diff.py, diff_diff/staggered_triple_diff.py, docs/methodology/REGISTRY.md. Concrete fix: add registry bullets for EPV diagnostics, fallback behavior, and the default-change note in the StaggeredTripleDifference section, matching the detail already added for Callaway-Sant’Anna and TripleDifference.

Code Quality

Severity: P2. Impact: per-cell EPV reporting is incomplete on cache hits. In both Callaway-Sant’Anna and StaggeredTripleDifference, diagnostics are only written when a fresh logit fit runs; when cached coefficients are reused, later (g,t) cells get no epv_diagnostics entry, so results.epv_summary(show_all=True) and the consolidated warning counts can underreport affected cells. See diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered.py, diff_diff/staggered_triple_diff.py, diff_diff/staggered_triple_diff.py. Concrete fix: cache EPV metadata alongside coefficients or recompute/store the same diagnostic on every cache hit.

Performance

No material performance regressions stood out in the changed code.

Maintainability

No additional maintainability findings beyond the methodology/diagnostic propagation issues above.

Tech Debt

No separate finding, but none of the P1 items above are already tracked in TODO.md, so they remain unmitigated for assessment purposes.

Security

No security findings in the diff.

Documentation/Tests

Severity: P2. Impact: the added tests cover solve_logit(), panel Callaway-Sant’Anna, and TripleDifference, but they do not cover the broken panel=False Callaway-Sant’Anna paths or the new EPV/strict-mode behavior for StaggeredTripleDifference. The only changed staggered DDD test exercises weighted unconditional fallback mean, not EPV warnings/results or strict-mode re-raise. See tests/test_staggered.py, tests/test_staggered.py, tests/test_survey_staggered_ddd.py. Concrete fix: add targeted tests for panel=False fallback strict mode, weighted zero-weight EPV behavior, StaggeredTripleDifference EPV warning/result plumbing, and its rank_deficient_action="error" interaction with pscore_fallback="unconditional".

Path to Approval

Fix the repeated-cross-section Callaway-Sant’Anna fallback handlers so rank_deficient_action="error" always re-raises in both IPW and DR paths, and add panel=False regression tests for that contract.
Recompute EPV on the effective positive-weight sample inside solve_logit(), and add a zero-weight-subpopulation regression test that would currently miss the warning/error.
Make diagnose_propensity() respect panel=False and the actual RCS sample construction, or explicitly disable it for that mode until implemented.
Update the StaggeredTripleDifference Methodology Registry section to document EPV diagnostics and the pscore_fallback default change.
Add coverage for cached-diagnostic propagation so results.epv_diagnostics and epv_summary(show_all=True) stay complete on cache hits.

igerber · 2026-04-01T20:23:54Z

/ai-review

github-actions · 2026-04-01T20:31:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8d462f101a6fc2935b7c4c2c8f90d7012b4b5c63

Overall Assessment
⚠️ Needs changes

Executive Summary

The prior P1s from the last review look addressed: repeated-cross-section strict mode now re-raises, solve_logit() EPV uses the positive-weight sample, diagnose_propensity() no longer claims to support panel=False, and the StaggeredTripleDifference Registry section now documents the default/fallback change.
The remaining blocker is in StaggeredTripleDifference: EPV diagnostics are documented per DiD/logit, but the implementation stores only one (g, t) entry and overwrites it as each comparison cohort is processed.
That can silently hide a low-EPV comparison if a later g_c for the same ATT(g,t) has a higher EPV.
Cache-hit paths in SDDD still do not carry diagnostic metadata, so results.epv_summary(show_all=True) can underreport reused propensity fits.
I did not find new safe_inference() / NaN-propagation regressions or security issues in the modified paths.

Methodology
Affected methods checked: propensity-score logistic EPV diagnostics in solve_logit(), CallawaySantAnna, TripleDifference, and StaggeredTripleDifference. The default epv_threshold=10 is consistent with the cited Peduzzi simulation, which reported no major problems at EPV ≥ 10 and materially worse bias/coverage behavior below 10. (pubmed.ncbi.nlm.nih.gov)

Severity: P1. Impact: StaggeredTripleDifference can combine multiple comparison cohorts for one (g, t) cell in diff_diff/staggered_triple_diff.py:L396-L407, but _compute_ddd_gt_gc() writes a single epv_diagnostics[(g, t)] entry per comparison cohort in diff_diff/staggered_triple_diff.py:L1072-L1078. The Methodology Registry documents SDDD EPV as per-DiD in docs/methodology/REGISTRY.md:L1294-L1303, while the results API/summary expose only one cell-level record in diff_diff/staggered_triple_diff_results.py:L162-L185 and diff_diff/staggered_triple_diff_results.py:L257-L288. If an earlier g_c has low EPV and a later g_c does not, the later write wins and the low-EPV fit disappears from warnings and results.epv_summary(). Concrete fix: store diagnostics per actual propensity fit (g, t, g_c, did_label), or compute and retain the minimum EPV across all contributing fits for the cell and add a **Note:** in the Registry that SDDD reports a cell-level worst-case diagnostic rather than per-DiD diagnostics.

Code Quality

Severity: P2. Impact: _compute_pscore() cache hits only return stored coefficients in diff_diff/staggered_triple_diff.py:L1347-L1354, and cache writes still persist only beta_clean in diff_diff/staggered_triple_diff.py:L1369-L1372. Reused propensity fits therefore do not repopulate epv_diagnostics_out, so later cells can be missing from results.epv_diagnostics even though the same low-EPV logit was used. Concrete fix: cache (beta_clean, diag) and update epv_diagnostics_out on cache hits, matching the pattern already used in diff_diff/staggered.py:L2091-L2100 and diff_diff/staggered.py:L2122-L2126.

Performance

No material performance findings in the changed code.

Maintainability

No additional maintainability findings beyond the SDDD EPV data-model mismatch above.

Tech Debt

No separate finding, but the remaining EPV issues are not tracked under TODO.md:L47-L70, so they remain unmitigated for approval.

Security

No security findings in the diff.

Documentation/Tests

Severity: P2. Impact: the new EPV tests cover solve_logit() and Callaway-Sant’Anna, but the only changed SDDD test still checks weighted fallback only in tests/test_survey_staggered_ddd.py:L833-L864; there is no regression coverage for the multi-g_c overwrite path or the cache-hit diagnostic loss in the modified SDDD implementation. Concrete fix: add one test where a (g, t) cell has at least two valid comparison cohorts with different EPV values and assert the lower EPV survives, plus one universal-base/cache-hit test asserting results.epv_summary(show_all=True) retains diagnostics on reused propensity fits.

Path to Approval

Make SDDD EPV reporting faithfully represent every fitted propensity model contributing to ATT(g,t), or compute a true cell-level minimum across all contributing fits and document that deviation in the Registry with a **Note:**.
Cache and replay EPV diagnostic metadata in StaggeredTripleDifference._compute_pscore().
Add regression tests for multi-comparison overwrite and cache-hit diagnostic persistence.

igerber · 2026-04-01T20:35:12Z

/ai-review

github-actions · 2026-04-01T20:45:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 65afd79ee65dffc8c98c585cfee21eecc5be6afe

Overall Assessment
⚠️ Needs changes

Executive Summary

The epv_threshold=10 default itself is aligned with the Peduzzi rule-of-thumb the PR cites; the remaining blockers are in how the new fallback and diagnostic paths are wired into estimation. citeturn1open0
The prior SDDD EPV overwrite/cache concern looks addressed in diff_diff/staggered_triple_diff.py:L1016-L1083 and diff_diff/staggered_triple_diff.py:L1355-L1361, and the new Registry note at docs/methodology/REGISTRY.md:L1298-L1305 now documents the cell-level worst-case summary.
CallawaySantAnna’s new pscore_fallback="unconditional" path still feeds full-covariate propensity-score nuisance corrections into the IF/SE code in several branches, so fallback inference is not actually unconditional.
CallawaySantAnna.diagnose_propensity() is not using the same not_yet_treated control definition as estimation, so it can overstate EPV for later (g,t) cells and miss low-EPV fits.
The new public params are not fully propagated to results objects: pscore_fallback is missing from the CS and SDDD result containers, and TripleDifferenceResults also omits epv_threshold.
Coverage is still missing for the repaired SDDD EPV merge/cache behavior and for diagnose_propensity() under control_group="not_yet_treated".

Methodology

Severity: P1. Impact: [Newly identified] CallawaySantAnna’s unconditional-fallback path is only unconditional for the point estimator. After the new fallback branch sets a constant pscore, the panel/RC IPW/DR code still builds propensity-score nuisance corrections from the full covariate design, so SEs, p-values, and CIs are computed as if the dropped propensity covariates had still been estimated. That conflicts with the new Registry/warning contract that fallback drops covariates for the cell. Location: diff_diff/staggered.py:L2127-L2145, diff_diff/staggered.py:L2176-L2207, diff_diff/staggered.py:L2406-L2425, diff_diff/staggered.py:L2453-L2546, diff_diff/staggered.py:L3206-L3331, diff_diff/staggered.py:L3467-L3669, docs/methodology/REGISTRY.md:L412-L415. Concrete fix: thread a ps_fallback_used flag through _ipw_estimation, _doubly_robust, _ipw_estimation_rc, and _doubly_robust_rc; when fallback is used, either skip PS correction entirely or recompute it from an intercept-only model, and document whichever choice you keep.
Severity: P1. Impact: diagnose_propensity() does not use the same control-group definition as fit() when control_group="not_yet_treated". The new helper hardcodes nyt_threshold = g - 1 at the cohort level, but actual ATT estimation uses max(t, base_period) + anticipation, so later/post-treatment cells can have far fewer valid controls than the diagnostic reports. That lets the new pre-estimation EPV check label a cohort ok even when some fitted cells are low-EPV or skipped. Location: diff_diff/staggered.py:L443-L457, diff_diff/staggered.py:L649-L656, diff_diff/staggered.py:L2742-L2745. Concrete fix: have diagnose_propensity() evaluate the same valid (g,t) cells as fit() and return per-cell EPV, or report the per-cohort minimum across those cells and add a **Note:** in REGISTRY.md that it is a conservative aggregation.

Code Quality

No separate code-quality finding beyond the methodology issues above. I did not find a new safe_inference()/NaN-propagation regression in the touched code.

Performance

No material performance findings in the changed paths.

Maintainability

Severity: P1. Impact: The new public parameters are not fully propagated to results objects, which violates the repo’s new-parameter contract and makes downstream auditing harder. pscore_fallback is not stored on CallawaySantAnnaResults or StaggeredTripleDiffResults, and TripleDifferenceResults stores epv_diagnostics but not epv_threshold or pscore_fallback, so callers cannot reconstruct which fallback policy or EPV threshold generated the diagnostics once they only have the results object. Location: diff_diff/staggered.py:L1900-L1926, diff_diff/staggered_results.py:L121-L130, diff_diff/staggered_triple_diff.py:L704-L721, diff_diff/staggered_triple_diff_results.py:L82-L90, diff_diff/triple_diff.py:L675-L695, diff_diff/triple_diff.py:L90-L111. Concrete fix: add pscore_fallback to the CS/SDDD result dataclasses, add both epv_threshold and pscore_fallback to TripleDifferenceResults, and pass them through from fit().

Tech Debt

No separate mitigating TODO.md entry covers the P1s above, so they remain unmitigated for approval.

Security

No security findings in the diff.

Documentation/Tests

Severity: P2. Impact: The re-review target in StaggeredTripleDifference still lacks direct regression coverage for the repaired EPV behavior: the only touched SDDD test now checks opt-in fallback, not the new worst-case merge/cache replay, and the new diagnose_propensity() tests never exercise control_group="not_yet_treated". Location: tests/test_survey_staggered_ddd.py:L836-L864, tests/test_staggered.py:L3716-L3764. Concrete fix: add one SDDD regression with at least two contributing g_c values and different EPVs asserting results.epv_summary(show_all=True) retains the minimum EPV across cache hits, plus one CallawaySantAnna(control_group="not_yet_treated") diagnose_propensity() test where late periods have fewer valid controls than the cohort-level baseline.

Path to Approval

Make CallawaySantAnna’s unconditional-fallback inference actually unconditional: skip PS nuisance correction on fallback or replace it with an intercept-only correction in all affected panel/RC IPW/DR paths.
Rework diagnose_propensity() so its control definition matches estimation for control_group="not_yet_treated"; if you keep a per-cohort summary, make it a documented conservative minimum over actual (g,t) cells.
Finish parameter propagation by storing pscore_fallback on the CS/SDDD results objects and both epv_threshold and pscore_fallback on TripleDifferenceResults.
Add targeted regression tests for the fallback-inference path, the not_yet_treated diagnostic path, and the SDDD worst-case/cache EPV summary path.

igerber · 2026-04-01T20:59:37Z

/ai-review

github-actions · 2026-04-01T21:11:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b2e02faa43c77c72f89d46d1402c8f727a3967d7

Overall Assessment
⚠️ Needs changes

Executive Summary

The prior P1s from the last review look addressed in the changed files: CS now skips the PS IF correction on unconditional fallback, diagnose_propensity() now refuses control_group="not_yet_treated", the new params are propagated into results objects, and the SDDD worst-case EPV aggregation is documented.
The EPV threshold itself is not the blocker here; the remaining issue is in how one fallback branch handles survey weights.
Severity P1. Impact: [Newly identified] CallawaySantAnna’s survey-weighted panel DR fallback still uses an unweighted treated share when it switches to unconditional propensity, so that fallback no longer matches the survey-weighted intercept-only logit the estimator otherwise uses.
Severity P2. Impact: the new diagnose_propensity() helper does not mirror fit()’s first_treat=np.inf normalization, so it can emit a bogus group=inf cohort in supported never-treated inputs.
Test coverage added here exercises weighted fallback only for SDDD; there is still no targeted regression for CS survey fallback or for diagnose_propensity() with first_treat=np.inf.

Methodology

Severity: P1. Impact: [Newly identified] In survey-weighted panel DR, the unconditional fallback in _doubly_robust() sets p = n_t / (n_t + n_c) even when survey weights are present, instead of using the weighted treated share. That is an undocumented mismatch with the survey-weighted intercept-only logit implied by solve_logit(..., weights=...), and the fallback probability then feeds directly into the DR control-weight / OR-correction terms. Location: _doubly_robust() fallback in diff_diff/staggered.py:L2418, _doubly_robust() survey DR terms in diff_diff/staggered.py:L2445, compare the already-correct weighted fallback in _compute_pscore() in diff_diff/staggered_triple_diff.py:L1398. Concrete fix: when sw_all is present and fallback triggers, compute the constant propensity as the positive-weight weighted mean of the treatment indicator (for example np.average(D[pos], weights=sw_all[pos])) and reuse that helper anywhere survey-weighted unconditional fallback is supported.
Severity: P3. Impact: The EPV threshold of 10 itself looks aligned with the cited Peduzzi rule-of-thumb, and the new default-change notes are now documented in the Registry, so I would not block on the threshold/default choice alone. Location: docs/methodology/REGISTRY.md:L407, docs/methodology/REGISTRY.md:L1243, docs/methodology/REGISTRY.md:L1297. Concrete fix: none. citeturn0search0

Code Quality

Severity: P2. Impact: diagnose_propensity() does not apply the same np.inf -> 0 never-treated normalization that fit() applies before deriving treatment_groups, so a supported never-treated encoding can show up as a fake treated cohort in the new diagnostic output. That makes the new helper disagree with the estimator on inputs already covered by fit(). Location: diff_diff/staggered.py:L434, diff_diff/staggered.py:L1468, and the existing fit() regression at tests/test_staggered.py:L93. Concrete fix: normalize np.inf to 0 on a local copy inside diagnose_propensity() before computing treatment_groups / precomputed structures, then add a matching regression test.

Performance

No findings in the changed paths.

Maintainability

No separate maintainability finding beyond the fallback-logic inconsistency already noted above.

Tech Debt

No mitigating TODO.md entry covers the remaining P1, so it remains unmitigated for approval. Location: TODO.md:L47.

Security

No security findings in the diff.

Documentation/Tests

Severity: P2. Impact: The new tests do not exercise the blocker path above. This PR adds a weighted-fallback regression only for SDDD, while the CS fallback tests are still unweighted and diagnose_propensity() has no np.inf coverage, which is why both changed branches remained unguarded. Location: tests/test_survey_staggered_ddd.py:L833, tests/test_staggered.py:L3670, tests/test_staggered.py:L3716. Concrete fix: add 1) a panel+survey CallawaySantAnna(estimation_method="dr", pscore_fallback="unconditional") regression with unequal weights and forced logit failure, asserting the fallback matches the weighted intercept-only propensity path; and 2) a diagnose_propensity() regression with never-treated encoded as np.inf, asserting there is no group=inf row.

Path to Approval

Fix the survey-weighted panel DR fallback in CallawaySantAnna so unconditional propensity uses the weighted treated share, not raw counts.
Add a regression test that forces fallback in the survey-weighted panel DR path and asserts the expected weighted-fallback behavior.
Optionally, clean up diagnose_propensity(np.inf) at the same time; it is not blocker severity, but it is a real new-helper correctness gap.

igerber · 2026-04-01T23:26:33Z

/ai-review

github-actions · 2026-04-01T23:32:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b1d78e90194ea52bcd6e065246f5e3cd63a49622

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review blockers in CallawaySantAnna look addressed in code: the survey-weighted unconditional fallback now uses weighted treated share in both panel and repeated-cross-section paths, and diagnose_propensity() now normalizes first_treat=np.inf before building cohorts.
The remaining blocker is in TripleDifference: its new pscore_fallback="unconditional" branch still uses the raw subgroup share even when the propensity logit was being fit with survey weights, so survey-weighted IPW/DR fallback no longer matches the weighted intercept-only logit it is substituting for.
The EPV threshold/default-change behavior itself looks fine: the Registry now documents the pscore_fallback="error" default, the CS fallback SE note, and the SDDD worst-case EPV roll-up, so those are not defects.
The changed tests still do not exercise the survey-weighted TripleDifference fallback path; the new fallback test there is unweighted, which is why the blocker above is not caught.

Methodology

Affected methods: propensity-score logit fallback and EPV diagnostics in CallawaySantAnna, TripleDifference, and StaggeredTripleDifference.

Severity P1. Impact: In survey-weighted TripleDifference, the new unconditional-fallback branch uses np.mean(PA4) even though the primary propensity fit is solve_logit(..., weights=w_sub), whose documented contract is a survey-weighted logit. That makes the fallback probability an undocumented mismatch with the weighted intercept-only logit, and the wrong constant propensity then feeds the control weighting in both IPW and DR DDD. This can change both point estimates and IF-based SEs whenever survey weights are unequal. Location: diff_diff/triple_diff.py:L991-L1008, diff_diff/linalg.py:L1164-L1168, docs/methodology/REGISTRY.md:L1243-L1248. Compare the already-correct weighted fallback implementations in diff_diff/staggered.py:L2157-L2162, diff_diff/staggered.py:L2444-L2449, and diff_diff/staggered_triple_diff.py:L1398-L1406. Concrete fix: when w_sub is present and fallback triggers, compute the constant propensity from the positive-weight weighted mean of PA4 (for example np.average(PA4[pos], weights=w_sub[pos])), falling back to the raw mean only if no positive-weight rows remain.
Severity P3. Impact: The default change to pscore_fallback="error", the CS note that unconditional fallback skips PS IF correction, and the SDDD worst-case (g,t) EPV summary are now explicitly documented in the Registry, so I am not treating those deviations as defects. Location: docs/methodology/REGISTRY.md:L402-L420, docs/methodology/REGISTRY.md:L1243-L1254, docs/methodology/REGISTRY.md:L1297-L1310. Concrete fix: none.

Code Quality

No findings in the changed paths beyond the methodology issue above.

Performance

No findings.

Maintainability

No separate findings. The new (beta, diag) propensity-cache payload is consumed consistently in the changed callers.

Tech Debt

No separate tech-debt finding. TODO.md does not contain a mitigating entry for the P1 above, so it remains unmitigated for approval.

Security

No findings.

Documentation/Tests

Severity P2. Impact: The new fallback coverage still misses the survey-weighted TripleDifference path that is now wrong. The changed TripleDifference fallback test forces logit failure without survey_design, while the only new weighted fallback regression in this PR was added for StaggeredTripleDifference. That gap is why the blocker above is not caught. Location: tests/test_methodology_triple_diff.py:L1435-L1464, tests/test_survey_staggered_ddd.py:L833-L860, tests/test_staggered.py:L3688-L3714. Concrete fix: add a survey-weighted TripleDifference(..., pscore_fallback="unconditional") regression with unequal weights and forced logit failure, parameterized over estimation_method in {"ipw", "dr"}, and assert the fallback probability equals the weighted subgroup-4 share.

Path to Approval

In TripleDifference, change the unconditional survey fallback to use the positive-weight weighted mean of PA4, matching the survey-weighted logit semantics already used elsewhere in the repo.
Add a survey-weighted regression test for TripleDifference fallback in both ipw and dr modes, with unequal weights and forced logit failure, asserting the constant propensity equals the weighted subgroup-4 share and the fit remains finite.

igerber · 2026-04-01T23:38:12Z

/ai-review

github-actions · 2026-04-01T23:45:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f7d34f28d810b44bc427d152ab3f92fa321bed4f

Overall Assessment

✅ Looks good

Executive Summary

The prior P1 from the last review is resolved: TripleDifference now uses the positive-weight weighted mean in its unconditional fallback when survey weights are present, matching the weighted-logit contract in the other estimators. diff_diff/triple_diff.py:1007
I did not find a new P0/P1 in the changed estimator, weighting, or inference paths. The core EPV choice itself is methodologically defensible: Peduzzi is the cited source for the 10-EPV heuristic, and later work explicitly treats that cutoff as a rule of thumb rather than a hard law. (pubmed.ncbi.nlm.nih.gov)
P2: CallawaySantAnna.diagnose_propensity() does not measure EPV on the same effective sample that fit() actually uses when outcomes are missing or survey weights are zero, so the new pre-check can overstate EPV on those panels. diff_diff/staggered.py:381 diff_diff/staggered.py:465 diff_diff/staggered.py:677 diff_diff/linalg.py:1291
P2: the weighted-fallback methodology fix is still not directly locked down by the changed tests. TripleDifference coverage remains unweighted, and the new StaggeredTripleDifference survey regression only checks finiteness instead of the weighted-share fallback it claims to verify. tests/test_methodology_triple_diff.py:1435 tests/test_survey_staggered_ddd.py:836
P3: TripleDifferenceResults now stores EPV diagnostics and exposes epv_summary(), but summary() still does not surface them, unlike the companion results classes added in this PR. diff_diff/triple_diff.py:108 diff_diff/triple_diff.py:122 diff_diff/triple_diff.py:285
I could not run the changed tests in this environment because pytest and the runtime deps are not installed, so this is a source-based re-review.

Methodology

Affected methods: propensity-score logit diagnostics/fallback in CallawaySantAnna, TripleDifference, and StaggeredTripleDifference.

Severity: P2. Impact: CallawaySantAnna.diagnose_propensity() computes EPV from raw cohort/control counts and has no survey_design input, while the actual fitted logit in fit() is run on the valid base/post outcome pair and on the positive-weight effective sample. On panels with missing outcomes or zero survey weights, the new pre-check can therefore label a cohort as adequately powered even though the fitted cell-level logit later warns or errors for low EPV. Concrete fix: make diagnose_propensity() accept survey_design and apply the same valid-pair / positive-weight filtering as fit(), or explicitly document that it is only a raw-count heuristic rather than the fitted-sample EPV. diff_diff/staggered.py:381 diff_diff/staggered.py:465 diff_diff/staggered.py:560 diff_diff/staggered.py:677 diff_diff/staggered.py:1379 diff_diff/linalg.py:1291
Severity: P3. Impact: None. The default pscore_fallback="error" change, the Callaway note that unconditional fallback skips PS influence-function correction, and the SDDD worst-case (g,t) EPV roll-up are all explicitly documented in the Registry, so I am treating them as documented deviations rather than defects. Concrete fix: none. docs/methodology/REGISTRY.md:407 docs/methodology/REGISTRY.md:418 docs/methodology/REGISTRY.md:1243 docs/methodology/REGISTRY.md:1297

Code Quality

No findings in the changed code paths beyond the items called out elsewhere.

Performance

No findings.

Maintainability

Severity: P3. Impact: TripleDifferenceResults now has an EPV data surface (epv_diagnostics, epv_summary()), but the default human-facing surface stays inconsistent with the two staggered results classes because summary() omits any EPV block. That makes the new diagnostic easy to miss unless callers already know to look for a second method. Concrete fix: mirror the low-EPV summary block pattern already added to the staggered results classes, and add a small regression test for the rendered summary text. diff_diff/triple_diff.py:122 diff_diff/triple_diff.py:285 diff_diff/staggered_results.py:198 diff_diff/staggered_triple_diff_results.py:163

Tech Debt

No finding requiring new TODO.md tracking. I did not see an existing TODO.md entry that mitigates the P2 items above.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The weighted-fallback fix that resolved the prior blocker is still not directly asserted by the changed tests. In TripleDifference, the new fallback regression never passes a survey design, so it does not exercise the weighted branch at all. In StaggeredTripleDifference, the new survey regression says it verifies the weighted mean, but it only asserts that the overall ATT is finite. That leaves the methodology fix vulnerable to silent regression. Concrete fix: add unequal-weight survey regressions for both ipw and dr that force logit failure and assert the fallback constant equals the positive-weight weighted share used by the implementation. tests/test_methodology_triple_diff.py:1435 tests/test_survey_staggered_ddd.py:836 diff_diff/triple_diff.py:1007 diff_diff/staggered_triple_diff.py:1398
Severity: P3. Impact: The new EPV result surfaces for TripleDifference and StaggeredTripleDifference are only lightly or not at all exercised by tests. I do not see changed coverage for TripleDifferenceResults.epv_summary() or for the new SDDD EPV summary/reporting path. Concrete fix: add focused result-surface tests that check epv_diagnostics, epv_summary(show_all=True/False), and summary rendering when at least one subgroup/cell is low-EPV. diff_diff/triple_diff.py:285 diff_diff/staggered_triple_diff_results.py:255 tests/test_methodology_triple_diff.py:1146 tests/test_survey_staggered_ddd.py:833

igerber · 2026-04-01T23:59:02Z

/ai-review

github-actions · 2026-04-02T00:13:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4412bf0734e1642f50b7944b71ee71031e3d4a8c

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

[P1] The new EPV diagnostic counts the intercept in the denominator (k = p + 1) in both solve_logit() and diagnose_propensity(), and the Registry now codifies that same rule. Peduzzi’s cited rule-of-thumb is framed per predictive variable, not per predictive variable plus intercept, so the PR makes EPV systematically too strict and can generate false low-EPV warnings or strict-mode failures across the PS-based estimators. diff_diff/linalg.py:L1291-L1325 diff_diff/staggered.py:L458-L476 docs/methodology/REGISTRY.md:L407-L411 docs/methodology/REGISTRY.md:L1249-L1252 docs/methodology/REGISTRY.md:L1297-L1300. (pubmed.ncbi.nlm.nih.gov)
The prior weighted-fallback blocker still looks fixed in code: unconditional fallback now uses the positive-weight weighted treated share in both TripleDifference and StaggeredTripleDifference, and strict-mode re-raise behavior is preserved in the changed Callaway-Sant’Anna paths. diff_diff/triple_diff.py:L1030-L1038 diff_diff/staggered_triple_diff.py:L1398-L1408 diff_diff/staggered.py:L2148-L2154 diff_diff/staggered.py:L2435-L2440
The earlier diagnose_propensity() concern is now explicitly documented in the method docstring as a raw-count heuristic rather than the fit-time effective-sample EPV, so I am not treating that as an open defect. diff_diff/staggered.py:L395-L401
[P2] The weighted-fallback fix is still weakly tested: the TripleDifference regression is unweighted, and the staggered DDD survey regression only checks finiteness instead of the fallback probability it claims to exercise. tests/test_methodology_triple_diff.py:L1435-L1464 tests/test_survey_staggered_ddd.py:L836-L864
Source-based re-review only: I could not execute the changed tests in this environment because runtime dependencies are missing (numpy import failed on import).

Methodology

Severity: P1. Impact: affected methods are the propensity-score logit paths in Callaway-Sant’Anna, TripleDifference, and StaggeredTripleDifference. The new EPV logic uses the full parameter count including the intercept (k_solve = X_solve.shape[1]; n_params = n_covariates + 1). Peduzzi’s abstract defines EPV using predictive variables: 252 deaths and 7 variables gives EPV = 36, which implies the intercept is not part of the denominator. This PR therefore hard-codes a stricter rule than the cited source and can misclassify adequately powered logits as low-EPV, including raising ValueError in strict mode. Concrete fix: compute EPV with the retained predictor count excluding the intercept, propagate that corrected count into diagnostics/result summaries, and update the Registry wording; if the team wants the stricter intercept-inclusive heuristic, it needs an explicit **Note:** deviation label in REGISTRY.md. diff_diff/linalg.py:L1291-L1325 diff_diff/triple_diff.py:L1015-L1023 diff_diff/staggered_triple_diff.py:L1367-L1375 diff_diff/staggered.py:L458-L476 docs/methodology/REGISTRY.md:L407-L411 docs/methodology/REGISTRY.md:L1249-L1252 docs/methodology/REGISTRY.md:L1297-L1300. (pubmed.ncbi.nlm.nih.gov)
Severity: P3. Impact: none. The pscore_fallback="error" default change, the “skip PS IF correction on unconditional fallback” rule, and the worst-case (g,t) EPV roll-up in staggered DDD are all documented in the Registry with note-style language, so I am not treating those as defects. Concrete fix: none. docs/methodology/REGISTRY.md:L412-L419 docs/methodology/REGISTRY.md:L1243-L1254 docs/methodology/REGISTRY.md:L1301-L1310

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: the new TripleDifference fallback warning tells Python users to switch to est_method='reg', but the public API keyword is estimation_method. That is minor, but it points users to a nonexistent parameter in the changed path. Concrete fix: change the warning text to estimation_method='reg'. diff_diff/triple_diff.py:L1040-L1045

Tech Debt

No finding requiring new TODO.md tracking.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the changed tests do not directly assert the weighted fallback constant that fixed the earlier blocker. tests/test_methodology_triple_diff.py forces a logit failure only in an unweighted design and merely checks for finite output; tests/test_survey_staggered_ddd.py uses survey weights but still only checks that overall_att is finite. A regression back to an unweighted fallback share, or incorrect positive-weight filtering, could slip through silently. Concrete fix: add unequal-weight cases for both estimators that force logit failure and assert the fallback probability equals the positive-weight weighted share used by the implementation. tests/test_methodology_triple_diff.py:L1435-L1464 tests/test_survey_staggered_ddd.py:L836-L864 diff_diff/triple_diff.py:L1030-L1038 diff_diff/staggered_triple_diff.py:L1398-L1408

Path to Approval

Change the EPV denominator from “predictors plus intercept” to “predictor variables” everywhere this PR computes or documents EPV: solve_logit(), CallawaySantAnna.diagnose_propensity(), and the Registry sections for Callaway-Sant’Anna, TripleDifference, and StaggeredTripleDifference.
Add boundary tests that lock down the Peduzzi denominator after the fix. Example: with 10 minority-class observations and 1 retained covariate, EPV should be 10, not 5; with 2 retained covariates, EPV should be 5.

igerber · 2026-04-02T01:09:54Z

/ai-review

github-actions · 2026-04-02T01:20:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f64cc1af2eafa9f17fe6405cc5e6dc92315aaed

Overall assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive summary

The prior P1 methodology blocker appears resolved: EPV is now computed with the intercept excluded, and the registry wording matches.
The fallback math itself still looks sound on source review: weighted unconditional fallback and strict-mode re-raise paths are present in the changed estimators.
Severity P2: the new DR fallback warnings, plus one Callaway-Sant'Anna registry bullet, say fallback drops all covariates, but the DR code still uses covariates in the outcome-regression component.
Severity P2: the changed fallback tests still do not pin down the actual weighted fallback constant they claim to exercise.
This was a source-based re-review only; python -m pytest --version fails here because pytest is not installed.

Methodology

Severity: P2. Impact: the new fallback messaging is semantically wrong for DR. In CallawaySantAnna, the DR code fits outcome-regression nuisance models before the propensity fallback, and in the repeated-cross-section DR path it does the same; TripleDifference and StaggeredTripleDifference also continue to run their OR components when estimation_method="dr". But the new warnings say fallback means “all covariates dropped,” and the Callaway-Sant'Anna registry now says the same. That can mislead users about which estimator actually ran and how to interpret the result. Concrete fix: make the warning/registry text method-specific. Keep “all covariates dropped” for IPW, but for DR say that the propensity model fell back to an unconditional score while the OR component still uses covariates, and that only the PS influence-function correction is skipped. staggered.py:L2346, staggered.py:L2443, staggered.py:L3425, staggered.py:L3519, triple_diff.py:L1040, triple_diff.py:L1103, staggered_triple_diff.py:L1199, staggered_triple_diff.py:L1389, REGISTRY.md:L412
Re-review note: I do not see a remaining source-material defect in the EPV denominator. solve_logit() and diagnose_propensity() now exclude the intercept, which matches Peduzzi’s framing of EPV by predictive variables, and the configurable threshold reads as a reasonable implementation choice rather than a source violation. linalg.py:L1304, staggered.py:L459, REGISTRY.md:L407. citeturn0search0turn0search2turn0search3

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new issue needs TODO.md tracking.

Security

No findings.

Documentation/Tests

Source-based only in this environment: python -m pytest --version fails with No module named pytest.

Severity: P2. Impact: the fallback regression tests are still weak where the bug risk actually sits. The updated TripleDifference test is unweighted and only checks warning plus finite output, and the survey staggered DDD test says it verifies the weighted fallback but only asserts finite overall_att. The new Callaway-Sant'Anna fallback tests are similarly warning/non-null checks. A regression back to unweighted fallback shares, or to the wrong DR fallback semantics, could still pass. Concrete fix: add unequal-positive-weight fixtures that force logit failure and assert the fallback probability equals the positive-weight weighted treated share in each affected estimator; for DR, also assert the OR covariates still affect the estimate so the code is not silently behaving like “all covariates dropped.” tests/test_methodology_triple_diff.py:L1435, tests/test_survey_staggered_ddd.py:L836, tests/test_staggered.py:L3489, tests/test_staggered.py:L3688, triple_diff.py:L1030, staggered_triple_diff.py:L1398, staggered.py:L2165, staggered.py:L2452, staggered.py:L3260, staggered.py:L3528
Severity: P3. Impact: several docstrings and new test comments still use the old intercept-inclusive EPV wording (“parameters” or arithmetic that divides by covariates plus intercept), even though the code and registry now correctly exclude the intercept. Concrete fix: update the docstrings/comments to say “predictor variables (excluding intercept)” and fix the example arithmetic. linalg.py:L1171, staggered.py:L193, triple_diff.py:L390, tests/test_linalg.py:L1686, tests/test_staggered.py:L3763

igerber · 2026-04-02T01:43:29Z

/ai-review

github-actions · 2026-04-02T01:57:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b4044ee7e7852a5138fe68022f017236962a14c6

Overall assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive summary

Re-review status: the earlier EPV-denominator source issue looks resolved.
The pscore_fallback="error" default change is documented in the Methodology Registry for the affected estimators, so I did not treat that default change as a defect.
Remaining P2: DR fallback text is still misleading in one CallawaySantAnna docstring and in the DDD warning surfaces; it says covariates are dropped even though DR still uses the OR covariates.
Remaining P2: the updated fallback tests still do not pin the weighted fallback probability or the DR-under-fallback semantics that actually changed.
Source-based only here: python -m pytest --version fails because pytest is not installed.

Methodology

Affected methods: propensity-score logit EPV diagnostics and fallback behavior in solve_logit(), CallawaySantAnna, TripleDifference, and StaggeredTripleDifference. No P0/P1 methodology defects found in the changed math or variance paths. solve_logit() now excludes the intercept from the EPV denominator, and Peduzzi et al. define EPV using predictive variables and report no major problems at EPV values of 10 or greater, so the current denominator and default threshold are source-consistent. diff_diff/linalg.py:1305, docs/methodology/REGISTRY.md:407. (pubmed.ncbi.nlm.nih.gov)

Severity: P2. Impact: DR fallback messaging is still semantically wrong in three public surfaces. The CallawaySantAnna class docstring still says unconditional fallback “effectively drops all covariates,” TripleDifference says the same in both its class docstring and runtime warning, and StaggeredTripleDifference runtime warning also says all covariates are dropped. In the DR branches, the outcome-regression nuisance model still runs with covariates; only the propensity model becomes unconditional and the PS nuisance correction is skipped. That can mislead users about which estimator actually ran. Concrete fix: make the text method-specific everywhere: for IPW, “all covariates dropped” is fine; for DR, say the propensity model fell back to an unconditional score while the OR component still uses covariates. diff_diff/staggered.py:207, diff_diff/staggered.py:2350, diff_diff/triple_diff.py:400, diff_diff/triple_diff.py:1042, diff_diff/triple_diff.py:1105, diff_diff/staggered_triple_diff.py:1199, diff_diff/staggered_triple_diff.py:1390

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining P2s are not currently tracked in TODO.md:47, so I treated them as live review items rather than accepted debt.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the changed fallback tests still mostly assert warning/finite output instead of the actual quantities this PR changed. The TripleDifference test and the survey-weighted StaggeredTripleDifference test do not assert that the fallback probability equals the positive-weight treated share, and the DR tests do not prove that OR covariates remain active under fallback. A regression back to unweighted fallback shares, or a DR fallback that silently behaves like “all covariates dropped,” would still pass. Concrete fix: add unequal-weight fixtures that force logit failure and assert the fallback constant in each affected branch; for DR, add a paired test where changing OR covariates changes the estimate under fallback. tests/test_methodology_triple_diff.py:1435, tests/test_survey_staggered_ddd.py:836, tests/test_staggered.py:3688, diff_diff/triple_diff.py:1031, diff_diff/staggered_triple_diff.py:1399, diff_diff/staggered.py:2166, diff_diff/staggered.py:2454, diff_diff/staggered.py:3262, diff_diff/staggered.py:3531
Severity: P3. Impact: several new EPV test comments still use the old intercept-inclusive arithmetic (40/9, 100/3, 15/3, 1/4) even though the code and registry now correctly use predictor count excluding the intercept. This is informational only, but it reintroduces confusion into the tests. Concrete fix: update the comments to the intercept-excluded arithmetic or rewrite them as “predictor variables excluding the intercept.” tests/test_linalg.py:1686, tests/test_linalg.py:1698, tests/test_linalg.py:1728, tests/test_staggered.py:3763

Verification note: source-based only here; python -m pytest --version fails with No module named pytest.

igerber · 2026-04-02T10:38:01Z

/ai-review

github-actions · 2026-04-02T10:46:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6a2e254b0c5936a50ccb182224f3146c3e25a6b8

Overall Assessment

Looks good

Highest unmitigated severity: P2.

Executive Summary

No unmitigated P0/P1 issues found in the changed estimator math, weighting, variance/SE handling, identification checks, or documented default-behavior changes.
Methodology alignment looks sound: the new EPV check uses minority-class observations per predictor, excluding the intercept, and the pscore_fallback="error" default change is documented in docs/methodology/REGISTRY.md. citeturn0search0
Re-review status: the earlier misleading DR fallback wording issue appears resolved in the touched public docstrings and warnings.
New P2: all three new epv_summary(show_all=False) helpers can return a column-less empty DataFrame when diagnostics exist but none are low.
Remaining P2: the updated fallback tests still do not assert the weighted fallback probability or the DR-under-fallback semantics that the code now claims to support.
Verification note: I could not run pytest here because python -m pytest --version fails with No module named pytest.

Methodology

Severity: None. Impact: Affected methods are the propensity-score logit EPV diagnostics and fallback behavior in CallawaySantAnna, TripleDifference, and StaggeredTripleDifference. I did not find an undocumented mismatch with Peduzzi or the Methodology Registry. solve_logit() now computes EPV as minority-class observations divided by predictor count excluding the intercept, which matches Peduzzi’s “events per predictive variable” framing, and the fallback/default behavior changes are explicitly documented in the registry. Concrete fix: none. diff_diff/linalg.py:L1292-L1328, docs/methodology/REGISTRY.md:L408-L422, docs/methodology/REGISTRY.md:L1246-L1256, docs/methodology/REGISTRY.md:L1300-L1312. citeturn0search0

Code Quality

Severity: P2. Impact: The new epv_summary(show_all=False) helpers violate their documented return contract when diagnostics exist but none are low. In that case rows stays empty and each method returns pd.DataFrame(rows), which produces a column-less empty frame instead of the promised schema. Any caller that expects group/time/subgroup columns will break on valid high-EPV fits. Concrete fix: return pd.DataFrame(rows, columns=[...]) in all three helpers, and add one regression test for the no-low-EPV case. diff_diff/staggered_results.py:L291-L313, diff_diff/staggered_triple_diff_results.py:L267-L289, diff_diff/triple_diff.py:L317-L338

Performance

Severity: None. Impact: No material performance regression stood out in the changed paths; the EPV bookkeeping is linear in already-materialized arrays and the cache now stores only a small diagnostics dict alongside the coefficient vector. Concrete fix: none.

Maintainability

Severity: None. Impact: Parameter propagation looks complete in the changed files. The new epv_threshold and pscore_fallback parameters are validated, exposed via get_params(), and stored on results objects for the touched estimators. Concrete fix: none.

Tech Debt

Severity: None. Impact: I did not find matching TODO.md entries that would mitigate the live P2/P3 items, so I treated them as active review findings rather than accepted debt. Concrete fix: none, unless you want to defer them explicitly in TODO.md. TODO.md:L47-L85

Security

Severity: None. Impact: No new secret-handling, subprocess, filesystem, or untrusted-input issues are introduced by this diff. Concrete fix: none.

Documentation/Tests

Severity: P2. Impact: The fallback tests still do not pin the behavioral contracts this PR changed. The implementations now use weighted positive-mass treated shares for unconditional fallback and preserve the OR covariates in DR fallback, but the changed tests still only assert warnings or finite outputs. A regression back to an unweighted fallback constant, or a DR fallback that effectively drops all covariates, would still pass. Concrete fix: add unequal-weight fixtures that force fallback and assert the exact fallback constant; for DR, add a paired test where only the OR covariates change and the fallback estimate changes accordingly. diff_diff/staggered.py:L2156-L2174, diff_diff/staggered.py:L2443-L2462, diff_diff/staggered.py:L3521-L3538, diff_diff/staggered_triple_diff.py:L1399-L1410, diff_diff/triple_diff.py:L1033-L1048, tests/test_staggered.py:L3490-L3517, tests/test_methodology_triple_diff.py:L1435-L1464, tests/test_survey_staggered_ddd.py:L833-L864
Severity: P3. Impact: Two new EPV test comments still use the wrong arithmetic in the configurable-threshold case. The fixture there is 15 events / 2 predictors = 7.5, but the comments say EPV=5, which reintroduces confusion around the intercept-excluded denominator. Concrete fix: update the comments to EPV = 7.5 or rewrite them as 15 events / 2 predictor variables. tests/test_linalg.py:L1728-L1735

…ault Events Per Variable (EPV) check in solve_logit warns when minority-class observations per parameter falls below threshold (default 10, per Peduzzi et al. 1996). Affects CallawaySantAnna, TripleDifference, and StaggeredTripleDiff — all estimators using logit for propensity scores. New pscore_fallback parameter defaults to "error" instead of silently dropping covariates when logit fails. Set pscore_fallback="unconditional" for legacy behavior. diagnose_propensity() method on CallawaySantAnna enables pre-estimation EPV assessment across cohorts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cstrings P0: Zero-fill NaN coefficients from dropped rank-deficient columns before caching in CS panel IPW/DR paths, preventing NaN propagation on cache reuse. Matches existing pattern in StaggeredTripleDiff._compute_pscore(). P1: Restore strict-mode semantics so rank_deficient_action="error" always re-raises regardless of pscore_fallback setting. Update TripleDifference REGISTRY.md section with pscore_fallback and EPV documentation. Propagate epv_threshold and pscore_fallback through triple_difference() wrapper. P2: Store epv_threshold on results objects for correct summary rendering. Add class docstrings for new parameters. Add regression tests for cache NaN poisoning and strict-mode interaction with pscore_fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e diagnostics Fix RCS fallback handlers to re-raise when rank_deficient_action="error", matching the panel path fix. Compute EPV on positive-weight sample only when weights have zeros (Peduzzi's rule applies to the fitted sample). Guard diagnose_propensity() against panel=False with NotImplementedError. Update StaggeredTripleDifference REGISTRY.md section with EPV diagnostics and pscore_fallback documentation. Cache EPV diagnostic metadata alongside logit coefficients so cache-hit cells appear in epv_summary(show_all=True). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pagation Retain worst-case (minimum) EPV across all g_c comparison cohorts for the same (g,t) cell instead of overwriting. Cache EPV diagnostic metadata alongside logit coefficients in _compute_pscore() and propagate on cache hits, matching the CallawaySantAnna pattern. Add REGISTRY.md note documenting the cell-level worst-case reporting convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…y guard, result params Skip propensity-score influence function correction when unconditional fallback is used (constant pscore has zero estimation uncertainty). Adds ps_fallback_used flag across all 4 IPW/DR methods (panel+RCS). Guard diagnose_propensity() against control_group='not_yet_treated' with NotImplementedError since the control set varies per (g,t) cell. Propagate pscore_fallback to all three results dataclasses and epv_threshold to TripleDifferenceResults for full audit trail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…f handling Use weighted treated share (np.average with survey weights) for unconditional fallback propensity instead of raw count ratio. Applies to all 4 panel/RCS IPW/DR fallback sites. Normalize np.inf → 0 for never-treated encoding in diagnose_propensity() to match fit()'s treatment_groups derivation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use weighted subgroup share (np.average with survey weights) for unconditional fallback instead of raw np.mean(PA4), matching the survey-weighted logit semantics used in CS and SDDD fallback paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aveat Add low-EPV diagnostic block to TripleDifferenceResults.summary(), matching the pattern in CS and SDDD results classes. Document diagnose_propensity() as a raw-count heuristic that may overstate EPV vs. fit-time effective sample (missing outcomes, zero survey weights). Direct users to results.epv_diagnostics for authoritative per-cell EPV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… convention) Peduzzi et al. (1996) define EPV using independent predictor variables, not including the intercept. Change denominator from k_solve (which includes the intercept column) to n_predictors = k_solve - 1. Also fix TripleDifference fallback warning to use correct API keyword estimation_method (not est_method). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DR fallback warnings now say propensity model is unconditional while outcome regression still uses covariates, instead of misleading "all covariates dropped" text. IPW warnings unchanged. Update REGISTRY.md fallback description to distinguish IPW vs DR behavior. Fix docstrings to say "predictor variables (excluding intercept)" consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ention Make all fallback warnings and docstrings method-specific: IPW says covariates dropped, DR says propensity model unconditional while outcome regression still uses covariates. Update test comments from old intercept-inclusive arithmetic to predictor-variable counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ensure epv_summary(show_all=False) returns DataFrame with correct column schema even when no entries have low EPV, across all three results classes. Fix remaining test comments to use intercept-excluded EPV arithmetic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-02T10:51:43Z

/ai-review

github-actions · 2026-04-02T11:05:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 67bc6dbbfe34b2b3e9e543f02594c60ebe1003cf

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

Re-review status: the prior epv_summary(show_all=False) contract bug is fixed in all three result containers, and the EPV arithmetic comment in the new solve_logit() tests is corrected. diff_diff/staggered_results.py:L282-L314 diff_diff/staggered_triple_diff_results.py:L258-L290 diff_diff/triple_diff.py:L308-L339 tests/test_linalg.py:L1721-L1735
No unmitigated P0/P1 issues found in estimator math, weighting, variance/SE handling, identification checks, or the changed default behaviors.
Methodology alignment looks sound for the affected methods: solve_logit() EPV diagnostics, CallawaySantAnna IPW/DR propensity fits, TripleDifference subgroup logits, and StaggeredTripleDifference pairwise logits all match the Methodology Registry as updated.
Remaining P2: the fallback tests still mostly assert warnings or finite outputs, not the exact weighted unconditional propensity constant, DR-under-fallback semantics, or the new default-"error" behavior across all affected estimators.
Verification note: I could not run pytest here because /usr/bin/python does not have pytest installed.

Methodology

Severity: None. Impact: The affected methods are solve_logit() EPV diagnostics (Peduzzi-style minority-class observations per predictor), CallawaySantAnna IPW/DR panel and repeated-cross-section paths, TripleDifference, and StaggeredTripleDifference. I did not find an undocumented mismatch with the registry: EPV excludes the intercept, the pscore_fallback="error" default change is documented, and the conservative worst-case (g, t) EPV roll-up for staggered DDD is explicitly documented. Concrete fix: none. diff_diff/linalg.py:L1292-L1329 docs/methodology/REGISTRY.md:L399-L422 docs/methodology/REGISTRY.md:L1245-L1256 docs/methodology/REGISTRY.md:L1299-L1312

Code Quality

Severity: None. Impact: The previous empty-schema epv_summary() issue is resolved; all three helpers now return empty DataFrames with the documented columns instead of column-less frames. Concrete fix: none. diff_diff/staggered_results.py:L282-L314 diff_diff/staggered_triple_diff_results.py:L258-L290 diff_diff/triple_diff.py:L308-L339

Performance

Severity: None. Impact: No material performance regression stood out in the changed paths; the EPV bookkeeping is linear in already-materialized arrays, and the cache growth is limited to a small diagnostics dict next to cached coefficients. Concrete fix: none.

Maintainability

Severity: None. Impact: Parameter propagation looks complete in the changed files. epv_threshold and pscore_fallback are validated, exposed through get_params(), and stored on results objects for all three affected estimators. Concrete fix: none. diff_diff/staggered.py:L287-L376 diff_diff/staggered.py:L3805-L3824 diff_diff/triple_diff.py:L478-L514 diff_diff/triple_diff.py:L1894-L1903 diff_diff/staggered_triple_diff.py:L93-L178

Tech Debt

Severity: None. Impact: TODO.md does not already track the remaining live test-gap item, so I treated it as an active P2 finding rather than accepted debt. Concrete fix: none beyond the Documentation/Tests item. TODO.md:L47-L87

Security

Severity: None. Impact: No new secret-handling, subprocess, filesystem, or untrusted-input issue is introduced by this diff. Concrete fix: none.

Documentation/Tests

Severity: P2. Impact: The fallback tests still do not pin the behaviors this PR changed. CallawaySantAnna now has a default-"error" regression test, but TripleDifference and StaggeredTripleDifference do not. The opt-in fallback tests across all three estimators still only assert warnings or finite outputs, so regressions to an unweighted fallback constant, to silently reverting the default back to unconditional fallback, or to DR fallback effectively dropping the OR side would still pass. Concrete fix: add forced-failure fixtures with unequal weights that assert the exact fallback constant, add paired DR tests where PS fails but OR covariates change and the estimate changes, and add default-"error" tests for TripleDifference and StaggeredTripleDifference. diff_diff/staggered.py:L2150-L2175 diff_diff/staggered.py:L2437-L2463 diff_diff/triple_diff.py:L1014-L1082 diff_diff/staggered_triple_diff.py:L1364-L1413 tests/test_staggered.py:L3670-L3714 tests/test_methodology_triple_diff.py:L1435-L1464 tests/test_survey_staggered_ddd.py:L833-L863
Severity: P3. Impact: I could not execute the touched test files in this environment because pytest is unavailable, so this review is static rather than runtime-verified. Concrete fix: run the affected pytest files in the project test environment before merge.

igerber and others added 11 commits April 2, 2026 06:48

igerber force-pushed the logit-assessment branch from 6a2e254 to 67bc6db Compare April 2, 2026 10:51

igerber merged commit e1cd6f5 into main Apr 2, 2026
14 checks passed

igerber deleted the logit-assessment branch April 2, 2026 12:13

igerber mentioned this pull request Apr 2, 2026

Bump version to 2.8.2 #255

Merged

Conversation

igerber commented Apr 1, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

igerber commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

igerber commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

igerber commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

igerber commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

igerber commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant