Surface PowerAnalysis simulation-failure count and narrow except clause by igerber · Pull Request #326 · igerber/diff-diff

igerber · 2026-04-18T23:37:28Z

Summary

Narrow the bare except Exception in simulate_power() to (ValueError, np.linalg.LinAlgError, KeyError, RuntimeError, ZeroDivisionError). Programming errors (TypeError, AttributeError, etc.) now propagate instead of being absorbed into the simulation-failure counter.
Surface the primary-effect failure count on the result object as SimulationPowerResults.n_simulation_failures, include it in summary(), and add it to to_dict() so it round-trips through to_dataframe().
Preserve existing behavior: the > 10% proportional UserWarning still fires per effect size, and the all-failed escape still raises RuntimeError.

Methodology references (required if estimator / math changes)

Method name(s): simulate_power() (simulation-based power analysis).
Paper / source link(s): No estimator math changed. This PR adjusts exception-handling scope and exposes an internal counter on the result object.
Any intentional deviations from the source (and why): None. The behavior is now explicitly documented under the PowerAnalysis section of docs/methodology/REGISTRY.md.

Validation

Tests added/updated (tests/test_power.py::TestSimulatePower):
- Failure counter is populated on partial-failure runs and is 0 on clean runs.
- TypeError propagates out of simulate_power (programming-error contract).
- Proportional UserWarning still fires above 10% failure rate on the same per-effect-size message.
- n_simulation_failures survives to_dict() serialization.
Full TestSimulatePower, TestSimulateMDE, TestSimulateSampleSize, TestEstimatorCoverage suites green (64 tests).
Backtest / simulation / notebook evidence (if applicable): N/A — warning-contract and result-surface change, no numeric-contract change.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Audit context: resolves finding #11 from the axis-C (silent fallback) slice of the in-flight silent-failures audit. Next up in the plan sequence is PR #7 (axis-A SyntheticDiD diagnostics scale-parity, post-PR #312).

Two silent-failure patterns at `power.py:2241` addressed together: 1. Bare `except Exception` absorbed every exception a user-supplied estimator or result extractor could raise, including programming errors (TypeError, AttributeError, IndexError, ...). Those masked bugs counted as "simulation failures" and were aggregated as a reliability signal instead of propagating. Narrowed the catch to `(ValueError, np.linalg.LinAlgError, KeyError, RuntimeError, ZeroDivisionError)` — the set that reasonable DGPs and fit paths can raise on adversarial samples. 2. The internal `n_failures` counter was per-effect-size and thrown away after the outer loop, leaving users with no programmatic way to check whether a run was clean. Surface the primary-effect failure count on the results object as `SimulationPowerResults.n_simulation_failures` and include it in `summary()` output. The proportional > 10% `UserWarning` and the raise-if-all-fail escape are preserved. Covered by audit axis C (silent fallback). Finding #11 from `docs/audits/silent-failures-findings.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two P2 findings from local AI review were one issue: the new `n_simulation_failures` field was added to the dataclass and `summary()` but not to `to_dict()`. That left a gap between the in-memory result and the JSON/DataFrame serialization path used by notebooks and pipelines. - Add `n_simulation_failures` to `SimulationPowerResults.to_dict()`. - Add a regression test asserting the field round-trips through `to_dict()` after a partially-failing run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-18T23:43:16Z

Overall Assessment

✅ Looks good. I did not find any unmitigated P0/P1 methodology or correctness issues in the diff. Highest remaining issue is P2.

Executive Summary

simulate_power() is the only affected method, and the diff does not change estimator math, weighting, identification assumptions, or variance/SE formulas.
The narrowed exception-handling contract and surfaced primary-effect failure count are explicitly documented in the Methodology Registry, so there is no methodology blocker.
The new failure counter is wired through SimulationPowerResults, summary(), and to_dict() on the primary-effect path.
SimulationPowerResults now has a new required dataclass field, which is a backward-incompatible public __init__ change for callers that construct result objects directly.
The new tests cover partial failures and TypeError propagation, but they do not cover the all-failed RuntimeError escape that the PR says is preserved.

Methodology

P3 Informational. Impact: the only behavioral change is the per-replicate exception filter plus surfacing the primary-effect failure count; there is no estimator-formula, weighting, identification, or SE/variance change in diff_diff/power.py:L2251-L2312. This contract is explicitly documented in docs/methodology/REGISTRY.md:L2356-L2356, so under the review rubric it is not a defect. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

P2 Impact: diff_diff/power.py:L1020-L1040 adds n_simulation_failures as a required dataclass field. SimulationPowerResults is a documented public class with a public constructor surface in docs/api/power.rst:L70-L78 and docs/api/_autosummary/diff_diff.SimulationPowerResults.rst:L6-L21, so downstream code that manually instantiates it with the old signature will now fail with TypeError. Concrete fix: keep the new field backward-compatible by moving it after the existing required fields and giving it a default of 0, or making it keyword-only with a default.

Tech Debt

No new untracked technical debt identified in the changed code.

Security

No findings.

Documentation/Tests

P3 Impact: the new tests in tests/test_power.py:L559-L690 cover partial failures, zero-failure runs, warning emission, and propagated TypeError, but they do not exercise the changed code’s all-failed branch in diff_diff/power.py:L2263-L2274. That leaves the advertised RuntimeError("All simulations failed. Check estimator and data generator.") contract unguarded. Concrete fix: add a test estimator that always raises a caught exception such as ValueError, and assert the exact RuntimeError message.

Verification note: I could not run pytest here because pytest is not installed in this environment.

Two items from CI AI review on PR #326: 1. P2 backward-compat: moving `n_simulation_failures` to the end of `SimulationPowerResults` with a default of `0`. Users who manually instantiate the dataclass with the pre-PR field order continue to work; `simulate_power()` still fills the field in via keyword. The field remains part of `to_dict()` output (PR-level contract unchanged). 2. P3 coverage: adding a regression test for the all-failed escape path. An estimator that raises `ValueError` on every replicate now asserts both the `RuntimeError("All simulations failed. ...")` message and that the narrow-except filter doesn't swallow it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T00:36:51Z

/ai-review

github-actions · 2026-04-19T00:39:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a1a522d16723db4a4b3d36fc45cab3bb6f3c8a6a

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the diff. Highest remaining item is P3 informational.

Executive Summary

This re-review resolves the prior non-blocking items: SimulationPowerResults.n_simulation_failures now defaults to 0, avoiding the constructor compatibility issue, and the all-failed RuntimeError path now has direct test coverage in diff_diff/power.py:L1041-L1041 and tests/test_power.py:L629-L645.
simulate_power() is the only affected method, and the diff does not change estimator math, weighting, identification assumptions, or variance/SE formulas; it only narrows caught exception types and surfaces the primary-effect failure counter in the result object at diff_diff/power.py:L2251-L2312.
The new exception-handling contract is explicitly documented in the Methodology Registry, so this is a documented behavior change rather than an undocumented methodology deviation at docs/methodology/REGISTRY.md:L2356-L2356.
The new field is wired consistently through the changed result surfaces: class docs, dataclass field, summary(), and to_dict() at diff_diff/power.py:L997-L1141.
The added tests cover partial failures, zero-failure runs, propagated TypeError, all-failed RuntimeError, warning-threshold behavior, and serialization of n_simulation_failures at tests/test_power.py:L559-L713.
I could not execute pytest here because pytest is not installed in this environment.

Methodology

Severity: P3. Impact: The only behavioral change is the narrowed per-simulation exception filter plus surfacing the primary-effect failure count; this does not alter the simulation power estimator, weighting, identification, or variance/SE computation, and it matches the explicit registry note in diff_diff/power.py:L2251-L2312 and docs/methodology/REGISTRY.md:L2356-L2356. Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The prior constructor-compatibility concern is resolved by the defaulted dataclass field at diff_diff/power.py:L1041-L1041.

Tech Debt

No new untracked technical debt identified in the changed files. I did not find a TODO.md blocker relevant to this diff.

Security

No findings.

Documentation/Tests

No findings. The prior test gap on the all-failed branch is resolved in tests/test_power.py:L629-L645, and the new tests cover the newly exposed failure counter and programming-error propagation across the changed path in tests/test_power.py:L559-L713.
Verification note: I could not run the test suite locally because pytest is unavailable in this environment.

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits April 18, 2026 19:35

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit 61aea00 into main Apr 19, 2026
23 of 24 checks passed

igerber deleted the fix/power-analysis-simulation-failure-rate branch April 19, 2026 10:06

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface PowerAnalysis simulation-failure count and narrow except clause#326

Surface PowerAnalysis simulation-failure count and narrow except clause#326
igerber merged 3 commits intomainfrom
fix/power-analysis-simulation-failure-rate

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant