Skip to content

Surface PowerAnalysis simulation-failure count and narrow except clause#326

Merged
igerber merged 3 commits intomainfrom
fix/power-analysis-simulation-failure-rate
Apr 19, 2026
Merged

Surface PowerAnalysis simulation-failure count and narrow except clause#326
igerber merged 3 commits intomainfrom
fix/power-analysis-simulation-failure-rate

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 18, 2026

Summary

  • Narrow the bare except Exception in simulate_power() to (ValueError, np.linalg.LinAlgError, KeyError, RuntimeError, ZeroDivisionError). Programming errors (TypeError, AttributeError, etc.) now propagate instead of being absorbed into the simulation-failure counter.
  • Surface the primary-effect failure count on the result object as SimulationPowerResults.n_simulation_failures, include it in summary(), and add it to to_dict() so it round-trips through to_dataframe().
  • Preserve existing behavior: the > 10% proportional UserWarning still fires per effect size, and the all-failed escape still raises RuntimeError.

Methodology references (required if estimator / math changes)

  • Method name(s): simulate_power() (simulation-based power analysis).
  • Paper / source link(s): No estimator math changed. This PR adjusts exception-handling scope and exposes an internal counter on the result object.
  • Any intentional deviations from the source (and why): None. The behavior is now explicitly documented under the PowerAnalysis section of docs/methodology/REGISTRY.md.

Validation

  • Tests added/updated (tests/test_power.py::TestSimulatePower):
    • Failure counter is populated on partial-failure runs and is 0 on clean runs.
    • TypeError propagates out of simulate_power (programming-error contract).
    • Proportional UserWarning still fires above 10% failure rate on the same per-effect-size message.
    • n_simulation_failures survives to_dict() serialization.
  • Full TestSimulatePower, TestSimulateMDE, TestSimulateSampleSize, TestEstimatorCoverage suites green (64 tests).
  • Backtest / simulation / notebook evidence (if applicable): N/A — warning-contract and result-surface change, no numeric-contract change.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Audit context: resolves finding #11 from the axis-C (silent fallback) slice of the in-flight silent-failures audit. Next up in the plan sequence is PR #7 (axis-A SyntheticDiD diagnostics scale-parity, post-PR #312).

igerber and others added 2 commits April 18, 2026 19:35
Two silent-failure patterns at `power.py:2241` addressed together:

1. Bare `except Exception` absorbed every exception a user-supplied
   estimator or result extractor could raise, including programming
   errors (TypeError, AttributeError, IndexError, ...). Those masked
   bugs counted as "simulation failures" and were aggregated as a
   reliability signal instead of propagating. Narrowed the catch to
   `(ValueError, np.linalg.LinAlgError, KeyError, RuntimeError,
   ZeroDivisionError)` — the set that reasonable DGPs and fit paths
   can raise on adversarial samples.

2. The internal `n_failures` counter was per-effect-size and thrown
   away after the outer loop, leaving users with no programmatic way
   to check whether a run was clean. Surface the primary-effect
   failure count on the results object as
   `SimulationPowerResults.n_simulation_failures` and include it in
   `summary()` output. The proportional > 10% `UserWarning` and the
   raise-if-all-fail escape are preserved.

Covered by audit axis C (silent fallback). Finding #11 from
`docs/audits/silent-failures-findings.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two P2 findings from local AI review were one issue: the new
`n_simulation_failures` field was added to the dataclass and `summary()`
but not to `to_dict()`. That left a gap between the in-memory result and
the JSON/DataFrame serialization path used by notebooks and pipelines.

- Add `n_simulation_failures` to `SimulationPowerResults.to_dict()`.
- Add a regression test asserting the field round-trips through
  `to_dict()` after a partially-failing run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good. I did not find any unmitigated P0/P1 methodology or correctness issues in the diff. Highest remaining issue is P2.

Executive Summary

  • simulate_power() is the only affected method, and the diff does not change estimator math, weighting, identification assumptions, or variance/SE formulas.
  • The narrowed exception-handling contract and surfaced primary-effect failure count are explicitly documented in the Methodology Registry, so there is no methodology blocker.
  • The new failure counter is wired through SimulationPowerResults, summary(), and to_dict() on the primary-effect path.
  • SimulationPowerResults now has a new required dataclass field, which is a backward-incompatible public __init__ change for callers that construct result objects directly.
  • The new tests cover partial failures and TypeError propagation, but they do not cover the all-failed RuntimeError escape that the PR says is preserved.

Methodology

  • P3 Informational. Impact: the only behavioral change is the per-replicate exception filter plus surfacing the primary-effect failure count; there is no estimator-formula, weighting, identification, or SE/variance change in diff_diff/power.py:L2251-L2312. This contract is explicitly documented in docs/methodology/REGISTRY.md:L2356-L2356, so under the review rubric it is not a defect. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

Tech Debt

No new untracked technical debt identified in the changed code.

Security

No findings.

Documentation/Tests

  • P3 Impact: the new tests in tests/test_power.py:L559-L690 cover partial failures, zero-failure runs, warning emission, and propagated TypeError, but they do not exercise the changed code’s all-failed branch in diff_diff/power.py:L2263-L2274. That leaves the advertised RuntimeError("All simulations failed. Check estimator and data generator.") contract unguarded. Concrete fix: add a test estimator that always raises a caught exception such as ValueError, and assert the exact RuntimeError message.

Verification note: I could not run pytest here because pytest is not installed in this environment.

Two items from CI AI review on PR #326:

1. P2 backward-compat: moving `n_simulation_failures` to the end of
   `SimulationPowerResults` with a default of `0`. Users who manually
   instantiate the dataclass with the pre-PR field order continue to
   work; `simulate_power()` still fills the field in via keyword. The
   field remains part of `to_dict()` output (PR-level contract
   unchanged).

2. P3 coverage: adding a regression test for the all-failed escape
   path. An estimator that raises `ValueError` on every replicate now
   asserts both the `RuntimeError("All simulations failed. ...")`
   message and that the narrow-except filter doesn't swallow it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a1a522d16723db4a4b3d36fc45cab3bb6f3c8a6a


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the diff. Highest remaining item is P3 informational.

Executive Summary

  • This re-review resolves the prior non-blocking items: SimulationPowerResults.n_simulation_failures now defaults to 0, avoiding the constructor compatibility issue, and the all-failed RuntimeError path now has direct test coverage in diff_diff/power.py:L1041-L1041 and tests/test_power.py:L629-L645.
  • simulate_power() is the only affected method, and the diff does not change estimator math, weighting, identification assumptions, or variance/SE formulas; it only narrows caught exception types and surfaces the primary-effect failure counter in the result object at diff_diff/power.py:L2251-L2312.
  • The new exception-handling contract is explicitly documented in the Methodology Registry, so this is a documented behavior change rather than an undocumented methodology deviation at docs/methodology/REGISTRY.md:L2356-L2356.
  • The new field is wired consistently through the changed result surfaces: class docs, dataclass field, summary(), and to_dict() at diff_diff/power.py:L997-L1141.
  • The added tests cover partial failures, zero-failure runs, propagated TypeError, all-failed RuntimeError, warning-threshold behavior, and serialization of n_simulation_failures at tests/test_power.py:L559-L713.
  • I could not execute pytest here because pytest is not installed in this environment.

Methodology

  • Severity: P3. Impact: The only behavioral change is the narrowed per-simulation exception filter plus surfacing the primary-effect failure count; this does not alter the simulation power estimator, weighting, identification, or variance/SE computation, and it matches the explicit registry note in diff_diff/power.py:L2251-L2312 and docs/methodology/REGISTRY.md:L2356-L2356. Concrete fix: None.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • No new untracked technical debt identified in the changed files. I did not find a TODO.md blocker relevant to this diff.

Security

  • No findings.

Documentation/Tests

  • No findings. The prior test gap on the all-failed branch is resolved in tests/test_power.py:L629-L645, and the new tests cover the newly exposed failure counter and programming-error propagation across the changed path in tests/test_power.py:L559-L713.
  • Verification note: I could not run the test suite locally because pytest is unavailable in this environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit 61aea00 into main Apr 19, 2026
23 of 24 checks passed
@igerber igerber deleted the fix/power-analysis-simulation-failure-rate branch April 19, 2026 10:06
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant