Surface PowerAnalysis simulation-failure count and narrow except clause#326
Surface PowerAnalysis simulation-failure count and narrow except clause#326
Conversation
Two silent-failure patterns at `power.py:2241` addressed together: 1. Bare `except Exception` absorbed every exception a user-supplied estimator or result extractor could raise, including programming errors (TypeError, AttributeError, IndexError, ...). Those masked bugs counted as "simulation failures" and were aggregated as a reliability signal instead of propagating. Narrowed the catch to `(ValueError, np.linalg.LinAlgError, KeyError, RuntimeError, ZeroDivisionError)` — the set that reasonable DGPs and fit paths can raise on adversarial samples. 2. The internal `n_failures` counter was per-effect-size and thrown away after the outer loop, leaving users with no programmatic way to check whether a run was clean. Surface the primary-effect failure count on the results object as `SimulationPowerResults.n_simulation_failures` and include it in `summary()` output. The proportional > 10% `UserWarning` and the raise-if-all-fail escape are preserved. Covered by audit axis C (silent fallback). Finding #11 from `docs/audits/silent-failures-findings.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two P2 findings from local AI review were one issue: the new `n_simulation_failures` field was added to the dataclass and `summary()` but not to `to_dict()`. That left a gap between the in-memory result and the JSON/DataFrame serialization path used by notebooks and pipelines. - Add `n_simulation_failures` to `SimulationPowerResults.to_dict()`. - Add a regression test asserting the field round-trips through `to_dict()` after a partially-failing run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good. I did not find any unmitigated P0/P1 methodology or correctness issues in the diff. Highest remaining issue is P2. Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability
Tech Debt No new untracked technical debt identified in the changed code. Security No findings. Documentation/Tests
Verification note: I could not run |
Two items from CI AI review on PR #326: 1. P2 backward-compat: moving `n_simulation_failures` to the end of `SimulationPowerResults` with a default of `0`. Users who manually instantiate the dataclass with the pre-PR field order continue to work; `simulate_power()` still fills the field in via keyword. The field remains part of `to_dict()` output (PR-level contract unchanged). 2. P3 coverage: adding a regression test for the all-failed escape path. An estimator that raises `ValueError` on every replicate now asserts both the `RuntimeError("All simulations failed. ...")` message and that the narrow-except filter doesn't swallow it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the diff. Highest remaining item is P3 informational. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
except Exceptioninsimulate_power()to(ValueError, np.linalg.LinAlgError, KeyError, RuntimeError, ZeroDivisionError). Programming errors (TypeError,AttributeError, etc.) now propagate instead of being absorbed into the simulation-failure counter.SimulationPowerResults.n_simulation_failures, include it insummary(), and add it toto_dict()so it round-trips throughto_dataframe().> 10%proportionalUserWarningstill fires per effect size, and the all-failed escape still raisesRuntimeError.Methodology references (required if estimator / math changes)
simulate_power()(simulation-based power analysis).docs/methodology/REGISTRY.md.Validation
tests/test_power.py::TestSimulatePower):TypeErrorpropagates out ofsimulate_power(programming-error contract).UserWarningstill fires above 10% failure rate on the same per-effect-size message.n_simulation_failuressurvivesto_dict()serialization.TestSimulatePower,TestSimulateMDE,TestSimulateSampleSize,TestEstimatorCoveragesuites green (64 tests).Security / privacy
Audit context: resolves finding #11 from the axis-C (silent fallback) slice of the in-flight silent-failures audit. Next up in the plan sequence is PR #7 (axis-A SyntheticDiD diagnostics scale-parity, post-PR #312).