Surface silent np.linalg.solve fallbacks across axis-A minor solver paths#334
Surface silent np.linalg.solve fallbacks across axis-A minor solver paths#334
Conversation
…aths Addresses findings #17, #18, #19 from the Phase 2 silent-failures audit (axis A, all Minor). Each site previously ran np.linalg.solve against a matrix that could be rank-deficient or near-singular with no user-facing signal. - StaggeredTripleDifference: `_compute_did_panel` now appends a condition-number sample to an instance tracker on LinAlgError; `fit()` emits ONE aggregate UserWarning listing affected (g, g_c, t) cells and the max condition number instead of silently falling back to np.linalg.lstsq per pair. Tracker resets on repeat fit. - EfficientDiD covariate sieve (estimate_propensity_ratio_sieve, estimate_inverse_propensity_sieve): precondition-check the normal-equations matrix via np.linalg.cond before solve and reject K values above 1/sqrt(eps); partial-K skips now surface via UserWarning listing the skipped K values, instead of being swallowed by `continue`. - compute_survey_vcov: check cond(X'WX) before the sandwich solve; emit UserWarning above the 1/sqrt(eps) threshold so ill-conditioned bread matrices don't silently produce unstable variance estimates. Sibling sites picked up via repo-wide lstsq-fallback pattern grep (per the pattern-check feedback memory): - two_stage.py:1768 (TSL variance bread) - two_stage_bootstrap.py:197 (multiplier bootstrap bread) Both now warn before the silent lstsq fallback. Adds 8 targeted tests across test_staggered_triple_diff.py, test_efficient_did.py, and test_survey.py, covering collinear/ill-conditioned triggers and happy-path negatives. REGISTRY.md notes added for each affected estimator section. No behavioral change on well-conditioned inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…oStage tests CI review on PR #334 flagged one P1 (incomplete audit of silent solve fallbacks) and two P3s. P1 — Callaway-Sant'Anna `_safe_inv()` sibling site: The shared `_safe_inv(A)` helper in staggered.py was still silently falling back from np.linalg.solve to np.linalg.lstsq. It feeds ~13 analytical SE paths (propensity-score Hessian, OR bread, event-study bread, etc.), so a rank-deficient design could silently ship degraded analytical SEs. Extended `_safe_inv(A, tracker: Optional[list] = None)` to append a condition-number sample on LinAlgError when a tracker is passed. Initialize `self._safe_inv_tracker: List[float] = []` at the top of `CallawaySantAnna.fit()`, thread `tracker=self._safe_inv_tracker` through all 13 callsites, and emit ONE aggregate UserWarning at the end of fit() listing the number of fallbacks and max condition number. Matches the tracker pattern established in STD finding #17. Added TestCallawaySantAnnaSafeInvFallback with two tests: collinear covariates trigger the aggregate warning; well-conditioned data emits no warning (happy-path regression guard). REGISTRY.md §CallawaySantAnna notes the new warning contract. P3 — Sieve docstrings lag behavior: Updated estimate_propensity_ratio_sieve and estimate_inverse_propensity_sieve docstrings to describe the new cond(A) > 1/sqrt(eps) precondition check, the partial-K skip warning, and the all-K fallback semantics. P3 — No TwoStage regression coverage: Added TestTwoStageStage2BreadWarning with two tests covering both the analytical TSL and bootstrap bread paths (contract: if the lstsq fallback triggers, it must warn). TODO.md: logged honest_did.py:1907 basis-enumeration skip as an intentional algorithm behavior (not a silent failure per the Phase 2 audit definition) but notable for a future diagnostic enhancement. No behavioral change on well-conditioned inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…+ stronger tests Round 2 CI review surfaced one P1 and two P3s. P1 — StaggeredTripleDifference PS-Hessian fallback: `_compute_pscore` had a second silent lstsq fallback (np.linalg.inv(X'WX) → lstsq on LinAlgError) that was missed in the initial PR. Under IPW/DR inference a rank-deficient propensity-score design could silently degrade influence-function corrections. Added a separate `self._ps_lstsq_fallback_tracker` alongside the existing OR tracker; `_compute_pscore` appends a condition-number sample per LinAlgError. `fit()` emits a sibling aggregate UserWarning with cell count + max condition number. Added TestStaggeredTripleDiffORSolveFallback ::test_collinear_covariates_emit_ps_hessian_warning which forces the PS-Hessian path under estimation_method="ipw". REGISTRY note added. Also scoped the existing OR-side test to the OR message text so the two aggregate warnings don't collide in the assertion. P3 — TwoStage warning text accuracy: Reviewer correctly pointed out that "drop collinear covariates" was misleading because X_2 is the Stage-2 indicator design (treatment, event-time, or group dummies), not user covariates. Reworded both the analytical and bootstrap warnings to name the actual failure mode (zero- weight or all-zero indicator column from an aggregation path with no qualifying observations). P3 — TwoStage tests were not verifying the warning path: Reviewer noted that my previous z1/z2 collinear-covariate tests never reached X_2'WX_2 at all because user covariates go into Stage 1, not Stage 2. Rewrote both tests to patch np.linalg.solve and raise LinAlgError specifically on the `solve(X'WX, np.eye(k))` shape, forcing the Stage-2 bread fallback. Tests now directly assert the warning fires and the lstsq fallback still produces finite SEs. No behavioral change on well-conditioned inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
PR #330 marked `test_timing_performance` and `TestPerformanceRegression` with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not slow'"` already excludes. That catches the default Python CI matrix but misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155, 162, 190`, which explicitly override the marker filter with `-m ''` so they can exercise the full slow suite (intentional — TROP parity tests live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold on Windows py3.11 under the Rust backend. Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition to `@pytest.mark.slow` on the affected tests: - `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance` - `test_se_accuracy.py::TestPerformanceRegression` (class-level) - `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit` GitHub Actions sets `CI=true` on every runner, so the skip covers both the default-CI and Rust-CI invocation patterns. Local development flows (`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI` env var means the tests still run as on-demand performance sanity. The `test_m0_short_circuit` case is special: it uses wall-clock time as a proxy for "short-circuit path taken" (fast path <0.5s, slow optimization would blow past that). The existing PR #330 TODO.md entry already tracks replacing it with a mock/spy; the `skipif` here is the interim guard until that refactor lands. Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all three targets); unset `CI` and the tests run and pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Phase 2 silent-failures audit — axis-A minor solver paths (findings #17, #18, #19), all Minor severity. Each site previously ran
np.linalg.solveagainst a matrix that could be rank-deficient or near-singular with no user-facing signal.staggered_triple_diff.py:1330):_compute_did_panelrecords a condition-number sample in an instance tracker on LinAlgError;fit()emits ONE aggregateUserWarningat the end listing affected (g, g_c, t) cells and the max condition number, instead of silently falling back tonp.linalg.lstsqper pair. Tracker resets on repeatfit().efficient_did_covariates.py:253,401): Precondition-check the normal-equations matrix vianp.linalg.condbeforenp.linalg.solve— near-singular A above1/sqrt(eps)(≈ 6.7e7) is rejected explicitly. Partial-K skips now surface viaUserWarninglisting skipped K values, instead of being swallowed bycontinue.survey.py:1450):compute_survey_vcovcheckscond(X'WX)before the sandwich solve and warns above the1/sqrt(eps)threshold so ill-conditioned bread matrices don't silently produce unstable SEs.Sibling sites picked up via repo-wide
linalg.solvepattern grep (per the pattern-check feedback memory):two_stage.py:1768— TSL variance bread silentlstsqfallback → now warnstwo_stage_bootstrap.py:197— multiplier bootstrap bread silentlstsqfallback → now warnsNo behavioral change on well-conditioned inputs. Behavior change is strictly additive warnings + K-rejection precondition check that can only produce a fit that is at-least-as-well-conditioned as before.
Out of scope, to be evaluated in a follow-up if needed:
staggered.py:_safe_invhelper (called many times in CS for analytical SE paths; silentlstsqfallback per call). The bread matrices it operates on are downstream ofsolve_olswhich already signals rank deficiency viarank_deficient_action.Methodology references (required if estimator / math changes)
Validation
tests/test_staggered_triple_diff.py::TestStaggeredTripleDiffORSolveFallback(3 tests: collinear covariates warn, well-conditioned no-warning, no-covariates no-warning)tests/test_efficient_did.py::TestSievePartialKSkipWarning(3 tests: ratio-sieve partial skip warns, inverse-propensity partial skip warns, clean data no-warning)tests/test_survey.py::TestSurveyVcovIllConditionedWarning(2 tests: ill-conditioned X warns, well-conditioned no-warning)fit()calls on the same estimator.tests/test_staggered_triple_diff.py,tests/test_efficient_did.py,tests/test_survey.py,tests/test_two_stage.pypass locally (379 tests).Security / privacy
Generated with Claude Code