Skip to content

Surface silent np.linalg.solve fallbacks across axis-A minor solver paths#334

Merged
igerber merged 3 commits intomainfrom
fix/axis-a-minor-solver-paths
Apr 19, 2026
Merged

Surface silent np.linalg.solve fallbacks across axis-A minor solver paths#334
igerber merged 3 commits intomainfrom
fix/axis-a-minor-solver-paths

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

Phase 2 silent-failures audit — axis-A minor solver paths (findings #17, #18, #19), all Minor severity. Each site previously ran np.linalg.solve against a matrix that could be rank-deficient or near-singular with no user-facing signal.

  • Fix quick start guide output accuracy #17 StaggeredTripleDifference per-pair solve (staggered_triple_diff.py:1330): _compute_did_panel records a condition-number sample in an instance tracker on LinAlgError; fit() emits ONE aggregate UserWarning at the end listing affected (g, g_c, t) cells and the max condition number, instead of silently falling back to np.linalg.lstsq per pair. Tracker resets on repeat fit().
  • Fix result summary table to appropriately show the significance level in the table #18 EfficientDiD covariate sieve (efficient_did_covariates.py:253,401): Precondition-check the normal-equations matrix via np.linalg.cond before np.linalg.solve — near-singular A above 1/sqrt(eps) (≈ 6.7e7) is rejected explicitly. Partial-K skips now surface via UserWarning listing skipped K values, instead of being swallowed by continue.
  • Fix placebo test examples in README documentation #19 Survey TSL variance (survey.py:1450): compute_survey_vcov checks cond(X'WX) before the sandwich solve and warns above the 1/sqrt(eps) threshold so ill-conditioned bread matrices don't silently produce unstable SEs.

Sibling sites picked up via repo-wide linalg.solve pattern grep (per the pattern-check feedback memory):

  • two_stage.py:1768 — TSL variance bread silent lstsq fallback → now warns
  • two_stage_bootstrap.py:197 — multiplier bootstrap bread silent lstsq fallback → now warns

No behavioral change on well-conditioned inputs. Behavior change is strictly additive warnings + K-rejection precondition check that can only produce a fit that is at-least-as-well-conditioned as before.

Out of scope, to be evaluated in a follow-up if needed: staggered.py:_safe_inv helper (called many times in CS for analytical SE paths; silent lstsq fallback per call). The bread matrices it operates on are downstream of solve_ols which already signals rank deficiency via rank_deficient_action.

Methodology references (required if estimator / math changes)

  • Method name(s): Ortiz-Villavicencio & Sant'Anna (2025) StaggeredTripleDifference; Chen, Sant'Anna & Xie (2025) EfficientDiD; Binder (1983) / Lumley (2004) TSL survey variance
  • Paper / source link(s): REGISTRY.md sections §StaggeredTripleDifference (updated), §EfficientDiD (updated), §Survey Data Support / Taylor Series Linearization (updated)
  • Any intentional deviations from the source (and why): None. These are defensive signaling changes; numerical behavior on well-conditioned inputs is unchanged.

Validation

  • Tests added/updated:
    • tests/test_staggered_triple_diff.py::TestStaggeredTripleDiffORSolveFallback (3 tests: collinear covariates warn, well-conditioned no-warning, no-covariates no-warning)
    • tests/test_efficient_did.py::TestSievePartialKSkipWarning (3 tests: ratio-sieve partial skip warns, inverse-propensity partial skip warns, clean data no-warning)
    • tests/test_survey.py::TestSurveyVcovIllConditionedWarning (2 tests: ill-conditioned X warns, well-conditioned no-warning)
    • Repeat-fit idempotency verified manually: instance tracker resets cleanly between consecutive fit() calls on the same estimator.
  • Backtest / simulation / notebook evidence (if applicable): N/A — additive-warning PR.
  • Full tests/test_staggered_triple_diff.py, tests/test_efficient_did.py, tests/test_survey.py, tests/test_two_stage.py pass locally (379 tests).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…aths

Addresses findings #17, #18, #19 from the Phase 2 silent-failures audit (axis A,
all Minor). Each site previously ran np.linalg.solve against a matrix that
could be rank-deficient or near-singular with no user-facing signal.

- StaggeredTripleDifference: `_compute_did_panel` now appends a condition-number
  sample to an instance tracker on LinAlgError; `fit()` emits ONE aggregate
  UserWarning listing affected (g, g_c, t) cells and the max condition number
  instead of silently falling back to np.linalg.lstsq per pair. Tracker resets
  on repeat fit.
- EfficientDiD covariate sieve (estimate_propensity_ratio_sieve,
  estimate_inverse_propensity_sieve): precondition-check the normal-equations
  matrix via np.linalg.cond before solve and reject K values above
  1/sqrt(eps); partial-K skips now surface via UserWarning listing the
  skipped K values, instead of being swallowed by `continue`.
- compute_survey_vcov: check cond(X'WX) before the sandwich solve; emit
  UserWarning above the 1/sqrt(eps) threshold so ill-conditioned bread
  matrices don't silently produce unstable variance estimates.

Sibling sites picked up via repo-wide lstsq-fallback pattern grep (per
the pattern-check feedback memory):
- two_stage.py:1768 (TSL variance bread)
- two_stage_bootstrap.py:197 (multiplier bootstrap bread)
Both now warn before the silent lstsq fallback.

Adds 8 targeted tests across test_staggered_triple_diff.py,
test_efficient_did.py, and test_survey.py, covering collinear/ill-conditioned
triggers and happy-path negatives. REGISTRY.md notes added for each affected
estimator section. No behavioral change on well-conditioned inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Executive Summary

  • No unmitigated P0s in the diff.
  • The methodology-affecting changes that are in the diff are consistent with the new registry notes: EfficientDiD’s sieve K precheck is documented, and the StaggeredTripleDiff / Survey / TwoStage changes are warning-surface additions rather than undocumented estimator changes.
  • No new inline inference or partial-NaN anti-patterns were introduced in the touched files.
  • P1: the repo-wide silent-solver audit is incomplete. The unchanged helper diff_diff/staggered.py:L95-L100 still silently falls back from np.linalg.solve to np.linalg.lstsq on Callaway-Sant’Anna analytical inference paths, which is the same pattern this PR is trying to surface.
  • P3: the sieve helper docstrings lag the new condition-number rejection behavior.
  • P3: there is no direct regression coverage for the new TwoStage Stage-2 bread warnings. I also could not execute tests in this sandbox because pytest/numpy are unavailable.

Methodology

Code Quality

  • No findings. I did not find new inline t = effect / se inference code or partial NaN guards in the changed files.

Performance

  • No material findings. The added cond() checks are on small normal-equation / bread matrices and are unlikely to dominate estimator runtime.

Maintainability

  • Severity: P3
    Impact: The sieve helper docstrings still describe only singular-basis fallback and do not mention the new cond(A) > 1/sqrt(eps) rejection or partial-K warning surface, so the in-code documentation now lags the implemented behavior. See diff_diff/efficient_did_covariates.py:L161-L175 and diff_diff/efficient_did_covariates.py:L343-L356.
    Concrete fix: Update both docstrings to mention the condition-number precheck, partial-K warnings, and all-K fallback semantics.

Tech Debt

  • No separate findings. No new deferred-work entry was added to TODO.md:L51-L85, so the remaining P1 above is currently untracked.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Surface the remaining _safe_inv() fallback in Callaway-Sant’Anna analytical inference with a user-visible warning, or equivalent aggregated warning, instead of leaving it silent.
  2. Add a regression test that drives at least one Callaway-Sant’Anna analytical path through that fallback and asserts the warning is emitted.

…oStage tests

CI review on PR #334 flagged one P1 (incomplete audit of silent solve fallbacks)
and two P3s.

P1 — Callaway-Sant'Anna `_safe_inv()` sibling site:
  The shared `_safe_inv(A)` helper in staggered.py was still silently falling
  back from np.linalg.solve to np.linalg.lstsq. It feeds ~13 analytical SE
  paths (propensity-score Hessian, OR bread, event-study bread, etc.), so
  a rank-deficient design could silently ship degraded analytical SEs.

  Extended `_safe_inv(A, tracker: Optional[list] = None)` to append a
  condition-number sample on LinAlgError when a tracker is passed. Initialize
  `self._safe_inv_tracker: List[float] = []` at the top of
  `CallawaySantAnna.fit()`, thread `tracker=self._safe_inv_tracker` through
  all 13 callsites, and emit ONE aggregate UserWarning at the end of fit()
  listing the number of fallbacks and max condition number. Matches the
  tracker pattern established in STD finding #17.

  Added TestCallawaySantAnnaSafeInvFallback with two tests: collinear
  covariates trigger the aggregate warning; well-conditioned data emits no
  warning (happy-path regression guard).

  REGISTRY.md §CallawaySantAnna notes the new warning contract.

P3 — Sieve docstrings lag behavior: Updated estimate_propensity_ratio_sieve
  and estimate_inverse_propensity_sieve docstrings to describe the new
  cond(A) > 1/sqrt(eps) precondition check, the partial-K skip warning,
  and the all-K fallback semantics.

P3 — No TwoStage regression coverage: Added TestTwoStageStage2BreadWarning
  with two tests covering both the analytical TSL and bootstrap bread
  paths (contract: if the lstsq fallback triggers, it must warn).

TODO.md: logged honest_did.py:1907 basis-enumeration skip as an intentional
algorithm behavior (not a silent failure per the Phase 2 audit definition)
but notable for a future diagnostic enhancement.

No behavioral change on well-conditioned inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 90a38316506e2604030627ffba091edb8f441eae


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The previous P1 on CallawaySantAnna._safe_inv() appears addressed: the fallback is now tracked, surfaced once per fit, and documented in the Methodology Registry.
  • EfficientDiD’s new sieve cond(A) rejection is documented in REGISTRY.md, so that methodology deviation is informational rather than defective.
  • Severity: P1 [Newly identified]. StaggeredTripleDifference still has a silent lstsq fallback in the propensity-score Hessian path, so the silent-solver audit is incomplete for IPW/DR inference.
  • Survey and TwoStage changes are additive warning surfaces; I did not find new inline-inference or partial-NaN anti-patterns in the touched code.
  • Severity: P3. The new TwoStage tests do not actually force the warned Stage-2 bread path, so that warning surface is still effectively unverified.
  • Review limitation: pytest is not installed in this sandbox, so this is a static review only.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • Severity: P3-informational. Impact: The new honest_did.py diagnostic follow-up is properly tracked in TODO.md:L85, so it does not block this PR. Concrete fix: None.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Surface the remaining StaggeredTripleDifference PS-Hessian fallback in _compute_pscore() with a user-visible warning, preferably using the same aggregate-warning pattern already added for the OR fallback.
  2. Add a regression test that forces that PS-Hessian path under estimation_method="ipw" or "dr" and asserts the warning is emitted.

…+ stronger tests

Round 2 CI review surfaced one P1 and two P3s.

P1 — StaggeredTripleDifference PS-Hessian fallback:
  `_compute_pscore` had a second silent lstsq fallback (np.linalg.inv(X'WX)
  → lstsq on LinAlgError) that was missed in the initial PR. Under IPW/DR
  inference a rank-deficient propensity-score design could silently degrade
  influence-function corrections.

  Added a separate `self._ps_lstsq_fallback_tracker` alongside the existing
  OR tracker; `_compute_pscore` appends a condition-number sample per
  LinAlgError. `fit()` emits a sibling aggregate UserWarning with cell
  count + max condition number. Added TestStaggeredTripleDiffORSolveFallback
  ::test_collinear_covariates_emit_ps_hessian_warning which forces the
  PS-Hessian path under estimation_method="ipw". REGISTRY note added.

  Also scoped the existing OR-side test to the OR message text so the two
  aggregate warnings don't collide in the assertion.

P3 — TwoStage warning text accuracy:
  Reviewer correctly pointed out that "drop collinear covariates" was
  misleading because X_2 is the Stage-2 indicator design (treatment,
  event-time, or group dummies), not user covariates. Reworded both the
  analytical and bootstrap warnings to name the actual failure mode (zero-
  weight or all-zero indicator column from an aggregation path with no
  qualifying observations).

P3 — TwoStage tests were not verifying the warning path:
  Reviewer noted that my previous z1/z2 collinear-covariate tests never
  reached X_2'WX_2 at all because user covariates go into Stage 1, not
  Stage 2. Rewrote both tests to patch np.linalg.solve and raise
  LinAlgError specifically on the `solve(X'WX, np.eye(k))` shape, forcing
  the Stage-2 bread fallback. Tests now directly assert the warning fires
  and the lstsq fallback still produces finite SEs.

No behavioral change on well-conditioned inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 19, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 099507dc1987605a8cc9087c7c9c8ee6e1b2668d


Overall Assessment

Looks good

Executive Summary

  • All previously raised findings appear addressed in the changed code.
  • The prior P1 on the silent StaggeredTripleDifference propensity-score Hessian fallback is resolved: the fallback is now surfaced once per fit(), documented in the Methodology Registry, and covered by a regression test in the IPW path.
  • EfficientDiD's new sieve cond(A) rejection changes behavior only in a documented registry-noted way, so it is informational rather than a methodology defect.
  • The previous TwoStage warning-text and test-reachability concerns are resolved: the warning now points to the singular Stage-2 indicator design, and the tests directly force both analytical and bootstrap bread fallback handlers.
  • I did not find new incorrect variance/SE handling, NaN-inference regressions, control-group errors, or parameter-propagation gaps in the touched code.
  • Review limitation: pytest is unavailable in this sandbox, so this is a static review only.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational. Impact: The out-of-scope honest_did.py silent basis-rejection diagnostic is now explicitly tracked in TODO.md, so the remaining follow-up is properly deferred and non-blocking under the stated policy. See TODO.md:L85-L85 and diff_diff/honest_did.py:L1906-L1909. Concrete fix: None.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit 4a48487 into main Apr 19, 2026
22 of 24 checks passed
@igerber igerber deleted the fix/axis-a-minor-solver-paths branch April 19, 2026 18:18
igerber added a commit that referenced this pull request Apr 19, 2026
PR #330 marked `test_timing_performance` and `TestPerformanceRegression`
with `@pytest.mark.slow`, which the default pytest `addopts = "-m 'not
slow'"` already excludes. That catches the default Python CI matrix but
misses the Rust-backend CI jobs at `.github/workflows/rust-test.yml:155,
162, 190`, which explicitly override the marker filter with `-m ''` so
they can exercise the full slow suite (intentional — TROP parity tests
live there). That's why our PR #334 tripped a 0.120s vs 0.1s threshold
on Windows py3.11 under the Rust backend.

Add a `skipif(os.environ.get("CI") == "true", ...)` marker in addition
to `@pytest.mark.slow` on the affected tests:
- `test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance`
- `test_se_accuracy.py::TestPerformanceRegression` (class-level)
- `test_methodology_honest_did.py::TestOptimalFLCI::test_m0_short_circuit`

GitHub Actions sets `CI=true` on every runner, so the skip covers both
the default-CI and Rust-CI invocation patterns. Local development flows
(`pytest`, `pytest -m slow`, `pytest -m ''`) are unaffected — no `CI`
env var means the tests still run as on-demand performance sanity.

The `test_m0_short_circuit` case is special: it uses wall-clock time as
a proxy for "short-circuit path taken" (fast path <0.5s, slow
optimization would blow past that). The existing PR #330 TODO.md entry
already tracks replacing it with a mock/spy; the `skipif` here is the
interim guard until that refactor lands.

Verified locally: `CI=true pytest ... -m ''` reports 5 skipped (all
three targets); unset `CI` and the tests run and pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant