Add axis-G Rust vs Python backend parity edge-case tests#337
Merged
Conversation
Phase 2 silent-failures audit — axis-G (backend parity). Closes the coverage gap the audit flagged in three Rust-backed solver surfaces. Test-only PR; any discovered divergences are marked `xfail(strict=True)` and logged to `TODO.md` as P1 follow-ups rather than fixed in-scope. Finding #21 — `solve_ols` skip-rank-check parity (`linalg.py:369-373, 597-639`): three parity tests in `TestSolveOLSSkipRankCheckParity` covering mixed-scale columns (norm ratio > 1e6), near-singular full-rank (cond > 1e10), and rank-deficient collinear designs under `skip_rank_check=True` on HC1. Backends agree on fitted values within `rtol=1e-6, atol=1e-8`. All pass; no Rust-side code change needed. Finding #22 — `compute_synthetic_weights` parity (`utils.py:1134-1199`): three parity tests in `TestSyntheticWeightsBackendParity`. Near-singular `Y'Y` passes at `atol=1e-7`; extreme Y scale (1e9) and lambda_reg variations are `xfail(strict=True)` with a baselined ~15-80% weight divergence. Root cause: Rust path is Frank-Wolfe, Python fallback is projected gradient descent (`utils.py:1228`) — same QP, different simplex vertices under near-degenerate inputs. Finding #23 — TROP Rust grid-search + bootstrap parity (`trop_global.py:688-750, 966-1006`): two parity tests in `TestTROPRustEdgeCaseParity`, `@pytest.mark.slow` class-level. Both `xfail(strict=True)`: grid-search ATT on rank-deficient Y (~6% divergence), bootstrap SE under `seed=42` (~28% divergence, RNG backend mismatch — Rust `rand` crate vs numpy `default_rng`). Plan governance: - Per `feedback_ci_reviewer_pattern_checks`, greped adjacent Rust entry points (`_solve_ols_rust`, `_rust_synthetic_weights`, `_rust_loocv_grid_search_global`, `_rust_bootstrap_trop_variance_global`); no additional silent-fallback surfaces identified. - Per plan Non-goal #4, did not open an axis-H finding on TROP's `seed=None → 0` substitution at `trop_global.py:994` (out of scope). - No behavioral changes, no warnings, no REGISTRY changes, no flags. TODO.md logs three P1 follow-up entries: algorithmic unification for `compute_synthetic_weights` (FW vs PGD), TROP grid-search divergence on rank-deficient Y, TROP bootstrap RNG unification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed files. This PR is limited to backend-parity tests plus TODO tracking; it does not change estimator math, weighting, variance/SE code, identification checks, or defaults. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
igerber
added a commit
that referenced
this pull request
Apr 19, 2026
… cache Bundles the two remaining S-complexity findings from the Phase 2 audit, closing Phase 3 execution. Finding #12 — ContinuousDiD B-spline degenerate knot (axis C, Minor, `continuous_did_bspline.py:153`): `bspline_derivative_design_matrix` silently swallowed `ValueError` from `scipy.interpolate.BSpline` in the per-basis derivative loop, leaving affected columns of the derivative design matrix as zero with no user-visible signal. Downstream ContinuousDiD analytical inference then fed a biased `dPsi` into SE computation. Fix aggregates failed-basis indices and emits ONE `UserWarning` naming them. The all-identical-knot degenerate case (single dose value, `knots[0] == knots[-1]`) remains silently handled — derivatives there are mathematically zero, well-defined, and always have been. Finding #28 — PowerAnalysis survey-design cache staleness (axis J, Major, `power.py:171-180`): `_build_survey_design()` populated `self._cached_survey_design` on first call and never invalidated. Mutating `config.survey_design` after `__init__` silently returned the stale cached design. Default construction is microseconds and user-provided designs are reference copies, so the cache never earned its cost. Fix drops the cache entirely; method now reflects live `self.survey_design` every call. Six new tests: - `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3): single-dose silent contract, `ValueError`-forced aggregate warning, happy-path no-warning regression. - `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` (3): mutate-survey_design-picks-up-new, clearing-falls-back-to-default, repeat-calls-equivalent regression. REGISTRY notes added under §ContinuousDiD (edge cases) and §PowerAnalysis (`survey_config` section). Audit state post-PR: all 28 actionable Phase-2 findings resolved (26 in prior PRs; #12 + #28 here). Three P1 follow-ups remain logged in `TODO.md` from PR #337's discovered divergences (FW/PGD algorithmic mismatch in `compute_synthetic_weights`, TROP grid-search on rank- deficient Y, TROP bootstrap RNG unification). Those are post-audit cleanup work, not Phase-3 scope. No behavioral changes on clean inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 20, 2026
…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 silent-failures audit — axis-G (backend parity). Closes the coverage gap the audit flagged on three Rust-backed solver surfaces (findings #21, #22, #23). Test-only PR; any discovered divergences are marked
xfail(strict=True)and logged toTODO.mdas P1 follow-ups rather than fixed in-scope.solve_olsskip-rank-check parity — three tests inTestSolveOLSSkipRankCheckParitycovering mixed-scale, near-singular full-rank, and rank-deficient designs underskip_rank_check=Trueon HC1. All pass atrtol=1e-6, atol=1e-8on fitted values; no Rust-side code change needed.compute_synthetic_weightsparity — three tests. Near-singularY'Ypasses. Extreme Y scale (1e9) and lambda_reg variationsxfail(strict=True)— root cause: Rust path is Frank-Wolfe, Python fallback is projected-gradient descent. Same QP, different simplex vertices.@pytest.mark.slow) — grid-search ATT on rank-deficient Yxfail(strict=True)(~6% divergence). Bootstrap SE under fixed seedxfail(strict=True)(~28% divergence due to RNG-backend mismatch: Rustrandcrate vs numpydefault_rng).The four
xfail(strict=True)markers baseline the gaps so we get notified if/when the algorithms align. Three P1 follow-up entries added toTODO.md.Methodology references
Validation
tests/test_rust_backend.py(8 tests total: 4 pass, 4 xfail).pytest tests/test_rust_backend.py -m 'not slow'→ 72 passed, 14 deselected, 2 xfailedpytest tests/test_rust_backend.py::TestTROPRustEdgeCaseParity -m ''→ all pass (2 xfailed as expected)DIFF_DIFF_BACKENDcoverage: tests patchHAS_RUST_BACKEND=Falseto force Python path, matching the pattern already established bytest_trop_global_solver_parity_no_lowrank(test_rust_backend.py:1687).Security / privacy
Generated with Claude Code