Skip to content

Add axis-G Rust vs Python backend parity edge-case tests#337

Merged
igerber merged 1 commit intomainfrom
fix/axis-g-rust-parity-tests
Apr 19, 2026
Merged

Add axis-G Rust vs Python backend parity edge-case tests#337
igerber merged 1 commit intomainfrom
fix/axis-g-rust-parity-tests

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 19, 2026

Summary

Phase 2 silent-failures audit — axis-G (backend parity). Closes the coverage gap the audit flagged on three Rust-backed solver surfaces (findings #21, #22, #23). Test-only PR; any discovered divergences are marked xfail(strict=True) and logged to TODO.md as P1 follow-ups rather than fixed in-scope.

The four xfail(strict=True) markers baseline the gaps so we get notified if/when the algorithms align. Three P1 follow-up entries added to TODO.md.

Methodology references

  • Method name(s): N/A - no methodology changes
  • Paper / source link(s): N/A
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: 3 new test classes in tests/test_rust_backend.py (8 tests total: 4 pass, 4 xfail).
  • Backtest / simulation / notebook evidence: N/A (test-only PR)
  • Local verification:
    • pytest tests/test_rust_backend.py -m 'not slow' → 72 passed, 14 deselected, 2 xfailed
    • pytest tests/test_rust_backend.py::TestTROPRustEdgeCaseParity -m '' → all pass (2 xfailed as expected)
  • DIFF_DIFF_BACKEND coverage: tests patch HAS_RUST_BACKEND=False to force Python path, matching the pattern already established by test_trop_global_solver_parity_no_lowrank (test_rust_backend.py:1687).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Phase 2 silent-failures audit — axis-G (backend parity). Closes the
coverage gap the audit flagged in three Rust-backed solver surfaces.
Test-only PR; any discovered divergences are marked `xfail(strict=True)`
and logged to `TODO.md` as P1 follow-ups rather than fixed in-scope.

Finding #21 — `solve_ols` skip-rank-check parity (`linalg.py:369-373,
597-639`): three parity tests in `TestSolveOLSSkipRankCheckParity`
covering mixed-scale columns (norm ratio > 1e6), near-singular full-rank
(cond > 1e10), and rank-deficient collinear designs under
`skip_rank_check=True` on HC1. Backends agree on fitted values within
`rtol=1e-6, atol=1e-8`. All pass; no Rust-side code change needed.

Finding #22 — `compute_synthetic_weights` parity (`utils.py:1134-1199`):
three parity tests in `TestSyntheticWeightsBackendParity`. Near-singular
`Y'Y` passes at `atol=1e-7`; extreme Y scale (1e9) and lambda_reg
variations are `xfail(strict=True)` with a baselined ~15-80% weight
divergence. Root cause: Rust path is Frank-Wolfe, Python fallback is
projected gradient descent (`utils.py:1228`) — same QP, different
simplex vertices under near-degenerate inputs.

Finding #23 — TROP Rust grid-search + bootstrap parity
(`trop_global.py:688-750, 966-1006`): two parity tests in
`TestTROPRustEdgeCaseParity`, `@pytest.mark.slow` class-level. Both
`xfail(strict=True)`: grid-search ATT on rank-deficient Y (~6%
divergence), bootstrap SE under `seed=42` (~28% divergence, RNG
backend mismatch — Rust `rand` crate vs numpy `default_rng`).

Plan governance:
- Per `feedback_ci_reviewer_pattern_checks`, greped adjacent Rust
  entry points (`_solve_ols_rust`, `_rust_synthetic_weights`,
  `_rust_loocv_grid_search_global`, `_rust_bootstrap_trop_variance_global`);
  no additional silent-fallback surfaces identified.
- Per plan Non-goal #4, did not open an axis-H finding on TROP's
  `seed=None → 0` substitution at `trop_global.py:994` (out of scope).
- No behavioral changes, no warnings, no REGISTRY changes, no flags.

TODO.md logs three P1 follow-up entries: algorithmic unification for
`compute_synthetic_weights` (FW vs PGD), TROP grid-search divergence on
rank-deficient Y, TROP bootstrap RNG unification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed files. This PR is limited to backend-parity tests plus TODO tracking; it does not change estimator math, weighting, variance/SE code, identification checks, or defaults.

Executive Summary

Methodology

  • No findings. Affected surfaces are existing backend implementations only: solve_ols, compute_synthetic_weights, and TROP global parity. The PR does not alter estimator equations, assumptions, weighting, or SE formulas.

Code Quality

  • No findings. The mock-based Rust/Python forcing pattern matches the existing style in tests/test_rust_backend.py:1571 and correctly patches the module-level backend gates used by the tested paths.

Performance

  • No findings. The heavier TROP parity coverage is correctly isolated behind @pytest.mark.slow at tests/test_rust_backend.py:2317, so default suite runtime is unchanged.

Maintainability

  • No findings. The new TODO entries are concrete, scoped to specific call sites, and tied directly to the strict xfail baselines.

Tech Debt

  • Severity: P3
    Impact: The backend parity gaps are now explicitly tracked in TODO.md:86, TODO.md:87, and TODO.md:88, which is the right treatment for this test-only PR under the project’s deferred-work policy.
    Concrete fix: None required for approval; the follow-up work is already recorded.

Security

  • No findings. The diff adds tests and TODO text only; no new secret-handling, I/O, or network-facing behavior appears.

Documentation/Tests

  • No findings. xfail(strict=True) is the correct mechanism for these baselines because an eventual backend alignment will surface as XPASS instead of disappearing silently.
  • Verification note: I was not able to run pytest locally in this environment because the executable is unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026
@igerber igerber merged commit dab5771 into main Apr 19, 2026
22 of 23 checks passed
@igerber igerber deleted the fix/axis-g-rust-parity-tests branch April 19, 2026 21:27
igerber added a commit that referenced this pull request Apr 19, 2026
… cache

Bundles the two remaining S-complexity findings from the Phase 2 audit,
closing Phase 3 execution.

Finding #12 — ContinuousDiD B-spline degenerate knot (axis C, Minor,
`continuous_did_bspline.py:153`): `bspline_derivative_design_matrix`
silently swallowed `ValueError` from `scipy.interpolate.BSpline` in the
per-basis derivative loop, leaving affected columns of the derivative
design matrix as zero with no user-visible signal. Downstream
ContinuousDiD analytical inference then fed a biased `dPsi` into SE
computation. Fix aggregates failed-basis indices and emits ONE
`UserWarning` naming them. The all-identical-knot degenerate case
(single dose value, `knots[0] == knots[-1]`) remains silently handled —
derivatives there are mathematically zero, well-defined, and always
have been.

Finding #28 — PowerAnalysis survey-design cache staleness (axis J,
Major, `power.py:171-180`): `_build_survey_design()` populated
`self._cached_survey_design` on first call and never invalidated.
Mutating `config.survey_design` after `__init__` silently returned the
stale cached design. Default construction is microseconds and
user-provided designs are reference copies, so the cache never earned
its cost. Fix drops the cache entirely; method now reflects live
`self.survey_design` every call.

Six new tests:
- `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3):
  single-dose silent contract, `ValueError`-forced aggregate warning,
  happy-path no-warning regression.
- `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` (3):
  mutate-survey_design-picks-up-new, clearing-falls-back-to-default,
  repeat-calls-equivalent regression.

REGISTRY notes added under §ContinuousDiD (edge cases) and §PowerAnalysis
(`survey_config` section).

Audit state post-PR: all 28 actionable Phase-2 findings resolved (26 in
prior PRs; #12 + #28 here). Three P1 follow-ups remain logged in
`TODO.md` from PR #337's discovered divergences (FW/PGD algorithmic
mismatch in `compute_synthetic_weights`, TROP grid-search on rank-
deficient Y, TROP bootstrap RNG unification). Those are post-audit
cleanup work, not Phase-3 scope.

No behavioral changes on clean inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 20, 2026
…-failures audit

Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per
project SemVer convention, minor bumps are reserved for new estimators or new
module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults
(PR #318) add a new public API surface and drive this bump.

Headline work:
- PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner-
  ready output layer. Plain-English narrative summaries across all 16 result types,
  with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md.
- PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear
  regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the
  upcoming HeterogeneousAdoptionDiD estimator.
- PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A
  contract), heterogeneity + within-group-varying PSU under Binder TSL, and
  PSU-level Hall-Mammen wild bootstrap at cell granularity.
- PR #333 performance review - docs/performance-scenarios.md documents 5-7
  realistic practitioner workflows; benchmark harness extended.

Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339)
continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J.

CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default
CI after false-positive flakes; perf-review harness is the principled replacement.

Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml,
diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released:
2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the
comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers;
the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in
a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant