Add axis-G Rust vs Python backend parity edge-case tests by igerber · Pull Request #337 · igerber/diff-diff

igerber · 2026-04-19T20:12:26Z

Summary

Phase 2 silent-failures audit — axis-G (backend parity). Closes the coverage gap the audit flagged on three Rust-backed solver surfaces (findings #21, #22, #23). Test-only PR; any discovered divergences are marked xfail(strict=True) and logged to TODO.md as P1 follow-ups rather than fixed in-scope.

Update roadmap with current implementation limitations #21 solve_ols skip-rank-check parity — three tests in TestSolveOLSSkipRankCheckParity covering mixed-scale, near-singular full-rank, and rank-deficient designs under skip_rank_check=True on HC1. All pass at rtol=1e-6, atol=1e-8 on fitted values; no Rust-side code change needed.
Implement CallawaySantAnna covariate adjustment for conditional parallel trends #22 compute_synthetic_weights parity — three tests. Near-singular Y'Y passes. Extreme Y scale (1e9) and lambda_reg variations xfail(strict=True) — root cause: Rust path is Frank-Wolfe, Python fallback is projected-gradient descent. Same QP, different simplex vertices.
Address code review feedback for CallawaySantAnna covariates #23 TROP Rust grid-search + bootstrap parity (@pytest.mark.slow) — grid-search ATT on rank-deficient Y xfail(strict=True) (~6% divergence). Bootstrap SE under fixed seed xfail(strict=True) (~28% divergence due to RNG-backend mismatch: Rust rand crate vs numpy default_rng).

The four xfail(strict=True) markers baseline the gaps so we get notified if/when the algorithms align. Three P1 follow-up entries added to TODO.md.

Methodology references

Method name(s): N/A - no methodology changes
Paper / source link(s): N/A
Any intentional deviations from the source (and why): None

Validation

Tests added/updated: 3 new test classes in tests/test_rust_backend.py (8 tests total: 4 pass, 4 xfail).
Backtest / simulation / notebook evidence: N/A (test-only PR)
Local verification:
- pytest tests/test_rust_backend.py -m 'not slow' → 72 passed, 14 deselected, 2 xfailed
- pytest tests/test_rust_backend.py::TestTROPRustEdgeCaseParity -m '' → all pass (2 xfailed as expected)
DIFF_DIFF_BACKEND coverage: tests patch HAS_RUST_BACKEND=False to force Python path, matching the pattern already established by test_trop_global_solver_parity_no_lowrank (test_rust_backend.py:1687).

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Phase 2 silent-failures audit — axis-G (backend parity). Closes the coverage gap the audit flagged in three Rust-backed solver surfaces. Test-only PR; any discovered divergences are marked `xfail(strict=True)` and logged to `TODO.md` as P1 follow-ups rather than fixed in-scope. Finding #21 — `solve_ols` skip-rank-check parity (`linalg.py:369-373, 597-639`): three parity tests in `TestSolveOLSSkipRankCheckParity` covering mixed-scale columns (norm ratio > 1e6), near-singular full-rank (cond > 1e10), and rank-deficient collinear designs under `skip_rank_check=True` on HC1. Backends agree on fitted values within `rtol=1e-6, atol=1e-8`. All pass; no Rust-side code change needed. Finding #22 — `compute_synthetic_weights` parity (`utils.py:1134-1199`): three parity tests in `TestSyntheticWeightsBackendParity`. Near-singular `Y'Y` passes at `atol=1e-7`; extreme Y scale (1e9) and lambda_reg variations are `xfail(strict=True)` with a baselined ~15-80% weight divergence. Root cause: Rust path is Frank-Wolfe, Python fallback is projected gradient descent (`utils.py:1228`) — same QP, different simplex vertices under near-degenerate inputs. Finding #23 — TROP Rust grid-search + bootstrap parity (`trop_global.py:688-750, 966-1006`): two parity tests in `TestTROPRustEdgeCaseParity`, `@pytest.mark.slow` class-level. Both `xfail(strict=True)`: grid-search ATT on rank-deficient Y (~6% divergence), bootstrap SE under `seed=42` (~28% divergence, RNG backend mismatch — Rust `rand` crate vs numpy `default_rng`). Plan governance: - Per `feedback_ci_reviewer_pattern_checks`, greped adjacent Rust entry points (`_solve_ols_rust`, `_rust_synthetic_weights`, `_rust_loocv_grid_search_global`, `_rust_bootstrap_trop_variance_global`); no additional silent-fallback surfaces identified. - Per plan Non-goal #4, did not open an axis-H finding on TROP's `seed=None → 0` substitution at `trop_global.py:994` (out of scope). - No behavioral changes, no warnings, no REGISTRY changes, no flags. TODO.md logs three P1 follow-up entries: algorithmic unification for `compute_synthetic_weights` (FW vs PGD), TROP grid-search divergence on rank-deficient Y, TROP bootstrap RNG unification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-19T20:20:46Z

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed files. This PR is limited to backend-parity tests plus TODO tracking; it does not change estimator math, weighting, variance/SE code, identification checks, or defaults.

Executive Summary

The diff only touches tests/test_rust_backend.py:2115 and TODO.md:86; no production estimator implementation changed.
Methodology cross-check is clean: TROP still follows the existing registry contract that variance is bootstrap-based and n_bootstrap >= 2, with inference flowing through the current implementation in docs/methodology/REGISTRY.md:1993, docs/methodology/REGISTRY.md:2017, and diff_diff/trop_global.py:842.
The new tests target the intended backend gates: solve_ols(skip_rank_check=True) in diff_diff/linalg.py:621, synthetic-weight dispatch in diff_diff/utils.py:1180, and TROP global LOOCV/bootstrap branches in diff_diff/trop_global.py:688 and diff_diff/trop_global.py:966.
The three newly added TODO entries are under the project’s “Tech Debt from Code Reviews” section at TODO.md:86, TODO.md:87, and TODO.md:88, so these parity gaps are now explicitly tracked rather than implicit.
The slow TROP parity class is still exercised in automation: default pytest excludes slow tests via pyproject.toml:103, but Rust CI runs tests/test_rust_backend.py with -m '' in .github/workflows/rust-test.yml:145.
I could not execute the tests in this review environment because pytest is not installed.

Methodology

No findings. Affected surfaces are existing backend implementations only: solve_ols, compute_synthetic_weights, and TROP global parity. The PR does not alter estimator equations, assumptions, weighting, or SE formulas.

Code Quality

No findings. The mock-based Rust/Python forcing pattern matches the existing style in tests/test_rust_backend.py:1571 and correctly patches the module-level backend gates used by the tested paths.

Performance

No findings. The heavier TROP parity coverage is correctly isolated behind @pytest.mark.slow at tests/test_rust_backend.py:2317, so default suite runtime is unchanged.

Maintainability

No findings. The new TODO entries are concrete, scoped to specific call sites, and tied directly to the strict xfail baselines.

Tech Debt

Severity: P3
Impact: The backend parity gaps are now explicitly tracked in TODO.md:86, TODO.md:87, and TODO.md:88, which is the right treatment for this test-only PR under the project’s deferred-work policy.
Concrete fix: None required for approval; the follow-up work is already recorded.

Security

No findings. The diff adds tests and TODO text only; no new secret-handling, I/O, or network-facing behavior appears.

Documentation/Tests

No findings. xfail(strict=True) is the correct mechanism for these baselines because an eventual backend alignment will surface as XPASS instead of disappearing silently.
Verification note: I was not able to run pytest locally in this environment because the executable is unavailable.

… cache Bundles the two remaining S-complexity findings from the Phase 2 audit, closing Phase 3 execution. Finding #12 — ContinuousDiD B-spline degenerate knot (axis C, Minor, `continuous_did_bspline.py:153`): `bspline_derivative_design_matrix` silently swallowed `ValueError` from `scipy.interpolate.BSpline` in the per-basis derivative loop, leaving affected columns of the derivative design matrix as zero with no user-visible signal. Downstream ContinuousDiD analytical inference then fed a biased `dPsi` into SE computation. Fix aggregates failed-basis indices and emits ONE `UserWarning` naming them. The all-identical-knot degenerate case (single dose value, `knots[0] == knots[-1]`) remains silently handled — derivatives there are mathematically zero, well-defined, and always have been. Finding #28 — PowerAnalysis survey-design cache staleness (axis J, Major, `power.py:171-180`): `_build_survey_design()` populated `self._cached_survey_design` on first call and never invalidated. Mutating `config.survey_design` after `__init__` silently returned the stale cached design. Default construction is microseconds and user-provided designs are reference copies, so the cache never earned its cost. Fix drops the cache entirely; method now reflects live `self.survey_design` every call. Six new tests: - `tests/test_continuous_did.py::TestBSplineDerivativeDegenerateBasis` (3): single-dose silent contract, `ValueError`-forced aggregate warning, happy-path no-warning regression. - `tests/test_power.py::TestSurveyPowerConfigDesignStaleness` (3): mutate-survey_design-picks-up-new, clearing-falls-back-to-default, repeat-calls-equivalent regression. REGISTRY notes added under §ContinuousDiD (edge cases) and §PowerAnalysis (`survey_config` section). Audit state post-PR: all 28 actionable Phase-2 findings resolved (26 in prior PRs; #12 + #28 here). Three P1 follow-ups remain logged in `TODO.md` from PR #337's discovered divergences (FW/PGD algorithmic mismatch in `compute_synthetic_weights`, TROP grid-search on rank- deficient Y, TROP bootstrap RNG unification). Those are post-audit cleanup work, not Phase-3 scope. No behavioral changes on clean inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-failures audit Packages 161 commits across 18 PRs since v3.1.3 as minor release 3.2.0. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level public API — BusinessReport / DiagnosticReport / DiagnosticReportResults (PR #318) add a new public API surface and drive this bump. Headline work: - PR #318 BusinessReport + DiagnosticReport (experimental preview) - practitioner- ready output layer. Plain-English narrative summaries across all 16 result types, with AI-legible to_dict() schemas. See docs/methodology/REPORTING.md. - PR #327, #335 did-no-untreated foundation - kernel infrastructure, local linear regression, HC2/Bell-McCaffrey variance, nprobust port. Foundation for the upcoming HeterogeneousAdoptionDiD estimator. - PR #323, #329, #332 dCDH survey completion - cell-period IF allocator (Class A contract), heterogeneity + within-group-varying PSU under Binder TSL, and PSU-level Hall-Mammen wild bootstrap at cell granularity. - PR #333 performance review - docs/performance-scenarios.md documents 5-7 realistic practitioner workflows; benchmark harness extended. Silent-failures audit closeouts (PRs #324, #326, #328, #331, #334, #337, #339) continue the reliability work started in v3.1.2-3.1.3 across axes A/C/E/G/J. CI infrastructure: PRs #330 and #336 exclude wall-clock timing tests from default CI after false-positive flakes; perf-review harness is the principled replacement. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.2.0, date-released: 2026-04-19). CHANGELOG populated with Added / Changed / Fixed sections and the comparison-link footer. CITATION.cff retains v3.1.3 versioned DOI in identifiers; the v3.2.0 versioned DOI will be minted by Zenodo on GitHub Release and added in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber merged commit dab5771 into main Apr 19, 2026
22 of 23 checks passed

igerber deleted the fix/axis-g-rust-parity-tests branch April 19, 2026 21:27

igerber mentioned this pull request Apr 19, 2026

Close axis-C/J silent-failures audit: B-spline derivative + PA survey cache #339

Merged

igerber mentioned this pull request Apr 20, 2026

Release 3.2.0: BusinessReport preview, dCDH survey completion, silent-failures audit #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add axis-G Rust vs Python backend parity edge-case tests#337

Add axis-G Rust vs Python backend parity edge-case tests#337
igerber merged 1 commit intomainfrom
fix/axis-g-rust-parity-tests

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant