Skip to content

Add LSMR/MLSMR Krylov solver alongside CG#25

Merged
schroedk merged 18 commits intomainfrom
feature/mlsmr-mlsqr
Apr 29, 2026
Merged

Add LSMR/MLSMR Krylov solver alongside CG#25
schroedk merged 18 commits intomainfrom
feature/mlsmr-mlsqr

Conversation

@schroedk
Copy link
Copy Markdown
Collaborator

@schroedk schroedk commented Apr 28, 2026

Summary

Adds LSMR (Fong & Saunders) and a preconditioned variant as
KrylovMethod::Lsmr { local_size }, fully wired through to the Python
API as within.LSMR(tol, maxiter, local_size). CG remains the default;
LSMR is opt-in via config=LSMR(...).

The motivation is accuracy on ill-conditioned designs. CG's residual
stopping rule under-resolves rank-deficient fixed-effects systems
(sparse-mobility AKM, multi-factor high-cardinality panels), while
LSMR's normal-equation stop bounds the demean error directly.

Headline benchmark

Sparse-mobility AKM, 5 M obs (worker = 833 k × firm = 25 k, 1 % mobility):

method tol per_solve demean_rel demean_max
CG 1e-12 692 ms 5.73e-10 1.82e-8
LSMR 1e-12 773 ms 4.98e-12 1.96e-10

~100× tighter demean at +12 % wall time. Tightening CG by one tolerance
magnitude does not close the gap on this shape (CG-1e-13 still
trails LSMR-1e-12 by ~10× in both relative L2 and pointwise max).

On well-conditioned designs (1-/2-/3-FE panels, AKM with 10 % mobility)
the two methods are within 10–25 % on time, and CG with one-magnitude
tighter tolerance does match LSMR's accuracy. CG remains the right
default for those cases.

Full sweep over 1fe-highcard / 2fe-panel / 3fe-panel / akm / akm-sparse / 3fe-highcard / 4fe-chain at {500 k, 1 M, 5 M} obs:
crates/within/examples/lsmr_vs_cg.rs.

Diff buckets

  • Solver corecrates/schwarz-precond/src/solve/lsmr/{,bidiag,recurrence,tests}.rs
  • within wiring + Python bindingscrates/within/src/{solver,operator,config}.rs,
    crates/within-py/src/lib.rs, python/within/{__init__.py,_within.pyi}
  • Examples and bench-suite additions — ~850 lines, splittable to a
    follow-up PR if reviewer scope is tight

schroedk added 14 commits April 23, 2026 10:15
Modified LSMR (Fong & Saunders 2011) using the Modified Golub-Kahan
bidiagonalization (Arridge et al. 2014), which requires only one M^-1
application per iteration — the preconditioner M ≈ A^T A is applied
directly with no square-root factorization needed.

Exposed as `schwarz_precond::solve::lsmr::mlsmr`. Covered by unit tests
for unpreconditioned, preconditioned, inconsistent, and maxiter paths.
- KrylovMethod::Lsmr variant dispatched inline in Solver::solve.
- WeightedDesignOperator adapts WeightedDesign to the rectangular A = W^{1/2}D
  operator MLSMR expects; its normal equations A^T A coincide with the
  Gramian, so the existing Schwarz preconditioner carries over unchanged.
- Python: new LSMR config class (tol, maxiter). CG-style preconditioner
  validation now also covers LSMR.
- Tests: orchestrate_solve covers weighted LSMR and LSMR-vs-CG agreement.
The old threshold `EPS * rz_init` fired when `rz_new/rz_init < EPS`, i.e.
`||r||/||b|| < sqrt(EPS) ~ 1.5e-8`, colliding with user tolerances near
1e-8 and cutting off CG above the attainable-accuracy floor of
`||r||/||b|| ~ u` for well-conditioned problems. The new threshold
EPS^2 * rz_init fires only when rz reaches true numerical noise
(`||r|| ~ EPS*||b||`), restoring legitimate convergence near sqrt(EPS)
tol. Resolves the `prop_single_factor_converges` property-test failure.
Documents the LSMR rectangular least-squares solver and the CG
stagnation guard fix landed on this branch.
Include LSMR(Schwarz) alongside CG(Schwarz) and GMRES(Mult-Schwarz) in
standard_solver_configs so every comparison suite exercises all three
Krylov methods. Thread an optional maxiter override through
standard_solver_configs for suites that raise the iteration cap.

Suites with the plain CG+GMRES pair (preconditioners_3fe, high_fe,
high_fe_scaling) now delegate to standard_solver_configs.
preconditioner_comparison keeps its CG(none) baseline and appends the
standard trio. Suites whose purpose is a different comparison axis
(fixest_comparison exact/approx Schur, ac_comparison AC/AC2 local
solver, graph_backend_comparison) append the standard trio alongside
their existing configs with clearer label naming.
Fuse and parallelize the per-iteration axpy-with-sq-norm, axpby, dot,
and the h̄/x/h/v update block behind a 10k-element threshold with
4096-sized chunks. Small problems stay sequential to avoid rayon
dispatch overhead; the chunk size keeps each task L1-resident, which
beat alternative schemes (per-thread chunks, always-parallel) in
profiling on 1M/5M-obs 3FE panels.
- Cargo profile `profiling` inherits release + line-table debug info so
  samply/perf/dtrace can resolve stack frames without LTO symbol loss.
- samply added to pixi dev deps.
- Two new examples under within/: lsmr_profile (single-shape 3FE DGP for
  samply capture) and lsmr_vs_cg (head-to-head CG vs LSMR across shapes
  and tolerances, reporting wall time and demean error vs a reference).
Restructure lsmr.rs so the code mirrors the four blocks Fong & Saunders
write in Algorithm 2.8: bidiagonalization, P̂_k, P̄_k, and the (x, h, h̄)
recurrence.

- Add `Givens { c, s, r }` with `Givens::new(a, b)` so each rotation
  construction is one named call, not three lines duplicated twice.
- Introduce a `Bidiagonalization` trait with two impls. `GolubKahan`
  handles the unpreconditioned path with no `p_tilde` buffer and no `M⁻¹`
  apply; `ModifiedGolubKahan` (renamed from `MgkBidiag`) keeps the
  M-norm dot-product trick. The choice of preconditioner is now confined
  to picking which bidiagonalization to construct.
- `LsmrRecurrenceState::step` shrinks to constructing P̂_k and P̄_k and
  committing chain state; it returns a `RotationStep` of natural rotation
  outputs (ρ, ρ̄, θ_new, θ̄, ζ) instead of pre-baked solution coefficients.
- `SolutionState::update` takes the current and previous `RotationStep`
  and derives its own (t_x, t_hbar, t_h) ratios at the point of use.
  `RotationStep::initial()` seeds the first iteration so `h̄₀ = 0` falls
  out without a special case.
- `mlsmr` dispatches `None`/`Some` once into a generic `lsmr_from_bidiag`
  that knows nothing about `M`.

No behavioral change; all four LSMR tests pass.
…ence

`mlsmr` with `None` (GolubKahan) and with `Some(&IdentityOperator)`
(ModifiedGolubKahan with M = I) are mathematically the same algorithm.
Add a test that runs both on the existing 4x3 `OverdeterminedOp` system
and asserts the iteration counts, solutions, and residual-norm estimates
agree to 1e-10. Guards against future drift between the two
bidiagonalization implementations.
Adds local (windowed) modified Gram-Schmidt to both Golub-Kahan and
Modified Golub-Kahan bidiagonalizations, restoring v-orthogonality on
ill-conditioned problems where the short recurrence drifts. Surfaced as
`local_size: Option<usize>` on `KrylovMethod::Lsmr` and the Python `LSMR`
config. `None` (default) is the zero-cost no-op fast path.

Includes a Vandermonde demo example, regression and orthogonality tests
for the windowed path, and a Python sweep script comparing window sizes
across panel-data topologies.
Three follow-ups from review of the windowed-reorth change:

* `mlsmr` no longer treats `‖b‖ < EPSILON` as exactly zero — uses
  `b_norm == 0.0` instead, so scaled-small RHS vectors flow through
  the normal solve path rather than silently returning x = 0 with a
  false residual of 0.
* `LocalReorth::new` and `ModifiedLocalReorth::new` now clamp ring
  capacity by `min(m, n)` instead of just `n`, matching the actual
  bound on the bidiagonalization sequence and avoiding unnecessary
  n-length allocations on wide rectangular operators.
* New `test_mlsmr_none_matches_identity_precond_windowed` extends the
  existing None-vs-IdentityOperator equivalence guarantee to the
  windowed path, guarding the M-weighted scaling logic in
  `ModifiedLocalReorth::push` against future drift.
`solve/lsmr.rs` had grown to ~1500 lines mixing the bidiagonalization
machinery, the Givens/recurrence/convergence state machines, and a
~500-line inline test module. Split along the producer/consumer seam:

- `lsmr.rs` (~90 lines) — facade with `mlsmr`, `LsmrResult`, the
  `lsmr_from_bidiag` driver, and the shared parallel-kernel constants.
- `lsmr/bidiag.rs` — the `(α, β)` stream: `Bidiagonalization` trait,
  `GolubKahan`/`ModifiedGolubKahan`, `LocalReorth`/`ModifiedLocalReorth`,
  and the vector kernels they consume.
- `lsmr/recurrence.rs` — the consumer: `Givens`, `RotationStep`,
  `LsmrRecurrenceState`, `SolutionState`, `ConvergenceState`, `Stop`.
- `lsmr/tests.rs` — the existing test suite, unchanged.

Public API frozen: `schwarz_precond::solve::lsmr::{mlsmr, LsmrResult}`
still resolves identically for `within::solver` and the
`lsmr_local_reorth` example.
The "0 = disabled" convention was carried via empty `slots` Vecs and
`is_empty()` short-circuits in `reorthogonalize`/`push`. Encoding the
disabled state as `Option<LocalReorth>` / `Option<ModifiedLocalReorth>`
on the buffer puts it in the type, drops the sentinel branches, and
removes the `is_empty` helpers.
…rage)

Comment-only changes to bidiag.rs and recurrence.rs (algebra unchanged):
- bidiag: debug_assert alpha>0 before p_coeff division; restructure
  ModifiedGolubKahan::step comments into three self-contained phases
- recurrence: doc-comment Stop enum and variants; expand Givens-rotation
  comments to map onto Fong & Saunders Algorithm 2.8; document why
  absolute f64::EPSILON guards are acceptable in SolutionState::update

Test improvements:
- Extract normal_equation_residual helper, reuse across 3 sites
- Relax iter-count assert in none/identity equivalence test to <=1
- Replace exact equality on r1.x / residual_norm with <1e-15 tolerance
  to keep determinism testable under future parallel reductions
- Add test_mlsmr_step1_alpha_zero_early_exit covering the A^T b == 0 path
- Add test_mlsmr_local_reorth_window_boundary_sizes for window in
  {1, n_cols, n_cols+1} on Vandermonde(30,12)
- Rename closure param size -> window_size; add cond(A) ~ 1e10 preface
@schroedk schroedk force-pushed the feature/mlsmr-mlsqr branch from 98cdaf4 to 8cddfb0 Compare April 28, 2026 14:27
- Drop lsmr_profile (samply harness), lsmr_local_reorth (one-off
  reorth-window study), and bench_lsmr_local_size.py (manual local-size
  sweep). All three were exploration scaffolding; their findings are
  baked into the solver defaults and the lsmr/cg comparison in
  lsmr_vs_cg.
- Extend lsmr_vs_cg with akm-sparse / 3fe-highcard / 4fe-chain shapes
  and CG tol=1e-13 to back the PR's headline accuracy claims.
@schroedk schroedk force-pushed the feature/mlsmr-mlsqr branch from 8cddfb0 to 9c98f2d Compare April 28, 2026 14:31
- Expose lsmr() and preconditioned_lsmr() as standalone entry points;
  mlsmr() remains as a thin Option-dispatched wrapper for the internal
  round-trip test and the existing public API.
- Add LsmrStopReason for diagnostics, splitting Stop::Converged into
  ResidualTolerance / NormalEquationTolerance and surfacing
  ZeroRhs / InitialNormalEquationResidualZero / BidiagonalizationBreakdown
  / MaxIterations.
- Add SolveError::InvalidInput, validate rhs length / finiteness, tol,
  and (preconditioned) shape before any work runs.
- Convergence wins over breakdown when both fire on the same step; the
  user-supplied tolerance is the contract.
- Factor the zero-rhs early-return path into zero_rhs_result(n).
- Tests: cover underdetermined / rank-deficient / zero-column-and-row
  systems, mid-stream beta=0 lucky breakdown, zero-rhs and
  bad-input rejection, and dispatcher equivalence.
- Replace Vec<Vec<f64>> reorth slots with a single flat Vec<f64> and
  index helpers. One allocation per LocalReorth / ModifiedLocalReorth
  instead of `cap`; better locality for the windowed MGS pass.
- Use par_dot(b, b).sqrt() in init() instead of vec_norm(b); par_dot
  falls back to sequential below LSMR_PAR_THRESHOLD.
- Handle the beta == 0 lucky breakdown explicitly: zero v (and the
  paired p_tilde in the modified path) and return alpha = 0 so the
  caller's solution.update is a no-op. The outer solver's alpha == 0
  branch then exits cleanly with BidiagonalizationBreakdown.
Make future variant additions to SolveError non-breaking for downstream
Rust consumers. Bundle the attribute with the InvalidInput variant added
earlier in this branch so the breakage happens once, not twice.

Also fold in CHANGELOG entries for the non_exhaustive/InvalidInput
change and the iterations-semantics shift (now includes refinement
iterations), and patch a duplicated docstring line + missing time_total
entry in the SolveResult attributes block of _within.pyi.
@schroedk schroedk merged commit ef3ad04 into main Apr 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant