v0.9.0
Fixed — batched iterative refinement for wide multi-RHS solves (#58)
Solver::solve_many_refined looped the single-RHS refiner per column,
which bypassed the BLAS-3 panel kernel — so with refinement on (the
default) batched multi-RHS solves were 3–7× slower per RHS than the
unrefined batched path, and could be slower than looping single-RHS.
Wide refined solves (nrhs ≥ 16) now refine through a batched loop:
the initial and per-step correction solves go through solve_sparse_many
(one panel solve over the still-active columns), and the residual is a
per-column SpMV. Per-column best-iterate tracking and the convergence
predicates (ε·√n relative target, 2-strike plateau, 100× divergence)
are preserved exactly, and each step compacts the active columns so the
batched path never does more solve work than the per-column loop. Narrow
solves (nrhs < 16, e.g. the IPM predictor-corrector) keep the
per-column loop unchanged.
For 16 ≤ nrhs < 32 the batched result is bit-identical to the
per-column loop (the solve runs the rank-1 kernels there); for
nrhs ≥ 32 it agrees with the per-column oracle to the refinement
residual target. Measured (bench_multirhs, 2-D Laplacians): batched
refined is ~2.5–3× faster per RHS than the per-column refined loop.
The common case (every column already at the target after the direct
solve, i.e. ~0 correction steps) now returns before allocating the wide
best_x/residual buffers — it allocates only the solution and a
length-n scratch. Those three n × nrhs allocations were cheap in the
native binary but cost ~50 µs/RHS under the Python process's allocator,
masking the win through Solver.solve_refined on 2-D inputs; with them
gone the Python refined path shows the full ~2–3× too.
Changed — multi-RHS sparse solve: BLAS-3 panel kernels (#57 fix #2)
Wide multi-RHS solves (nrhs ≥ 32) now run each supernode's
forward/backward substitution as a register-blocked dense panel solve
— a TRSM on the unit-lower triangle L_11 plus a MR×NR (4×8)
register-tiled GEMM on the trailing block L_21 — instead of a rank-1
cascade. Narrow solves (the IPM hot path) stay on the rank-1 kernels.
Two supporting layout changes, both internal (no public API/ABI change,
RHS/solution stay column-major n × nrhs):
- The internal
yworking buffer is now row-major, so every
per-supernode gather/scatter is a contiguous memcpy instead of a
stride-ntranspose. This removed a cache-conflict pathology on
power-of-twon(the n=1024 grid had regressed) and roughly halved
the wide-solve time across the board. - The forward substitution and D-block solve are fused into one
postorder pass (a node's eliminated rows are final once its
forward-sub completes), saving one gather/scatter round trip per
supernode.
Forward substitution stays bit-identical to looping single-RHS;
back-substitution differs only by floating-point reassociation
(~κ·eps). The multi-RHS parity suite (nrhs up to 64, MR/NR tail
sizes) verifies max|many − single| < 1e-12 against the independent
single-RHS oracle (observed ≤ 1.6e-15).
Measured per-RHS speedup vs looping single-RHS, on 2-D Laplacians
(cargo run --release --bin bench_multirhs, nrhs ∈ {64, 256}):
n=484 ~4–5×, n=1024 ~3×, n=2025 ~5–6×.
Changed — multi-RHS sparse solve: row-major working buffer (#57)
The per-supernode working buffer in solve_sparse_core_many_into now
uses a row-major layout (w[i*nrhs + c]) instead of column-major
(w[c*nrow + i]). The per-RHS inner loops in forward-sub, the D-block
solve, and back-sub are now contiguous (stride-1) and auto-vectorize.
The caller-visible RHS/solution layout is unchanged — it remains
column-major n × nrhs, matching MUMPS/SSIDS — so there is no public
API or ABI change. The single-RHS path (solve_sparse_core_into) and
the iterative-refinement path are untouched. Results are bit-identical
to looping the single-RHS solve (verified by the multi-RHS parity
suite, max|many − single| = 0).
Measured per-RHS speedup of one batched solve_sparse_many call vs
looping single-RHS, on 2-D Laplacians (cargo run --release --bin bench_multirhs): batched/looped ratio 0.76–0.99 (up to ~1.3× faster
per RHS) at nrhs ∈ {64, 256}. The larger 5–10× regime needs the
BLAS-3 panel kernels (issue #57 fix #2, deferred).