v0.2.0
Fixed — Honest resolved_method and consistent Auto routing (#3)
SymbolicFactorization.resolved_method now reflects what the symbolic
pipeline actually ran rather than what the caller requested. Two
behavior changes:
- ScotchND silent fallback is surfaced. When SCOTCH's nested-
dissection recursion produces no separator for the entire graph
(bordered-KKT shapes such as PoissonControl trigger this), the
driver falls back to AMD viaamd_leaffor every recursion node.
Previouslyresolved_methodstill reportedScotchNDwhile the
permutation was bit-identical to AMD's. It now reportsAmd. The
recovery itself is unchanged — only its visibility is fixed. OrderingMethod::Autois consistent with the no-arg default.
Auto resolution now happens against the original matrix's pattern
before anyLdltCompresspreprocessor reshapes the graph, and the
residual branch delegates topick_default_method(the rule used
by the no-argsymbolic_factorize). Previously Auto could pick a
different concrete method than the default rule on the same
matrix, depending on whether compression triggered. Auto is now a
strict superset: same answer as the default plus the two extra
shape-bakeoff branches (n>100_000→ ScotchND,n<10_000→
KahipND).
Reported by independent triage on K=158 PoissonControl benchmarks.
Fixed — CscMatrix::from_triplets rejects upper-triangle entries (#4)
CscMatrix::from_triplets and CscMatrix::validate now return
FeralError::InvalidInput when any triplet has row < col. Previously
upper-triangle entries were silently accepted and routed through
sort_and_sum_duplicates, producing a CscMatrix whose row indices
violated the documented "lower triangle only" invariant. Downstream
consumers (e.g. symmetric_pattern) assume lower-triangle storage, so
the same symmetric matrix described with upper- vs lower-triangle
triplets produced different solve results. The error message identifies
the offending triplet by index and (row, col). Reported by @janosh.
Changed
Solvernow defaults to the rayon-parallel multifrontal driver
(factorize_multifrontal_parallel_with_workspace). The driver is
bit-exact with the sequential supernodal path on a per-supernode
basis and falls through to the sequential path when the supernode
count is belowN_PAR_MIN = 32, so small-problem latency is not
affected. Override withSolver::new().with_parallel(false).
Closes #7. Motivation: pounce'smarine_1600/pinene_3200
Mittelmann runs were spending all their time in sequential
factor_one_supernodeeven though the parallel driver was
available; this wiresSolverdirectly to it.
Added
Solver::with_parallel(bool)— opt out of the rayon-parallel
driver (returnsSelffor builder chaining).Solver::parallel()— test/diagnostic accessor for the current
flag value.SymmetricMatrix::from_pooled_buf(n, buf)constructor that zeros
only the lower triangle when reusing a pooled buffer; cuts the
dead upper-triangle memset out offactor_one_supernode's
per-supernode hot path. See dev/decisions.md 2026-05-12.compute_infnorm_dense(&SymmetricMatrix)insrc/scaling/infnorm.rs
— dense-native Knight-Ruiz iteration for the D.3/D.4 dense
fast-path; the sparsecompute_infnormremains the path for the
multifrontal driver.Solver::inertia()—Option<&Inertia>accessor returning the
full inertia of the last successful factor. Complements the
Ipopt-shapednum_negative_eigenvalues(which panics if no factor
is stored) for callers that prefer to branch onNone. Used by
the cross-solver bench harness.
Tooling
- Cross-solver comparison harness under
external_benchmarks/comparison/(run.py, aggregate.py, report.py)
measures feral against MUMPS 5.8.2 and HSL MA97 2.8.1 on a sampled
SuiteSparse subset and emitsREPORT.md. Each solver is configured
to its production-quality (refinement-on) settings so the residual
comparison is apples-to-apples:- feral driver routes through
solve_sparse_refined(Richardson
refinement with stagnation exit) insrc/bin/bench_one_matrix.rs. external_benchmarks/mumps_oracle/mumps_bench.Fsets
ICNTL(10) = 2(max two iterative-refinement steps) — MUMPS
default is 0 (no refinement).external_benchmarks/hsl_bench/hsl_bench.cwraps
ma97_solve_din a 4-step Richardson loop because MA97 has no
native residual-based refinement entry for non-singular systems.
Configuration is documented in the generated report's Solvers
table.
- feral driver routes through
Performance
- Dense fast-path (
dense_fast_factor) now runs Knight-Ruiz ∞-norm
scaling directly on the column-major lower-triangle buffer
produced byto_dense_into, removing therow_idx[k]
indirection that dominated wall time on small-dense matrices.
Routing:ScalingStrategy::AutoandInfNormgo through the
dense KR;Mc64Symmetric,Identity,Externalare honored
via the unchanged sparse path. Bit-exact with the sparse KR on
every fast-path-gate matrix (should_use_dense_fast_path⇒
matrix is small enough that every column-major slot maps 1:1
onto a CSC entry or a known-zero, and(d_i · 0 · d_j) = 0
is a no-op in the max-reduction). Targets the
dev/results/lever-d3/stage1-stage2-2026-04-19.md§1 finding
thatcompute_scalingwas 82% (34 of 41 µs) of dense-path
wall time on TRO3X3_0013. - Pooled
local_contribsper rayon worker insideFactorWorkspace,
removing a per-taskVec<Option<ContribBlock>>of length
n_snodesfrom the parallel driver. Decisive on cont-201
(sequential –34%, parallel-at-T=8 –10%); also helps
bratu3d (–6% / –5%). Bit-exact. See dev/decisions.md
2026-05-12 and dev/sessions/2026-05-12-01.md. - Skip the upper-triangle zero on pooled frontal buffer reuse
(SymmetricMatrix::from_pooled_buf). Bit-exact; 5–10% sequential
wall reduction across mid-size matrices.
Investigated
-
Parallel multifrontal driver lock contention (T=4) — falsified.
Added opt-inAtomicLockStatstelemetry toNumericParamswith
per-task lock wait/hold counters and eight per-phase wall-time
counters wrapping the sequential prologue/epilogue. cont-201's
previously-reported residual headroom is sequential symbolic
factorize (157 ms of a 214 ms single-shot wall), not mutex
contention (worst-case 3.4% of body time on cont-201, 0.02% on
c-big). On the cached-symbolic path (production / pounce-IPM
regime) cont-201 wall drops 214 → 56 ms with body_frac jumping
0.15× → 0.55×; remaining 1.5× headroom is inside the rayon::scope,
not at lock sites. Full analysis in
dev/debugging/2026-05-12-cont201-cached-headroom.md. -
Parallel driver within-scope localization (iteration 2) —
rayon idle dominates. Addedtask_wall_ns(whole-closure
bracket) andws_lock_wait_ns(per-worker workspace mutex
wait) toAtomicLockStats. Derivedrayon_idle = scope·T − task_wall_aggquantifies the parallelism deficit attributable
to etree dependencies. cont-201 cached: rayon_idle = 12.3 ms/T
(78% of the gap) vs locks 1.7 ms/T (10%) and ctrl-flow 1.5
ms/T (10%). c-big at T=4 is essentially sequential (74%
rayon-idle capacity, 1.04× speedup vs body_agg). Conclusion:
assembly-tree parallelism is exhausted on these matrices;
within-supernode parallelism (panel-BK / threaded dense
kernels) is the only remaining axis. Closes the cont-201
assembly-tree investigation. See iteration 2 in
dev/debugging/2026-05-12-cont201-cached-headroom.md. -
Scaling cache verification (iteration 3) —
compute_scaling_with_cache
works as designed. Addedsolver_scaling_phase_splittest
(#[ignore]) that loads the corpus and times
pick_scaling_strategy+compute_scaling_with_cache(cache=None)- reorder gather. c-big picks
Mc64Symmetricand the no-cache
path takes 2.3 seconds (full Hungarian); the cached path
in production takes 2.4 ms — 1000× speedup, cache hits.
cont-201 and bcsstk38 pickInfNorm, which is values-dependent
Knight-Ruiz iteration (~4 ms per call) and is not cacheable
across IPM iterations. The 3.95 ms scaling slice on cached
cont-201 is fundamental per-factor work, not a missed cache.
Closes scaling probe from session 2026-05-12-02 "Next session
should #2".
- reorder gather. c-big picks
-
Issue #5 (MSS1 BK inertia non-monotone under δ_w·I): triage
complete, closed on the feral side. Landed a reproducer test- zero_tol/pivot_threshold sweep diagnostics in
src/numeric/factorize.rs::tests. Empirically demonstrated
that no in-kernel magnitude-floor lever cures the wandering;
cross-checked MUMPS 5.8.2 and MA57 (via Ipopt's wrapper) and
confirmed neither implements eigenvalue-aware 2×2 splitting.
Recommended fix is upstream (caller-side δ_c bump matching
Ipopt'sPerturbForSingularity). Full analysis in
dev/research/issue-5-mss1-inertia-monotonicity.md§9.
- zero_tol/pivot_threshold sweep diagnostics in