Releases: jkitchin/feral
feral v0.11.0
feral 0.11.0
Highlights
- Unsymmetric LU basis engine (
feral::lu, #81) — general square
A = P L U Qfactorization withftran/btransolves and simplex-style
product-form column updates, auto-routing between a dense and a sparse
engine. Exposed in Python asLuFactor. - Expanded Python interface (all additive, backward-compatible):
LuFactor; factor access viaSolver.factors()(unit-lowerLas
CSC / optional scipy, block-diagonalD,perm/scalingfor the
L D Lᵀ = P (S A S) Pᵀreconstruction); symbolic analysis via
feral.analyze(...)/Solver.symbolic(); and introspection plus new
Solver(...)tuning knobs (ordering,profiling,mc64_cache, …),
pivot-magnitude and MC64 counters,last_factor_stats(),
profile_report(),scaling_info. - Documentation — new mdBook Python chapter sections, a new executed
example notebook05_lu_and_introspection, and a refreshed top-level
README (LU engine, production-ready Status framing).
Packaging
- The six fill-reducing ordering crates (
feral-ordering-core,feral-amd,
feral-amf,feral-metis,feral-scotch,feral-kahip) are bumped
0.2.0 → 0.2.1so crates.io picks up the API-compatible fixes and perf
work accumulated since0.2.0(they had been silently skipped at publish
across0.3.0–0.10.0).feral'sversion = "0.2"requirement resolves
to0.2.1unchanged. A release-checklist guard now flags this staleness
class so it cannot recur.
Correctness & performance campaign
This release also folds in the repo-review campaign: a large set of
correctness fixes (inertia reporting, MC64 partial-singular fallback,
sparse/dense LU singularity handling, MTX-reader validation, CSC validation,
2×2 D-block gating, static-pivot scale-invariance) and ordering-crate
algorithm fixes (real SHEM coarsening and true GGP gain in feral-metis,
vertex-weight flow balancing and n_components in feral-kahip,
separator-FM in feral-scotch, sorted-rows invariant in
feral-ordering-core), plus several performance improvements (pooled scratch
reuse, borrow-not-clone in feral_factor, linear default postorder).
See CHANGELOG.md
for the complete, itemized list (every fix is tagged with its finding ID).
Inertia remains validated against the feral + MUMPS 5.8.2 + SPRAL/SSIDS
consensus oracle; 705 tests passing.
feral v0.10.0
Ordering-routing release. No open issues at cut; 20 commits since v0.9.0.
This is a minor release (not a patch) because the headline change is a
deliberate change to default Auto ordering behavior, not just a bug fix.
Changed
Autonow prefers AMF over MetisND at every size (#73, follow-up to #67).
TheAMF_BAND_MAXceiling is removed: whenever the size rule would pick
MetisND,choose_adaptivenow routes to AMF at everyn. The #50 powerflow
guard (n > 100k && avg_deg < 5 → Amd) and the #64 arrow/dense-border catch
still fire first, so only the uniformly-thin would-be-MetisND population is
affected. The #73 A/B (factor+solve) showed AMF winning on every measured
matrix in the n>100k thin regime (dtoc2 2.49×, cont5_1_l 2.75×, nql180 2.05×,
pinene 1.18×, YATP1NE 2.13×) — including a case where MetisND had smaller
symbolic fill yet was 2× slower, so a fill-guarded race was rejected.- 144 diagnostic binaries relocated to
crates/feral-diagnostics(#71). Root
cargo build/test/clippynow operate on theferalpackage only. Run
diagnostics on demand withcargo run -p feral-diagnostics --bin <name>.
Fixed
- Spurious zero pivots → wrong inertia under
Autoscaling on ill-conditioned
KKTs (#65). Inertia-guided MC64 fallback corrects the false-infeasible signal. - Arrow/bordered-KKT ordering blow-up (#64).
Autono longer picks MetisND on
arrow/bordered patterns that caused ~7× fill blow-up (r05 LP); routes to AMF.
Compatibility
Public Rust API and C ABI unchanged. Functional changes are confined to the root
feral package; the ordering plugin crates are unchanged (stay at 0.2.0).
See CHANGELOG.md for the full per-change detail.
v0.9.0
Fixed — batched iterative refinement for wide multi-RHS solves (#58)
Solver::solve_many_refined looped the single-RHS refiner per column,
which bypassed the BLAS-3 panel kernel — so with refinement on (the
default) batched multi-RHS solves were 3–7× slower per RHS than the
unrefined batched path, and could be slower than looping single-RHS.
Wide refined solves (nrhs ≥ 16) now refine through a batched loop:
the initial and per-step correction solves go through solve_sparse_many
(one panel solve over the still-active columns), and the residual is a
per-column SpMV. Per-column best-iterate tracking and the convergence
predicates (ε·√n relative target, 2-strike plateau, 100× divergence)
are preserved exactly, and each step compacts the active columns so the
batched path never does more solve work than the per-column loop. Narrow
solves (nrhs < 16, e.g. the IPM predictor-corrector) keep the
per-column loop unchanged.
For 16 ≤ nrhs < 32 the batched result is bit-identical to the
per-column loop (the solve runs the rank-1 kernels there); for
nrhs ≥ 32 it agrees with the per-column oracle to the refinement
residual target. Measured (bench_multirhs, 2-D Laplacians): batched
refined is ~2.5–3× faster per RHS than the per-column refined loop.
The common case (every column already at the target after the direct
solve, i.e. ~0 correction steps) now returns before allocating the wide
best_x/residual buffers — it allocates only the solution and a
length-n scratch. Those three n × nrhs allocations were cheap in the
native binary but cost ~50 µs/RHS under the Python process's allocator,
masking the win through Solver.solve_refined on 2-D inputs; with them
gone the Python refined path shows the full ~2–3× too.
Changed — multi-RHS sparse solve: BLAS-3 panel kernels (#57 fix #2)
Wide multi-RHS solves (nrhs ≥ 32) now run each supernode's
forward/backward substitution as a register-blocked dense panel solve
— a TRSM on the unit-lower triangle L_11 plus a MR×NR (4×8)
register-tiled GEMM on the trailing block L_21 — instead of a rank-1
cascade. Narrow solves (the IPM hot path) stay on the rank-1 kernels.
Two supporting layout changes, both internal (no public API/ABI change,
RHS/solution stay column-major n × nrhs):
- The internal
yworking buffer is now row-major, so every
per-supernode gather/scatter is a contiguous memcpy instead of a
stride-ntranspose. This removed a cache-conflict pathology on
power-of-twon(the n=1024 grid had regressed) and roughly halved
the wide-solve time across the board. - The forward substitution and D-block solve are fused into one
postorder pass (a node's eliminated rows are final once its
forward-sub completes), saving one gather/scatter round trip per
supernode.
Forward substitution stays bit-identical to looping single-RHS;
back-substitution differs only by floating-point reassociation
(~κ·eps). The multi-RHS parity suite (nrhs up to 64, MR/NR tail
sizes) verifies max|many − single| < 1e-12 against the independent
single-RHS oracle (observed ≤ 1.6e-15).
Measured per-RHS speedup vs looping single-RHS, on 2-D Laplacians
(cargo run --release --bin bench_multirhs, nrhs ∈ {64, 256}):
n=484 ~4–5×, n=1024 ~3×, n=2025 ~5–6×.
Changed — multi-RHS sparse solve: row-major working buffer (#57)
The per-supernode working buffer in solve_sparse_core_many_into now
uses a row-major layout (w[i*nrhs + c]) instead of column-major
(w[c*nrow + i]). The per-RHS inner loops in forward-sub, the D-block
solve, and back-sub are now contiguous (stride-1) and auto-vectorize.
The caller-visible RHS/solution layout is unchanged — it remains
column-major n × nrhs, matching MUMPS/SSIDS — so there is no public
API or ABI change. The single-RHS path (solve_sparse_core_into) and
the iterative-refinement path are untouched. Results are bit-identical
to looping the single-RHS solve (verified by the multi-RHS parity
suite, max|many − single| = 0).
Measured per-RHS speedup of one batched solve_sparse_many call vs
looping single-RHS, on 2-D Laplacians (cargo run --release --bin bench_multirhs): batched/looped ratio 0.76–0.99 (up to ~1.3× faster
per RHS) at nrhs ∈ {64, 256}. The larger 5–10× regime needs the
BLAS-3 panel kernels (issue #57 fix #2, deferred).
feral v0.8.0
Removed — 4 synthetic rank-deficient stress-corpus matrices
Dropped rankdef_10_3, rankdef_50_5, rankdef_exact_50_5, and
stokes_q1p0_8 from the stress corpus. Under #54's SSIDS-aligned
strict-zero routing, feral reported inertia.zero = 1 on all four —
which contradicted MUMPS, SSIDS, and MA57 simultaneously,
violating CLAUDE.md's "must agree with at least one canonical" rule
and red-lighting the stress-smoke gate on the v0.8.0 release commit.
Rather than allowlist them or narrow #54's zero_tol (which would
reopen the IPM δ-cascade instability on nuffield2_trap_iter1.mtx
that motivated #54 in the first place), the four matrices were
removed entirely. They were synthetic borderline fixtures where the
"correct" zero count depends on order-1e-15 round-off and no
3-of-4-oracle consensus exists — the same disagreement that
compute_consensus.py already tags excluded. The rank-deficient
regime remains covered by rankdef_5_2, rankdef_200_20,
rankdef_exact_100_10, saddle_rankdef_50_10_3,
saddle_rankdef_100_20_5. See dev/decisions.md (2026-05-28 entry).
Performance — Thomson-Hessian per-iter throughput (#56)
Three additive levers on the dense / wide-supernode IPM-KKT hot path,
landed on issue-56-thomson-hessian-throughput and merged to main:
- Lever A — pre-built
permuted_pattern+ permute structure cache.
Collapsed the symbolicpermute (P A P^T)+symmetric_pattern
phases to <0.5 % of total wall on warm calls. - Lever B — fused single-pass
contribextractwrite. Replaced
resize(cdim², 0.0)+ lower-triangle overwrite with
reserve + unsafe set_len+ write-each-cell-once. Bit-identical
contrib block (extend_addreads onlyci ≥ cj; root-Schur
extractor canonicalizes via transpose;parallel_corpus_parity
binary-compares the full buffer). Roughly 33 % fewer writes.
Re-measurement at Thomson n=200 (sequential, 9 warm reps):
factor min −5.5 % (parallel ON), −10 % (parallel OFF). - Lever C — InfNorm Knight-Ruiz inner-loop vectorization. Hoisted
the loop-carriedrow_max[j]dependency to a register accumulator
(sparse and dense paths, bit-identical by associativity of max on
non-NaN finite inputs); added a pulp-dispatched SIMD off-diagonal
kernel on the dense path. Re-measurement at Thomson n=200:
scaling phase −19 %, total wall −5 %.
No regression on the Phase 2.8.1 corpus partition gates; both
small-frontal and medium buckets improved relative to the 2026-05-27
baseline. KIRBY2 family worst-case sparse outlier improved from
10.25× → 7.97× vs MUMPS.
See dev/sessions/2026-05-28-01.md and
dev/research/issue-56-thomson-hessian-throughput-2026-05-27.md for
the localization data and re-measurement tables.
Added — symbolic-analysis-time delayed-pivot budget + CB rewire (#55)
Per-supernode delayed_capacity is assigned during symbolic
analysis. The numeric phase enforces the bound at the frontal
expansion site: if more pivots delay into a supernode than its
capacity, the factor returns the structured
FeralError::DelayBudgetExceeded { supernode, required, capacity }
(MUMPS INFO(2) workspace-overflow analog) rather than growing
the front unboundedly. Cascade-break now triggers on budget
exhaustion instead of the heuristic ratio gate, matching MUMPS's
dfac_front_aux.F:1251-1331 invariant that static perturbation
fires only when delay is structurally impossible.
Capacity formula:
tight = max(4 * own_ncol, 16)capacity = min(subtree_ncol(s) - own_ncol(s), tight)
Root-supernode cap (defensive, n >= 1024): declines amalgamations
that would push the root past min(0.05 * n, 2048) columns.
FeralConfig::default() now ships with cascade-break armed
(cascade_break_ratio = Some(0.5), cascade_break_eps = Some(1e-10))
as the budget-exhausted fallback. Pounce's per-problem .opt
overrides for nql180 / pinene_3200 are no longer required.
See dev/research/symbolic-delay-budget-2026-05-27.md and
dev/decisions.md (2026-05-27 entry, frozen convention).
Changed — strict-zero pivots route to inertia.zero (SSIDS-aligned, #54)
When ZeroPivotAction::ForceAccept accepts a 1×1 pivot whose
magnitude satisfies |d| <= zero_tol, the inertia now increments
zero instead of routing by sign(d). This matches SSIDS
(NumericSubtree.hxx:259-267) and HSL MA57 (INFO(24) = neig,
INFO(25) = number of zero pivots) and supersedes the Issue #42
Option A sign-routing rule.
Motivation: pounce's IPM δ-cascade on nuffield2_trap_iter1.mtx
(LP-shaped KKT, n=26155) stalled for 600 s (vs 1.8 s on MA57)
because Option A split bit-exact zero pivots by IEEE round-off and
the counter jumped backwards mid-cascade (13042 → 12615). The new
accounting restores monotonicity (probe confirms 0 backwards jumps
across δ_x ∈ {0 … 6.99e19}).
Behavior changes visible to callers:
Solver::inertia()andfactor()'sInertiafield now report
the mathematical (Sylvester) inertia on matrices with rank
deficiency that lands in strict-zero pivots.num_negative_eigenvalues()returns strictinertia.negative
only (unchanged convention). IPM callers comparing against an
expected oracle should now sumnegative + zeroto match
MA57's INFO(24)+INFO(25) convention.
Tests updated to reflect the new convention; see
dev/decisions.md (2026-05-26 entry) for the full trade-off
analysis and dev/research/issue-54-lp-kkt-inertia.md for the
oracle cross-check.
Added — DelayBudgetExceeded exposed through Python bindings
feral-solver (Python) now re-exports the #55 error as a named
exception:
feral.DelayBudgetExceeded— subclass offeral.FactorError
(and transitivelyferal.FeralError), raised when
factor()/refactor()hits the symbolic-analysis-time
delayed-pivot budget. The message carries the supernode index,
required and capacity columns. Python callers no longer see this
routed through the genericNumericFailurewildcard.
Added — documentation site (mdBook + rustdoc on GitHub Pages)
book/— mdBook skeleton (introduction,getting-started,
inertia,api). Content is intentionally thin; thedev/research/
notes remain internal and are not auto-published..github/workflows/pages.yml— builds the book and
cargo doc --workspace --no-deps --lib, mounts rustdoc under/api/,
and deploys to GitHub Pages on every push tomain. Site URL is
https://jkitchin.github.io/feral/; rustdoc at/api/feral/.Cargo.toml:/bookadded to the crateexcludelist so the book
source does not ship in the published crate tarball..gitignore:/book/bookignored (mdBook output is regenerated by
the Pages workflow).
Repo setting required once: Settings → Pages → Source: GitHub
Actions. Until that flag is set, the deploy job will fail with a
"Pages not enabled" error; the build job is unaffected.
feral v0.7.0
v0.7.0 — opt-in instrumentation accessors
Minor release for #52.
New public surface on Solver for downstream IPM debugging of
linear-solver behavior, without re-running with cargo flamegraph
or instrumenting forks.
Added
Phase A — always-on snapshot. Solver::last_factor_stats()
returns Option<FactorStats> populated after every successful
factor(). Fields: nnz_a, nnz_l, fill_ratio, inertia,
min_abs_pivot, max_abs_pivot, pattern_reused, scaling_info.
No gating flag — the two extra integer writes per factor() are
cheaper than a gate check would be.
Phase B — opt-in profiler. Solver::with_profiling(true)
enables per-supernode numeric timings and one-shot symbolic timings,
read back via Solver::profile_report() and
Solver::symbolic_profile_report(). The symbolic report is
populated only on cache-miss factors — None on cache hits is the
unambiguous "did symbolic actually run" signal.
Performance
Default is false; when off, no profiler arcs are allocated and the
code path is byte-identical to a pre-issue-52 build. Measured by
benches/issue52_overhead.rs on tridiagonal SPDs
(n ∈ {64, 256, 1024}, sequential, 30 samples × 3 s, dev machine):
| n | main baseline | default-off | profiling-on |
|---|---|---|---|
| 64 | 260.6 µs | 257.6 µs | 258.9 µs |
| 256 | 348.2 µs | 345.2 µs | 347.0 µs |
| 1024 | 719.6 µs | 714.7 µs | 709.1 µs |
Default-off vs main baseline: within ±1.2% (often faster — noise).
Profiling-on vs default-off: within ±1%. Both deltas sit inside the
criterion noise band on tridiagonal workloads.
Deferred
- Thread-local accumulator for parallel-driver profiler contention
(plan §B2b) — documented as escape hatch; revisit only if a real
IPM workload regresses with profiling on. - Phase C residual-norm / refinement-step accessors — out of scope
for this issue.
feral v0.6.0
Added — Solver::with_scaling() builder ([#51][i51])
New public builder method on Solver to pin the scaling strategy
explicitly. Lets IPM hosts override ScalingStrategy::Auto when
they already know what they want — recommended escape hatch when
the picker's per-matrix heuristic is wrong for a problem class.
Default (Auto) is now sticky on cached pattern (see "Fixed" below)
so the picker no longer flaps; this builder is for callers that
want full control.
Fixed — sticky Auto scaling on cached pattern ([#51][i51])
Three coordinated fixes in src/numeric/solver.rs address a ~50×
numeric-phase slowdown on IPM workloads when the picker re-routes
across iters or PartialSingular outcomes drop the MC64 cache.
Reproducer (pounce/gams/nlpbench/feral_repro/powerflow22/,
n=2,813,976 IPM KKT, default Auto):
| call | pre-fix | post-fix |
|---|---|---|
| factor (cold) | 54.77 s | 54.42 s |
| refactor (cached symbolic, iter A values) | 1.00 s | 1.02 s |
| factor #2 (cached symbolic, iter B values) | 53.80 s | 1.07 s |
The three fixes:
-
Sticky Auto pick. First
factor()on a pattern runs the full
compute_scaling_auto_with_cachepipeline (preserves Policy-4
fallback semantics — InfNorm-spread guard, off-diag-ratio guard,
MC64 catastrophic-spread guard). Post-call we derive the resolved
strategy fromfactors.scaling_info
(Mc64FallbackToInfnorm → InfNorm,
PartialSingular → Mc64Symmetric,NotApplied → Identity,
Applied → pick_scaling_strategy(matrix)) and stash it on the
Solverasauto_picked_strategy. Every subsequent factor on
the same pattern uses the stashed strategy directly, bypassing
Auto. Pattern change clears it alongside the MC64 cache. Mirrors
MUMPS ICNTL(7) / SSIDSoptions%ordering: structural decision
once at first call, reuse every refactor. -
MC64 cache gate widened to
PartialSingular. The Hungarian
on a structurally rank-deficient KKT still produced a real
scaling vector (unmatched positions land at 1.0 per
mc64.rs:222); the value-bound check still gates reuse, so
caching is correctness-safe. Pre-fix the post-#49 gate required
ScalingInfo::Applied, which dropped the cache on every IPM iter
over a structurally rank-deficient KKT and forced the Hungarian
to rerun from scratch. -
Solver::with_scaling()builder (see "Added" above).
Tests
Three unit tests in src/numeric/solver.rs:
issue_51_with_scaling_builder_overrides_defaultissue_51_auto_pick_is_sticky_on_cached_patternissue_51_partial_singular_populates_cache
And one #[ignore]'d corpus regression test:
issue_51_corpus_sticky_auto_holds_across_ipm_iters walks
tests/data/parity/<family>/*.mtx for every family with ≥ 2 IPM
iter snapshots, factors all iters against one Solver::new(), and
asserts the sticky pick holds across every iter (13 families
covered: acopp30, hatfldbne, hahn1, ssi, …).
mc64_fallback_surfaces_via_solver_api was updated to reflect the
new sticky semantics: iter 2 now asserts
mc64_fallback_count == 1 (sticky pin runs straight InfNorm; no
fallback to surface) and locks
auto_picked_strategy = Some(InfNorm).
Changed — Auto dispatcher rewrites ([#50][i50])
The OrderingMethod::Auto dispatcher (src/symbolic/mod.rs:: choose_adaptive) was simplified to two rules on top of
pick_default_method: kept only the very-large-and-sparse catch
(n > 100_000 && full_avg_deg < 5.0 → Amd); everything else
delegates to pick_default_method (n ≤ 10_000 → Amf,
n > 10_000 → MetisND).
-
Fix A — large-and-sparse swap (
c442a0c). The pre-fix
n > 100_000 && full_avg_deg < 5.0 → ScotchNDbranch is swapped
toAmd. Onpowerflow22(n=2.8 M, full_avg_deg ≈ 3.7) prior
ScotchND took 113.8 s symbolic (15.8 M nnz_L); AMD takes 55 s
(10.4 M nnz_L). The ScotchND advantage at very large n had been
load-bearing against the same BK pivoting cascade that motivated
the chain catches; issue #46 eliminated that amplifier in May
2026 and removed the justification for routing very-large sparse
matrices through nested dissection. Corpus inventory:
dev/research/issue-50-numeric-inventory.csvshows the IPM
corpus's [100k, 200k) bucket has AMD/MetisND num_nnz_l ratio 1.00
on both representatives. Validation: 258 chain-catch corpus rows
under post-Fix Auto — 0 failures, 0 num_nnz_l regressions for
matrices that actually reroute. See
dev/research/issue-50-metisnd-symbolic-cost.md§F7–§F11. -
F11 follow-up — small-and-sparse retire (
3f8f6f6). The
pre-fixn < 10_000 && full_avg_deg < 15.0 → KahipNDbranch is
deleted entirely; the population now falls through to
pick_default_method'sn ≤ 10_000 → Amfdefault (MUMPS
ana_set_ordering.FSYM=2 N≤10000). Justified by an 838-matrix
4-way inventory (dev/research/small-sparse-inventory.csv): AMF
wins 169/838 strict per-matrix (KahipND 16); aggregate num_nnz_l
ratios to AMD are AMF 0.870×, KahipND 0.984×; aggregate factor_us
ratios are AMF 0.832×, KahipND 0.990×. KahipND remains reachable
viaOrderingMethod::KahipNDfor the 41 high-avg-deg cases
(STEENBRD, HADAMARD, TABLE8) where it still wins — all sub-22k
nnz_L absolute. Seedev/research/issue-50-metisnd-symbolic-cost.md
§F12.
feral v0.5.0
feral v0.5.0 (2026-05-22)
First tagged release. Sparse symmetric indefinite direct solver in
pure Rust with certified inertia counts.
Highlights since 0.4.0:
- Added: value-bounded MC64 scaling cache; near-singularity signal.
- Changed: inertia counts every pivot by sign; oracle-consensus parity.
- Performance: Schur trailing-update kernel widened (quad NEON-tile).
- Fixed: 8 correctness/robustness fixes across scaling, inertia, and
delayed-pivot handling (#39, #45, #46, #47, #48, #49).
See CHANGELOG.md [0.5.0] for the full list.
v0.4.0 — Mittelmann parity with MA57 + auto-armed cascade-break
Headline
feral is now at parity with MA57 on the 47-problem Mittelmann NLP
benchmark panel (Ipopt 3.14.20 + --linear_solver=feral vs
--linear_solver=ma57, 600 s wall timeout each):
| solver | solve-rate | geomean wall (37 both-solved) | aggregate wall |
|---|---|---|---|
| MA57 | 39/47 | 1.000 | 3166 s |
| feral | 39/47 | 0.960 | 1699 s (54%) |
Two changes closed the gap: an auto-armed cascade-break default
in the C-ABI and a tightened MC64 routing gate. Headline rescues
on Mittelmann: marine_1600 -96%, clnlbeam -87%, robot_1600 -79%,
corkscrw -70%, dtoc2 rescued from timeout.
Highlights
- Auto-armed cascade-break in
feral_new(Ipopt C-ABI). Arms
Solver::with_auto_cascade_break(0.05)by default, disable with
FERAL_AUTO_CB_BETA=0. (#37, #38) - Scaling routing tightened.
ScalingStrategy::Autonow requires
a dense-arrow head (max_col_nnz > 32) in addition to the
diag-only mass threshold, fixing a 1-D-PDE-KKT regression. (#68) - MA57-style static-pivot perturbation knob + non-finite input
validation + closed-form 2x2 eigenvalue inertia classifier +
stale-MC64-cache invalidation. (#38) mittelmann_ipoptbenchmark harness — runs Ipopt+feral and
Ipopt+MA57 over the 47-problem NLP panel and produces a side-by-
side report.- Python bindings (
feral-solveron PyPI) now at 0.4.0 with
full IPM helper (feral.ipm.KktSolver), scipy.sparse adapters,
and JAX interop. - Ipopt integration documented end-to-end in the README, with the
feral-ipopt-shimsubproject providing one-shot
make ipopt && make hs071.
See CHANGELOG.md
for the full per-PR release notes (auto-CB, scaling routing,
diagnostic tracers, env-knob reference, parity panel notes,
oracle-consensus correctness gate, and v0.4.0 housekeeping).
Install
pip install feral-solver # Python wheels (CPython 3.10+)
cargo add feral # Rust crateCompiling FERAL into Ipopt
See the "Using FERAL inside Ipopt" section of the README — one-shot
recipe is cd feral-ipopt-shim && make ipopt && make hs071.
v0.3.0 — Feral C ABI for Ipopt + 3-way NLP comparison
Added — Feral C ABI for Ipopt linkage (feral::capi)
New pub mod capi (src/capi.rs) exposes a minimal C ABI surface
matching Ipopt's SparseSymLinearSolverInterface plug-in shape:
feral_new, feral_free, feral_set_structure, feral_values_ptr,
feral_factor, feral_solve, feral_num_neg. Matrix format is
Ipopt's CSR_Format_0_Offset (upper-triangle CSR, 0-based) which is
byte-identical to feral's lower-triangle CSC. Status codes mirror
Ipopt's ESymSolverStatus enum.
Cargo.toml adds staticlib to crate-type so the ABI can be
linked into the C++ Ipopt build via the feral-ipopt-shim/ patch
(opt-in for downstream Ipopt builders; pure-Rust consumers continue
to use the rlib). See dev/research/feral-ipopt-c-shim.md and
dev/plans/feral-ipopt-shim.md for the design.
Added — Ipopt 3-way NLP comparison harness
external_benchmarks/nlp_comparison/ runs the Ipopt
ScalableProblems suite against three Ipopt 3.14.20 binaries
(build-mumps, build-ma57, build-feral), each linked to a
single sparse direct solver. 35 problems × 3 solvers; see
REPORT.md for the 2026-05-13 sweep. MUMPS 35/35 optimal, MA57
34/35, feral 34/35; geomean over triple-optimal subset: MUMPS
139 ms, feral 158 ms, MA57 162 ms. Generates results.json and a
Markdown report. Logs/out/RHS blobs gitignored; only the harness +
report are tracked.
Added — MA57 oracle + 4-way cross-solver comparison
external_benchmarks/ma57_oracle/ builds a CoinHSL MA57 benchmark
binary alongside the existing MUMPS/SSIDS oracles.
external_benchmarks/comparison/ is extended from 3-way to 4-way
(feral + MUMPS + SSIDS + MA57), with new run.py / aggregate.py
/ report.py wiring MA57 into the per-matrix sample comparison and
REPORT.md summary.
Added — Issue #9 Steps 2 + 3: 32×32 register-resident kernel wired into production
Step 3 (SIMD body). update_1x1_block32 in
src/dense/block_ldlt32.rs tiles trailing destination columns in
groups of four through schur_panel_minus_nofma_strided_quad
(n_elim=1), with a trailing _dual for the 2-column tail and a final
axpy_minus_unroll4_nofma for the 1-column tail. Each tile packs 4
dst columns per pulp dispatch sharing one source-column load — the
intended Phase 2.4.3 register-resident pattern. Per-element output is
byte-identical to the scalar reference and to factor::do_1x1_update
(verified by 4 bit-parity unit tests at p=0, p=5, p=30, zero-pivot).
Step 2 (dispatch wiring). do_1x1_update and do_2x2_update
(factor.rs) gain an n == 32 fast-path delegating to
update_1x1_block32 / update_2x2_block32.
factor_frontal_blocked_in_place_with_scratch dispatches
nrow==ncol==32 fronts to factor_block32 (which delegates to
factor_frontal); the eager unblocked BK loop drives the SIMD update
via the fast-paths. This bypasses lblt_panel_frontal for full
32×32 fronts because, at bs==ncol==32, the panel's
apply_blocked_schur_panel quad-dispatch path is unreachable
(j_start = k + n_elim == nrow skips the batched trailing update),
so all trailing-update FLOPs are done by single-column peek-ahead
axpys. The eager-update path issues quad dispatches for every
trailing tile of 4 columns instead.
Bench: median small p90 1.33 (was 1.36), median medium p90 1.74
(was 1.78) across 3 runs. Modest but consistent improvement at the
better edge of the noise band. Inertia 154428/154481, byte-identical
to baseline.
Step 4 (rank-2 SIMD body) remains deferred — the quad kernel's
per-q sequential rounding chain is 1-ULP-divergent from
axpy2_minus_unroll4_nofma's fused chain, so a custom
4-dst-column 2-src pulp dispatch is required. 2×2 pivots are rare on
the bench corpus (no measurable bench impact expected); tracked as
follow-up. Step 5 (cross-arch CI gate) also tracked as
follow-up.
Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B + C)
Phase C (single-slot contrib pool). New pub contrib_pool: Option<Vec<f64>> field on FactorScratch. The multifrontal driver puts
the child's ContribBlock.data into the slot after extend_add consumes
it; the kernel takes at extract time, clears+resizes to cdim*cdim,
and writes. When the slot is empty (cold scratch, or take outpaces put),
the kernel falls back to a fresh Vec allocation — bit-identical to the
pre-Phase-C path. An initial multi-slot Vec<Vec<f64>> variant was
abandoned: it preserved bit-parity but regressed bench p90 by ~+0.19
(small) / ~+0.30 (medium) in 4 consecutive runs (growable-indirection
bookkeeping cost more than the malloc/free pairs it avoided). The
single-slot variant is bench-neutral vs Phase A+B (small p90 1.41,
medium p90 1.83–1.85) and bit-parity is preserved across all four
parity cases including the new (d) pool-hot pre-seeded case.
Phase C contributes no measurable bench movement on this corpus, but
the infrastructure is correct and ready: if a future kernel change
makes the contrib allocation a bigger fraction of factor cost, the
recycle path engages automatically. Final issue #13 standing:
criterion #1 (ns/sup reduction) MET, criterion #2 (bench p90 small <
1.30 OR medium < 1.60) unreachable via allocation pooling on this
corpus, criterion #3 (no correctness regression) MET, criterion #4
(bit-exact blocked_ldlt) MET. Per-front kernel cost (32×32 SIMD,
issue #9) is the next plausible lever for the bench-ratio gap.
Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B background)
Phase A (FactorScratch pool). New FactorScratch { subdiag, d_panel }
struct in src/dense/factor.rs pools the two internal-only working buffers
that factor_frontal_blocked_in_place previously allocated per supernode.
New entry point factor_frontal_blocked_in_place_with_scratch accepts
&mut FactorScratch; the existing function is now a thin wrapper that
allocates a fresh scratch and delegates. FactorWorkspace carries a
factor_scratch field that the three hot-path call sites in
src/numeric/factorize.rs (D.3 dense fast path, factor_one_supernode,
factor_one_small_leaf) thread through. The scratch is safe to re-warm
across different (nrow, bs) shapes — the kernel prologue clears and
resizes unconditionally. Bit-parity gated by
tests/factor_scratch_parity.rs (7-case size sweep + 6-case repeated-
calls regression) plus the 19 byte-identity tests/blocked_ldlt.rs
integration tests.
Phase B (extend_add direct writes). The multifrontal extend_add
in src/numeric/factorize.rs now bypasses SymmetricMatrix::set/get
and writes directly into frontal.data using the lower-triangle column-
major linear index. Per-cell work drops by one indirection, one branch,
and one redundant i >= j sanity check, with the symmetric-storage
canonicalisation preserved at the caller.
Diagnostic (cargo run --bin diag_supernode_cost --release): Phase A
delivered −16 % to −54 % ns/sup on the CRESC100 / ACOPR30 / HAIFAM /
KIRBY2 cluster (issue #13 acceptance criterion #1 MET). Phase B is
within run-to-run noise of Phase A on ns/sup, which is expected
because extend_add is a child-driven post-factor cost rather than
per-supernode.
Bench (cargo run --bin bench --release): dense small-frontal p90
1.33–1.37 and medium p90 1.75–1.78 (vs issue baseline 1.33 / 1.70).
Issue #13 acceptance criterion #2 (small p90 < 1.30 OR medium p90 <
1.60) NOT met by Phases A+B alone. 154428/154481 inertia match
preserved exactly. Phase C (return-struct pooling for l, d_diag,
d_subdiag, contrib, perm, perm_inv) is deferred to a separate
session; design choice (ABI break vs take-into vs with_capacity hints)
is unresolved.
Added — BLAS-3 quad-column trailing-update kernel (#9, parked on #13)
schur_panel_minus_nofma_strided_quad in src/dense/schur_kernel.rs
processes four trailing columns per pulp dispatch, halving src memory
traffic vs the existing dual kernel. Wired into
apply_blocked_schur_panel — every front with ≥ 4 trailing columns
now routes through quad → dual → single fall-through. Bit-exact per
column with four sequential single-column dispatches (176-config
parity sweep + 19 byte-identical blocked_ldlt integration tests).
Zero corpus regression: dense small-frontal p90 1.33 (target ≤ 2.0
PASS), medium p90 1.70 (target ≤ 3.0 PASS); 154428/154481 inertia
match, 99.8 % residual pass.
No measurable headline-throughput win on the current corpus — the
2026-04-27 CHAINWOO_0000 root that motivated the work (1984 × 32) no
longer exists on the current build (max actual nrow = 18 after METIS-
ND on this build). The new bottleneck is per-supernode fixed
overhead, tracked as issue #13. Kernel retained as parked
infrastructure: it re-engages automatically when fronts grow tall-
skinny again. See dev/decisions.md 2026-05-12 (c).
Added — block_ldlt32 scaffold and trailing-update primitives (#9)
New module src/dense/block_ldlt32.rs with BLOCK_SIZE = 32,
factor_block32 stub (delegates to factor_frontal pending the
const-generic driver port), update_1x1_block32, update_2x2_block32
scalar primitives, and a bit-parity test harness diffing factors by
to_bits(). Signatures match the planned pulp dispatch contract; the
SIMD body swap is a surgical follow-up gated on issue #13.
v0.2.0
Fixed — Honest resolved_method and consistent Auto routing (#3)
SymbolicFactorization.resolved_method now reflects what the symbolic
pipeline actually ran rather than what the caller requested. Two
behavior changes:
- ScotchND silent fallback is surfaced. When SCOTCH's nested-
dissection recursion produces no separator for the entire graph
(bordered-KKT shapes such as PoissonControl trigger this), the
driver falls back to AMD viaamd_leaffor every recursion node.
Previouslyresolved_methodstill reportedScotchNDwhile the
permutation was bit-identical to AMD's. It now reportsAmd. The
recovery itself is unchanged — only its visibility is fixed. OrderingMethod::Autois consistent with the no-arg default.
Auto resolution now happens against the original matrix's pattern
before anyLdltCompresspreprocessor reshapes the graph, and the
residual branch delegates topick_default_method(the rule used
by the no-argsymbolic_factorize). Previously Auto could pick a
different concrete method than the default rule on the same
matrix, depending on whether compression triggered. Auto is now a
strict superset: same answer as the default plus the two extra
shape-bakeoff branches (n>100_000→ ScotchND,n<10_000→
KahipND).
Reported by independent triage on K=158 PoissonControl benchmarks.
Fixed — CscMatrix::from_triplets rejects upper-triangle entries (#4)
CscMatrix::from_triplets and CscMatrix::validate now return
FeralError::InvalidInput when any triplet has row < col. Previously
upper-triangle entries were silently accepted and routed through
sort_and_sum_duplicates, producing a CscMatrix whose row indices
violated the documented "lower triangle only" invariant. Downstream
consumers (e.g. symmetric_pattern) assume lower-triangle storage, so
the same symmetric matrix described with upper- vs lower-triangle
triplets produced different solve results. The error message identifies
the offending triplet by index and (row, col). Reported by @janosh.
Changed
Solvernow defaults to the rayon-parallel multifrontal driver
(factorize_multifrontal_parallel_with_workspace). The driver is
bit-exact with the sequential supernodal path on a per-supernode
basis and falls through to the sequential path when the supernode
count is belowN_PAR_MIN = 32, so small-problem latency is not
affected. Override withSolver::new().with_parallel(false).
Closes #7. Motivation: pounce'smarine_1600/pinene_3200
Mittelmann runs were spending all their time in sequential
factor_one_supernodeeven though the parallel driver was
available; this wiresSolverdirectly to it.
Added
Solver::with_parallel(bool)— opt out of the rayon-parallel
driver (returnsSelffor builder chaining).Solver::parallel()— test/diagnostic accessor for the current
flag value.SymmetricMatrix::from_pooled_buf(n, buf)constructor that zeros
only the lower triangle when reusing a pooled buffer; cuts the
dead upper-triangle memset out offactor_one_supernode's
per-supernode hot path. See dev/decisions.md 2026-05-12.compute_infnorm_dense(&SymmetricMatrix)insrc/scaling/infnorm.rs
— dense-native Knight-Ruiz iteration for the D.3/D.4 dense
fast-path; the sparsecompute_infnormremains the path for the
multifrontal driver.Solver::inertia()—Option<&Inertia>accessor returning the
full inertia of the last successful factor. Complements the
Ipopt-shapednum_negative_eigenvalues(which panics if no factor
is stored) for callers that prefer to branch onNone. Used by
the cross-solver bench harness.
Tooling
- Cross-solver comparison harness under
external_benchmarks/comparison/(run.py, aggregate.py, report.py)
measures feral against MUMPS 5.8.2 and HSL MA97 2.8.1 on a sampled
SuiteSparse subset and emitsREPORT.md. Each solver is configured
to its production-quality (refinement-on) settings so the residual
comparison is apples-to-apples:- feral driver routes through
solve_sparse_refined(Richardson
refinement with stagnation exit) insrc/bin/bench_one_matrix.rs. external_benchmarks/mumps_oracle/mumps_bench.Fsets
ICNTL(10) = 2(max two iterative-refinement steps) — MUMPS
default is 0 (no refinement).external_benchmarks/hsl_bench/hsl_bench.cwraps
ma97_solve_din a 4-step Richardson loop because MA97 has no
native residual-based refinement entry for non-singular systems.
Configuration is documented in the generated report's Solvers
table.
- feral driver routes through
Performance
- Dense fast-path (
dense_fast_factor) now runs Knight-Ruiz ∞-norm
scaling directly on the column-major lower-triangle buffer
produced byto_dense_into, removing therow_idx[k]
indirection that dominated wall time on small-dense matrices.
Routing:ScalingStrategy::AutoandInfNormgo through the
dense KR;Mc64Symmetric,Identity,Externalare honored
via the unchanged sparse path. Bit-exact with the sparse KR on
every fast-path-gate matrix (should_use_dense_fast_path⇒
matrix is small enough that every column-major slot maps 1:1
onto a CSC entry or a known-zero, and(d_i · 0 · d_j) = 0
is a no-op in the max-reduction). Targets the
dev/results/lever-d3/stage1-stage2-2026-04-19.md§1 finding
thatcompute_scalingwas 82% (34 of 41 µs) of dense-path
wall time on TRO3X3_0013. - Pooled
local_contribsper rayon worker insideFactorWorkspace,
removing a per-taskVec<Option<ContribBlock>>of length
n_snodesfrom the parallel driver. Decisive on cont-201
(sequential –34%, parallel-at-T=8 –10%); also helps
bratu3d (–6% / –5%). Bit-exact. See dev/decisions.md
2026-05-12 and dev/sessions/2026-05-12-01.md. - Skip the upper-triangle zero on pooled frontal buffer reuse
(SymmetricMatrix::from_pooled_buf). Bit-exact; 5–10% sequential
wall reduction across mid-size matrices.
Investigated
-
Parallel multifrontal driver lock contention (T=4) — falsified.
Added opt-inAtomicLockStatstelemetry toNumericParamswith
per-task lock wait/hold counters and eight per-phase wall-time
counters wrapping the sequential prologue/epilogue. cont-201's
previously-reported residual headroom is sequential symbolic
factorize (157 ms of a 214 ms single-shot wall), not mutex
contention (worst-case 3.4% of body time on cont-201, 0.02% on
c-big). On the cached-symbolic path (production / pounce-IPM
regime) cont-201 wall drops 214 → 56 ms with body_frac jumping
0.15× → 0.55×; remaining 1.5× headroom is inside the rayon::scope,
not at lock sites. Full analysis in
dev/debugging/2026-05-12-cont201-cached-headroom.md. -
Parallel driver within-scope localization (iteration 2) —
rayon idle dominates. Addedtask_wall_ns(whole-closure
bracket) andws_lock_wait_ns(per-worker workspace mutex
wait) toAtomicLockStats. Derivedrayon_idle = scope·T − task_wall_aggquantifies the parallelism deficit attributable
to etree dependencies. cont-201 cached: rayon_idle = 12.3 ms/T
(78% of the gap) vs locks 1.7 ms/T (10%) and ctrl-flow 1.5
ms/T (10%). c-big at T=4 is essentially sequential (74%
rayon-idle capacity, 1.04× speedup vs body_agg). Conclusion:
assembly-tree parallelism is exhausted on these matrices;
within-supernode parallelism (panel-BK / threaded dense
kernels) is the only remaining axis. Closes the cont-201
assembly-tree investigation. See iteration 2 in
dev/debugging/2026-05-12-cont201-cached-headroom.md. -
Scaling cache verification (iteration 3) —
compute_scaling_with_cache
works as designed. Addedsolver_scaling_phase_splittest
(#[ignore]) that loads the corpus and times
pick_scaling_strategy+compute_scaling_with_cache(cache=None)- reorder gather. c-big picks
Mc64Symmetricand the no-cache
path takes 2.3 seconds (full Hungarian); the cached path
in production takes 2.4 ms — 1000× speedup, cache hits.
cont-201 and bcsstk38 pickInfNorm, which is values-dependent
Knight-Ruiz iteration (~4 ms per call) and is not cacheable
across IPM iterations. The 3.95 ms scaling slice on cached
cont-201 is fundamental per-factor work, not a missed cache.
Closes scaling probe from session 2026-05-12-02 "Next session
should #2".
- reorder gather. c-big picks
-
Issue #5 (MSS1 BK inertia non-monotone under δ_w·I): triage
complete, closed on the feral side. Landed a reproducer test- zero_tol/pivot_threshold sweep diagnostics in
src/numeric/factorize.rs::tests. Empirically demonstrated
that no in-kernel magnitude-floor lever cures the wandering;
cross-checked MUMPS 5.8.2 and MA57 (via Ipopt's wrapper) and
confirmed neither implements eigenvalue-aware 2×2 splitting.
Recommended fix is upstream (caller-side δ_c bump matching
Ipopt'sPerturbForSingularity). Full analysis in
dev/research/issue-5-mss1-inertia-monotonicity.md§9.
- zero_tol/pivot_threshold sweep diagnostics in