Releases · jkitchin/feral

12 Jun 15:10

jkitchin

v0.11.0

5ab8074

feral v0.11.0 Latest

Latest

feral 0.11.0

Highlights

Unsymmetric LU basis engine (feral::lu, #81) — general square
A = P L U Q factorization with ftran/btran solves and simplex-style
product-form column updates, auto-routing between a dense and a sparse
engine. Exposed in Python as LuFactor.
Expanded Python interface (all additive, backward-compatible):
LuFactor; factor access via Solver.factors() (unit-lower L as
CSC / optional scipy, block-diagonal D, perm/scaling for the
L D Lᵀ = P (S A S) Pᵀ reconstruction); symbolic analysis via
feral.analyze(...) / Solver.symbolic(); and introspection plus new
Solver(...) tuning knobs (ordering, profiling, mc64_cache, …),
pivot-magnitude and MC64 counters, last_factor_stats(),
profile_report(), scaling_info.
Documentation — new mdBook Python chapter sections, a new executed
example notebook 05_lu_and_introspection, and a refreshed top-level
README (LU engine, production-ready Status framing).

Packaging

The six fill-reducing ordering crates (feral-ordering-core, feral-amd,
feral-amf, feral-metis, feral-scotch, feral-kahip) are bumped
0.2.0 → 0.2.1 so crates.io picks up the API-compatible fixes and perf
work accumulated since 0.2.0 (they had been silently skipped at publish
across 0.3.0–0.10.0). feral's version = "0.2" requirement resolves
to 0.2.1 unchanged. A release-checklist guard now flags this staleness
class so it cannot recur.

Correctness & performance campaign

This release also folds in the repo-review campaign: a large set of
correctness fixes (inertia reporting, MC64 partial-singular fallback,
sparse/dense LU singularity handling, MTX-reader validation, CSC validation,
2×2 D-block gating, static-pivot scale-invariance) and ordering-crate
algorithm fixes (real SHEM coarsening and true GGP gain in feral-metis,
vertex-weight flow balancing and n_components in feral-kahip,
separator-FM in feral-scotch, sorted-rows invariant in
feral-ordering-core), plus several performance improvements (pooled scratch
reuse, borrow-not-clone in feral_factor, linear default postorder).

See CHANGELOG.md
for the complete, itemized list (every fix is tagged with its finding ID).

Inertia remains validated against the feral + MUMPS 5.8.2 + SPRAL/SSIDS
consensus oracle; 705 tests passing.

Assets 2

04 Jun 11:54

jkitchin

v0.10.0

23ecfaf

feral v0.10.0

Ordering-routing release. No open issues at cut; 20 commits since v0.9.0.

This is a minor release (not a patch) because the headline change is a
deliberate change to default Auto ordering behavior, not just a bug fix.

Changed

Auto now prefers AMF over MetisND at every size (#73, follow-up to #67).
The AMF_BAND_MAX ceiling is removed: whenever the size rule would pick
MetisND, choose_adaptive now routes to AMF at every n. The #50 powerflow
guard (n > 100k && avg_deg < 5 → Amd) and the #64 arrow/dense-border catch
still fire first, so only the uniformly-thin would-be-MetisND population is
affected. The #73 A/B (factor+solve) showed AMF winning on every measured
matrix in the n>100k thin regime (dtoc2 2.49×, cont5_1_l 2.75×, nql180 2.05×,
pinene 1.18×, YATP1NE 2.13×) — including a case where MetisND had smaller
symbolic fill yet was 2× slower, so a fill-guarded race was rejected.
144 diagnostic binaries relocated to crates/feral-diagnostics (#71). Root
cargo build/test/clippy now operate on the feral package only. Run
diagnostics on demand with cargo run -p feral-diagnostics --bin <name>.

Fixed

Spurious zero pivots → wrong inertia under Auto scaling on ill-conditioned
KKTs (#65). Inertia-guided MC64 fallback corrects the false-infeasible signal.
Arrow/bordered-KKT ordering blow-up (#64). Auto no longer picks MetisND on
arrow/bordered patterns that caused ~7× fill blow-up (r05 LP); routes to AMF.

Compatibility

Public Rust API and C ABI unchanged. Functional changes are confined to the root
feral package; the ordering plugin crates are unchanged (stay at 0.2.0).

See CHANGELOG.md for the full per-change detail.

Assets 2

31 May 00:18

jkitchin

v0.9.0

cd12735

v0.9.0

Fixed — batched iterative refinement for wide multi-RHS solves (#58)

Solver::solve_many_refined looped the single-RHS refiner per column,
which bypassed the BLAS-3 panel kernel — so with refinement on (the
default) batched multi-RHS solves were 3–7× slower per RHS than the
unrefined batched path, and could be slower than looping single-RHS.

Wide refined solves (nrhs ≥ 16) now refine through a batched loop:
the initial and per-step correction solves go through solve_sparse_many
(one panel solve over the still-active columns), and the residual is a
per-column SpMV. Per-column best-iterate tracking and the convergence
predicates (ε·√n relative target, 2-strike plateau, 100× divergence)
are preserved exactly, and each step compacts the active columns so the
batched path never does more solve work than the per-column loop. Narrow
solves (nrhs < 16, e.g. the IPM predictor-corrector) keep the
per-column loop unchanged.

For 16 ≤ nrhs < 32 the batched result is bit-identical to the
per-column loop (the solve runs the rank-1 kernels there); for
nrhs ≥ 32 it agrees with the per-column oracle to the refinement
residual target. Measured (bench_multirhs, 2-D Laplacians): batched
refined is ~2.5–3× faster per RHS than the per-column refined loop.

The common case (every column already at the target after the direct
solve, i.e. ~0 correction steps) now returns before allocating the wide
best_x/residual buffers — it allocates only the solution and a
length-n scratch. Those three n × nrhs allocations were cheap in the
native binary but cost ~50 µs/RHS under the Python process's allocator,
masking the win through Solver.solve_refined on 2-D inputs; with them
gone the Python refined path shows the full ~2–3× too.

Changed — multi-RHS sparse solve: BLAS-3 panel kernels (#57 fix #2)

Wide multi-RHS solves (nrhs ≥ 32) now run each supernode's
forward/backward substitution as a register-blocked dense panel solve
— a TRSM on the unit-lower triangle L_11 plus a MR×NR (4×8)
register-tiled GEMM on the trailing block L_21 — instead of a rank-1
cascade. Narrow solves (the IPM hot path) stay on the rank-1 kernels.

Two supporting layout changes, both internal (no public API/ABI change,
RHS/solution stay column-major n × nrhs):

The internal y working buffer is now row-major, so every
per-supernode gather/scatter is a contiguous memcpy instead of a
stride-n transpose. This removed a cache-conflict pathology on
power-of-two n (the n=1024 grid had regressed) and roughly halved
the wide-solve time across the board.
The forward substitution and D-block solve are fused into one
postorder pass (a node's eliminated rows are final once its
forward-sub completes), saving one gather/scatter round trip per
supernode.

Forward substitution stays bit-identical to looping single-RHS;
back-substitution differs only by floating-point reassociation
(~κ·eps). The multi-RHS parity suite (nrhs up to 64, MR/NR tail
sizes) verifies max|many − single| < 1e-12 against the independent
single-RHS oracle (observed ≤ 1.6e-15).

Measured per-RHS speedup vs looping single-RHS, on 2-D Laplacians
(cargo run --release --bin bench_multirhs, nrhs ∈ {64, 256}):
n=484 ~4–5×, n=1024 ~3×, n=2025 ~5–6×.

Changed — multi-RHS sparse solve: row-major working buffer (#57)

The per-supernode working buffer in solve_sparse_core_many_into now
uses a row-major layout (w[i*nrhs + c]) instead of column-major
(w[c*nrow + i]). The per-RHS inner loops in forward-sub, the D-block
solve, and back-sub are now contiguous (stride-1) and auto-vectorize.

The caller-visible RHS/solution layout is unchanged — it remains
column-major n × nrhs, matching MUMPS/SSIDS — so there is no public
API or ABI change. The single-RHS path (solve_sparse_core_into) and
the iterative-refinement path are untouched. Results are bit-identical
to looping the single-RHS solve (verified by the multi-RHS parity
suite, max|many − single| = 0).

Measured per-RHS speedup of one batched solve_sparse_many call vs
looping single-RHS, on 2-D Laplacians (cargo run --release --bin bench_multirhs): batched/looped ratio 0.76–0.99 (up to ~1.3× faster
per RHS) at nrhs ∈ {64, 256}. The larger 5–10× regime needs the
BLAS-3 panel kernels (issue #57 fix #2, deferred).

Assets 2

29 May 01:40

jkitchin

v0.8.0

6088078

feral v0.8.0

Removed — 4 synthetic rank-deficient stress-corpus matrices

Dropped rankdef_10_3, rankdef_50_5, rankdef_exact_50_5, and
stokes_q1p0_8 from the stress corpus. Under #54's SSIDS-aligned
strict-zero routing, feral reported inertia.zero = 1 on all four —
which contradicted MUMPS, SSIDS, and MA57 simultaneously,
violating CLAUDE.md's "must agree with at least one canonical" rule
and red-lighting the stress-smoke gate on the v0.8.0 release commit.

Rather than allowlist them or narrow #54's zero_tol (which would
reopen the IPM δ-cascade instability on nuffield2_trap_iter1.mtx
that motivated #54 in the first place), the four matrices were
removed entirely. They were synthetic borderline fixtures where the
"correct" zero count depends on order-1e-15 round-off and no
3-of-4-oracle consensus exists — the same disagreement that
compute_consensus.py already tags excluded. The rank-deficient
regime remains covered by rankdef_5_2, rankdef_200_20,
rankdef_exact_100_10, saddle_rankdef_50_10_3,
saddle_rankdef_100_20_5. See dev/decisions.md (2026-05-28 entry).

Performance — Thomson-Hessian per-iter throughput (#56)

Three additive levers on the dense / wide-supernode IPM-KKT hot path,
landed on issue-56-thomson-hessian-throughput and merged to main:

Lever A — pre-built permuted_pattern + permute structure cache.
Collapsed the symbolic permute (P A P^T) + symmetric_pattern
phases to <0.5 % of total wall on warm calls.
Lever B — fused single-pass contribextract write. Replaced
resize(cdim², 0.0) + lower-triangle overwrite with
reserve + unsafe set_len + write-each-cell-once. Bit-identical
contrib block (extend_add reads only ci ≥ cj; root-Schur
extractor canonicalizes via transpose; parallel_corpus_parity
binary-compares the full buffer). Roughly 33 % fewer writes.
Re-measurement at Thomson n=200 (sequential, 9 warm reps):
factor min −5.5 % (parallel ON), −10 % (parallel OFF).
Lever C — InfNorm Knight-Ruiz inner-loop vectorization. Hoisted
the loop-carried row_max[j] dependency to a register accumulator
(sparse and dense paths, bit-identical by associativity of max on
non-NaN finite inputs); added a pulp-dispatched SIMD off-diagonal
kernel on the dense path. Re-measurement at Thomson n=200:
scaling phase −19 %, total wall −5 %.

No regression on the Phase 2.8.1 corpus partition gates; both
small-frontal and medium buckets improved relative to the 2026-05-27
baseline. KIRBY2 family worst-case sparse outlier improved from
10.25× → 7.97× vs MUMPS.

See dev/sessions/2026-05-28-01.md and
dev/research/issue-56-thomson-hessian-throughput-2026-05-27.md for
the localization data and re-measurement tables.

Added — symbolic-analysis-time delayed-pivot budget + CB rewire (#55)

Per-supernode delayed_capacity is assigned during symbolic
analysis. The numeric phase enforces the bound at the frontal
expansion site: if more pivots delay into a supernode than its
capacity, the factor returns the structured
FeralError::DelayBudgetExceeded { supernode, required, capacity }
(MUMPS INFO(2) workspace-overflow analog) rather than growing
the front unboundedly. Cascade-break now triggers on budget
exhaustion instead of the heuristic ratio gate, matching MUMPS's
dfac_front_aux.F:1251-1331 invariant that static perturbation
fires only when delay is structurally impossible.

Capacity formula:

tight = max(4 * own_ncol, 16)
capacity = min(subtree_ncol(s) - own_ncol(s), tight)

Root-supernode cap (defensive, n >= 1024): declines amalgamations
that would push the root past min(0.05 * n, 2048) columns.

FeralConfig::default() now ships with cascade-break armed
(cascade_break_ratio = Some(0.5), cascade_break_eps = Some(1e-10))
as the budget-exhausted fallback. Pounce's per-problem .opt
overrides for nql180 / pinene_3200 are no longer required.

See dev/research/symbolic-delay-budget-2026-05-27.md and
dev/decisions.md (2026-05-27 entry, frozen convention).

Changed — strict-zero pivots route to `inertia.zero` (SSIDS-aligned, #54)

When ZeroPivotAction::ForceAccept accepts a 1×1 pivot whose
magnitude satisfies |d| <= zero_tol, the inertia now increments
zero instead of routing by sign(d). This matches SSIDS
(NumericSubtree.hxx:259-267) and HSL MA57 (INFO(24) = neig,
INFO(25) = number of zero pivots) and supersedes the Issue #42
Option A sign-routing rule.

Motivation: pounce's IPM δ-cascade on nuffield2_trap_iter1.mtx
(LP-shaped KKT, n=26155) stalled for 600 s (vs 1.8 s on MA57)
because Option A split bit-exact zero pivots by IEEE round-off and
the counter jumped backwards mid-cascade (13042 → 12615). The new
accounting restores monotonicity (probe confirms 0 backwards jumps
across δ_x ∈ {0 … 6.99e19}).

Behavior changes visible to callers:

Solver::inertia() and factor()'s Inertia field now report
the mathematical (Sylvester) inertia on matrices with rank
deficiency that lands in strict-zero pivots.
num_negative_eigenvalues() returns strict inertia.negative
only (unchanged convention). IPM callers comparing against an
expected oracle should now sum negative + zero to match
MA57's INFO(24)+INFO(25) convention.

Tests updated to reflect the new convention; see
dev/decisions.md (2026-05-26 entry) for the full trade-off
analysis and dev/research/issue-54-lp-kkt-inertia.md for the
oracle cross-check.

Added — `DelayBudgetExceeded` exposed through Python bindings

feral-solver (Python) now re-exports the #55 error as a named
exception:

feral.DelayBudgetExceeded — subclass of feral.FactorError
(and transitively feral.FeralError), raised when
factor()/refactor() hits the symbolic-analysis-time
delayed-pivot budget. The message carries the supernode index,
required and capacity columns. Python callers no longer see this
routed through the generic NumericFailure wildcard.

Added — documentation site (mdBook + rustdoc on GitHub Pages)

book/ — mdBook skeleton (introduction, getting-started,
inertia, api). Content is intentionally thin; the dev/research/
notes remain internal and are not auto-published.
.github/workflows/pages.yml — builds the book and
cargo doc --workspace --no-deps --lib, mounts rustdoc under /api/,
and deploys to GitHub Pages on every push to main. Site URL is
https://jkitchin.github.io/feral/; rustdoc at /api/feral/.
Cargo.toml: /book added to the crate exclude list so the book
source does not ship in the published crate tarball.
.gitignore: /book/book ignored (mdBook output is regenerated by
the Pages workflow).

Repo setting required once: Settings → Pages → Source: GitHub
Actions. Until that flag is set, the deploy job will fail with a
"Pages not enabled" error; the build job is unaffected.

Assets 2

25 May 17:13

jkitchin

v0.7.0

575c40b

feral v0.7.0

v0.7.0 — opt-in instrumentation accessors

Minor release for #52.
New public surface on Solver for downstream IPM debugging of
linear-solver behavior, without re-running with cargo flamegraph
or instrumenting forks.

Added

Phase A — always-on snapshot. Solver::last_factor_stats()
returns Option<FactorStats> populated after every successful
factor(). Fields: nnz_a, nnz_l, fill_ratio, inertia,
min_abs_pivot, max_abs_pivot, pattern_reused, scaling_info.
No gating flag — the two extra integer writes per factor() are
cheaper than a gate check would be.

Phase B — opt-in profiler. Solver::with_profiling(true)
enables per-supernode numeric timings and one-shot symbolic timings,
read back via Solver::profile_report() and
Solver::symbolic_profile_report(). The symbolic report is
populated only on cache-miss factors — None on cache hits is the
unambiguous "did symbolic actually run" signal.

Performance

Default is false; when off, no profiler arcs are allocated and the
code path is byte-identical to a pre-issue-52 build. Measured by
benches/issue52_overhead.rs on tridiagonal SPDs
(n ∈ {64, 256, 1024}, sequential, 30 samples × 3 s, dev machine):

n	main baseline	default-off	profiling-on
64	260.6 µs	257.6 µs	258.9 µs
256	348.2 µs	345.2 µs	347.0 µs
1024	719.6 µs	714.7 µs	709.1 µs

Default-off vs main baseline: within ±1.2% (often faster — noise).
Profiling-on vs default-off: within ±1%. Both deltas sit inside the
criterion noise band on tridiagonal workloads.

Deferred

Thread-local accumulator for parallel-driver profiler contention
(plan §B2b) — documented as escape hatch; revisit only if a real
IPM workload regresses with profiling on.
Phase C residual-norm / refinement-step accessors — out of scope
for this issue.

Assets 2

23 May 22:31

jkitchin

v0.6.0

b548a8c

feral v0.6.0

Added — `Solver::with_scaling()` builder ([#51][i51])

New public builder method on Solver to pin the scaling strategy
explicitly. Lets IPM hosts override ScalingStrategy::Auto when
they already know what they want — recommended escape hatch when
the picker's per-matrix heuristic is wrong for a problem class.
Default (Auto) is now sticky on cached pattern (see "Fixed" below)
so the picker no longer flaps; this builder is for callers that
want full control.

Fixed — sticky `Auto` scaling on cached pattern ([#51][i51])

Three coordinated fixes in src/numeric/solver.rs address a ~50×
numeric-phase slowdown on IPM workloads when the picker re-routes
across iters or PartialSingular outcomes drop the MC64 cache.

Reproducer (pounce/gams/nlpbench/feral_repro/powerflow22/,
n=2,813,976 IPM KKT, default Auto):

call	pre-fix	post-fix
factor (cold)	54.77 s	54.42 s
refactor (cached symbolic, iter A values)	1.00 s	1.02 s
factor #2 (cached symbolic, iter B values)	53.80 s	1.07 s

The three fixes:

Sticky Auto pick. First factor() on a pattern runs the full
compute_scaling_auto_with_cache pipeline (preserves Policy-4
fallback semantics — InfNorm-spread guard, off-diag-ratio guard,
MC64 catastrophic-spread guard). Post-call we derive the resolved
strategy from factors.scaling_info
(Mc64FallbackToInfnorm → InfNorm,
PartialSingular → Mc64Symmetric, NotApplied → Identity,
Applied → pick_scaling_strategy(matrix)) and stash it on the
Solver as auto_picked_strategy. Every subsequent factor on
the same pattern uses the stashed strategy directly, bypassing
Auto. Pattern change clears it alongside the MC64 cache. Mirrors
MUMPS ICNTL(7) / SSIDS options%ordering: structural decision
once at first call, reuse every refactor.
MC64 cache gate widened to PartialSingular. The Hungarian
on a structurally rank-deficient KKT still produced a real
scaling vector (unmatched positions land at 1.0 per
mc64.rs:222); the value-bound check still gates reuse, so
caching is correctness-safe. Pre-fix the post-#49 gate required
ScalingInfo::Applied, which dropped the cache on every IPM iter
over a structurally rank-deficient KKT and forced the Hungarian
to rerun from scratch.
Solver::with_scaling() builder (see "Added" above).

Tests

Three unit tests in src/numeric/solver.rs:

issue_51_with_scaling_builder_overrides_default
issue_51_auto_pick_is_sticky_on_cached_pattern
issue_51_partial_singular_populates_cache

And one #[ignore]'d corpus regression test:
issue_51_corpus_sticky_auto_holds_across_ipm_iters walks
tests/data/parity/<family>/*.mtx for every family with ≥ 2 IPM
iter snapshots, factors all iters against one Solver::new(), and
asserts the sticky pick holds across every iter (13 families
covered: acopp30, hatfldbne, hahn1, ssi, …).

mc64_fallback_surfaces_via_solver_api was updated to reflect the
new sticky semantics: iter 2 now asserts
mc64_fallback_count == 1 (sticky pin runs straight InfNorm; no
fallback to surface) and locks
auto_picked_strategy = Some(InfNorm).

Changed — `Auto` dispatcher rewrites ([#50][i50])

The OrderingMethod::Auto dispatcher (src/symbolic/mod.rs:: choose_adaptive) was simplified to two rules on top of
pick_default_method: kept only the very-large-and-sparse catch
(n > 100_000 && full_avg_deg < 5.0 → Amd); everything else
delegates to pick_default_method (n ≤ 10_000 → Amf,
n > 10_000 → MetisND).

Fix A — large-and-sparse swap (c442a0c). The pre-fix
n > 100_000 && full_avg_deg < 5.0 → ScotchND branch is swapped
to Amd. On powerflow22 (n=2.8 M, full_avg_deg ≈ 3.7) prior
ScotchND took 113.8 s symbolic (15.8 M nnz_L); AMD takes 55 s
(10.4 M nnz_L). The ScotchND advantage at very large n had been
load-bearing against the same BK pivoting cascade that motivated
the chain catches; issue #46 eliminated that amplifier in May
2026 and removed the justification for routing very-large sparse
matrices through nested dissection. Corpus inventory:
dev/research/issue-50-numeric-inventory.csv shows the IPM
corpus's [100k, 200k) bucket has AMD/MetisND num_nnz_l ratio 1.00
on both representatives. Validation: 258 chain-catch corpus rows
under post-Fix Auto — 0 failures, 0 num_nnz_l regressions for
matrices that actually reroute. See
dev/research/issue-50-metisnd-symbolic-cost.md §F7–§F11.
F11 follow-up — small-and-sparse retire (3f8f6f6). The
pre-fix n < 10_000 && full_avg_deg < 15.0 → KahipND branch is
deleted entirely; the population now falls through to
pick_default_method's n ≤ 10_000 → Amf default (MUMPS
ana_set_ordering.F SYM=2 N≤10000). Justified by an 838-matrix
4-way inventory (dev/research/small-sparse-inventory.csv): AMF
wins 169/838 strict per-matrix (KahipND 16); aggregate num_nnz_l
ratios to AMD are AMF 0.870×, KahipND 0.984×; aggregate factor_us
ratios are AMF 0.832×, KahipND 0.990×. KahipND remains reachable
via OrderingMethod::KahipND for the 41 high-avg-deg cases
(STEENBRD, HADAMARD, TABLE8) where it still wins — all sub-22k
nnz_L absolute. See dev/research/issue-50-metisnd-symbolic-cost.md
§F12.

Assets 2

22 May 19:51

jkitchin

v0.5.0

dfb5029

feral v0.5.0

feral v0.5.0 (2026-05-22)

First tagged release. Sparse symmetric indefinite direct solver in
pure Rust with certified inertia counts.

Highlights since 0.4.0:

Added: value-bounded MC64 scaling cache; near-singularity signal.
Changed: inertia counts every pivot by sign; oracle-consensus parity.
Performance: Schur trailing-update kernel widened (quad NEON-tile).
Fixed: 8 correctness/robustness fixes across scaling, inertia, and
delayed-pivot handling (#39, #45, #46, #47, #48, #49).

See CHANGELOG.md [0.5.0] for the full list.

Assets 2

17 May 21:42

jkitchin

v0.4.0

2442d1f

v0.4.0 — Mittelmann parity with MA57 + auto-armed cascade-break

Headline

feral is now at parity with MA57 on the 47-problem Mittelmann NLP
benchmark panel (Ipopt 3.14.20 + --linear_solver=feral vs
--linear_solver=ma57, 600 s wall timeout each):

solver	solve-rate	geomean wall (37 both-solved)	aggregate wall
MA57	39/47	1.000	3166 s
feral	39/47	0.960	1699 s (54%)

Two changes closed the gap: an auto-armed cascade-break default
in the C-ABI and a tightened MC64 routing gate. Headline rescues
on Mittelmann: marine_1600 -96%, clnlbeam -87%, robot_1600 -79%,
corkscrw -70%, dtoc2 rescued from timeout.

Highlights

Auto-armed cascade-break in feral_new (Ipopt C-ABI). Arms
Solver::with_auto_cascade_break(0.05) by default, disable with
FERAL_AUTO_CB_BETA=0. (#37, #38)
Scaling routing tightened. ScalingStrategy::Auto now requires
a dense-arrow head (max_col_nnz > 32) in addition to the
diag-only mass threshold, fixing a 1-D-PDE-KKT regression. (#68)
MA57-style static-pivot perturbation knob + non-finite input
validation + closed-form 2x2 eigenvalue inertia classifier +
stale-MC64-cache invalidation. (#38)
mittelmann_ipopt benchmark harness — runs Ipopt+feral and
Ipopt+MA57 over the 47-problem NLP panel and produces a side-by-
side report.
Python bindings (feral-solver on PyPI) now at 0.4.0 with
full IPM helper (feral.ipm.KktSolver), scipy.sparse adapters,
and JAX interop.
Ipopt integration documented end-to-end in the README, with the
feral-ipopt-shim subproject providing one-shot
make ipopt && make hs071.

See CHANGELOG.md
for the full per-PR release notes (auto-CB, scaling routing,
diagnostic tracers, env-knob reference, parity panel notes,
oracle-consensus correctness gate, and v0.4.0 housekeeping).

Install

pip install feral-solver         # Python wheels (CPython 3.10+)
cargo add feral                  # Rust crate

Compiling FERAL into Ipopt

See the "Using FERAL inside Ipopt" section of the README — one-shot
recipe is cd feral-ipopt-shim && make ipopt && make hs071.

Assets 2

13 May 17:15

jkitchin

v0.3.0

11d0a90

v0.3.0 — Feral C ABI for Ipopt + 3-way NLP comparison

Added — Feral C ABI for Ipopt linkage (`feral::capi`)

New pub mod capi (src/capi.rs) exposes a minimal C ABI surface
matching Ipopt's SparseSymLinearSolverInterface plug-in shape:
feral_new, feral_free, feral_set_structure, feral_values_ptr,
feral_factor, feral_solve, feral_num_neg. Matrix format is
Ipopt's CSR_Format_0_Offset (upper-triangle CSR, 0-based) which is
byte-identical to feral's lower-triangle CSC. Status codes mirror
Ipopt's ESymSolverStatus enum.

Cargo.toml adds staticlib to crate-type so the ABI can be
linked into the C++ Ipopt build via the feral-ipopt-shim/ patch
(opt-in for downstream Ipopt builders; pure-Rust consumers continue
to use the rlib). See dev/research/feral-ipopt-c-shim.md and
dev/plans/feral-ipopt-shim.md for the design.

Added — Ipopt 3-way NLP comparison harness

external_benchmarks/nlp_comparison/ runs the Ipopt
ScalableProblems suite against three Ipopt 3.14.20 binaries
(build-mumps, build-ma57, build-feral), each linked to a
single sparse direct solver. 35 problems × 3 solvers; see
REPORT.md for the 2026-05-13 sweep. MUMPS 35/35 optimal, MA57
34/35, feral 34/35; geomean over triple-optimal subset: MUMPS
139 ms, feral 158 ms, MA57 162 ms. Generates results.json and a
Markdown report. Logs/out/RHS blobs gitignored; only the harness +
report are tracked.

Added — MA57 oracle + 4-way cross-solver comparison

external_benchmarks/ma57_oracle/ builds a CoinHSL MA57 benchmark
binary alongside the existing MUMPS/SSIDS oracles.
external_benchmarks/comparison/ is extended from 3-way to 4-way
(feral + MUMPS + SSIDS + MA57), with new run.py / aggregate.py
/ report.py wiring MA57 into the per-matrix sample comparison and
REPORT.md summary.

Added — Issue #9 Steps 2 + 3: 32×32 register-resident kernel wired into production

Step 3 (SIMD body). update_1x1_block32 in
src/dense/block_ldlt32.rs tiles trailing destination columns in
groups of four through schur_panel_minus_nofma_strided_quad
(n_elim=1), with a trailing _dual for the 2-column tail and a final
axpy_minus_unroll4_nofma for the 1-column tail. Each tile packs 4
dst columns per pulp dispatch sharing one source-column load — the
intended Phase 2.4.3 register-resident pattern. Per-element output is
byte-identical to the scalar reference and to factor::do_1x1_update
(verified by 4 bit-parity unit tests at p=0, p=5, p=30, zero-pivot).

Step 2 (dispatch wiring). do_1x1_update and do_2x2_update
(factor.rs) gain an n == 32 fast-path delegating to
update_1x1_block32 / update_2x2_block32.
factor_frontal_blocked_in_place_with_scratch dispatches
nrow==ncol==32 fronts to factor_block32 (which delegates to
factor_frontal); the eager unblocked BK loop drives the SIMD update
via the fast-paths. This bypasses lblt_panel_frontal for full
32×32 fronts because, at bs==ncol==32, the panel's
apply_blocked_schur_panel quad-dispatch path is unreachable
(j_start = k + n_elim == nrow skips the batched trailing update),
so all trailing-update FLOPs are done by single-column peek-ahead
axpys. The eager-update path issues quad dispatches for every
trailing tile of 4 columns instead.

Bench: median small p90 1.33 (was 1.36), median medium p90 1.74
(was 1.78) across 3 runs. Modest but consistent improvement at the
better edge of the noise band. Inertia 154428/154481, byte-identical
to baseline.

Step 4 (rank-2 SIMD body) remains deferred — the quad kernel's
per-q sequential rounding chain is 1-ULP-divergent from
axpy2_minus_unroll4_nofma's fused chain, so a custom
4-dst-column 2-src pulp dispatch is required. 2×2 pivots are rare on
the bench corpus (no measurable bench impact expected); tracked as
follow-up. Step 5 (cross-arch CI gate) also tracked as
follow-up.

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B + C)

Phase C (single-slot contrib pool). New pub contrib_pool: Option<Vec<f64>> field on FactorScratch. The multifrontal driver puts
the child's ContribBlock.data into the slot after extend_add consumes
it; the kernel takes at extract time, clears+resizes to cdim*cdim,
and writes. When the slot is empty (cold scratch, or take outpaces put),
the kernel falls back to a fresh Vec allocation — bit-identical to the
pre-Phase-C path. An initial multi-slot Vec<Vec<f64>> variant was
abandoned: it preserved bit-parity but regressed bench p90 by ~+0.19
(small) / ~+0.30 (medium) in 4 consecutive runs (growable-indirection
bookkeeping cost more than the malloc/free pairs it avoided). The
single-slot variant is bench-neutral vs Phase A+B (small p90 1.41,
medium p90 1.83–1.85) and bit-parity is preserved across all four
parity cases including the new (d) pool-hot pre-seeded case.

Phase C contributes no measurable bench movement on this corpus, but
the infrastructure is correct and ready: if a future kernel change
makes the contrib allocation a bigger fraction of factor cost, the
recycle path engages automatically. Final issue #13 standing:
criterion #1 (ns/sup reduction) MET, criterion #2 (bench p90 small <
1.30 OR medium < 1.60) unreachable via allocation pooling on this
corpus, criterion #3 (no correctness regression) MET, criterion #4
(bit-exact blocked_ldlt) MET. Per-front kernel cost (32×32 SIMD,
issue #9) is the next plausible lever for the bench-ratio gap.

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B background)

Phase A (FactorScratch pool). New FactorScratch { subdiag, d_panel }
struct in src/dense/factor.rs pools the two internal-only working buffers
that factor_frontal_blocked_in_place previously allocated per supernode.
New entry point factor_frontal_blocked_in_place_with_scratch accepts
&mut FactorScratch; the existing function is now a thin wrapper that
allocates a fresh scratch and delegates. FactorWorkspace carries a
factor_scratch field that the three hot-path call sites in
src/numeric/factorize.rs (D.3 dense fast path, factor_one_supernode,
factor_one_small_leaf) thread through. The scratch is safe to re-warm
across different (nrow, bs) shapes — the kernel prologue clears and
resizes unconditionally. Bit-parity gated by
tests/factor_scratch_parity.rs (7-case size sweep + 6-case repeated-
calls regression) plus the 19 byte-identity tests/blocked_ldlt.rs
integration tests.

Phase B (extend_add direct writes). The multifrontal extend_add
in src/numeric/factorize.rs now bypasses SymmetricMatrix::set/get
and writes directly into frontal.data using the lower-triangle column-
major linear index. Per-cell work drops by one indirection, one branch,
and one redundant i >= j sanity check, with the symmetric-storage
canonicalisation preserved at the caller.

Diagnostic (cargo run --bin diag_supernode_cost --release): Phase A
delivered −16 % to −54 % ns/sup on the CRESC100 / ACOPR30 / HAIFAM /
KIRBY2 cluster (issue #13 acceptance criterion #1 MET). Phase B is
within run-to-run noise of Phase A on ns/sup, which is expected
because extend_add is a child-driven post-factor cost rather than
per-supernode.

Bench (cargo run --bin bench --release): dense small-frontal p90
1.33–1.37 and medium p90 1.75–1.78 (vs issue baseline 1.33 / 1.70).
Issue #13 acceptance criterion #2 (small p90 < 1.30 OR medium p90 <
1.60) NOT met by Phases A+B alone. 154428/154481 inertia match
preserved exactly. Phase C (return-struct pooling for l, d_diag,
d_subdiag, contrib, perm, perm_inv) is deferred to a separate
session; design choice (ABI break vs take-into vs with_capacity hints)
is unresolved.

Added — BLAS-3 quad-column trailing-update kernel (#9, parked on #13)

schur_panel_minus_nofma_strided_quad in src/dense/schur_kernel.rs
processes four trailing columns per pulp dispatch, halving src memory
traffic vs the existing dual kernel. Wired into
apply_blocked_schur_panel — every front with ≥ 4 trailing columns
now routes through quad → dual → single fall-through. Bit-exact per
column with four sequential single-column dispatches (176-config
parity sweep + 19 byte-identical blocked_ldlt integration tests).
Zero corpus regression: dense small-frontal p90 1.33 (target ≤ 2.0
PASS), medium p90 1.70 (target ≤ 3.0 PASS); 154428/154481 inertia
match, 99.8 % residual pass.

No measurable headline-throughput win on the current corpus — the
2026-04-27 CHAINWOO_0000 root that motivated the work (1984 × 32) no
longer exists on the current build (max actual nrow = 18 after METIS-
ND on this build). The new bottleneck is per-supernode fixed
overhead, tracked as issue #13. Kernel retained as parked
infrastructure: it re-engages automatically when fronts grow tall-
skinny again. See dev/decisions.md 2026-05-12 (c).

Added — block_ldlt32 scaffold and trailing-update primitives (#9)

New module src/dense/block_ldlt32.rs with BLOCK_SIZE = 32,
factor_block32 stub (delegates to factor_frontal pending the
const-generic driver port), update_1x1_block32, update_2x2_block32
scalar primitives, and a bit-parity test harness diffing factors by
to_bits(). Signatures match the planned pulp dispatch contract; the
SIMD body swap is a surgical follow-up gated on issue #13.

Assets 2

12 May 23:47

jkitchin

v0.2.0

da6727f

v0.2.0

Fixed — Honest `resolved_method` and consistent `Auto` routing (#3)

SymbolicFactorization.resolved_method now reflects what the symbolic
pipeline actually ran rather than what the caller requested. Two
behavior changes:

ScotchND silent fallback is surfaced. When SCOTCH's nested-
dissection recursion produces no separator for the entire graph
(bordered-KKT shapes such as PoissonControl trigger this), the
driver falls back to AMD via amd_leaf for every recursion node.
Previously resolved_method still reported ScotchND while the
permutation was bit-identical to AMD's. It now reports Amd. The
recovery itself is unchanged — only its visibility is fixed.
OrderingMethod::Auto is consistent with the no-arg default.
Auto resolution now happens against the original matrix's pattern
before any LdltCompress preprocessor reshapes the graph, and the
residual branch delegates to pick_default_method (the rule used
by the no-arg symbolic_factorize). Previously Auto could pick a
different concrete method than the default rule on the same
matrix, depending on whether compression triggered. Auto is now a
strict superset: same answer as the default plus the two extra
shape-bakeoff branches (n>100_000 → ScotchND, n<10_000 →
KahipND).

Reported by independent triage on K=158 PoissonControl benchmarks.

Fixed — `CscMatrix::from_triplets` rejects upper-triangle entries (#4)

CscMatrix::from_triplets and CscMatrix::validate now return
FeralError::InvalidInput when any triplet has row < col. Previously
upper-triangle entries were silently accepted and routed through
sort_and_sum_duplicates, producing a CscMatrix whose row indices
violated the documented "lower triangle only" invariant. Downstream
consumers (e.g. symmetric_pattern) assume lower-triangle storage, so
the same symmetric matrix described with upper- vs lower-triangle
triplets produced different solve results. The error message identifies
the offending triplet by index and (row, col). Reported by @janosh.

Changed

Solver now defaults to the rayon-parallel multifrontal driver
(factorize_multifrontal_parallel_with_workspace). The driver is
bit-exact with the sequential supernodal path on a per-supernode
basis and falls through to the sequential path when the supernode
count is below N_PAR_MIN = 32, so small-problem latency is not
affected. Override with Solver::new().with_parallel(false).
Closes #7. Motivation: pounce's marine_1600 / pinene_3200
Mittelmann runs were spending all their time in sequential
factor_one_supernode even though the parallel driver was
available; this wires Solver directly to it.

Added

Solver::with_parallel(bool) — opt out of the rayon-parallel
driver (returns Self for builder chaining).
Solver::parallel() — test/diagnostic accessor for the current
flag value.
SymmetricMatrix::from_pooled_buf(n, buf) constructor that zeros
only the lower triangle when reusing a pooled buffer; cuts the
dead upper-triangle memset out of factor_one_supernode's
per-supernode hot path. See dev/decisions.md 2026-05-12.
compute_infnorm_dense(&SymmetricMatrix) in src/scaling/infnorm.rs
— dense-native Knight-Ruiz iteration for the D.3/D.4 dense
fast-path; the sparse compute_infnorm remains the path for the
multifrontal driver.
Solver::inertia() — Option<&Inertia> accessor returning the
full inertia of the last successful factor. Complements the
Ipopt-shaped num_negative_eigenvalues (which panics if no factor
is stored) for callers that prefer to branch on None. Used by
the cross-solver bench harness.

Tooling

Cross-solver comparison harness under
external_benchmarks/comparison/ (run.py, aggregate.py, report.py)
measures feral against MUMPS 5.8.2 and HSL MA97 2.8.1 on a sampled
SuiteSparse subset and emits REPORT.md. Each solver is configured
to its production-quality (refinement-on) settings so the residual
comparison is apples-to-apples:
- feral driver routes through solve_sparse_refined (Richardson
  refinement with stagnation exit) in src/bin/bench_one_matrix.rs.
- external_benchmarks/mumps_oracle/mumps_bench.F sets
  ICNTL(10) = 2 (max two iterative-refinement steps) — MUMPS
  default is 0 (no refinement).
- external_benchmarks/hsl_bench/hsl_bench.c wraps
  ma97_solve_d in a 4-step Richardson loop because MA97 has no
  native residual-based refinement entry for non-singular systems.
  Configuration is documented in the generated report's Solvers
  table.

Performance

Dense fast-path (dense_fast_factor) now runs Knight-Ruiz ∞-norm
scaling directly on the column-major lower-triangle buffer
produced by to_dense_into, removing the row_idx[k]
indirection that dominated wall time on small-dense matrices.
Routing: ScalingStrategy::Auto and InfNorm go through the
dense KR; Mc64Symmetric, Identity, External are honored
via the unchanged sparse path. Bit-exact with the sparse KR on
every fast-path-gate matrix (should_use_dense_fast_path ⇒
matrix is small enough that every column-major slot maps 1:1
onto a CSC entry or a known-zero, and (d_i · 0 · d_j) = 0
is a no-op in the max-reduction). Targets the
dev/results/lever-d3/stage1-stage2-2026-04-19.md §1 finding
that compute_scaling was 82% (34 of 41 µs) of dense-path
wall time on TRO3X3_0013.
Pooled local_contribs per rayon worker inside FactorWorkspace,
removing a per-task Vec<Option<ContribBlock>> of length
n_snodes from the parallel driver. Decisive on cont-201
(sequential –34%, parallel-at-T=8 –10%); also helps
bratu3d (–6% / –5%). Bit-exact. See dev/decisions.md
2026-05-12 and dev/sessions/2026-05-12-01.md.
Skip the upper-triangle zero on pooled frontal buffer reuse
(SymmetricMatrix::from_pooled_buf). Bit-exact; 5–10% sequential
wall reduction across mid-size matrices.

Investigated

Parallel multifrontal driver lock contention (T=4) — falsified.
Added opt-in AtomicLockStats telemetry to NumericParams with
per-task lock wait/hold counters and eight per-phase wall-time
counters wrapping the sequential prologue/epilogue. cont-201's
previously-reported residual headroom is sequential symbolic
factorize (157 ms of a 214 ms single-shot wall), not mutex
contention (worst-case 3.4% of body time on cont-201, 0.02% on
c-big). On the cached-symbolic path (production / pounce-IPM
regime) cont-201 wall drops 214 → 56 ms with body_frac jumping
0.15× → 0.55×; remaining 1.5× headroom is inside the rayon::scope,
not at lock sites. Full analysis in
dev/debugging/2026-05-12-cont201-cached-headroom.md.
Parallel driver within-scope localization (iteration 2) —
rayon idle dominates. Added task_wall_ns (whole-closure
bracket) and ws_lock_wait_ns (per-worker workspace mutex
wait) to AtomicLockStats. Derived rayon_idle = scope·T − task_wall_agg quantifies the parallelism deficit attributable
to etree dependencies. cont-201 cached: rayon_idle = 12.3 ms/T
(78% of the gap) vs locks 1.7 ms/T (10%) and ctrl-flow 1.5
ms/T (10%). c-big at T=4 is essentially sequential (74%
rayon-idle capacity, 1.04× speedup vs body_agg). Conclusion:
assembly-tree parallelism is exhausted on these matrices;
within-supernode parallelism (panel-BK / threaded dense
kernels) is the only remaining axis. Closes the cont-201
assembly-tree investigation. See iteration 2 in
dev/debugging/2026-05-12-cont201-cached-headroom.md.
Scaling cache verification (iteration 3) — compute_scaling_with_cache
works as designed. Added solver_scaling_phase_split test
(#[ignore]) that loads the corpus and times
pick_scaling_strategy + compute_scaling_with_cache(cache=None)
- reorder gather. c-big picks Mc64Symmetric and the no-cache
  path takes 2.3 seconds (full Hungarian); the cached path
  in production takes 2.4 ms — 1000× speedup, cache hits.
  cont-201 and bcsstk38 pick InfNorm, which is values-dependent
  Knight-Ruiz iteration (~4 ms per call) and is not cacheable
  across IPM iterations. The 3.95 ms scaling slice on cached
  cont-201 is fundamental per-factor work, not a missed cache.
  Closes scaling probe from session 2026-05-12-02 "Next session
  should #2".
Issue #5 (MSS1 BK inertia non-monotone under δ_w·I): triage
complete, closed on the feral side. Landed a reproducer test
- zero_tol/pivot_threshold sweep diagnostics in
  src/numeric/factorize.rs::tests. Empirically demonstrated
  that no in-kernel magnitude-floor lever cures the wandering;
  cross-checked MUMPS 5.8.2 and MA57 (via Ipopt's wrapper) and
  confirmed neither implements eigenvalue-aware 2×2 splitting.
  Recommended fix is upstream (caller-side δ_c bump matching
  Ipopt's PerturbForSingularity). Full analysis in
  dev/research/issue-5-mss1-inertia-monotonicity.md §9.

Contributors

janosh

Assets 2

Releases: jkitchin/feral

feral v0.11.0

feral 0.11.0

Highlights

Packaging

Correctness & performance campaign

Uh oh!

feral v0.10.0

Changed

Fixed

Compatibility

Uh oh!

v0.9.0

Fixed — batched iterative refinement for wide multi-RHS solves (#58)

Changed — multi-RHS sparse solve: BLAS-3 panel kernels (#57 fix #2)

Changed — multi-RHS sparse solve: row-major working buffer (#57)

Uh oh!

feral v0.8.0

Removed — 4 synthetic rank-deficient stress-corpus matrices

Performance — Thomson-Hessian per-iter throughput (#56)

Added — symbolic-analysis-time delayed-pivot budget + CB rewire (#55)

Changed — strict-zero pivots route to inertia.zero (SSIDS-aligned, #54)

Added — DelayBudgetExceeded exposed through Python bindings

Added — documentation site (mdBook + rustdoc on GitHub Pages)

Uh oh!

feral v0.7.0

v0.7.0 — opt-in instrumentation accessors

Added

Performance

Deferred

Uh oh!

feral v0.6.0

Added — Solver::with_scaling() builder ([#51][i51])

Fixed — sticky Auto scaling on cached pattern ([#51][i51])

Tests

Changed — Auto dispatcher rewrites ([#50][i50])

Uh oh!

feral v0.5.0

Uh oh!

v0.4.0 — Mittelmann parity with MA57 + auto-armed cascade-break

Headline

Highlights

Install

Compiling FERAL into Ipopt

Uh oh!

v0.3.0 — Feral C ABI for Ipopt + 3-way NLP comparison

Added — Feral C ABI for Ipopt linkage (feral::capi)

Added — Ipopt 3-way NLP comparison harness

Added — MA57 oracle + 4-way cross-solver comparison

Added — Issue #9 Steps 2 + 3: 32×32 register-resident kernel wired into production

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B + C)

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B background)

Added — BLAS-3 quad-column trailing-update kernel (#9, parked on #13)

Added — block_ldlt32 scaffold and trailing-update primitives (#9)

Uh oh!

v0.2.0

Fixed — Honest resolved_method and consistent Auto routing (#3)

Fixed — CscMatrix::from_triplets rejects upper-triangle entries (#4)

Changed

Added

Tooling

Performance

Investigated

Contributors

Uh oh!

Changed — strict-zero pivots route to `inertia.zero` (SSIDS-aligned, #54)

Added — `DelayBudgetExceeded` exposed through Python bindings

Added — `Solver::with_scaling()` builder ([#51][i51])

Fixed — sticky `Auto` scaling on cached pattern ([#51][i51])

Changed — `Auto` dispatcher rewrites ([#50][i50])

Added — Feral C ABI for Ipopt linkage (`feral::capi`)

Fixed — Honest `resolved_method` and consistent `Auto` routing (#3)

Fixed — `CscMatrix::from_triplets` rejects upper-triangle entries (#4)