Skip to content

Releases: jkitchin/feral

feral v0.11.0

12 Jun 15:10

Choose a tag to compare

feral 0.11.0

Highlights

  • Unsymmetric LU basis engine (feral::lu, #81) — general square
    A = P L U Q factorization with ftran/btran solves and simplex-style
    product-form column updates, auto-routing between a dense and a sparse
    engine. Exposed in Python as LuFactor.
  • Expanded Python interface (all additive, backward-compatible):
    LuFactor; factor access via Solver.factors() (unit-lower L as
    CSC / optional scipy, block-diagonal D, perm/scaling for the
    L D Lᵀ = P (S A S) Pᵀ reconstruction); symbolic analysis via
    feral.analyze(...) / Solver.symbolic(); and introspection plus new
    Solver(...) tuning knobs (ordering, profiling, mc64_cache, …),
    pivot-magnitude and MC64 counters, last_factor_stats(),
    profile_report(), scaling_info.
  • Documentation — new mdBook Python chapter sections, a new executed
    example notebook 05_lu_and_introspection, and a refreshed top-level
    README (LU engine, production-ready Status framing).

Packaging

  • The six fill-reducing ordering crates (feral-ordering-core, feral-amd,
    feral-amf, feral-metis, feral-scotch, feral-kahip) are bumped
    0.2.0 → 0.2.1 so crates.io picks up the API-compatible fixes and perf
    work accumulated since 0.2.0 (they had been silently skipped at publish
    across 0.3.00.10.0). feral's version = "0.2" requirement resolves
    to 0.2.1 unchanged. A release-checklist guard now flags this staleness
    class so it cannot recur.

Correctness & performance campaign

This release also folds in the repo-review campaign: a large set of
correctness fixes (inertia reporting, MC64 partial-singular fallback,
sparse/dense LU singularity handling, MTX-reader validation, CSC validation,
2×2 D-block gating, static-pivot scale-invariance) and ordering-crate
algorithm fixes (real SHEM coarsening and true GGP gain in feral-metis,
vertex-weight flow balancing and n_components in feral-kahip,
separator-FM in feral-scotch, sorted-rows invariant in
feral-ordering-core), plus several performance improvements (pooled scratch
reuse, borrow-not-clone in feral_factor, linear default postorder).

See CHANGELOG.md
for the complete, itemized list (every fix is tagged with its finding ID).

Inertia remains validated against the feral + MUMPS 5.8.2 + SPRAL/SSIDS
consensus oracle; 705 tests passing.

feral v0.10.0

04 Jun 11:54

Choose a tag to compare

Ordering-routing release. No open issues at cut; 20 commits since v0.9.0.

This is a minor release (not a patch) because the headline change is a
deliberate change to default Auto ordering behavior, not just a bug fix.

Changed

  • Auto now prefers AMF over MetisND at every size (#73, follow-up to #67).
    The AMF_BAND_MAX ceiling is removed: whenever the size rule would pick
    MetisND, choose_adaptive now routes to AMF at every n. The #50 powerflow
    guard (n > 100k && avg_deg < 5 → Amd) and the #64 arrow/dense-border catch
    still fire first, so only the uniformly-thin would-be-MetisND population is
    affected. The #73 A/B (factor+solve) showed AMF winning on every measured
    matrix in the n>100k thin regime (dtoc2 2.49×, cont5_1_l 2.75×, nql180 2.05×,
    pinene 1.18×, YATP1NE 2.13×) — including a case where MetisND had smaller
    symbolic fill yet was 2× slower, so a fill-guarded race was rejected.
  • 144 diagnostic binaries relocated to crates/feral-diagnostics (#71). Root
    cargo build/test/clippy now operate on the feral package only. Run
    diagnostics on demand with cargo run -p feral-diagnostics --bin <name>.

Fixed

  • Spurious zero pivots → wrong inertia under Auto scaling on ill-conditioned
    KKTs (#65).
    Inertia-guided MC64 fallback corrects the false-infeasible signal.
  • Arrow/bordered-KKT ordering blow-up (#64). Auto no longer picks MetisND on
    arrow/bordered patterns that caused ~7× fill blow-up (r05 LP); routes to AMF.

Compatibility

Public Rust API and C ABI unchanged. Functional changes are confined to the root
feral package; the ordering plugin crates are unchanged (stay at 0.2.0).

See CHANGELOG.md for the full per-change detail.

v0.9.0

31 May 00:18

Choose a tag to compare

Fixed — batched iterative refinement for wide multi-RHS solves (#58)

Solver::solve_many_refined looped the single-RHS refiner per column,
which bypassed the BLAS-3 panel kernel — so with refinement on (the
default) batched multi-RHS solves were 3–7× slower per RHS than the
unrefined batched path, and could be slower than looping single-RHS.

Wide refined solves (nrhs ≥ 16) now refine through a batched loop:
the initial and per-step correction solves go through solve_sparse_many
(one panel solve over the still-active columns), and the residual is a
per-column SpMV. Per-column best-iterate tracking and the convergence
predicates (ε·√n relative target, 2-strike plateau, 100× divergence)
are preserved exactly, and each step compacts the active columns so the
batched path never does more solve work than the per-column loop. Narrow
solves (nrhs < 16, e.g. the IPM predictor-corrector) keep the
per-column loop unchanged.

For 16 ≤ nrhs < 32 the batched result is bit-identical to the
per-column loop (the solve runs the rank-1 kernels there); for
nrhs ≥ 32 it agrees with the per-column oracle to the refinement
residual target. Measured (bench_multirhs, 2-D Laplacians): batched
refined is ~2.5–3× faster per RHS than the per-column refined loop.

The common case (every column already at the target after the direct
solve, i.e. ~0 correction steps) now returns before allocating the wide
best_x/residual buffers — it allocates only the solution and a
length-n scratch. Those three n × nrhs allocations were cheap in the
native binary but cost ~50 µs/RHS under the Python process's allocator,
masking the win through Solver.solve_refined on 2-D inputs; with them
gone the Python refined path shows the full ~2–3× too.

Changed — multi-RHS sparse solve: BLAS-3 panel kernels (#57 fix #2)

Wide multi-RHS solves (nrhs ≥ 32) now run each supernode's
forward/backward substitution as a register-blocked dense panel solve
— a TRSM on the unit-lower triangle L_11 plus a MR×NR (4×8)
register-tiled GEMM on the trailing block L_21 — instead of a rank-1
cascade. Narrow solves (the IPM hot path) stay on the rank-1 kernels.

Two supporting layout changes, both internal (no public API/ABI change,
RHS/solution stay column-major n × nrhs):

  • The internal y working buffer is now row-major, so every
    per-supernode gather/scatter is a contiguous memcpy instead of a
    stride-n transpose. This removed a cache-conflict pathology on
    power-of-two n (the n=1024 grid had regressed) and roughly halved
    the wide-solve time across the board.
  • The forward substitution and D-block solve are fused into one
    postorder pass
    (a node's eliminated rows are final once its
    forward-sub completes), saving one gather/scatter round trip per
    supernode.

Forward substitution stays bit-identical to looping single-RHS;
back-substitution differs only by floating-point reassociation
(~κ·eps). The multi-RHS parity suite (nrhs up to 64, MR/NR tail
sizes) verifies max|many − single| < 1e-12 against the independent
single-RHS oracle (observed ≤ 1.6e-15).

Measured per-RHS speedup vs looping single-RHS, on 2-D Laplacians
(cargo run --release --bin bench_multirhs, nrhs ∈ {64, 256}):
n=484 ~4–5×, n=1024 ~3×, n=2025 ~5–6×.

Changed — multi-RHS sparse solve: row-major working buffer (#57)

The per-supernode working buffer in solve_sparse_core_many_into now
uses a row-major layout (w[i*nrhs + c]) instead of column-major
(w[c*nrow + i]). The per-RHS inner loops in forward-sub, the D-block
solve, and back-sub are now contiguous (stride-1) and auto-vectorize.

The caller-visible RHS/solution layout is unchanged — it remains
column-major n × nrhs, matching MUMPS/SSIDS — so there is no public
API or ABI change. The single-RHS path (solve_sparse_core_into) and
the iterative-refinement path are untouched. Results are bit-identical
to looping the single-RHS solve (verified by the multi-RHS parity
suite, max|many − single| = 0).

Measured per-RHS speedup of one batched solve_sparse_many call vs
looping single-RHS, on 2-D Laplacians (cargo run --release --bin bench_multirhs): batched/looped ratio 0.76–0.99 (up to ~1.3× faster
per RHS) at nrhs ∈ {64, 256}. The larger 5–10× regime needs the
BLAS-3 panel kernels (issue #57 fix #2, deferred).

feral v0.8.0

29 May 01:40

Choose a tag to compare

Removed — 4 synthetic rank-deficient stress-corpus matrices

Dropped rankdef_10_3, rankdef_50_5, rankdef_exact_50_5, and
stokes_q1p0_8 from the stress corpus. Under #54's SSIDS-aligned
strict-zero routing, feral reported inertia.zero = 1 on all four —
which contradicted MUMPS, SSIDS, and MA57 simultaneously,
violating CLAUDE.md's "must agree with at least one canonical" rule
and red-lighting the stress-smoke gate on the v0.8.0 release commit.

Rather than allowlist them or narrow #54's zero_tol (which would
reopen the IPM δ-cascade instability on nuffield2_trap_iter1.mtx
that motivated #54 in the first place), the four matrices were
removed entirely. They were synthetic borderline fixtures where the
"correct" zero count depends on order-1e-15 round-off and no
3-of-4-oracle consensus exists — the same disagreement that
compute_consensus.py already tags excluded. The rank-deficient
regime remains covered by rankdef_5_2, rankdef_200_20,
rankdef_exact_100_10, saddle_rankdef_50_10_3,
saddle_rankdef_100_20_5. See dev/decisions.md (2026-05-28 entry).

Performance — Thomson-Hessian per-iter throughput (#56)

Three additive levers on the dense / wide-supernode IPM-KKT hot path,
landed on issue-56-thomson-hessian-throughput and merged to main:

  • Lever A — pre-built permuted_pattern + permute structure cache.
    Collapsed the symbolic permute (P A P^T) + symmetric_pattern
    phases to <0.5 % of total wall on warm calls.
  • Lever B — fused single-pass contribextract write. Replaced
    resize(cdim², 0.0) + lower-triangle overwrite with
    reserve + unsafe set_len + write-each-cell-once. Bit-identical
    contrib block (extend_add reads only ci ≥ cj; root-Schur
    extractor canonicalizes via transpose; parallel_corpus_parity
    binary-compares the full buffer). Roughly 33 % fewer writes.
    Re-measurement at Thomson n=200 (sequential, 9 warm reps):
    factor min −5.5 % (parallel ON), −10 % (parallel OFF).
  • Lever C — InfNorm Knight-Ruiz inner-loop vectorization. Hoisted
    the loop-carried row_max[j] dependency to a register accumulator
    (sparse and dense paths, bit-identical by associativity of max on
    non-NaN finite inputs); added a pulp-dispatched SIMD off-diagonal
    kernel on the dense path. Re-measurement at Thomson n=200:
    scaling phase −19 %, total wall −5 %.

No regression on the Phase 2.8.1 corpus partition gates; both
small-frontal and medium buckets improved relative to the 2026-05-27
baseline. KIRBY2 family worst-case sparse outlier improved from
10.25× → 7.97× vs MUMPS.

See dev/sessions/2026-05-28-01.md and
dev/research/issue-56-thomson-hessian-throughput-2026-05-27.md for
the localization data and re-measurement tables.

Added — symbolic-analysis-time delayed-pivot budget + CB rewire (#55)

Per-supernode delayed_capacity is assigned during symbolic
analysis. The numeric phase enforces the bound at the frontal
expansion site: if more pivots delay into a supernode than its
capacity, the factor returns the structured
FeralError::DelayBudgetExceeded { supernode, required, capacity }
(MUMPS INFO(2) workspace-overflow analog) rather than growing
the front unboundedly. Cascade-break now triggers on budget
exhaustion instead of the heuristic ratio gate, matching MUMPS's
dfac_front_aux.F:1251-1331 invariant that static perturbation
fires only when delay is structurally impossible.

Capacity formula:

  • tight = max(4 * own_ncol, 16)
  • capacity = min(subtree_ncol(s) - own_ncol(s), tight)

Root-supernode cap (defensive, n >= 1024): declines amalgamations
that would push the root past min(0.05 * n, 2048) columns.

FeralConfig::default() now ships with cascade-break armed
(cascade_break_ratio = Some(0.5), cascade_break_eps = Some(1e-10))
as the budget-exhausted fallback. Pounce's per-problem .opt
overrides for nql180 / pinene_3200 are no longer required.

See dev/research/symbolic-delay-budget-2026-05-27.md and
dev/decisions.md (2026-05-27 entry, frozen convention).

Changed — strict-zero pivots route to inertia.zero (SSIDS-aligned, #54)

When ZeroPivotAction::ForceAccept accepts a 1×1 pivot whose
magnitude satisfies |d| <= zero_tol, the inertia now increments
zero instead of routing by sign(d). This matches SSIDS
(NumericSubtree.hxx:259-267) and HSL MA57 (INFO(24) = neig,
INFO(25) = number of zero pivots) and supersedes the Issue #42
Option A sign-routing rule.

Motivation: pounce's IPM δ-cascade on nuffield2_trap_iter1.mtx
(LP-shaped KKT, n=26155) stalled for 600 s (vs 1.8 s on MA57)
because Option A split bit-exact zero pivots by IEEE round-off and
the counter jumped backwards mid-cascade (13042 → 12615). The new
accounting restores monotonicity (probe confirms 0 backwards jumps
across δ_x ∈ {0 … 6.99e19}).

Behavior changes visible to callers:

  • Solver::inertia() and factor()'s Inertia field now report
    the mathematical (Sylvester) inertia on matrices with rank
    deficiency that lands in strict-zero pivots.
  • num_negative_eigenvalues() returns strict inertia.negative
    only (unchanged convention). IPM callers comparing against an
    expected oracle should now sum negative + zero to match
    MA57's INFO(24)+INFO(25) convention.

Tests updated to reflect the new convention; see
dev/decisions.md (2026-05-26 entry) for the full trade-off
analysis and dev/research/issue-54-lp-kkt-inertia.md for the
oracle cross-check.

Added — DelayBudgetExceeded exposed through Python bindings

feral-solver (Python) now re-exports the #55 error as a named
exception:

  • feral.DelayBudgetExceeded — subclass of feral.FactorError
    (and transitively feral.FeralError), raised when
    factor()/refactor() hits the symbolic-analysis-time
    delayed-pivot budget. The message carries the supernode index,
    required and capacity columns. Python callers no longer see this
    routed through the generic NumericFailure wildcard.

Added — documentation site (mdBook + rustdoc on GitHub Pages)

  • book/ — mdBook skeleton (introduction, getting-started,
    inertia, api). Content is intentionally thin; the dev/research/
    notes remain internal and are not auto-published.
  • .github/workflows/pages.yml — builds the book and
    cargo doc --workspace --no-deps --lib, mounts rustdoc under /api/,
    and deploys to GitHub Pages on every push to main. Site URL is
    https://jkitchin.github.io/feral/; rustdoc at /api/feral/.
  • Cargo.toml: /book added to the crate exclude list so the book
    source does not ship in the published crate tarball.
  • .gitignore: /book/book ignored (mdBook output is regenerated by
    the Pages workflow).

Repo setting required once: Settings → Pages → Source: GitHub
Actions
. Until that flag is set, the deploy job will fail with a
"Pages not enabled" error; the build job is unaffected.

feral v0.7.0

25 May 17:13

Choose a tag to compare

v0.7.0 — opt-in instrumentation accessors

Minor release for #52.
New public surface on Solver for downstream IPM debugging of
linear-solver behavior, without re-running with cargo flamegraph
or instrumenting forks.

Added

Phase A — always-on snapshot. Solver::last_factor_stats()
returns Option<FactorStats> populated after every successful
factor(). Fields: nnz_a, nnz_l, fill_ratio, inertia,
min_abs_pivot, max_abs_pivot, pattern_reused, scaling_info.
No gating flag — the two extra integer writes per factor() are
cheaper than a gate check would be.

Phase B — opt-in profiler. Solver::with_profiling(true)
enables per-supernode numeric timings and one-shot symbolic timings,
read back via Solver::profile_report() and
Solver::symbolic_profile_report(). The symbolic report is
populated only on cache-miss factors — None on cache hits is the
unambiguous "did symbolic actually run" signal.

Performance

Default is false; when off, no profiler arcs are allocated and the
code path is byte-identical to a pre-issue-52 build. Measured by
benches/issue52_overhead.rs on tridiagonal SPDs
(n ∈ {64, 256, 1024}, sequential, 30 samples × 3 s, dev machine):

n main baseline default-off profiling-on
64 260.6 µs 257.6 µs 258.9 µs
256 348.2 µs 345.2 µs 347.0 µs
1024 719.6 µs 714.7 µs 709.1 µs

Default-off vs main baseline: within ±1.2% (often faster — noise).
Profiling-on vs default-off: within ±1%. Both deltas sit inside the
criterion noise band on tridiagonal workloads.

Deferred

  • Thread-local accumulator for parallel-driver profiler contention
    (plan §B2b) — documented as escape hatch; revisit only if a real
    IPM workload regresses with profiling on.
  • Phase C residual-norm / refinement-step accessors — out of scope
    for this issue.

feral v0.6.0

23 May 22:31

Choose a tag to compare

Added — Solver::with_scaling() builder ([#51][i51])

New public builder method on Solver to pin the scaling strategy
explicitly. Lets IPM hosts override ScalingStrategy::Auto when
they already know what they want — recommended escape hatch when
the picker's per-matrix heuristic is wrong for a problem class.
Default (Auto) is now sticky on cached pattern (see "Fixed" below)
so the picker no longer flaps; this builder is for callers that
want full control.

Fixed — sticky Auto scaling on cached pattern ([#51][i51])

Three coordinated fixes in src/numeric/solver.rs address a ~50×
numeric-phase slowdown on IPM workloads when the picker re-routes
across iters or PartialSingular outcomes drop the MC64 cache.

Reproducer (pounce/gams/nlpbench/feral_repro/powerflow22/,
n=2,813,976 IPM KKT, default Auto):

call pre-fix post-fix
factor (cold) 54.77 s 54.42 s
refactor (cached symbolic, iter A values) 1.00 s 1.02 s
factor #2 (cached symbolic, iter B values) 53.80 s 1.07 s

The three fixes:

  1. Sticky Auto pick. First factor() on a pattern runs the full
    compute_scaling_auto_with_cache pipeline (preserves Policy-4
    fallback semantics — InfNorm-spread guard, off-diag-ratio guard,
    MC64 catastrophic-spread guard). Post-call we derive the resolved
    strategy from factors.scaling_info
    (Mc64FallbackToInfnorm → InfNorm,
    PartialSingular → Mc64Symmetric, NotApplied → Identity,
    Applied → pick_scaling_strategy(matrix)) and stash it on the
    Solver as auto_picked_strategy. Every subsequent factor on
    the same pattern uses the stashed strategy directly, bypassing
    Auto. Pattern change clears it alongside the MC64 cache. Mirrors
    MUMPS ICNTL(7) / SSIDS options%ordering: structural decision
    once at first call, reuse every refactor.

  2. MC64 cache gate widened to PartialSingular. The Hungarian
    on a structurally rank-deficient KKT still produced a real
    scaling vector (unmatched positions land at 1.0 per
    mc64.rs:222); the value-bound check still gates reuse, so
    caching is correctness-safe. Pre-fix the post-#49 gate required
    ScalingInfo::Applied, which dropped the cache on every IPM iter
    over a structurally rank-deficient KKT and forced the Hungarian
    to rerun from scratch.

  3. Solver::with_scaling() builder (see "Added" above).

Tests

Three unit tests in src/numeric/solver.rs:

  • issue_51_with_scaling_builder_overrides_default
  • issue_51_auto_pick_is_sticky_on_cached_pattern
  • issue_51_partial_singular_populates_cache

And one #[ignore]'d corpus regression test:
issue_51_corpus_sticky_auto_holds_across_ipm_iters walks
tests/data/parity/<family>/*.mtx for every family with ≥ 2 IPM
iter snapshots, factors all iters against one Solver::new(), and
asserts the sticky pick holds across every iter (13 families
covered: acopp30, hatfldbne, hahn1, ssi, …).

mc64_fallback_surfaces_via_solver_api was updated to reflect the
new sticky semantics: iter 2 now asserts
mc64_fallback_count == 1 (sticky pin runs straight InfNorm; no
fallback to surface) and locks
auto_picked_strategy = Some(InfNorm).

Changed — Auto dispatcher rewrites ([#50][i50])

The OrderingMethod::Auto dispatcher (src/symbolic/mod.rs:: choose_adaptive) was simplified to two rules on top of
pick_default_method: kept only the very-large-and-sparse catch
(n > 100_000 && full_avg_deg < 5.0 → Amd); everything else
delegates to pick_default_method (n ≤ 10_000 → Amf,
n > 10_000 → MetisND).

  • Fix A — large-and-sparse swap (c442a0c). The pre-fix
    n > 100_000 && full_avg_deg < 5.0 → ScotchND branch is swapped
    to Amd. On powerflow22 (n=2.8 M, full_avg_deg ≈ 3.7) prior
    ScotchND took 113.8 s symbolic (15.8 M nnz_L); AMD takes 55 s
    (10.4 M nnz_L). The ScotchND advantage at very large n had been
    load-bearing against the same BK pivoting cascade that motivated
    the chain catches; issue #46 eliminated that amplifier in May
    2026 and removed the justification for routing very-large sparse
    matrices through nested dissection. Corpus inventory:
    dev/research/issue-50-numeric-inventory.csv shows the IPM
    corpus's [100k, 200k) bucket has AMD/MetisND num_nnz_l ratio 1.00
    on both representatives. Validation: 258 chain-catch corpus rows
    under post-Fix Auto — 0 failures, 0 num_nnz_l regressions for
    matrices that actually reroute. See
    dev/research/issue-50-metisnd-symbolic-cost.md §F7–§F11.

  • F11 follow-up — small-and-sparse retire (3f8f6f6). The
    pre-fix n < 10_000 && full_avg_deg < 15.0 → KahipND branch is
    deleted entirely; the population now falls through to
    pick_default_method's n ≤ 10_000 → Amf default (MUMPS
    ana_set_ordering.F SYM=2 N≤10000). Justified by an 838-matrix
    4-way inventory (dev/research/small-sparse-inventory.csv): AMF
    wins 169/838 strict per-matrix (KahipND 16); aggregate num_nnz_l
    ratios to AMD are AMF 0.870×, KahipND 0.984×; aggregate factor_us
    ratios are AMF 0.832×, KahipND 0.990×. KahipND remains reachable
    via OrderingMethod::KahipND for the 41 high-avg-deg cases
    (STEENBRD, HADAMARD, TABLE8) where it still wins — all sub-22k
    nnz_L absolute. See dev/research/issue-50-metisnd-symbolic-cost.md
    §F12.

feral v0.5.0

22 May 19:51

Choose a tag to compare

feral v0.5.0 (2026-05-22)

First tagged release. Sparse symmetric indefinite direct solver in
pure Rust with certified inertia counts.

Highlights since 0.4.0:

  • Added: value-bounded MC64 scaling cache; near-singularity signal.
  • Changed: inertia counts every pivot by sign; oracle-consensus parity.
  • Performance: Schur trailing-update kernel widened (quad NEON-tile).
  • Fixed: 8 correctness/robustness fixes across scaling, inertia, and
    delayed-pivot handling (#39, #45, #46, #47, #48, #49).

See CHANGELOG.md [0.5.0] for the full list.

v0.4.0 — Mittelmann parity with MA57 + auto-armed cascade-break

17 May 21:42

Choose a tag to compare

Headline

feral is now at parity with MA57 on the 47-problem Mittelmann NLP
benchmark panel (Ipopt 3.14.20 + --linear_solver=feral vs
--linear_solver=ma57, 600 s wall timeout each):

solver solve-rate geomean wall (37 both-solved) aggregate wall
MA57 39/47 1.000 3166 s
feral 39/47 0.960 1699 s (54%)

Two changes closed the gap: an auto-armed cascade-break default
in the C-ABI and a tightened MC64 routing gate. Headline rescues
on Mittelmann: marine_1600 -96%, clnlbeam -87%, robot_1600 -79%,
corkscrw -70%, dtoc2 rescued from timeout.

Highlights

  • Auto-armed cascade-break in feral_new (Ipopt C-ABI). Arms
    Solver::with_auto_cascade_break(0.05) by default, disable with
    FERAL_AUTO_CB_BETA=0. (#37, #38)
  • Scaling routing tightened. ScalingStrategy::Auto now requires
    a dense-arrow head (max_col_nnz > 32) in addition to the
    diag-only mass threshold, fixing a 1-D-PDE-KKT regression. (#68)
  • MA57-style static-pivot perturbation knob + non-finite input
    validation + closed-form 2x2 eigenvalue inertia classifier +
    stale-MC64-cache invalidation. (#38)
  • mittelmann_ipopt benchmark harness — runs Ipopt+feral and
    Ipopt+MA57 over the 47-problem NLP panel and produces a side-by-
    side report.
  • Python bindings (feral-solver on PyPI) now at 0.4.0 with
    full IPM helper (feral.ipm.KktSolver), scipy.sparse adapters,
    and JAX interop.
  • Ipopt integration documented end-to-end in the README, with the
    feral-ipopt-shim subproject providing one-shot
    make ipopt && make hs071.

See CHANGELOG.md
for the full per-PR release notes (auto-CB, scaling routing,
diagnostic tracers, env-knob reference, parity panel notes,
oracle-consensus correctness gate, and v0.4.0 housekeeping).

Install

pip install feral-solver         # Python wheels (CPython 3.10+)
cargo add feral                  # Rust crate

Compiling FERAL into Ipopt

See the "Using FERAL inside Ipopt" section of the README — one-shot
recipe is cd feral-ipopt-shim && make ipopt && make hs071.

v0.3.0 — Feral C ABI for Ipopt + 3-way NLP comparison

13 May 17:15

Choose a tag to compare

Added — Feral C ABI for Ipopt linkage (feral::capi)

New pub mod capi (src/capi.rs) exposes a minimal C ABI surface
matching Ipopt's SparseSymLinearSolverInterface plug-in shape:
feral_new, feral_free, feral_set_structure, feral_values_ptr,
feral_factor, feral_solve, feral_num_neg. Matrix format is
Ipopt's CSR_Format_0_Offset (upper-triangle CSR, 0-based) which is
byte-identical to feral's lower-triangle CSC. Status codes mirror
Ipopt's ESymSolverStatus enum.

Cargo.toml adds staticlib to crate-type so the ABI can be
linked into the C++ Ipopt build via the feral-ipopt-shim/ patch
(opt-in for downstream Ipopt builders; pure-Rust consumers continue
to use the rlib). See dev/research/feral-ipopt-c-shim.md and
dev/plans/feral-ipopt-shim.md for the design.

Added — Ipopt 3-way NLP comparison harness

external_benchmarks/nlp_comparison/ runs the Ipopt
ScalableProblems suite against three Ipopt 3.14.20 binaries
(build-mumps, build-ma57, build-feral), each linked to a
single sparse direct solver. 35 problems × 3 solvers; see
REPORT.md for the 2026-05-13 sweep. MUMPS 35/35 optimal, MA57
34/35, feral 34/35; geomean over triple-optimal subset: MUMPS
139 ms, feral 158 ms, MA57 162 ms. Generates results.json and a
Markdown report. Logs/out/RHS blobs gitignored; only the harness +
report are tracked.

Added — MA57 oracle + 4-way cross-solver comparison

external_benchmarks/ma57_oracle/ builds a CoinHSL MA57 benchmark
binary alongside the existing MUMPS/SSIDS oracles.
external_benchmarks/comparison/ is extended from 3-way to 4-way
(feral + MUMPS + SSIDS + MA57), with new run.py / aggregate.py
/ report.py wiring MA57 into the per-matrix sample comparison and
REPORT.md summary.

Added — Issue #9 Steps 2 + 3: 32×32 register-resident kernel wired into production

Step 3 (SIMD body). update_1x1_block32 in
src/dense/block_ldlt32.rs tiles trailing destination columns in
groups of four through schur_panel_minus_nofma_strided_quad
(n_elim=1), with a trailing _dual for the 2-column tail and a final
axpy_minus_unroll4_nofma for the 1-column tail. Each tile packs 4
dst columns per pulp dispatch sharing one source-column load — the
intended Phase 2.4.3 register-resident pattern. Per-element output is
byte-identical to the scalar reference and to factor::do_1x1_update
(verified by 4 bit-parity unit tests at p=0, p=5, p=30, zero-pivot).

Step 2 (dispatch wiring). do_1x1_update and do_2x2_update
(factor.rs) gain an n == 32 fast-path delegating to
update_1x1_block32 / update_2x2_block32.
factor_frontal_blocked_in_place_with_scratch dispatches
nrow==ncol==32 fronts to factor_block32 (which delegates to
factor_frontal); the eager unblocked BK loop drives the SIMD update
via the fast-paths. This bypasses lblt_panel_frontal for full
32×32 fronts because, at bs==ncol==32, the panel's
apply_blocked_schur_panel quad-dispatch path is unreachable
(j_start = k + n_elim == nrow skips the batched trailing update),
so all trailing-update FLOPs are done by single-column peek-ahead
axpys. The eager-update path issues quad dispatches for every
trailing tile of 4 columns instead.

Bench: median small p90 1.33 (was 1.36), median medium p90 1.74
(was 1.78) across 3 runs. Modest but consistent improvement at the
better edge of the noise band. Inertia 154428/154481, byte-identical
to baseline.

Step 4 (rank-2 SIMD body) remains deferred — the quad kernel's
per-q sequential rounding chain is 1-ULP-divergent from
axpy2_minus_unroll4_nofma's fused chain, so a custom
4-dst-column 2-src pulp dispatch is required. 2×2 pivots are rare on
the bench corpus (no measurable bench impact expected); tracked as
follow-up. Step 5 (cross-arch CI gate) also tracked as
follow-up.

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B + C)

Phase C (single-slot contrib pool). New pub contrib_pool: Option<Vec<f64>> field on FactorScratch. The multifrontal driver puts
the child's ContribBlock.data into the slot after extend_add consumes
it; the kernel takes at extract time, clears+resizes to cdim*cdim,
and writes. When the slot is empty (cold scratch, or take outpaces put),
the kernel falls back to a fresh Vec allocation — bit-identical to the
pre-Phase-C path. An initial multi-slot Vec<Vec<f64>> variant was
abandoned: it preserved bit-parity but regressed bench p90 by ~+0.19
(small) / ~+0.30 (medium) in 4 consecutive runs (growable-indirection
bookkeeping cost more than the malloc/free pairs it avoided). The
single-slot variant is bench-neutral vs Phase A+B (small p90 1.41,
medium p90 1.83–1.85) and bit-parity is preserved across all four
parity cases including the new (d) pool-hot pre-seeded case.

Phase C contributes no measurable bench movement on this corpus, but
the infrastructure is correct and ready: if a future kernel change
makes the contrib allocation a bigger fraction of factor cost, the
recycle path engages automatically. Final issue #13 standing:
criterion #1 (ns/sup reduction) MET, criterion #2 (bench p90 small <
1.30 OR medium < 1.60) unreachable via allocation pooling on this
corpus
, criterion #3 (no correctness regression) MET, criterion #4
(bit-exact blocked_ldlt) MET. Per-front kernel cost (32×32 SIMD,
issue #9) is the next plausible lever for the bench-ratio gap.

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B background)

Phase A (FactorScratch pool). New FactorScratch { subdiag, d_panel }
struct in src/dense/factor.rs pools the two internal-only working buffers
that factor_frontal_blocked_in_place previously allocated per supernode.
New entry point factor_frontal_blocked_in_place_with_scratch accepts
&mut FactorScratch; the existing function is now a thin wrapper that
allocates a fresh scratch and delegates. FactorWorkspace carries a
factor_scratch field that the three hot-path call sites in
src/numeric/factorize.rs (D.3 dense fast path, factor_one_supernode,
factor_one_small_leaf) thread through. The scratch is safe to re-warm
across different (nrow, bs) shapes — the kernel prologue clears and
resizes unconditionally. Bit-parity gated by
tests/factor_scratch_parity.rs (7-case size sweep + 6-case repeated-
calls regression) plus the 19 byte-identity tests/blocked_ldlt.rs
integration tests.

Phase B (extend_add direct writes). The multifrontal extend_add
in src/numeric/factorize.rs now bypasses SymmetricMatrix::set/get
and writes directly into frontal.data using the lower-triangle column-
major linear index. Per-cell work drops by one indirection, one branch,
and one redundant i >= j sanity check, with the symmetric-storage
canonicalisation preserved at the caller.

Diagnostic (cargo run --bin diag_supernode_cost --release): Phase A
delivered −16 % to −54 % ns/sup on the CRESC100 / ACOPR30 / HAIFAM /
KIRBY2 cluster (issue #13 acceptance criterion #1 MET). Phase B is
within run-to-run noise of Phase A on ns/sup, which is expected
because extend_add is a child-driven post-factor cost rather than
per-supernode.

Bench (cargo run --bin bench --release): dense small-frontal p90
1.33–1.37 and medium p90 1.75–1.78 (vs issue baseline 1.33 / 1.70).
Issue #13 acceptance criterion #2 (small p90 < 1.30 OR medium p90 <
1.60) NOT met by Phases A+B alone. 154428/154481 inertia match
preserved exactly. Phase C (return-struct pooling for l, d_diag,
d_subdiag, contrib, perm, perm_inv) is deferred to a separate
session; design choice (ABI break vs take-into vs with_capacity hints)
is unresolved.

Added — BLAS-3 quad-column trailing-update kernel (#9, parked on #13)

schur_panel_minus_nofma_strided_quad in src/dense/schur_kernel.rs
processes four trailing columns per pulp dispatch, halving src memory
traffic vs the existing dual kernel. Wired into
apply_blocked_schur_panel — every front with ≥ 4 trailing columns
now routes through quad → dual → single fall-through. Bit-exact per
column with four sequential single-column dispatches (176-config
parity sweep + 19 byte-identical blocked_ldlt integration tests).
Zero corpus regression: dense small-frontal p90 1.33 (target ≤ 2.0
PASS), medium p90 1.70 (target ≤ 3.0 PASS); 154428/154481 inertia
match, 99.8 % residual pass.

No measurable headline-throughput win on the current corpus — the
2026-04-27 CHAINWOO_0000 root that motivated the work (1984 × 32) no
longer exists on the current build (max actual nrow = 18 after METIS-
ND on this build). The new bottleneck is per-supernode fixed
overhead, tracked as issue #13. Kernel retained as parked
infrastructure: it re-engages automatically when fronts grow tall-
skinny again. See dev/decisions.md 2026-05-12 (c).

Added — block_ldlt32 scaffold and trailing-update primitives (#9)

New module src/dense/block_ldlt32.rs with BLOCK_SIZE = 32,
factor_block32 stub (delegates to factor_frontal pending the
const-generic driver port), update_1x1_block32, update_2x2_block32
scalar primitives, and a bit-parity test harness diffing factors by
to_bits(). Signatures match the planned pulp dispatch contract; the
SIMD body swap is a surgical follow-up gated on issue #13.

v0.2.0

12 May 23:47

Choose a tag to compare

Fixed — Honest resolved_method and consistent Auto routing (#3)

SymbolicFactorization.resolved_method now reflects what the symbolic
pipeline actually ran rather than what the caller requested. Two
behavior changes:

  • ScotchND silent fallback is surfaced. When SCOTCH's nested-
    dissection recursion produces no separator for the entire graph
    (bordered-KKT shapes such as PoissonControl trigger this), the
    driver falls back to AMD via amd_leaf for every recursion node.
    Previously resolved_method still reported ScotchND while the
    permutation was bit-identical to AMD's. It now reports Amd. The
    recovery itself is unchanged — only its visibility is fixed.
  • OrderingMethod::Auto is consistent with the no-arg default.
    Auto resolution now happens against the original matrix's pattern
    before any LdltCompress preprocessor reshapes the graph, and the
    residual branch delegates to pick_default_method (the rule used
    by the no-arg symbolic_factorize). Previously Auto could pick a
    different concrete method than the default rule on the same
    matrix, depending on whether compression triggered. Auto is now a
    strict superset: same answer as the default plus the two extra
    shape-bakeoff branches (n>100_000 → ScotchND, n<10_000
    KahipND).

Reported by independent triage on K=158 PoissonControl benchmarks.

Fixed — CscMatrix::from_triplets rejects upper-triangle entries (#4)

CscMatrix::from_triplets and CscMatrix::validate now return
FeralError::InvalidInput when any triplet has row < col. Previously
upper-triangle entries were silently accepted and routed through
sort_and_sum_duplicates, producing a CscMatrix whose row indices
violated the documented "lower triangle only" invariant. Downstream
consumers (e.g. symmetric_pattern) assume lower-triangle storage, so
the same symmetric matrix described with upper- vs lower-triangle
triplets produced different solve results. The error message identifies
the offending triplet by index and (row, col). Reported by @janosh.

Changed

  • Solver now defaults to the rayon-parallel multifrontal driver
    (factorize_multifrontal_parallel_with_workspace). The driver is
    bit-exact with the sequential supernodal path on a per-supernode
    basis and falls through to the sequential path when the supernode
    count is below N_PAR_MIN = 32, so small-problem latency is not
    affected. Override with Solver::new().with_parallel(false).
    Closes #7. Motivation: pounce's marine_1600 / pinene_3200
    Mittelmann runs were spending all their time in sequential
    factor_one_supernode even though the parallel driver was
    available; this wires Solver directly to it.

Added

  • Solver::with_parallel(bool) — opt out of the rayon-parallel
    driver (returns Self for builder chaining).
  • Solver::parallel() — test/diagnostic accessor for the current
    flag value.
  • SymmetricMatrix::from_pooled_buf(n, buf) constructor that zeros
    only the lower triangle when reusing a pooled buffer; cuts the
    dead upper-triangle memset out of factor_one_supernode's
    per-supernode hot path. See dev/decisions.md 2026-05-12.
  • compute_infnorm_dense(&SymmetricMatrix) in src/scaling/infnorm.rs
    — dense-native Knight-Ruiz iteration for the D.3/D.4 dense
    fast-path; the sparse compute_infnorm remains the path for the
    multifrontal driver.
  • Solver::inertia()Option<&Inertia> accessor returning the
    full inertia of the last successful factor. Complements the
    Ipopt-shaped num_negative_eigenvalues (which panics if no factor
    is stored) for callers that prefer to branch on None. Used by
    the cross-solver bench harness.

Tooling

  • Cross-solver comparison harness under
    external_benchmarks/comparison/ (run.py, aggregate.py, report.py)
    measures feral against MUMPS 5.8.2 and HSL MA97 2.8.1 on a sampled
    SuiteSparse subset and emits REPORT.md. Each solver is configured
    to its production-quality (refinement-on) settings so the residual
    comparison is apples-to-apples:
    • feral driver routes through solve_sparse_refined (Richardson
      refinement with stagnation exit) in src/bin/bench_one_matrix.rs.
    • external_benchmarks/mumps_oracle/mumps_bench.F sets
      ICNTL(10) = 2 (max two iterative-refinement steps) — MUMPS
      default is 0 (no refinement).
    • external_benchmarks/hsl_bench/hsl_bench.c wraps
      ma97_solve_d in a 4-step Richardson loop because MA97 has no
      native residual-based refinement entry for non-singular systems.
      Configuration is documented in the generated report's Solvers
      table.

Performance

  • Dense fast-path (dense_fast_factor) now runs Knight-Ruiz ∞-norm
    scaling directly on the column-major lower-triangle buffer
    produced by to_dense_into, removing the row_idx[k]
    indirection that dominated wall time on small-dense matrices.
    Routing: ScalingStrategy::Auto and InfNorm go through the
    dense KR; Mc64Symmetric, Identity, External are honored
    via the unchanged sparse path. Bit-exact with the sparse KR on
    every fast-path-gate matrix (should_use_dense_fast_path
    matrix is small enough that every column-major slot maps 1:1
    onto a CSC entry or a known-zero, and (d_i · 0 · d_j) = 0
    is a no-op in the max-reduction). Targets the
    dev/results/lever-d3/stage1-stage2-2026-04-19.md §1 finding
    that compute_scaling was 82% (34 of 41 µs) of dense-path
    wall time on TRO3X3_0013.
  • Pooled local_contribs per rayon worker inside FactorWorkspace,
    removing a per-task Vec<Option<ContribBlock>> of length
    n_snodes from the parallel driver. Decisive on cont-201
    (sequential –34%, parallel-at-T=8 –10%); also helps
    bratu3d (–6% / –5%). Bit-exact. See dev/decisions.md
    2026-05-12 and dev/sessions/2026-05-12-01.md.
  • Skip the upper-triangle zero on pooled frontal buffer reuse
    (SymmetricMatrix::from_pooled_buf). Bit-exact; 5–10% sequential
    wall reduction across mid-size matrices.

Investigated

  • Parallel multifrontal driver lock contention (T=4) — falsified.
    Added opt-in AtomicLockStats telemetry to NumericParams with
    per-task lock wait/hold counters and eight per-phase wall-time
    counters wrapping the sequential prologue/epilogue. cont-201's
    previously-reported residual headroom is sequential symbolic
    factorize
    (157 ms of a 214 ms single-shot wall), not mutex
    contention (worst-case 3.4% of body time on cont-201, 0.02% on
    c-big). On the cached-symbolic path (production / pounce-IPM
    regime) cont-201 wall drops 214 → 56 ms with body_frac jumping
    0.15× → 0.55×; remaining 1.5× headroom is inside the rayon::scope,
    not at lock sites. Full analysis in
    dev/debugging/2026-05-12-cont201-cached-headroom.md.

  • Parallel driver within-scope localization (iteration 2) —
    rayon idle dominates. Added task_wall_ns (whole-closure
    bracket) and ws_lock_wait_ns (per-worker workspace mutex
    wait) to AtomicLockStats. Derived rayon_idle = scope·T − task_wall_agg quantifies the parallelism deficit attributable
    to etree dependencies. cont-201 cached: rayon_idle = 12.3 ms/T
    (78% of the gap) vs locks 1.7 ms/T (10%) and ctrl-flow 1.5
    ms/T (10%). c-big at T=4 is essentially sequential (74%
    rayon-idle capacity, 1.04× speedup vs body_agg). Conclusion:
    assembly-tree parallelism is exhausted on these matrices;
    within-supernode parallelism (panel-BK / threaded dense
    kernels) is the only remaining axis. Closes the cont-201
    assembly-tree investigation. See iteration 2 in
    dev/debugging/2026-05-12-cont201-cached-headroom.md.

  • Scaling cache verification (iteration 3) — compute_scaling_with_cache
    works as designed. Added solver_scaling_phase_split test
    (#[ignore]) that loads the corpus and times
    pick_scaling_strategy + compute_scaling_with_cache(cache=None)

    • reorder gather. c-big picks Mc64Symmetric and the no-cache
      path takes 2.3 seconds (full Hungarian); the cached path
      in production takes 2.4 ms — 1000× speedup, cache hits.
      cont-201 and bcsstk38 pick InfNorm, which is values-dependent
      Knight-Ruiz iteration (~4 ms per call) and is not cacheable
      across IPM iterations. The 3.95 ms scaling slice on cached
      cont-201 is fundamental per-factor work, not a missed cache.
      Closes scaling probe from session 2026-05-12-02 "Next session
      should #2".
  • Issue #5 (MSS1 BK inertia non-monotone under δ_w·I): triage
    complete, closed on the feral side. Landed a reproducer test

    • zero_tol/pivot_threshold sweep diagnostics in
      src/numeric/factorize.rs::tests. Empirically demonstrated
      that no in-kernel magnitude-floor lever cures the wandering;
      cross-checked MUMPS 5.8.2 and MA57 (via Ipopt's wrapper) and
      confirmed neither implements eigenvalue-aware 2×2 splitting.
      Recommended fix is upstream (caller-side δ_c bump matching
      Ipopt's PerturbForSingularity). Full analysis in
      dev/research/issue-5-mss1-inertia-monotonicity.md §9.