Skip to content

v0.3.0 — Feral C ABI for Ipopt + 3-way NLP comparison

Choose a tag to compare

@jkitchin jkitchin released this 13 May 17:15
· 423 commits to main since this release

Added — Feral C ABI for Ipopt linkage (feral::capi)

New pub mod capi (src/capi.rs) exposes a minimal C ABI surface
matching Ipopt's SparseSymLinearSolverInterface plug-in shape:
feral_new, feral_free, feral_set_structure, feral_values_ptr,
feral_factor, feral_solve, feral_num_neg. Matrix format is
Ipopt's CSR_Format_0_Offset (upper-triangle CSR, 0-based) which is
byte-identical to feral's lower-triangle CSC. Status codes mirror
Ipopt's ESymSolverStatus enum.

Cargo.toml adds staticlib to crate-type so the ABI can be
linked into the C++ Ipopt build via the feral-ipopt-shim/ patch
(opt-in for downstream Ipopt builders; pure-Rust consumers continue
to use the rlib). See dev/research/feral-ipopt-c-shim.md and
dev/plans/feral-ipopt-shim.md for the design.

Added — Ipopt 3-way NLP comparison harness

external_benchmarks/nlp_comparison/ runs the Ipopt
ScalableProblems suite against three Ipopt 3.14.20 binaries
(build-mumps, build-ma57, build-feral), each linked to a
single sparse direct solver. 35 problems × 3 solvers; see
REPORT.md for the 2026-05-13 sweep. MUMPS 35/35 optimal, MA57
34/35, feral 34/35; geomean over triple-optimal subset: MUMPS
139 ms, feral 158 ms, MA57 162 ms. Generates results.json and a
Markdown report. Logs/out/RHS blobs gitignored; only the harness +
report are tracked.

Added — MA57 oracle + 4-way cross-solver comparison

external_benchmarks/ma57_oracle/ builds a CoinHSL MA57 benchmark
binary alongside the existing MUMPS/SSIDS oracles.
external_benchmarks/comparison/ is extended from 3-way to 4-way
(feral + MUMPS + SSIDS + MA57), with new run.py / aggregate.py
/ report.py wiring MA57 into the per-matrix sample comparison and
REPORT.md summary.

Added — Issue #9 Steps 2 + 3: 32×32 register-resident kernel wired into production

Step 3 (SIMD body). update_1x1_block32 in
src/dense/block_ldlt32.rs tiles trailing destination columns in
groups of four through schur_panel_minus_nofma_strided_quad
(n_elim=1), with a trailing _dual for the 2-column tail and a final
axpy_minus_unroll4_nofma for the 1-column tail. Each tile packs 4
dst columns per pulp dispatch sharing one source-column load — the
intended Phase 2.4.3 register-resident pattern. Per-element output is
byte-identical to the scalar reference and to factor::do_1x1_update
(verified by 4 bit-parity unit tests at p=0, p=5, p=30, zero-pivot).

Step 2 (dispatch wiring). do_1x1_update and do_2x2_update
(factor.rs) gain an n == 32 fast-path delegating to
update_1x1_block32 / update_2x2_block32.
factor_frontal_blocked_in_place_with_scratch dispatches
nrow==ncol==32 fronts to factor_block32 (which delegates to
factor_frontal); the eager unblocked BK loop drives the SIMD update
via the fast-paths. This bypasses lblt_panel_frontal for full
32×32 fronts because, at bs==ncol==32, the panel's
apply_blocked_schur_panel quad-dispatch path is unreachable
(j_start = k + n_elim == nrow skips the batched trailing update),
so all trailing-update FLOPs are done by single-column peek-ahead
axpys. The eager-update path issues quad dispatches for every
trailing tile of 4 columns instead.

Bench: median small p90 1.33 (was 1.36), median medium p90 1.74
(was 1.78) across 3 runs. Modest but consistent improvement at the
better edge of the noise band. Inertia 154428/154481, byte-identical
to baseline.

Step 4 (rank-2 SIMD body) remains deferred — the quad kernel's
per-q sequential rounding chain is 1-ULP-divergent from
axpy2_minus_unroll4_nofma's fused chain, so a custom
4-dst-column 2-src pulp dispatch is required. 2×2 pivots are rare on
the bench corpus (no measurable bench impact expected); tracked as
follow-up. Step 5 (cross-arch CI gate) also tracked as
follow-up.

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B + C)

Phase C (single-slot contrib pool). New pub contrib_pool: Option<Vec<f64>> field on FactorScratch. The multifrontal driver puts
the child's ContribBlock.data into the slot after extend_add consumes
it; the kernel takes at extract time, clears+resizes to cdim*cdim,
and writes. When the slot is empty (cold scratch, or take outpaces put),
the kernel falls back to a fresh Vec allocation — bit-identical to the
pre-Phase-C path. An initial multi-slot Vec<Vec<f64>> variant was
abandoned: it preserved bit-parity but regressed bench p90 by ~+0.19
(small) / ~+0.30 (medium) in 4 consecutive runs (growable-indirection
bookkeeping cost more than the malloc/free pairs it avoided). The
single-slot variant is bench-neutral vs Phase A+B (small p90 1.41,
medium p90 1.83–1.85) and bit-parity is preserved across all four
parity cases including the new (d) pool-hot pre-seeded case.

Phase C contributes no measurable bench movement on this corpus, but
the infrastructure is correct and ready: if a future kernel change
makes the contrib allocation a bigger fraction of factor cost, the
recycle path engages automatically. Final issue #13 standing:
criterion #1 (ns/sup reduction) MET, criterion #2 (bench p90 small <
1.30 OR medium < 1.60) unreachable via allocation pooling on this
corpus
, criterion #3 (no correctness regression) MET, criterion #4
(bit-exact blocked_ldlt) MET. Per-front kernel cost (32×32 SIMD,
issue #9) is the next plausible lever for the bench-ratio gap.

Changed — Per-supernode fixed-overhead reduction (#13, Phases A + B background)

Phase A (FactorScratch pool). New FactorScratch { subdiag, d_panel }
struct in src/dense/factor.rs pools the two internal-only working buffers
that factor_frontal_blocked_in_place previously allocated per supernode.
New entry point factor_frontal_blocked_in_place_with_scratch accepts
&mut FactorScratch; the existing function is now a thin wrapper that
allocates a fresh scratch and delegates. FactorWorkspace carries a
factor_scratch field that the three hot-path call sites in
src/numeric/factorize.rs (D.3 dense fast path, factor_one_supernode,
factor_one_small_leaf) thread through. The scratch is safe to re-warm
across different (nrow, bs) shapes — the kernel prologue clears and
resizes unconditionally. Bit-parity gated by
tests/factor_scratch_parity.rs (7-case size sweep + 6-case repeated-
calls regression) plus the 19 byte-identity tests/blocked_ldlt.rs
integration tests.

Phase B (extend_add direct writes). The multifrontal extend_add
in src/numeric/factorize.rs now bypasses SymmetricMatrix::set/get
and writes directly into frontal.data using the lower-triangle column-
major linear index. Per-cell work drops by one indirection, one branch,
and one redundant i >= j sanity check, with the symmetric-storage
canonicalisation preserved at the caller.

Diagnostic (cargo run --bin diag_supernode_cost --release): Phase A
delivered −16 % to −54 % ns/sup on the CRESC100 / ACOPR30 / HAIFAM /
KIRBY2 cluster (issue #13 acceptance criterion #1 MET). Phase B is
within run-to-run noise of Phase A on ns/sup, which is expected
because extend_add is a child-driven post-factor cost rather than
per-supernode.

Bench (cargo run --bin bench --release): dense small-frontal p90
1.33–1.37 and medium p90 1.75–1.78 (vs issue baseline 1.33 / 1.70).
Issue #13 acceptance criterion #2 (small p90 < 1.30 OR medium p90 <
1.60) NOT met by Phases A+B alone. 154428/154481 inertia match
preserved exactly. Phase C (return-struct pooling for l, d_diag,
d_subdiag, contrib, perm, perm_inv) is deferred to a separate
session; design choice (ABI break vs take-into vs with_capacity hints)
is unresolved.

Added — BLAS-3 quad-column trailing-update kernel (#9, parked on #13)

schur_panel_minus_nofma_strided_quad in src/dense/schur_kernel.rs
processes four trailing columns per pulp dispatch, halving src memory
traffic vs the existing dual kernel. Wired into
apply_blocked_schur_panel — every front with ≥ 4 trailing columns
now routes through quad → dual → single fall-through. Bit-exact per
column with four sequential single-column dispatches (176-config
parity sweep + 19 byte-identical blocked_ldlt integration tests).
Zero corpus regression: dense small-frontal p90 1.33 (target ≤ 2.0
PASS), medium p90 1.70 (target ≤ 3.0 PASS); 154428/154481 inertia
match, 99.8 % residual pass.

No measurable headline-throughput win on the current corpus — the
2026-04-27 CHAINWOO_0000 root that motivated the work (1984 × 32) no
longer exists on the current build (max actual nrow = 18 after METIS-
ND on this build). The new bottleneck is per-supernode fixed
overhead, tracked as issue #13. Kernel retained as parked
infrastructure: it re-engages automatically when fronts grow tall-
skinny again. See dev/decisions.md 2026-05-12 (c).

Added — block_ldlt32 scaffold and trailing-update primitives (#9)

New module src/dense/block_ldlt32.rs with BLOCK_SIZE = 32,
factor_block32 stub (delegates to factor_frontal pending the
const-generic driver port), update_1x1_block32, update_2x2_block32
scalar primitives, and a bit-parity test harness diffing factors by
to_bits(). Signatures match the planned pulp dispatch contract; the
SIMD body swap is a surgical follow-up gated on issue #13.