Release v0.2.0 · jkitchin/feral

Fixed — Honest `resolved_method` and consistent `Auto` routing (#3)

SymbolicFactorization.resolved_method now reflects what the symbolic
pipeline actually ran rather than what the caller requested. Two
behavior changes:

ScotchND silent fallback is surfaced. When SCOTCH's nested-
dissection recursion produces no separator for the entire graph
(bordered-KKT shapes such as PoissonControl trigger this), the
driver falls back to AMD via amd_leaf for every recursion node.
Previously resolved_method still reported ScotchND while the
permutation was bit-identical to AMD's. It now reports Amd. The
recovery itself is unchanged — only its visibility is fixed.
OrderingMethod::Auto is consistent with the no-arg default.
Auto resolution now happens against the original matrix's pattern
before any LdltCompress preprocessor reshapes the graph, and the
residual branch delegates to pick_default_method (the rule used
by the no-arg symbolic_factorize). Previously Auto could pick a
different concrete method than the default rule on the same
matrix, depending on whether compression triggered. Auto is now a
strict superset: same answer as the default plus the two extra
shape-bakeoff branches (n>100_000 → ScotchND, n<10_000 →
KahipND).

Reported by independent triage on K=158 PoissonControl benchmarks.

Fixed — `CscMatrix::from_triplets` rejects upper-triangle entries (#4)

CscMatrix::from_triplets and CscMatrix::validate now return
FeralError::InvalidInput when any triplet has row < col. Previously
upper-triangle entries were silently accepted and routed through
sort_and_sum_duplicates, producing a CscMatrix whose row indices
violated the documented "lower triangle only" invariant. Downstream
consumers (e.g. symmetric_pattern) assume lower-triangle storage, so
the same symmetric matrix described with upper- vs lower-triangle
triplets produced different solve results. The error message identifies
the offending triplet by index and (row, col). Reported by @janosh.

Changed

Solver now defaults to the rayon-parallel multifrontal driver
(factorize_multifrontal_parallel_with_workspace). The driver is
bit-exact with the sequential supernodal path on a per-supernode
basis and falls through to the sequential path when the supernode
count is below N_PAR_MIN = 32, so small-problem latency is not
affected. Override with Solver::new().with_parallel(false).
Closes #7. Motivation: pounce's marine_1600 / pinene_3200
Mittelmann runs were spending all their time in sequential
factor_one_supernode even though the parallel driver was
available; this wires Solver directly to it.

Added

Solver::with_parallel(bool) — opt out of the rayon-parallel
driver (returns Self for builder chaining).
Solver::parallel() — test/diagnostic accessor for the current
flag value.
SymmetricMatrix::from_pooled_buf(n, buf) constructor that zeros
only the lower triangle when reusing a pooled buffer; cuts the
dead upper-triangle memset out of factor_one_supernode's
per-supernode hot path. See dev/decisions.md 2026-05-12.
compute_infnorm_dense(&SymmetricMatrix) in src/scaling/infnorm.rs
— dense-native Knight-Ruiz iteration for the D.3/D.4 dense
fast-path; the sparse compute_infnorm remains the path for the
multifrontal driver.
Solver::inertia() — Option<&Inertia> accessor returning the
full inertia of the last successful factor. Complements the
Ipopt-shaped num_negative_eigenvalues (which panics if no factor
is stored) for callers that prefer to branch on None. Used by
the cross-solver bench harness.

Tooling

Cross-solver comparison harness under
external_benchmarks/comparison/ (run.py, aggregate.py, report.py)
measures feral against MUMPS 5.8.2 and HSL MA97 2.8.1 on a sampled
SuiteSparse subset and emits REPORT.md. Each solver is configured
to its production-quality (refinement-on) settings so the residual
comparison is apples-to-apples:
- feral driver routes through solve_sparse_refined (Richardson
  refinement with stagnation exit) in src/bin/bench_one_matrix.rs.
- external_benchmarks/mumps_oracle/mumps_bench.F sets
  ICNTL(10) = 2 (max two iterative-refinement steps) — MUMPS
  default is 0 (no refinement).
- external_benchmarks/hsl_bench/hsl_bench.c wraps
  ma97_solve_d in a 4-step Richardson loop because MA97 has no
  native residual-based refinement entry for non-singular systems.
  Configuration is documented in the generated report's Solvers
  table.

Performance

Dense fast-path (dense_fast_factor) now runs Knight-Ruiz ∞-norm
scaling directly on the column-major lower-triangle buffer
produced by to_dense_into, removing the row_idx[k]
indirection that dominated wall time on small-dense matrices.
Routing: ScalingStrategy::Auto and InfNorm go through the
dense KR; Mc64Symmetric, Identity, External are honored
via the unchanged sparse path. Bit-exact with the sparse KR on
every fast-path-gate matrix (should_use_dense_fast_path ⇒
matrix is small enough that every column-major slot maps 1:1
onto a CSC entry or a known-zero, and (d_i · 0 · d_j) = 0
is a no-op in the max-reduction). Targets the
dev/results/lever-d3/stage1-stage2-2026-04-19.md §1 finding
that compute_scaling was 82% (34 of 41 µs) of dense-path
wall time on TRO3X3_0013.
Pooled local_contribs per rayon worker inside FactorWorkspace,
removing a per-task Vec<Option<ContribBlock>> of length
n_snodes from the parallel driver. Decisive on cont-201
(sequential –34%, parallel-at-T=8 –10%); also helps
bratu3d (–6% / –5%). Bit-exact. See dev/decisions.md
2026-05-12 and dev/sessions/2026-05-12-01.md.
Skip the upper-triangle zero on pooled frontal buffer reuse
(SymmetricMatrix::from_pooled_buf). Bit-exact; 5–10% sequential
wall reduction across mid-size matrices.

Investigated

Parallel multifrontal driver lock contention (T=4) — falsified.
Added opt-in AtomicLockStats telemetry to NumericParams with
per-task lock wait/hold counters and eight per-phase wall-time
counters wrapping the sequential prologue/epilogue. cont-201's
previously-reported residual headroom is sequential symbolic
factorize (157 ms of a 214 ms single-shot wall), not mutex
contention (worst-case 3.4% of body time on cont-201, 0.02% on
c-big). On the cached-symbolic path (production / pounce-IPM
regime) cont-201 wall drops 214 → 56 ms with body_frac jumping
0.15× → 0.55×; remaining 1.5× headroom is inside the rayon::scope,
not at lock sites. Full analysis in
dev/debugging/2026-05-12-cont201-cached-headroom.md.
Parallel driver within-scope localization (iteration 2) —
rayon idle dominates. Added task_wall_ns (whole-closure
bracket) and ws_lock_wait_ns (per-worker workspace mutex
wait) to AtomicLockStats. Derived rayon_idle = scope·T − task_wall_agg quantifies the parallelism deficit attributable
to etree dependencies. cont-201 cached: rayon_idle = 12.3 ms/T
(78% of the gap) vs locks 1.7 ms/T (10%) and ctrl-flow 1.5
ms/T (10%). c-big at T=4 is essentially sequential (74%
rayon-idle capacity, 1.04× speedup vs body_agg). Conclusion:
assembly-tree parallelism is exhausted on these matrices;
within-supernode parallelism (panel-BK / threaded dense
kernels) is the only remaining axis. Closes the cont-201
assembly-tree investigation. See iteration 2 in
dev/debugging/2026-05-12-cont201-cached-headroom.md.
Scaling cache verification (iteration 3) — compute_scaling_with_cache
works as designed. Added solver_scaling_phase_split test
(#[ignore]) that loads the corpus and times
pick_scaling_strategy + compute_scaling_with_cache(cache=None)
- reorder gather. c-big picks Mc64Symmetric and the no-cache
  path takes 2.3 seconds (full Hungarian); the cached path
  in production takes 2.4 ms — 1000× speedup, cache hits.
  cont-201 and bcsstk38 pick InfNorm, which is values-dependent
  Knight-Ruiz iteration (~4 ms per call) and is not cacheable
  across IPM iterations. The 3.95 ms scaling slice on cached
  cont-201 is fundamental per-factor work, not a missed cache.
  Closes scaling probe from session 2026-05-12-02 "Next session
  should #2".
Issue #5 (MSS1 BK inertia non-monotone under δ_w·I): triage
complete, closed on the feral side. Landed a reproducer test
- zero_tol/pivot_threshold sweep diagnostics in
  src/numeric/factorize.rs::tests. Empirically demonstrated
  that no in-kernel magnitude-floor lever cures the wandering;
  cross-checked MUMPS 5.8.2 and MA57 (via Ipopt's wrapper) and
  confirmed neither implements eigenvalue-aware 2×2 splitting.
  Recommended fix is upstream (caller-side δ_c bump matching
  Ipopt's PerturbForSingularity). Full analysis in
  dev/research/issue-5-mss1-inertia-monotonicity.md §9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Fixed — Honest `resolved_method` and consistent `Auto` routing (#3)

Fixed — `CscMatrix::from_triplets` rejects upper-triangle entries (#4)

Changed

Added

Tooling

Performance

Investigated

Contributors

Uh oh!

v0.2.0

Fixed — Honest resolved_method and consistent Auto routing (#3)

Fixed — CscMatrix::from_triplets rejects upper-triangle entries (#4)

Changed

Added

Tooling

Performance

Investigated

Contributors

Uh oh!

Fixed — Honest `resolved_method` and consistent `Auto` routing (#3)

Fixed — `CscMatrix::from_triplets` rejects upper-triangle entries (#4)