Skip to content

v0.2.0

Choose a tag to compare

@jkitchin jkitchin released this 12 May 23:47
· 452 commits to main since this release

Fixed — Honest resolved_method and consistent Auto routing (#3)

SymbolicFactorization.resolved_method now reflects what the symbolic
pipeline actually ran rather than what the caller requested. Two
behavior changes:

  • ScotchND silent fallback is surfaced. When SCOTCH's nested-
    dissection recursion produces no separator for the entire graph
    (bordered-KKT shapes such as PoissonControl trigger this), the
    driver falls back to AMD via amd_leaf for every recursion node.
    Previously resolved_method still reported ScotchND while the
    permutation was bit-identical to AMD's. It now reports Amd. The
    recovery itself is unchanged — only its visibility is fixed.
  • OrderingMethod::Auto is consistent with the no-arg default.
    Auto resolution now happens against the original matrix's pattern
    before any LdltCompress preprocessor reshapes the graph, and the
    residual branch delegates to pick_default_method (the rule used
    by the no-arg symbolic_factorize). Previously Auto could pick a
    different concrete method than the default rule on the same
    matrix, depending on whether compression triggered. Auto is now a
    strict superset: same answer as the default plus the two extra
    shape-bakeoff branches (n>100_000 → ScotchND, n<10_000
    KahipND).

Reported by independent triage on K=158 PoissonControl benchmarks.

Fixed — CscMatrix::from_triplets rejects upper-triangle entries (#4)

CscMatrix::from_triplets and CscMatrix::validate now return
FeralError::InvalidInput when any triplet has row < col. Previously
upper-triangle entries were silently accepted and routed through
sort_and_sum_duplicates, producing a CscMatrix whose row indices
violated the documented "lower triangle only" invariant. Downstream
consumers (e.g. symmetric_pattern) assume lower-triangle storage, so
the same symmetric matrix described with upper- vs lower-triangle
triplets produced different solve results. The error message identifies
the offending triplet by index and (row, col). Reported by @janosh.

Changed

  • Solver now defaults to the rayon-parallel multifrontal driver
    (factorize_multifrontal_parallel_with_workspace). The driver is
    bit-exact with the sequential supernodal path on a per-supernode
    basis and falls through to the sequential path when the supernode
    count is below N_PAR_MIN = 32, so small-problem latency is not
    affected. Override with Solver::new().with_parallel(false).
    Closes #7. Motivation: pounce's marine_1600 / pinene_3200
    Mittelmann runs were spending all their time in sequential
    factor_one_supernode even though the parallel driver was
    available; this wires Solver directly to it.

Added

  • Solver::with_parallel(bool) — opt out of the rayon-parallel
    driver (returns Self for builder chaining).
  • Solver::parallel() — test/diagnostic accessor for the current
    flag value.
  • SymmetricMatrix::from_pooled_buf(n, buf) constructor that zeros
    only the lower triangle when reusing a pooled buffer; cuts the
    dead upper-triangle memset out of factor_one_supernode's
    per-supernode hot path. See dev/decisions.md 2026-05-12.
  • compute_infnorm_dense(&SymmetricMatrix) in src/scaling/infnorm.rs
    — dense-native Knight-Ruiz iteration for the D.3/D.4 dense
    fast-path; the sparse compute_infnorm remains the path for the
    multifrontal driver.
  • Solver::inertia()Option<&Inertia> accessor returning the
    full inertia of the last successful factor. Complements the
    Ipopt-shaped num_negative_eigenvalues (which panics if no factor
    is stored) for callers that prefer to branch on None. Used by
    the cross-solver bench harness.

Tooling

  • Cross-solver comparison harness under
    external_benchmarks/comparison/ (run.py, aggregate.py, report.py)
    measures feral against MUMPS 5.8.2 and HSL MA97 2.8.1 on a sampled
    SuiteSparse subset and emits REPORT.md. Each solver is configured
    to its production-quality (refinement-on) settings so the residual
    comparison is apples-to-apples:
    • feral driver routes through solve_sparse_refined (Richardson
      refinement with stagnation exit) in src/bin/bench_one_matrix.rs.
    • external_benchmarks/mumps_oracle/mumps_bench.F sets
      ICNTL(10) = 2 (max two iterative-refinement steps) — MUMPS
      default is 0 (no refinement).
    • external_benchmarks/hsl_bench/hsl_bench.c wraps
      ma97_solve_d in a 4-step Richardson loop because MA97 has no
      native residual-based refinement entry for non-singular systems.
      Configuration is documented in the generated report's Solvers
      table.

Performance

  • Dense fast-path (dense_fast_factor) now runs Knight-Ruiz ∞-norm
    scaling directly on the column-major lower-triangle buffer
    produced by to_dense_into, removing the row_idx[k]
    indirection that dominated wall time on small-dense matrices.
    Routing: ScalingStrategy::Auto and InfNorm go through the
    dense KR; Mc64Symmetric, Identity, External are honored
    via the unchanged sparse path. Bit-exact with the sparse KR on
    every fast-path-gate matrix (should_use_dense_fast_path
    matrix is small enough that every column-major slot maps 1:1
    onto a CSC entry or a known-zero, and (d_i · 0 · d_j) = 0
    is a no-op in the max-reduction). Targets the
    dev/results/lever-d3/stage1-stage2-2026-04-19.md §1 finding
    that compute_scaling was 82% (34 of 41 µs) of dense-path
    wall time on TRO3X3_0013.
  • Pooled local_contribs per rayon worker inside FactorWorkspace,
    removing a per-task Vec<Option<ContribBlock>> of length
    n_snodes from the parallel driver. Decisive on cont-201
    (sequential –34%, parallel-at-T=8 –10%); also helps
    bratu3d (–6% / –5%). Bit-exact. See dev/decisions.md
    2026-05-12 and dev/sessions/2026-05-12-01.md.
  • Skip the upper-triangle zero on pooled frontal buffer reuse
    (SymmetricMatrix::from_pooled_buf). Bit-exact; 5–10% sequential
    wall reduction across mid-size matrices.

Investigated

  • Parallel multifrontal driver lock contention (T=4) — falsified.
    Added opt-in AtomicLockStats telemetry to NumericParams with
    per-task lock wait/hold counters and eight per-phase wall-time
    counters wrapping the sequential prologue/epilogue. cont-201's
    previously-reported residual headroom is sequential symbolic
    factorize
    (157 ms of a 214 ms single-shot wall), not mutex
    contention (worst-case 3.4% of body time on cont-201, 0.02% on
    c-big). On the cached-symbolic path (production / pounce-IPM
    regime) cont-201 wall drops 214 → 56 ms with body_frac jumping
    0.15× → 0.55×; remaining 1.5× headroom is inside the rayon::scope,
    not at lock sites. Full analysis in
    dev/debugging/2026-05-12-cont201-cached-headroom.md.

  • Parallel driver within-scope localization (iteration 2) —
    rayon idle dominates. Added task_wall_ns (whole-closure
    bracket) and ws_lock_wait_ns (per-worker workspace mutex
    wait) to AtomicLockStats. Derived rayon_idle = scope·T − task_wall_agg quantifies the parallelism deficit attributable
    to etree dependencies. cont-201 cached: rayon_idle = 12.3 ms/T
    (78% of the gap) vs locks 1.7 ms/T (10%) and ctrl-flow 1.5
    ms/T (10%). c-big at T=4 is essentially sequential (74%
    rayon-idle capacity, 1.04× speedup vs body_agg). Conclusion:
    assembly-tree parallelism is exhausted on these matrices;
    within-supernode parallelism (panel-BK / threaded dense
    kernels) is the only remaining axis. Closes the cont-201
    assembly-tree investigation. See iteration 2 in
    dev/debugging/2026-05-12-cont201-cached-headroom.md.

  • Scaling cache verification (iteration 3) — compute_scaling_with_cache
    works as designed. Added solver_scaling_phase_split test
    (#[ignore]) that loads the corpus and times
    pick_scaling_strategy + compute_scaling_with_cache(cache=None)

    • reorder gather. c-big picks Mc64Symmetric and the no-cache
      path takes 2.3 seconds (full Hungarian); the cached path
      in production takes 2.4 ms — 1000× speedup, cache hits.
      cont-201 and bcsstk38 pick InfNorm, which is values-dependent
      Knight-Ruiz iteration (~4 ms per call) and is not cacheable
      across IPM iterations. The 3.95 ms scaling slice on cached
      cont-201 is fundamental per-factor work, not a missed cache.
      Closes scaling probe from session 2026-05-12-02 "Next session
      should #2".
  • Issue #5 (MSS1 BK inertia non-monotone under δ_w·I): triage
    complete, closed on the feral side. Landed a reproducer test

    • zero_tol/pivot_threshold sweep diagnostics in
      src/numeric/factorize.rs::tests. Empirically demonstrated
      that no in-kernel magnitude-floor lever cures the wandering;
      cross-checked MUMPS 5.8.2 and MA57 (via Ipopt's wrapper) and
      confirmed neither implements eigenvalue-aware 2×2 splitting.
      Recommended fix is upstream (caller-side δ_c bump matching
      Ipopt's PerturbForSingularity). Full analysis in
      dev/research/issue-5-mss1-inertia-monotonicity.md §9.