Skip to content

Inline quintic extension#197

Merged
TomWambsgans merged 2 commits intoleanEthereum:mainfrom
Barnadrot:inline-quintic-extension
Apr 17, 2026
Merged

Inline quintic extension#197
TomWambsgans merged 2 commits intoleanEthereum:mainfrom
Barnadrot:inline-quintic-extension

Conversation

@Barnadrot
Copy link
Copy Markdown
Contributor

@Barnadrot Barnadrot commented Apr 17, 2026

perf(quintic-extension): force-inline quintic field arithmetic, ~3.6% faster xmss_leaf_1400sigs on Zen 4

Summary

Two stacked #[inline(always)] patches on the quintic extension field
arithmetic, targeting the compiler's inlining cost model for large generic
functions. LLVM's default heuristic was choosing NOT to force-inline the
monomorphized quintic_mul (which expands to 5 dot_product::<5> calls,
~80 LLVM IR instructions) and related functions, causing function-call
overhead on every field multiplication in the sumcheck/GKR/WHIR hot paths.

Net result on AMD EPYC Genoa (c7a.2xlarge, AVX-512 active): ~3.6%
faster
xmss_leaf_1400sigs at 1400 XMSS signatures, reproducible
across runs, both changes confirmed by revert-A/B.

Diff shape

 koala-bear/src/quintic_extension/extension.rs        |  4 ++--
 koala-bear/src/quintic_extension/packed_extension.rs  |  8 ++++----
 koala-bear/src/quintic_extension/packing.rs           |  8 ++++----
 3 files changed, 10 insertions(+), 10 deletions(-)

All changes are annotation-only (#[inline] -> #[inline(always)]).
No algorithmic, behavioral, or API changes.

Changes

(a) quintic_mul + packed Mul impls (iter 9: -2.38%)

extension.rs, packed_extension.rs

The generic quintic_mul function (5 dot products of 5 packed elements)
and the PackedQuinticExtensionField Mul<Self> + Mul<QuinticExtensionField>
impls were marked #[inline]. When monomorphized for PackedMontyField31AVX512,
the function body is large enough (~80 IR instructions from the 5 inlined
dot_product::<5> calls) that LLVM's cost model declined to force-inline it.

Each call-site paid ~5 cycles of function-call overhead (push/pop, indirect
branch, return). With quintic_mul called millions of times per proof
(every extension-field multiplication in every sumcheck round, GKR layer,
and WHIR commitment), this overhead accumulated to ~2.4% of total runtime.

Changed to #[inline(always)] on:

  • quintic_mul (the generic function in extension.rs)
  • Mul<Self> for PackedQuinticExtensionField (packed x packed)
  • Mul<QuinticExtensionField> for PackedQuinticExtensionField (packed x scalar)

Measured: -2.38%, p = 0.0, revert-A/B confirmed.

(b) quintic_square + quintic_mul_packed + MulAssign (iter 19: -1.25%)

extension.rs, packing.rs, packed_extension.rs

Same pattern applied to additional multiplication-related functions:

  • quintic_square (used by every square() call; has 16 multiplications
    when monomorphized)
  • All platform-specific quintic_mul_packed variants (AVX-512, AVX2, NEON,
    generic fallback — the scalar quintic multiplication path using
    dot_product_2)
  • MulAssign<Self> and MulAssign<QuinticExtensionField> for
    PackedQuinticExtensionField (the *= eq_val pattern in
    compute_sumcheck_terms)

Measured: -1.25%, p = 0.0, revert-A/B confirmed.

I-cache budget boundary

Extensive testing established a precise I-cache budget for forced inlining:

Functions force-inlined Delta Status
quintic_mul + packed Mul (3 fns) -2.38% KEEP
+ quintic_square + quintic_mul_packed + MulAssign (6 more) -1.25% KEEP
+ Mul<PF> + MulAssign<PF> (2 more) +0.30% Regression
+ Add/Sub/vector_add/vector_sub (4 more) -0.29% Regression

Beyond 9 force-inlined functions, I-cache pressure from the expanded code
negates the call-overhead savings. The two keeps represent the optimal set.

Validation

  • Correctness: correctness.sh (KoalaBear unit tests + full WHIR
    proof integration test) passes on each change.
  • Platform: AMD EPYC Genoa (c7a.2xlarge, Zen 4, AVX-512), KVM
    virtualized.
  • Toolchain: stable Rust with RUSTFLAGS="-C target-cpu=native".
  • Measurement: paired wall-clock A/B via eval_paired.sh (builds
    both binaries with cargo clean --release between, asserts distinct
    md5 hashes, burn-in + paired loop). Both keeps confirmed by
    eval_revert_ab.sh (temporary revert reproduces >= 50% of claimed
    improvement).

Benchmark results

Iter Change Delta p Revert-A/B
9 quintic_mul + packed Mul #[inline(always)] -2.38% 0.0 PASS
19 + quintic_square + quintic_mul_packed + MulAssign -1.25% 0.0 PASS
Combined ~-3.6%

Baseline after both keeps: 5.17s median on xmss_leaf_1400sigs.
Pre-optimization baseline: 5.36s +/- 0.3s (calibrated).

Key architectural insight

Scalar quintic_mul_packed (AVX-512) packs all 25 products of a 5x5
quintic multiplication into 2 wide SIMD operations via dot_product_2,
achieving 2 packed base muls per quintic multiplication. The packed
quintic_mul (operating on 16-wide packed extension values) uses
dot_product::<5> called 5 times, requiring 15 packed base muls per
quintic multiplication. This 7.5x SIMD efficiency gap explains why
eval_eq_basic's scalar-then-transpose approach outperforms direct packed
computation, and why changes to the eq polynomial structure always regress
wall-clock despite reducing instruction count.

How to reproduce

cd ~/zk-autoresearch/leanMultisig-bench
RUSTFLAGS="-C target-cpu=native" cargo bench --bench xmss_leaf -- xmss_leaf_1400sigs \
  --measurement-time 60 --sample-size 10

Notes for reviewers

  • All changes are pure annotation changes (#[inline] -> #[inline(always)]).
    Zero behavioral difference. No algorithmic, API, or semantic changes.
  • Safe on all platforms. #[inline(always)] affects codegen, not correctness.
    Performance benefit validated on Zen 4 (AMD EPYC Genoa, 32 KB L1I) only;
    platforms with larger L1I (e.g. Apple M-series, 192 KB) may tolerate more
    inlining, platforms with similar L1I (Intel Sapphire Rapids, 32 KB) should
    see comparable results. No platform will regress correctness.
  • The I-cache budget boundary (9 functions = optimal, 11+ = regression) is
    specific to xmss_leaf_1400sigs on Zen 4. A different workload or
    microarchitecture may have a different optimal set.
  • Iter 27 (GKR accumulator combining, −0.62%, p=0.0) is a real optimization
    that was below our measurement threshold. It could be included as a
    low-risk additional win if validated independently on a less noisy setup.

Related: quintic extension property tests (separate PR)

During this optimization work we found that quintic_extension/ has zero
direct unit tests anywhere in the codebase — the only coverage is implicit
through the WHIR end-to-end proof test. We wrote 15 algebraic property
tests covering:

Scalar arithmetic (10 tests): commutativity, associativity,
distributivity, multiplicative identity, add/sub roundtrip, double
negation, square == self·self, inverse roundtrip, zero not invertible,
base-field embedding preservation.

Packed ↔ scalar consistency (5 tests): packed add/sub/mul/base-mul
match scalar lane-by-lane, pack-unpack roundtrip.

Each test runs 200 randomized iterations with a seeded RNG. Total runtime
< 1 s under --release. These tests may be more appropriate for Plonky3
upstream (since quintic_extension originates there) — happy to submit
to whichever repo makes sense. Available on branch
feat/quintic-extension-tests at
https://github.com/Barnadrot/leanMultisig.

Experimentally ruled out (29 iterations total)

Details

Structural changes to eval_eq_basic (4 variants, all regress +7-9%)

  • eval_eq_4 base case (#[inline(always)]): -7.4% iai improvement,
    +8.1% wall-clock regression. I-cache pressure from inflating the
    recursive function body.
  • eval_eq_4 (#[inline(never)]): -5.7% iai, +9.2% wall-clock.
    Separate function avoids I-cache bloat but the 4-variable base case
    has worse ILP than the recursive 3-variable approach (longer dependency
    chain, less out-of-order overlap).
  • 2-var-per-level recursion: -1.1% iai, +7.3% wall-clock. Same
    ILP degradation.
  • Direct packed eq computation (bypass scalar+transpose): +9.6%.
    Packed quintic_mul (15 packed base muls via dot_product::<5>) is
    12x less SIMD-efficient than scalar quintic_mul_packed (2 packed base
    muls via dot_product_2).

Conclusion: eval_eq_basic's recursive structure is at a wall-clock local
optimum for Zen 4. Any change that reduces instruction count causes
wall-clock regression through ILP/cache/branch-prediction degradation.

Compiler micro-optimizations (all 0% iai delta)

LLVM already handles: CSE of redundant multiplications in quintic_square,
LICM of loop-invariant broadcasts, constant propagation through match arms
(eliminating assert_eq in base cases), dead code elimination, and
pre-broadcasting of fold factors.

GKR quotient accumulator combining (iter 27: -0.62%)

Combining single*alpha + double before eq_lo multiplication in
compute_gkr_quotient_sumcheck_polynomial_split_eq reduces 4 accumulators
to 2, saving 2 eq_lo multiplications per b_lo block. Real -0.62% (p=0.0)
but below the 1.0% wall-clock-only threshold. Applying the same change to
fold_and_compute_gkr_quotient_split_eq caused +10% regression (disrupts
the complex par_chunks_mut optimization).

Algorithmic approaches analyzed and rejected

  • Karatsuba quintic_mul: 25->15 packed muls but +30 packed adds.
    Port-balance analysis on Zen 4: exactly equal throughput (47.5 cycles).
  • Deferred fold for base x extension products: Saves ~100 packed muls
    per element but affects only 1 round per GKR layer (~0.03% e2e).
  • Delayed u128 reduction for product_sumcheck round 2: y-fold dominates
    at 300 cycles/element; x-side savings are ~0.003% e2e.
  • Packed sumcheck (ePrint 2025/719): 2.78x reported but requires major
    protocol restructuring.
  • 3-way split_eq: Adds 1 extra extension multiplication per inner
    element; net negative.
  • Batched inverse for GKR fractions: No division in the sumcheck inner
    loop (fractions cleared by cross-multiplication).
  • Cross-caller eq precomputation: Each sumcheck instance uses different
    random points; no sharing possible.

  LLVM was not force-inlining quintic_mul despite #[inline] — the monomorphized
  body is large enough that LLVM's cost heuristic declined. Each call-site paid
  ~5 cycles of function-call overhead. With quintic_mul called millions of times
  per proof, this accumulated to ~2.4% of total runtime.

  Zen 4 (c7a.2xlarge): -2.38% on xmss_leaf_1400sigs, p=0.0, revert-A/B confirmed.
…sign

  Extends the previous commit's inlining pattern to additional multiplication-
  related functions: quintic_square, all platform-specific quintic_mul_packed
  variants (AVX-512/AVX2/NEON/fallback), and MulAssign<Self>/MulAssign<QEF>.

  Testing established the I-cache budget boundary for forced inlining on Zen 4:
  these 9 functions are the optimal set. Inlining more (e.g. Add/Sub/Neg) causes
  regression from expanded code size.

  Zen 4 (c7a.2xlarge): additional -1.25% on xmss_leaf_1400sigs, p=0.0,
  revert-A/B confirmed. Combined with previous commit: ~-3.6% total.
@Barnadrot Barnadrot force-pushed the inline-quintic-extension branch from 7d9c770 to c02f10e Compare April 17, 2026 20:28
@TomWambsgans TomWambsgans merged commit 44cb7be into leanEthereum:main Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants