Inline quintic extension by Barnadrot · Pull Request #197 · leanEthereum/leanMultisig

Barnadrot · 2026-04-17T20:19:39Z

perf(quintic-extension): force-inline quintic field arithmetic, ~3.6% faster `xmss_leaf_1400sigs` on Zen 4

Summary

Two stacked #[inline(always)] patches on the quintic extension field
arithmetic, targeting the compiler's inlining cost model for large generic
functions. LLVM's default heuristic was choosing NOT to force-inline the
monomorphized quintic_mul (which expands to 5 dot_product::<5> calls,
~80 LLVM IR instructions) and related functions, causing function-call
overhead on every field multiplication in the sumcheck/GKR/WHIR hot paths.

Net result on AMD EPYC Genoa (c7a.2xlarge, AVX-512 active): ~3.6%
faster xmss_leaf_1400sigs at 1400 XMSS signatures, reproducible
across runs, both changes confirmed by revert-A/B.

Diff shape

 koala-bear/src/quintic_extension/extension.rs        |  4 ++--
 koala-bear/src/quintic_extension/packed_extension.rs  |  8 ++++----
 koala-bear/src/quintic_extension/packing.rs           |  8 ++++----
 3 files changed, 10 insertions(+), 10 deletions(-)

All changes are annotation-only (#[inline] -> #[inline(always)]).
No algorithmic, behavioral, or API changes.

Changes

(a) `quintic_mul` + packed `Mul` impls (iter 9: -2.38%)

extension.rs, packed_extension.rs

The generic quintic_mul function (5 dot products of 5 packed elements)
and the PackedQuinticExtensionField Mul<Self> + Mul<QuinticExtensionField>
impls were marked #[inline]. When monomorphized for PackedMontyField31AVX512,
the function body is large enough (~80 IR instructions from the 5 inlined
dot_product::<5> calls) that LLVM's cost model declined to force-inline it.

Each call-site paid ~5 cycles of function-call overhead (push/pop, indirect
branch, return). With quintic_mul called millions of times per proof
(every extension-field multiplication in every sumcheck round, GKR layer,
and WHIR commitment), this overhead accumulated to ~2.4% of total runtime.

Changed to #[inline(always)] on:

quintic_mul (the generic function in extension.rs)
Mul<Self> for PackedQuinticExtensionField (packed x packed)
Mul<QuinticExtensionField> for PackedQuinticExtensionField (packed x scalar)

Measured: -2.38%, p = 0.0, revert-A/B confirmed.

(b) `quintic_square` + `quintic_mul_packed` + `MulAssign` (iter 19: -1.25%)

extension.rs, packing.rs, packed_extension.rs

Same pattern applied to additional multiplication-related functions:

quintic_square (used by every square() call; has 16 multiplications
when monomorphized)
All platform-specific quintic_mul_packed variants (AVX-512, AVX2, NEON,
generic fallback — the scalar quintic multiplication path using
dot_product_2)
MulAssign<Self> and MulAssign<QuinticExtensionField> for
PackedQuinticExtensionField (the *= eq_val pattern in
compute_sumcheck_terms)

Measured: -1.25%, p = 0.0, revert-A/B confirmed.

I-cache budget boundary

Extensive testing established a precise I-cache budget for forced inlining:

Functions force-inlined	Delta	Status
quintic_mul + packed Mul (3 fns)	-2.38%	KEEP
+ quintic_square + quintic_mul_packed + MulAssign (6 more)	-1.25%	KEEP
+ Mul<PF> + MulAssign<PF> (2 more)	+0.30%	Regression
+ Add/Sub/vector_add/vector_sub (4 more)	-0.29%	Regression

Beyond 9 force-inlined functions, I-cache pressure from the expanded code
negates the call-overhead savings. The two keeps represent the optimal set.

Validation

Correctness: correctness.sh (KoalaBear unit tests + full WHIR
proof integration test) passes on each change.
Platform: AMD EPYC Genoa (c7a.2xlarge, Zen 4, AVX-512), KVM
virtualized.
Toolchain: stable Rust with RUSTFLAGS="-C target-cpu=native".
Measurement: paired wall-clock A/B via eval_paired.sh (builds
both binaries with cargo clean --release between, asserts distinct
md5 hashes, burn-in + paired loop). Both keeps confirmed by
eval_revert_ab.sh (temporary revert reproduces >= 50% of claimed
improvement).

Benchmark results

Iter	Change	Delta	p	Revert-A/B
9	quintic_mul + packed Mul `#[inline(always)]`	-2.38%	0.0	PASS
19	+ quintic_square + quintic_mul_packed + MulAssign	-1.25%	0.0	PASS
Combined		~-3.6%

Baseline after both keeps: 5.17s median on xmss_leaf_1400sigs.
Pre-optimization baseline: 5.36s +/- 0.3s (calibrated).

Key architectural insight

Scalar quintic_mul_packed (AVX-512) packs all 25 products of a 5x5
quintic multiplication into 2 wide SIMD operations via dot_product_2,
achieving 2 packed base muls per quintic multiplication. The packed
quintic_mul (operating on 16-wide packed extension values) uses
dot_product::<5> called 5 times, requiring 15 packed base muls per
quintic multiplication. This 7.5x SIMD efficiency gap explains why
eval_eq_basic's scalar-then-transpose approach outperforms direct packed
computation, and why changes to the eq polynomial structure always regress
wall-clock despite reducing instruction count.

How to reproduce

cd ~/zk-autoresearch/leanMultisig-bench
RUSTFLAGS="-C target-cpu=native" cargo bench --bench xmss_leaf -- xmss_leaf_1400sigs \
  --measurement-time 60 --sample-size 10

Notes for reviewers

All changes are pure annotation changes (#[inline] -> #[inline(always)]).
Zero behavioral difference. No algorithmic, API, or semantic changes.
Safe on all platforms. #[inline(always)] affects codegen, not correctness.
Performance benefit validated on Zen 4 (AMD EPYC Genoa, 32 KB L1I) only;
platforms with larger L1I (e.g. Apple M-series, 192 KB) may tolerate more
inlining, platforms with similar L1I (Intel Sapphire Rapids, 32 KB) should
see comparable results. No platform will regress correctness.
The I-cache budget boundary (9 functions = optimal, 11+ = regression) is
specific to xmss_leaf_1400sigs on Zen 4. A different workload or
microarchitecture may have a different optimal set.
Iter 27 (GKR accumulator combining, −0.62%, p=0.0) is a real optimization
that was below our measurement threshold. It could be included as a
low-risk additional win if validated independently on a less noisy setup.

Related: quintic extension property tests (separate PR)

During this optimization work we found that quintic_extension/ has zero
direct unit tests anywhere in the codebase — the only coverage is implicit
through the WHIR end-to-end proof test. We wrote 15 algebraic property
tests covering:

Scalar arithmetic (10 tests): commutativity, associativity,
distributivity, multiplicative identity, add/sub roundtrip, double
negation, square == self·self, inverse roundtrip, zero not invertible,
base-field embedding preservation.

Packed ↔ scalar consistency (5 tests): packed add/sub/mul/base-mul
match scalar lane-by-lane, pack-unpack roundtrip.

Each test runs 200 randomized iterations with a seeded RNG. Total runtime
< 1 s under --release. These tests may be more appropriate for Plonky3
upstream (since quintic_extension originates there) — happy to submit
to whichever repo makes sense. Available on branch
feat/quintic-extension-tests at
https://github.com/Barnadrot/leanMultisig.

Experimentally ruled out (29 iterations total)

Details

Structural changes to `eval_eq_basic` (4 variants, all regress +7-9%)

eval_eq_4 base case (#[inline(always)]): -7.4% iai improvement,
+8.1% wall-clock regression. I-cache pressure from inflating the
recursive function body.
eval_eq_4 (#[inline(never)]): -5.7% iai, +9.2% wall-clock.
Separate function avoids I-cache bloat but the 4-variable base case
has worse ILP than the recursive 3-variable approach (longer dependency
chain, less out-of-order overlap).
2-var-per-level recursion: -1.1% iai, +7.3% wall-clock. Same
ILP degradation.
Direct packed eq computation (bypass scalar+transpose): +9.6%.
Packed quintic_mul (15 packed base muls via dot_product::<5>) is
12x less SIMD-efficient than scalar quintic_mul_packed (2 packed base
muls via dot_product_2).

Conclusion: eval_eq_basic's recursive structure is at a wall-clock local
optimum for Zen 4. Any change that reduces instruction count causes
wall-clock regression through ILP/cache/branch-prediction degradation.

Compiler micro-optimizations (all 0% iai delta)

LLVM already handles: CSE of redundant multiplications in quintic_square,
LICM of loop-invariant broadcasts, constant propagation through match arms
(eliminating assert_eq in base cases), dead code elimination, and
pre-broadcasting of fold factors.

GKR quotient accumulator combining (iter 27: -0.62%)

Combining single*alpha + double before eq_lo multiplication in
compute_gkr_quotient_sumcheck_polynomial_split_eq reduces 4 accumulators
to 2, saving 2 eq_lo multiplications per b_lo block. Real -0.62% (p=0.0)
but below the 1.0% wall-clock-only threshold. Applying the same change to
fold_and_compute_gkr_quotient_split_eq caused +10% regression (disrupts
the complex par_chunks_mut optimization).

Algorithmic approaches analyzed and rejected

Karatsuba quintic_mul: 25->15 packed muls but +30 packed adds.
Port-balance analysis on Zen 4: exactly equal throughput (47.5 cycles).
Deferred fold for base x extension products: Saves ~100 packed muls
per element but affects only 1 round per GKR layer (~0.03% e2e).
Delayed u128 reduction for product_sumcheck round 2: y-fold dominates
at 300 cycles/element; x-side savings are ~0.003% e2e.
Packed sumcheck (ePrint 2025/719): 2.78x reported but requires major
protocol restructuring.
3-way split_eq: Adds 1 extra extension multiplication per inner
element; net negative.
Batched inverse for GKR fractions: No division in the sumcheck inner
loop (fractions cleared by cross-multiplication).
Cross-caller eq precomputation: Each sumcheck instance uses different
random points; no sharing possible.

LLVM was not force-inlining quintic_mul despite #[inline] — the monomorphized body is large enough that LLVM's cost heuristic declined. Each call-site paid ~5 cycles of function-call overhead. With quintic_mul called millions of times per proof, this accumulated to ~2.4% of total runtime. Zen 4 (c7a.2xlarge): -2.38% on xmss_leaf_1400sigs, p=0.0, revert-A/B confirmed.

…sign Extends the previous commit's inlining pattern to additional multiplication- related functions: quintic_square, all platform-specific quintic_mul_packed variants (AVX-512/AVX2/NEON/fallback), and MulAssign<Self>/MulAssign<QEF>. Testing established the I-cache budget boundary for forced inlining on Zen 4: these 9 functions are the optimal set. Inlining more (e.g. Add/Sub/Neg) causes regression from expanded code size. Zen 4 (c7a.2xlarge): additional -1.25% on xmss_leaf_1400sigs, p=0.0, revert-A/B confirmed. Combined with previous commit: ~-3.6% total.

Barnadrot added 2 commits April 17, 2026 20:28

Barnadrot force-pushed the inline-quintic-extension branch from 7d9c770 to c02f10e Compare April 17, 2026 20:28

TomWambsgans merged commit 44cb7be into leanEthereum:main Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline quintic extension#197

Inline quintic extension#197
TomWambsgans merged 2 commits intoleanEthereum:mainfrom
Barnadrot:inline-quintic-extension

Barnadrot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Barnadrot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

perf(quintic-extension): force-inline quintic field arithmetic, ~3.6% faster xmss_leaf_1400sigs on Zen 4

Summary

Diff shape

Changes

(a) quintic_mul + packed Mul impls (iter 9: -2.38%)

(b) quintic_square + quintic_mul_packed + MulAssign (iter 19: -1.25%)

I-cache budget boundary

Validation

Benchmark results

Key architectural insight

How to reproduce

Notes for reviewers

Related: quintic extension property tests (separate PR)

Experimentally ruled out (29 iterations total)

Structural changes to eval_eq_basic (4 variants, all regress +7-9%)

Compiler micro-optimizations (all 0% iai delta)

GKR quotient accumulator combining (iter 27: -0.62%)

Algorithmic approaches analyzed and rejected

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Barnadrot commented Apr 17, 2026 •

edited

Loading

perf(quintic-extension): force-inline quintic field arithmetic, ~3.6% faster `xmss_leaf_1400sigs` on Zen 4

(a) `quintic_mul` + packed `Mul` impls (iter 9: -2.38%)

(b) `quintic_square` + `quintic_mul_packed` + `MulAssign` (iter 19: -1.25%)

Structural changes to `eval_eq_basic` (4 variants, all regress +7-9%)