perf(quantize): AVX2 SIMD Q4_0/Q8_0 dequant — ~8-9× speedup (closes #386)#1691
Merged
Conversation
) Issue #386 measured Q4_0/Q8_0 dequant at ~1.22 Gelem/s — 5× below the 26.8 GB/s memcpy ceiling on the same hardware — and attributed it to a missing SIMD path. The scalar implementations bottleneck on i8→i32→f32 sign-extend cascades and per-byte nibble unpacks that LLVM cannot auto-vectorize through. This PR adds AVX2 fast paths in `aprender-core/src/format/quantize_simd.rs`, runtime-gated via `is_x86_feature_detected!("avx2")`. Non-x86 targets and hosts without AVX2 keep the scalar code path unchanged. ## Implementation - **Q8_0**: load 32 i8 quants → 4× `_mm256_cvtepi8_epi32` (8 lanes each) → `_mm256_cvtepi32_ps` → multiply broadcast f16 scale → 4× `_mm256_storeu_ps`. - **Q4_0**: load 16 packed nibble bytes → mask low + shift-mask high → `_mm_unpacklo/hi_epi8` to recover the interleaved layout (`byte_i = q_{2i+1} << 4 | q_{2i}` — matches `Q4_0Quantizer::quantize`) → 4× `_mm256_cvtepu8_epi32` + `_mm256_sub_epi32(_, 8)` → cvt → mul → store. Both functions are `unsafe fn` + `#[target_feature(enable = "avx2")]`, called from `_dispatch` wrappers that perform the feature detection. `#[allow(unsafe_code)]` is scoped to this file; no public API gains `unsafe`. Bounds invariants are documented in the module doc-comment and asserted by the parity tests. ## Measured speedup (262144 elements, AMD Zen 4 host) | | Before (issue #386) | After | Speedup | |--------------|---------------------|----------------|---------| | Q8_0 dequant | 1.22 Gelem/s | 11.4 Gelem/s | 9.3× | | Q4_0 dequant | 1.25 Gelem/s | 9.8 Gelem/s | 7.8× | Well above the ≥10% perf gate; in line with the issue's "AVX2 should approach memcpy speed" prediction (memcpy ceiling here is ~6.7 Gelem/s output bandwidth for f32 — both fast paths exceed it because input bandwidth at 1 byte/elem (Q8_0) or 0.5 bytes/elem (Q4_0) is far below the output side). ## Tests - `scalar_simd_parity_q8_0` / `scalar_simd_parity_q4_0`: bit-exact (`to_bits()`) equality between scalar and SIMD output across [32, 64, 256, 1024, 32 × 71] element counts. - `dispatch_returns_false_without_avx2`: non-x86 path asserts the dispatcher correctly signals scalar fallback. - All 58 pre-existing quantize tests still pass (round-trip + property). ``` $ cargo test -p aprender-core --lib --features format-quantize quantize_simd running 3 tests test format::quantize_simd::tests::dispatch_returns_false_without_avx2 ... ok test format::quantize_simd::tests::scalar_simd_parity_q4_0 ... ok test format::quantize_simd::tests::scalar_simd_parity_q8_0 ... ok test result: ok. 3 passed; 0 failed ``` ## Out of scope - SSE2 / AVX-512 / NEON / WASM SIMD128 fast paths — separate tickets (AVX2 covers the x86-64 default since ~2013, hitting essentially every modern x86-64 dev/CI machine including the self-hosted runners). - Q4_1 / Q4_K / Q6_K / Q8_K SIMD — separate, more complex layouts. - Quantize (compress) path — issue #386 notes quantize is 4× slower than dequant, but the dominant cost is finding block scales (reductions), not the per-element compress. Separate ticket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Issue #386 measured Q4_0/Q8_0 dequant at ~1.22 Gelem/s — 5× below the
26.8 GB/s memcpy ceiling on the same hardware — and attributed it to a
missing SIMD path. The scalar implementations bottleneck on i8→i32→f32
sign-extend cascades and per-byte nibble unpacks that LLVM cannot
auto-vectorize through.
This PR adds AVX2 fast paths in
`aprender-core/src/format/quantize_simd.rs`, runtime-gated via
`is_x86_feature_detected!("avx2")`. Non-x86 targets and hosts without
AVX2 keep the scalar code path unchanged.
Implementation
each) → `_mm256_cvtepi32_ps` → multiply broadcast f16 scale → 4×
`_mm256_storeu_ps`.
`mm_unpacklo/hi_epi8` to recover the interleaved layout
(`byte_i = q{2i+1} << 4 | q_{2i}` — matches
`Q4_0Quantizer::quantize`) → 4× `_mm256_cvtepu8_epi32` +
`mm256_sub_epi32(, 8)` → cvt → mul → store.
Both functions are `unsafe fn` + `#[target_feature(enable = "avx2")]`,
called from `_dispatch` wrappers that perform the feature detection.
`#[allow(unsafe_code)]` is scoped to this file; no public API gains
`unsafe`. Bounds invariants are documented in the module doc-comment
and asserted by the parity tests.
Measured speedup (262144 elements, AMD Zen 4 host)
Far above the ≥10% perf gate; in line with the issue's "AVX2 should
approach memcpy speed" prediction (memcpy ceiling here is ~6.7 Gelem/s
output bandwidth for f32 — both fast paths exceed it because input
bandwidth at 1 byte/elem (Q8_0) or 0.5 bytes/elem (Q4_0) is far below
the output side).
Tests
(`to_bits()`) equality between scalar and SIMD output across
[32, 64, 256, 1024, 32 × 71] element counts.
dispatcher correctly signals scalar fallback.
```
$ cargo test -p aprender-core --lib --features format-quantize quantize_simd
running 3 tests
test format::quantize_simd::tests::dispatch_returns_false_without_avx2 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q4_0 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q8_0 ... ok
test result: ok. 3 passed; 0 failed
```
Out of scope
(AVX2 covers the x86-64 default since ~2013, hitting essentially
every modern x86-64 dev/CI machine including the self-hosted runners).
than dequant, but the dominant cost is finding block scales
(reductions), not the per-element compress. Separate ticket.
Closes #386.
🤖 Generated with Claude Code