feat(hnsw): AVX-512 SIMD distance functions with runtime auto-detection#12
Conversation
Add AVX-512F variants for dot_product, compute_norm_squared,
cosine_similarity, and euclidean_distance in hnsw/simd.rs. Runtime
CPU feature detection automatically selects the best available SIMD
path:
AVX-512F -> AVX2 -> Scalar
AVX-512 processes 16 floats per instruction via _mm512_fmadd_ps
(fused multiply-add). The previous `HAS_AVX2: OnceLock<bool>` cache
is replaced with `SIMD_LEVEL: OnceLock<SimdLevel>`, where SimdLevel
is a new public enum { Avx512, Avx2, Scalar }. simd_level() is the
single source of truth and is called by all four dispatch wrappers.
cosine_similarity_avx512 fuses the dot + 2 squared-norm reductions
into a single 16-wide pass with three independent FMA accumulators —
~30x speedup over scalar for 1536-dim vectors.
Non-x86_64 platforms fall through to scalar unchanged; AVX2-only
CPUs hit the same AVX2 path as before. Existing 46 simd tests stay
green.
Benchmarks on AMD Ryzen 9 7945HX (Zen4, AVX-512 double-pumped),
cargo bench --features native-v3 --bench hnsw -- simd_:
dot_product/1536 scalar 837 ns -> AVX-512 76 ns (11x)
euclidean/1536 scalar 846 ns -> AVX-512 66 ns (13x)
cosine_similarity/ scalar 2484 ns -> AVX-512 71 ns (35x)
New tests:
test_simd_level_detection_succeeds
test_simd_level_matches_cpu_features
test_avx512_dot_product_matches_scalar
test_avx512_norm_squared_matches_scalar
test_avx512_cosine_similarity_matches_scalar
test_avx512_euclidean_distance_matches_scalar
test_avx512_remainder_handling (sizes 1..255, every len%16 bucket)
test_dispatch_typical_embedding_dims (384/768/1024/1536)
All 8 new tests gated under cfg(target_arch = "x86_64") and skip
gracefully when AVX-512F is not available. 1186 baseline lib tests
still pass; the centrality.rs failure is a separate pre-existing
issue fixed in fix/B1-pagerank-stale-test.
Tested on: AMD Ryzen 9 7945HX (Zen4 / Phoenix, AVX-512 via
double-pumping). Detection log on this CPU: "Detected SIMD level: Avx512".
|
around 35x speedup |
oldnordic
left a comment
There was a problem hiding this comment.
Benchmark Results: AVX-512 vs AVX2 on Real Hardware
Hardware: AMD Ryzen 7 7800X3D (AVX-512F/BW/VL/DQ/CD + AVX2 + FMA)
Method: Criterion, 3s measurement per case, release profile (opt-level=3, LTO=thin), identical test vectors.
AVX-512 (this PR) vs AVX2 (main)
| Function | Dim | Scalar | AVX2 | AVX-512 | 512 vs AVX2 | 512 vs Scalar |
|---|---|---|---|---|---|---|
| dot_product | 128 | 59.6ns | 6.0ns | 4.6ns | 1.32x | 13.1x |
| dot_product | 384 | 224ns | 22.7ns | 16.6ns | 1.36x | 13.5x |
| dot_product | 768 | 470ns | 49.2ns | 39.9ns | 1.23x | 11.8x |
| dot_product | 1536 | 965ns | 111ns | 78.7ns | 1.41x | 12.3x |
| euclidean | 128 | 67.3ns | 17.1ns | 5.1ns | 3.38x | 13.3x |
| euclidean | 384 | 232ns | 48.1ns | 22.9ns | 2.10x | 10.2x |
| euclidean | 768 | 480ns | 94.9ns | 44.0ns | 2.16x | 10.9x |
| euclidean | 1536 | 972ns | 188ns | 92.1ns | 2.04x | 10.6x |
| cosine | 128 | 173ns | 19.0ns | 8.8ns | 2.14x | 19.6x |
| cosine | 384 | 666ns | 63.8ns | 24.3ns | 2.62x | 27.4x |
| cosine | 768 | 1.40us | 133ns | 45.5ns | 2.93x | 30.8x |
| cosine | 1024 | 1.90us | 188ns | 58.6ns | 3.21x | 32.4x |
| cosine | 1536 | 2.89us | 311ns | 97.4ns | 3.20x | 29.7x |
| norm_squared | 128 | 55.2ns | 5.7ns | 3.1ns | 1.82x | 17.8x |
| norm_squared | 384 | 219ns | 19.9ns | 12.7ns | 1.56x | 17.2x |
| norm_squared | 768 | 466ns | 39.4ns | 27.7ns | 1.42x | 16.8x |
Summary: AVX-512 vs AVX2 median 2.10x (range 1.23x-3.38x). AVX-512 vs scalar median 13.6x (range 10.2x-32.4x).
Code Review Notes
- Clean dispatch refactor.
SimdLevelenum +simd_level()is a genuine improvement over scatteredHAS_AVX2bools. - Correct intrinsics.
_mm512_loadu_ps,_mm512_fmadd_ps,_mm512_reduce_add_psare all standard and correct. - Fused cosine kernel (dot + both norms in one loop with 3 FMA accumulators) is a nice optimization.
- Solid test coverage - level detection, AVX-512 vs scalar for all 4 ops, remainder handling, typical embedding dims.
One Correction
The "35x speedup" comment doesn't match measured data. The best observed was 32.4x vs scalar (cosine dim 1024), and vs the existing AVX2 path the improvement is median 2.10x. Would be good to update that comment.
Minor Doc Issue
Lines 82-84 claim AVX-512F implies FMA - while practically true for all shipping consumer CPUs, it's not architecturally guaranteed (early Knights Landing had AVX-512F without FMA). The code itself is fine since _mm512_fmadd_ps is part of AVX-512F, but the doc comment could be more precise.
Overall: the improvement is real and substantial. The code is sound, CI is green, and the refactor is clean.
Add AVX-512F variants for dot_product, compute_norm_squared, cosine_similarity, and euclidean_distance in hnsw/simd.rs. Runtime CPU feature detection automatically selects the best available SIMD path:
AVX-512F -> AVX2 -> Scalar
AVX-512 processes 16 floats per instruction via _mm512_fmadd_ps (fused multiply-add). The previous
HAS_AVX2: OnceLock<bool>cache is replaced withSIMD_LEVEL: OnceLock<SimdLevel>, where SimdLevel is a new public enum { Avx512, Avx2, Scalar }. simd_level() is the single source of truth and is called by all four dispatch wrappers.cosine_similarity_avx512 fuses the dot + 2 squared-norm reductions into a single 16-wide pass with three independent FMA accumulators — ~30x speedup over scalar for 1536-dim vectors.
Non-x86_64 platforms fall through to scalar unchanged; AVX2-only CPUs hit the same AVX2 path as before. Existing 46 simd tests stay green.
Benchmarks on AMD Ryzen 9 7945HX (Zen4, AVX-512 double-pumped), cargo bench --features native-v3 --bench hnsw -- simd_:
dot_product/1536 scalar 837 ns -> AVX-512 76 ns (11x)
euclidean/1536 scalar 846 ns -> AVX-512 66 ns (13x)
cosine_similarity/ scalar 2484 ns -> AVX-512 71 ns (35x)
New tests:
test_simd_level_detection_succeeds
test_simd_level_matches_cpu_features
test_avx512_dot_product_matches_scalar
test_avx512_norm_squared_matches_scalar
test_avx512_cosine_similarity_matches_scalar
test_avx512_euclidean_distance_matches_scalar
test_avx512_remainder_handling (sizes 1..255, every len%16 bucket)
test_dispatch_typical_embedding_dims (384/768/1024/1536)
All 8 new tests gated under cfg(target_arch = "x86_64") and skip gracefully when AVX-512F is not available. 1186 baseline lib tests still pass; the centrality.rs failure is a separate pre-existing issue fixed in fix/B1-pagerank-stale-test.
Tested on: AMD Ryzen 9 7945HX (Zen4 / Phoenix, AVX-512 via
double-pumping). Detection log on this CPU: "Detected SIMD level: Avx512".