Skip to content

feat(hnsw): AVX-512 SIMD distance functions with runtime auto-detection#12

Merged
oldnordic merged 1 commit into
oldnordic:mainfrom
maeddesg:feat/avx512-hnsw
May 20, 2026
Merged

feat(hnsw): AVX-512 SIMD distance functions with runtime auto-detection#12
oldnordic merged 1 commit into
oldnordic:mainfrom
maeddesg:feat/avx512-hnsw

Conversation

@maeddesg
Copy link
Copy Markdown
Contributor

Add AVX-512F variants for dot_product, compute_norm_squared, cosine_similarity, and euclidean_distance in hnsw/simd.rs. Runtime CPU feature detection automatically selects the best available SIMD path:

AVX-512F -> AVX2 -> Scalar

AVX-512 processes 16 floats per instruction via _mm512_fmadd_ps (fused multiply-add). The previous HAS_AVX2: OnceLock<bool> cache is replaced with SIMD_LEVEL: OnceLock<SimdLevel>, where SimdLevel is a new public enum { Avx512, Avx2, Scalar }. simd_level() is the single source of truth and is called by all four dispatch wrappers.

cosine_similarity_avx512 fuses the dot + 2 squared-norm reductions into a single 16-wide pass with three independent FMA accumulators — ~30x speedup over scalar for 1536-dim vectors.

Non-x86_64 platforms fall through to scalar unchanged; AVX2-only CPUs hit the same AVX2 path as before. Existing 46 simd tests stay green.

Benchmarks on AMD Ryzen 9 7945HX (Zen4, AVX-512 double-pumped), cargo bench --features native-v3 --bench hnsw -- simd_:

dot_product/1536 scalar 837 ns -> AVX-512 76 ns (11x)
euclidean/1536 scalar 846 ns -> AVX-512 66 ns (13x)
cosine_similarity/ scalar 2484 ns -> AVX-512 71 ns (35x)

New tests:
test_simd_level_detection_succeeds
test_simd_level_matches_cpu_features
test_avx512_dot_product_matches_scalar
test_avx512_norm_squared_matches_scalar
test_avx512_cosine_similarity_matches_scalar
test_avx512_euclidean_distance_matches_scalar
test_avx512_remainder_handling (sizes 1..255, every len%16 bucket)
test_dispatch_typical_embedding_dims (384/768/1024/1536)

All 8 new tests gated under cfg(target_arch = "x86_64") and skip gracefully when AVX-512F is not available. 1186 baseline lib tests still pass; the centrality.rs failure is a separate pre-existing issue fixed in fix/B1-pagerank-stale-test.

Tested on: AMD Ryzen 9 7945HX (Zen4 / Phoenix, AVX-512 via
double-pumping). Detection log on this CPU: "Detected SIMD level: Avx512".

Add AVX-512F variants for dot_product, compute_norm_squared,
cosine_similarity, and euclidean_distance in hnsw/simd.rs. Runtime
CPU feature detection automatically selects the best available SIMD
path:

  AVX-512F -> AVX2 -> Scalar

AVX-512 processes 16 floats per instruction via _mm512_fmadd_ps
(fused multiply-add). The previous `HAS_AVX2: OnceLock<bool>` cache
is replaced with `SIMD_LEVEL: OnceLock<SimdLevel>`, where SimdLevel
is a new public enum { Avx512, Avx2, Scalar }. simd_level() is the
single source of truth and is called by all four dispatch wrappers.

cosine_similarity_avx512 fuses the dot + 2 squared-norm reductions
into a single 16-wide pass with three independent FMA accumulators —
~30x speedup over scalar for 1536-dim vectors.

Non-x86_64 platforms fall through to scalar unchanged; AVX2-only
CPUs hit the same AVX2 path as before. Existing 46 simd tests stay
green.

Benchmarks on AMD Ryzen 9 7945HX (Zen4, AVX-512 double-pumped),
cargo bench --features native-v3 --bench hnsw -- simd_:

  dot_product/1536     scalar  837 ns  ->  AVX-512   76 ns   (11x)
  euclidean/1536       scalar  846 ns  ->  AVX-512   66 ns   (13x)
  cosine_similarity/   scalar 2484 ns  ->  AVX-512   71 ns   (35x)

New tests:
  test_simd_level_detection_succeeds
  test_simd_level_matches_cpu_features
  test_avx512_dot_product_matches_scalar
  test_avx512_norm_squared_matches_scalar
  test_avx512_cosine_similarity_matches_scalar
  test_avx512_euclidean_distance_matches_scalar
  test_avx512_remainder_handling      (sizes 1..255, every len%16 bucket)
  test_dispatch_typical_embedding_dims (384/768/1024/1536)

All 8 new tests gated under cfg(target_arch = "x86_64") and skip
gracefully when AVX-512F is not available. 1186 baseline lib tests
still pass; the centrality.rs failure is a separate pre-existing
issue fixed in fix/B1-pagerank-stale-test.

Tested on: AMD Ryzen 9 7945HX (Zen4 / Phoenix, AVX-512 via
double-pumping). Detection log on this CPU: "Detected SIMD level: Avx512".
@maeddesg
Copy link
Copy Markdown
Contributor Author

around 35x speedup

Copy link
Copy Markdown
Owner

@oldnordic oldnordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Results: AVX-512 vs AVX2 on Real Hardware

Hardware: AMD Ryzen 7 7800X3D (AVX-512F/BW/VL/DQ/CD + AVX2 + FMA)
Method: Criterion, 3s measurement per case, release profile (opt-level=3, LTO=thin), identical test vectors.

AVX-512 (this PR) vs AVX2 (main)

Function Dim Scalar AVX2 AVX-512 512 vs AVX2 512 vs Scalar
dot_product 128 59.6ns 6.0ns 4.6ns 1.32x 13.1x
dot_product 384 224ns 22.7ns 16.6ns 1.36x 13.5x
dot_product 768 470ns 49.2ns 39.9ns 1.23x 11.8x
dot_product 1536 965ns 111ns 78.7ns 1.41x 12.3x
euclidean 128 67.3ns 17.1ns 5.1ns 3.38x 13.3x
euclidean 384 232ns 48.1ns 22.9ns 2.10x 10.2x
euclidean 768 480ns 94.9ns 44.0ns 2.16x 10.9x
euclidean 1536 972ns 188ns 92.1ns 2.04x 10.6x
cosine 128 173ns 19.0ns 8.8ns 2.14x 19.6x
cosine 384 666ns 63.8ns 24.3ns 2.62x 27.4x
cosine 768 1.40us 133ns 45.5ns 2.93x 30.8x
cosine 1024 1.90us 188ns 58.6ns 3.21x 32.4x
cosine 1536 2.89us 311ns 97.4ns 3.20x 29.7x
norm_squared 128 55.2ns 5.7ns 3.1ns 1.82x 17.8x
norm_squared 384 219ns 19.9ns 12.7ns 1.56x 17.2x
norm_squared 768 466ns 39.4ns 27.7ns 1.42x 16.8x

Summary: AVX-512 vs AVX2 median 2.10x (range 1.23x-3.38x). AVX-512 vs scalar median 13.6x (range 10.2x-32.4x).

Code Review Notes

  1. Clean dispatch refactor. SimdLevel enum + simd_level() is a genuine improvement over scattered HAS_AVX2 bools.
  2. Correct intrinsics. _mm512_loadu_ps, _mm512_fmadd_ps, _mm512_reduce_add_ps are all standard and correct.
  3. Fused cosine kernel (dot + both norms in one loop with 3 FMA accumulators) is a nice optimization.
  4. Solid test coverage - level detection, AVX-512 vs scalar for all 4 ops, remainder handling, typical embedding dims.

One Correction

The "35x speedup" comment doesn't match measured data. The best observed was 32.4x vs scalar (cosine dim 1024), and vs the existing AVX2 path the improvement is median 2.10x. Would be good to update that comment.

Minor Doc Issue

Lines 82-84 claim AVX-512F implies FMA - while practically true for all shipping consumer CPUs, it's not architecturally guaranteed (early Knights Landing had AVX-512F without FMA). The code itself is fine since _mm512_fmadd_ps is part of AVX-512F, but the doc comment could be more precise.


Overall: the improvement is real and substantial. The code is sound, CI is green, and the refactor is clean.

@oldnordic oldnordic merged commit 8181c57 into oldnordic:main May 20, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants