feat(hnsw): AVX-512 SIMD distance functions with runtime auto-detection by maeddesg · Pull Request #12 · oldnordic/sqlitegraph

maeddesg · 2026-05-20T04:41:54Z

Add AVX-512F variants for dot_product, compute_norm_squared, cosine_similarity, and euclidean_distance in hnsw/simd.rs. Runtime CPU feature detection automatically selects the best available SIMD path:

AVX-512F -> AVX2 -> Scalar

AVX-512 processes 16 floats per instruction via _mm512_fmadd_ps (fused multiply-add). The previous HAS_AVX2: OnceLock<bool> cache is replaced with SIMD_LEVEL: OnceLock<SimdLevel>, where SimdLevel is a new public enum { Avx512, Avx2, Scalar }. simd_level() is the single source of truth and is called by all four dispatch wrappers.

cosine_similarity_avx512 fuses the dot + 2 squared-norm reductions into a single 16-wide pass with three independent FMA accumulators — ~30x speedup over scalar for 1536-dim vectors.

Non-x86_64 platforms fall through to scalar unchanged; AVX2-only CPUs hit the same AVX2 path as before. Existing 46 simd tests stay green.

Benchmarks on AMD Ryzen 9 7945HX (Zen4, AVX-512 double-pumped), cargo bench --features native-v3 --bench hnsw -- simd_:

dot_product/1536 scalar 837 ns -> AVX-512 76 ns (11x)
euclidean/1536 scalar 846 ns -> AVX-512 66 ns (13x)
cosine_similarity/ scalar 2484 ns -> AVX-512 71 ns (35x)

New tests:
test_simd_level_detection_succeeds
test_simd_level_matches_cpu_features
test_avx512_dot_product_matches_scalar
test_avx512_norm_squared_matches_scalar
test_avx512_cosine_similarity_matches_scalar
test_avx512_euclidean_distance_matches_scalar
test_avx512_remainder_handling (sizes 1..255, every len%16 bucket)
test_dispatch_typical_embedding_dims (384/768/1024/1536)

All 8 new tests gated under cfg(target_arch = "x86_64") and skip gracefully when AVX-512F is not available. 1186 baseline lib tests still pass; the centrality.rs failure is a separate pre-existing issue fixed in fix/B1-pagerank-stale-test.

Tested on: AMD Ryzen 9 7945HX (Zen4 / Phoenix, AVX-512 via
double-pumping). Detection log on this CPU: "Detected SIMD level: Avx512".

Add AVX-512F variants for dot_product, compute_norm_squared, cosine_similarity, and euclidean_distance in hnsw/simd.rs. Runtime CPU feature detection automatically selects the best available SIMD path: AVX-512F -> AVX2 -> Scalar AVX-512 processes 16 floats per instruction via _mm512_fmadd_ps (fused multiply-add). The previous `HAS_AVX2: OnceLock<bool>` cache is replaced with `SIMD_LEVEL: OnceLock<SimdLevel>`, where SimdLevel is a new public enum { Avx512, Avx2, Scalar }. simd_level() is the single source of truth and is called by all four dispatch wrappers. cosine_similarity_avx512 fuses the dot + 2 squared-norm reductions into a single 16-wide pass with three independent FMA accumulators — ~30x speedup over scalar for 1536-dim vectors. Non-x86_64 platforms fall through to scalar unchanged; AVX2-only CPUs hit the same AVX2 path as before. Existing 46 simd tests stay green. Benchmarks on AMD Ryzen 9 7945HX (Zen4, AVX-512 double-pumped), cargo bench --features native-v3 --bench hnsw -- simd_: dot_product/1536 scalar 837 ns -> AVX-512 76 ns (11x) euclidean/1536 scalar 846 ns -> AVX-512 66 ns (13x) cosine_similarity/ scalar 2484 ns -> AVX-512 71 ns (35x) New tests: test_simd_level_detection_succeeds test_simd_level_matches_cpu_features test_avx512_dot_product_matches_scalar test_avx512_norm_squared_matches_scalar test_avx512_cosine_similarity_matches_scalar test_avx512_euclidean_distance_matches_scalar test_avx512_remainder_handling (sizes 1..255, every len%16 bucket) test_dispatch_typical_embedding_dims (384/768/1024/1536) All 8 new tests gated under cfg(target_arch = "x86_64") and skip gracefully when AVX-512F is not available. 1186 baseline lib tests still pass; the centrality.rs failure is a separate pre-existing issue fixed in fix/B1-pagerank-stale-test. Tested on: AMD Ryzen 9 7945HX (Zen4 / Phoenix, AVX-512 via double-pumping). Detection log on this CPU: "Detected SIMD level: Avx512".

maeddesg · 2026-05-20T04:50:29Z

around 35x speedup

oldnordic

Benchmark Results: AVX-512 vs AVX2 on Real Hardware

Hardware: AMD Ryzen 7 7800X3D (AVX-512F/BW/VL/DQ/CD + AVX2 + FMA)
Method: Criterion, 3s measurement per case, release profile (opt-level=3, LTO=thin), identical test vectors.

AVX-512 (this PR) vs AVX2 (main)

Function	Dim	Scalar	AVX2	AVX-512	512 vs AVX2	512 vs Scalar
dot_product	128	59.6ns	6.0ns	4.6ns	1.32x	13.1x
dot_product	384	224ns	22.7ns	16.6ns	1.36x	13.5x
dot_product	768	470ns	49.2ns	39.9ns	1.23x	11.8x
dot_product	1536	965ns	111ns	78.7ns	1.41x	12.3x
euclidean	128	67.3ns	17.1ns	5.1ns	3.38x	13.3x
euclidean	384	232ns	48.1ns	22.9ns	2.10x	10.2x
euclidean	768	480ns	94.9ns	44.0ns	2.16x	10.9x
euclidean	1536	972ns	188ns	92.1ns	2.04x	10.6x
cosine	128	173ns	19.0ns	8.8ns	2.14x	19.6x
cosine	384	666ns	63.8ns	24.3ns	2.62x	27.4x
cosine	768	1.40us	133ns	45.5ns	2.93x	30.8x
cosine	1024	1.90us	188ns	58.6ns	3.21x	32.4x
cosine	1536	2.89us	311ns	97.4ns	3.20x	29.7x
norm_squared	128	55.2ns	5.7ns	3.1ns	1.82x	17.8x
norm_squared	384	219ns	19.9ns	12.7ns	1.56x	17.2x
norm_squared	768	466ns	39.4ns	27.7ns	1.42x	16.8x

Summary: AVX-512 vs AVX2 median 2.10x (range 1.23x-3.38x). AVX-512 vs scalar median 13.6x (range 10.2x-32.4x).

Code Review Notes

Clean dispatch refactor. SimdLevel enum + simd_level() is a genuine improvement over scattered HAS_AVX2 bools.
Correct intrinsics. _mm512_loadu_ps, _mm512_fmadd_ps, _mm512_reduce_add_ps are all standard and correct.
Fused cosine kernel (dot + both norms in one loop with 3 FMA accumulators) is a nice optimization.
Solid test coverage - level detection, AVX-512 vs scalar for all 4 ops, remainder handling, typical embedding dims.

One Correction

The "35x speedup" comment doesn't match measured data. The best observed was 32.4x vs scalar (cosine dim 1024), and vs the existing AVX2 path the improvement is median 2.10x. Would be good to update that comment.

Minor Doc Issue

Lines 82-84 claim AVX-512F implies FMA - while practically true for all shipping consumer CPUs, it's not architecturally guaranteed (early Knights Landing had AVX-512F without FMA). The code itself is fine since _mm512_fmadd_ps is part of AVX-512F, but the doc comment could be more precise.

Overall: the improvement is real and substantial. The code is sound, CI is green, and the refactor is clean.

oldnordic reviewed May 20, 2026

View reviewed changes

oldnordic merged commit 8181c57 into oldnordic:main May 20, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hnsw): AVX-512 SIMD distance functions with runtime auto-detection#12

feat(hnsw): AVX-512 SIMD distance functions with runtime auto-detection#12
oldnordic merged 1 commit into
oldnordic:mainfrom
maeddesg:feat/avx512-hnsw

maeddesg commented May 20, 2026

Uh oh!

maeddesg commented May 20, 2026

Uh oh!

oldnordic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maeddesg commented May 20, 2026

Uh oh!

maeddesg commented May 20, 2026

Uh oh!

oldnordic left a comment

Choose a reason for hiding this comment

Benchmark Results: AVX-512 vs AVX2 on Real Hardware

AVX-512 (this PR) vs AVX2 (main)

Code Review Notes

One Correction

Minor Doc Issue

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants