perf(mem_wal): match hnswlib throughput via runtime AVX-512 f32 distance by touch-of-grey · Pull Request #7009 · lance-format/lance

touch-of-grey · 2026-05-30T16:02:25Z

Summary

Makes the in-memory MemWAL HNSW (rust/lance/src/dataset/mem_wal/hnsw/) as fast as hnswlib on insert and search.

Decomposing the gap against hnswlib showed the index was never algorithmically slower — recall was equal-or-better and memory ~44% lower at every size. The only real gap was SIMD width: the shipped binary targets target-cpu=haswell, so f32::l2/f32::dot only autovectorize to AVX2, while a -march=native hnswlib uses AVX-512 on capable CPUs (a target-cpu=native control build confirmed parity).

This adds runtime-dispatched l2_f32/dot_f32 in lance-linalg (#[target_feature(avx512f)] 16-wide kernels gated by SIMD_SUPPORT, AVX2 fallback via the existing autovectorized path — same pattern as dot_u8) and routes the MemWAL HNSW distance through them, so the shipped build uses AVX-512 at runtime.

Results (c7i.12xlarge, 48 threads, dim=1024, m=12, ef=64, k=10)

Shipped build with this change vs hnswlib:

rows	insert L/H	query L/H	recall@10 L/H	peak RSS L/H
100k	0.98×	1.11×	0.80 / 0.84	0.56×
500k	0.96×	0.96×	0.51 / 0.52	0.56×
1M	0.98×	1.05×	0.49 / 0.43	0.56×

vs the pre-change shipped build: insert @1m +14%, query @1m +57%. perf stat @1m: cycles within 1% (Lance 0.99×), 41% fewer instructions (AVX-512 density + zero-copy Arrow vs hnswlib's per-vector memcpy), ~44% less RSS.

Net: matches or beats hnswlib on insert and query, with equivalent CPU cycles, lower memory, and comparable recall.

Changes

lance-linalg: new l2_f32/dot_f32 runtime AVX-512 dispatch (+ unit tests asserting they match the scalar reference across 16-multiple and tail lengths).
mem_wal/hnsw/storage.rs: route compute_f32_distance (L2 + Dot) through the dispatchers.
benches/mem_wal/vector/hnsw/: parity-suite driver and --query-repeats for a stable query window.

Follow-ups: AVX-512 cosine for the memtable; adopt the same dispatch in the broader f32::l2/f32::dot for all vector search.

cc @jackye1995 — please review.

The shipped binary targets target-cpu=haswell, so the autovectorized f32 L2/dot in lance-linalg only ever emit AVX2 even on AVX-512 CPUs, while a -march=native HNSW competitor uses AVX-512. Add runtime-dispatched l2_f32/dot_f32 (target_feature avx512f 16-wide kernels gated by SIMD_SUPPORT, AVX2 fallback via the existing autovectorized path) and route the in-memory MemWAL HNSW distance through them. Brings the MemWAL HNSW to parity with hnswlib on insert and search on AVX-512 hardware, with comparable recall and ~44% lower memory, keeping the AVX2 path for other CPUs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…L HNSW Add a parity-suite driver (Lance HNSW primitive vs hnswlib across 100k/500k/1M, capturing throughput and peak RSS) and a --query-repeats option so the query phase runs long enough to measure reliably. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

touch-of-grey · 2026-05-30T16:02:37Z

@jackye1995 ping for review. Summary: the MemWAL HNSW was never algorithmically slower than hnswlib (equal-or-better recall, ~44% less memory); the only gap was that the shipped target-cpu=haswell build never emits AVX-512 for f32 L2/dot. Adding runtime AVX-512 dispatch brings insert/query to parity (sometimes faster) on AVX-512 hardware. perf @1m: cycles within 1%, 41% fewer instructions. Full numbers in the description.

codecov · 2026-05-30T16:42:32Z

Codecov Report

❌ Patch coverage is 35.93750% with 41 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-linalg/src/distance/l2.rs	34.37%	21 Missing ⚠️
rust/lance-linalg/src/distance/dot.rs	36.66%	19 Missing ⚠️
rust/lance/src/dataset/mem_wal/hnsw/storage.rs	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

jackye1995

Looks good to me!

@jackye1995

…nch (#7010) ## Summary Adds **sustained-duration measurement** to the MemWAL HNSW parity bench so it reports steady-state throughput under continuous load rather than a short burst. This follows up the AVX-512 distance work in #7009. Changes (bench-only, no library changes): - `--insert-seconds` / `--query-seconds`: run the write (graph build) and read (query) workloads in a loop for a fixed wall-clock duration; report aggregate throughput over all passes (`insert_passes` / `query_passes`). - `insert_core` breakdown: times the insertion itself separately from per-build graph allocation + teardown. - Both knobs added to the Lance bench and the hnswlib reference bench; `run_parity_suite.sh` gains `INSERT_SECONDS` / `QUERY_SECONDS`. Motivation: a sub-second query window gave noisy/optimistic numbers and hid AVX-512 frequency throttling. Measuring 30 s of continuous load makes read/write parity (and where it doesn't hold) reproducible. ## Latest perf results (merged main, c7i.12xlarge, 48 threads, dim=1024, m=12, ef=64, k=10) Sustained 30 s read + 30 s write per size; AVX-512 throttles 3.78 GHz → ~2.5 GHz under all-core load (affects both impls). Read (query_qps), Lance / hnswlib: | rows | ratio | |------|------| | 100k | 1.01 | | 500k | 0.995 | | 1M | 0.996 | Write — insertion compute only (`insert_core`), Lance / hnswlib: | rows | ratio | |------|------| | 100k | 0.99 | | 500k | 0.98 | | 1M | 0.96 | Write — end-to-end incl. per-build graph alloc + teardown: | rows | ratio | |------|------| | 100k | 0.96 | | 500k | 0.89 | | 1M | 0.87 | Takeaways the improved bench makes visible: - **Read is at parity** under sustained throttled load (confirms #7009 holds; the burst window wasn't hiding a regression). - **Insertion compute is at parity** — AVX-512 distance keeps pace even while downclocked. - The end-to-end write gap at scale is **entirely graph allocation/teardown** (Lance's per-node `Vec`/`Mutex`/`Arc` vs hnswlib's flat arrays), not the algorithm — and it's allocator-sensitive: with mimalloc/jemalloc as the global allocator Lance is actually faster than hnswlib (≈1.08–1.25×). No in-tree change is warranted; using a modern allocator for the memtable workload closes it. cc @jackye1995 — please review. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

touch-of-grey and others added 2 commits May 30, 2026 09:01

claude Bot reviewed May 30, 2026

View reviewed changes

github-actions Bot added the performance label May 30, 2026

jackye1995 approved these changes May 30, 2026

View reviewed changes

jackye1995 merged commit eeef69d into lance-format:main May 30, 2026
29 checks passed

touch-of-grey deleted the VectorMemTableHnswParity branch May 30, 2026 17:24

touch-of-grey mentioned this pull request May 30, 2026

test(bench): sustained-duration measurement for MemWAL HNSW parity bench #7010

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mem_wal): match hnswlib throughput via runtime AVX-512 f32 distance#7009

perf(mem_wal): match hnswlib throughput via runtime AVX-512 f32 distance#7009
jackye1995 merged 2 commits into
lance-format:mainfrom
touch-of-grey:VectorMemTableHnswParity

touch-of-grey commented May 30, 2026

Uh oh!

claude Bot left a comment

Uh oh!

touch-of-grey commented May 30, 2026

Uh oh!

codecov Bot commented May 30, 2026

Uh oh!

jackye1995 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

touch-of-grey commented May 30, 2026

Summary

Results (c7i.12xlarge, 48 threads, dim=1024, m=12, ef=64, k=10)

Changes

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

touch-of-grey commented May 30, 2026

Uh oh!

codecov Bot commented May 30, 2026

Codecov Report

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants