Round 3 of the laurus perf push for the vector indexing pipeline. Recent rounds shipped HNSW visited-set bitmap (#406), cached `vector_ids` per field (#405), SQ + two-stage rerank (#481), PQ POC + full (#481 stage 3), QuantizedVectorPool optimisations (#522/#526). The remaining gap to Qdrant / Lucene 9 KnnVectorsFormat / LanceDB is in the in-memory and on-disk data structures: `Vec<Vec<Vec>>` HNSW graph (triple pointer chase), AoS f32 vector storage, hand-rolled serial k-means, full graph rebuild on any deletion, `AHashSet` deletion bitmap, per-vector inline `field_name` strings.
Scope
- HNSW graph: CSR (offsets + neighbours) layout with u32 internal IDs.
- Vector storage: SoA / columnar / mmap-friendly fixed-stride payload (new LVS2 segment format).
- HNSW build: lock-free arena, parallel level-generation, thread-local build buffers.
- IVF / PQ training: rayon-parallel k-means + sub-quantiser.
- HNSW deletions: tombstone-then-replace (no full rebuild).
- DeletionBitmap: Roaring with `arc_swap` (cross-cutting with U1 lexical).
- f16 / bf16 / int4 dense storage variants.
- DiskANN / Vamana-style out-of-core HNSW build for segments > RAM.
- Shared / per-schema PQ codebook (instead of per-segment training).
Reference: Qdrant / LanceDB / Lucene99 parity
| Area |
laurus today |
Qdrant / LanceDB / Lucene99 |
| HNSW adjacency |
`Vec<Vec<Vec>>` per node × layer × neighbour |
CSR `int[]` arena (Lucene99HnswVectorsFormat), `GraphLayers` (Qdrant) |
| Vector storage |
`Arc<Vec>` per vector + per-record `field_name` inline |
fixed-stride `.vec` blob (Lucene99), `ChunkedVectorStorage` (Qdrant), Arrow `fixed_size_list` (LanceDB) |
| Concurrent build |
`Vec<RwLock<Vec>>` |
fixed-stride per-node arena + spinlock byte (hnswlib), `AtomicRefCell` (Qdrant) |
| IVF / PQ training |
serial Lloyd + serial sub-quantiser |
rayon-parallel (Qdrant), OpenMP (FAISS) |
| HNSW deletion |
drop graph + full rebuild on next finalize |
`mark_deleted` + `replace_deleted` (hnswlib) |
| Deletion bitmap |
`RwLock<AHashSet>` |
Roaring (Qdrant `IdTracker`, Lucene) |
| Storage variants |
int8 + PQ8 only |
int8/int4 (Lucene 9.10), f16/bf16 (Qdrant), int4 PQ |
| Out-of-core build |
none — caps at RAM |
DiskANN / Vamana two-stage, LanceDB streaming IVF-PQ |
Exit criteria
- All sub-issues below closed or explicitly deferred.
- Round-3 benchmark suite (`make bench BENCH=vector_indexing`) shows HNSW build throughput within +/- 25 % of Qdrant on the SIFT 1M dataset.
Sub-issues
| ID |
Issue |
Size |
Title |
| VI-01 |
#619 |
XL |
perf(vector/index): switch HNSW adjacency to CSR (offsets + neighbours) packed layout |
| VI-02 |
#620 |
XL |
perf(vector/index): move from AoS f32 buffers to columnar / SoA vector storage with mmap-friendly alignment |
| VI-03 |
#621 |
L |
perf(vector/index): replace Vec<RwLock<Vec<u64>>> build-time graph with lock-free / sharded build |
| VI-04 |
#622 |
L |
perf(vector/index): parallelise IVF k-means training (Lloyd iterations + k-means++ init) and SIMD-vectorise inner loops |
| VI-05 |
#623 |
L |
perf(vector/index): add support for f16 / bf16 / int4 dense storage variants |
| VI-06 |
#624 |
M |
perf(vector/index): tombstone-then-replace ("replace_deleted") for HNSW deletions instead of full graph rebuild |
| VI-07 |
#625 |
M |
perf(vector/index): replace DeletionBitmap's AHashSet<u64> with a Roaring bitmap (with optional dense BitVec shard) |
| VI-08 |
#626 |
M |
perf(vector/index): replace HashMap<(u64, String), u32> field/doc->position index with field-id + dense ordinal lookup |
| VI-09 |
#627 |
M |
perf(vector/index): stop persisting the HNSW graph as redundant manual u64s; pack as CSR + variable-byte deltas |
| VI-10 |
#628 |
M |
perf(vector/index): transpose PQ codes to per-sub-vector columns for SIMD FastScan / PQ4 |
| VI-11 |
#629 |
M |
perf(vector/index): IVF inverted lists: store as contiguous Vec<u32> per cluster with offsets, not Vec<Vec<(u64, String, Vector)>> |
| VI-12 |
#630 |
M |
perf(vector/index): out-of-core HNSW build for segments > available RAM (DiskANN / Vamana inspired) |
| VI-13 |
#631 |
M |
perf(vector/index): train-once / reuse PQ codebook across segments instead of per-segment training |
| VI-14 |
#632 |
M |
perf(vector/index): build-time visited-set / candidate-heap reuse with thread-local arenas |
| VI-15 |
#633 |
M |
perf(vector/index): tighten on-disk record by promoting field_name to a per-segment dictionary |
| VI-16 |
#634 |
M |
perf(vector/index): commit pipeline: write payload incrementally instead of one giant create_output at flush |
| VI-17 |
#635 |
S |
perf(vector/index): use Vec<u32> doc-id internal type throughout the index (instead of u64) |
| VI-18 |
#636 |
S |
perf(vector/index): avoid vectors.clone() on the segment-merge read path |
| VI-19 |
#637 |
S |
perf(vector/index): HNSW writer level-distribution generation is single-threaded |
| VI-20 |
#638 |
S |
perf(vector/index): eagerly build per-segment search-time PQ LUT cache |
| VI-21 |
#639 |
S |
perf(vector/index): drop serde_json segments.json — use a binary checksum-protected segment manifest |
| VI-22 |
#640 |
S |
perf(vector/index): active-segment (SegmentedVectorField) brute-force search uses O(N) f32 scan with no SIMD batching |
| VI-23 |
#641 |
S |
perf(vector/index): HNSW per-level fixed-stride neighbour packing during build (M0 / M known up front) |
| VI-24 |
#642 |
S |
perf(vector/index): eliminate field_index: HashMap<String, ...> lookup hash on every NRT query |
| VI-25 |
#643 |
S |
perf(vector/index): multi-vector / sparse vector layout reservation in segment header |
Round-3 investigation report: ~/.claude/tasks/laurus/20260523_perf_round3_audit/.
Round 3 of the laurus perf push for the vector indexing pipeline. Recent rounds shipped HNSW visited-set bitmap (#406), cached `vector_ids` per field (#405), SQ + two-stage rerank (#481), PQ POC + full (#481 stage 3), QuantizedVectorPool optimisations (#522/#526). The remaining gap to Qdrant / Lucene 9 KnnVectorsFormat / LanceDB is in the in-memory and on-disk data structures: `Vec<Vec<Vec>>` HNSW graph (triple pointer chase), AoS f32 vector storage, hand-rolled serial k-means, full graph rebuild on any deletion, `AHashSet` deletion bitmap, per-vector inline `field_name` strings.
Scope
Reference: Qdrant / LanceDB / Lucene99 parity
Exit criteria
Sub-issues
Vec<RwLock<Vec<u64>>>build-time graph with lock-free / sharded buildDeletionBitmap'sAHashSet<u64>with a Roaring bitmap (with optional denseBitVecshard)HashMap<(u64, String), u32>field/doc->position index with field-id + dense ordinal lookupVec<u32>per cluster with offsets, notVec<Vec<(u64, String, Vector)>>field_nameto a per-segment dictionarycreate_outputat flushVec<u32>doc-id internal type throughout the index (instead ofu64)vectors.clone()on the segment-merge read pathserde_jsonsegments.json — use a binary checksum-protected segment manifestSegmentedVectorField) brute-force search uses O(N) f32 scan with no SIMD batchingfield_index: HashMap<String, ...>lookup hash on every NRT queryRound-3 investigation report:
~/.claude/tasks/laurus/20260523_perf_round3_audit/.