Concurrency hardening for BfTree parallel workloads#1158
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens diskann-bftree for concurrent workloads by synchronizing neighbor-list mutations (preventing lost updates), correcting dual-store (full + quant) lifecycle behavior, and adding concurrent stress tests to validate the new behavior.
Changes:
- Add striped per-vertex
RwLocksynchronization toNeighborProvider(read locks for reads, write locks for mutations) and introduce concurrent stress tests. - Adjust hard-delete behavior to also delete quantized vectors when present via a small abstraction over the quant store.
- Reorder dual-store writes in
set_elementto write full-precision first, then quantized.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| diskann-bftree/src/neighbors.rs | Adds striped locking around neighbor-list read/modify/write and adds concurrent stress tests. |
| diskann-bftree/src/provider.rs | Adds quant-store cleanup on delete via DeleteQuant and reorders full/quant write sequencing in set_element. |
| diskann-bftree/src/quant.rs | Adds quant-store delete support used by hard-delete cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks |
04a1dd8 to
c94cdcc
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1158 +/- ##
==========================================
+ Coverage 89.79% 90.85% +1.06%
==========================================
Files 488 489 +1
Lines 93312 93575 +263
==========================================
+ Hits 83791 85022 +1231
+ Misses 9521 8553 -968
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@copilot give it another review. |
I re-reviewed and found one remaining concurrency issue: beam expansion paths were reading neighbors without taking the striped read lock. I fixed both call sites in commit Validation run:
|
562c341 to
535d925
Compare
|
@harsha-simhadri after chatting a bit with Mark and and changing the approach to locking in the provider. I ran a few more wiki100k streaming benchmarks on my devbox instance it looks like the perf overhead is negligible. |
harsha-simhadri
left a comment
There was a problem hiding this comment.
Left some comments inline. Ack on perf numbers.
harsha-simhadri
left a comment
There was a problem hiding this comment.
Left some comments inline. Ack on perf numbers.
566d5ee to
412b5ce
Compare
Add striped RwLock to NeighborProvider to eliminate the TOCTOU race in append_vector's read-modify-write cycle. Under 8-thread contention on the same vertex, the unprotected path loses 11-51% of edges. Striped locks (16384 stripes, Fibonacci multiply-shift hash): - Constant ~128 KB memory regardless of dataset size (vs ~8 GB at 1B vectors for per-vertex locks) - RwLock allows concurrent readers during search; only writers serialize - Read lock on get_neighbors, write lock on set/append/delete - Internal get_neighbors_unlocked avoids deadlock in append_vector Dual-store write ordering (SetElement for QuantVectorProvider): - Write full-precision (authoritative) before quant - If quant write fails, worst case is a vector missing from quantized search but still reachable in full-precision (benign) Hard-delete cleanup (Delete trait): - Added DeleteQuant trait; QuantVectorProvider deletes its entry, NoStore is a no-op - Single generic Delete impl with Q: DeleteQuant bound - Neighbor adjacency cleanup handled by upper DiskANNIndex layer Tests: - 3 concurrent stress tests proving the TOCTOU fix - Verified 10/10 failures without locks, 10/10 passes with locks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move StripedLocks from NeighborProvider to BfTreeProvider so that delete and set_element operations are synchronized under the same lock table as neighbor mutations. This eliminates a potential race between concurrent delete and set_element on the same vertex ID. - Extract StripedLocks into its own module (locks.rs) - BfTreeProvider owns Arc<StripedLocks>, shared with NeighborAccessor - Dynamic stripe count based on available_parallelism (4x cores, min 64) - delete() and set_element() now acquire write locks per vertex Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses review feedback requesting documentation of lock acquisition count, ordering, deadlock freedom, and read behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
412b5ce to
fd863e5
Compare
Address review feedback on the concurrent neighbor-access tests: - test_concurrent_append_no_lost_edges: repeat the scenario over 100 outer iterations and scale the writer count to the host's available parallelism (with a floor) so lost-update regressions surface reliably; size max_degree to the workload. - test_concurrent_read_write_consistency: add an upper-bound check that every observed neighbor ID is within the range writers can produce (catching torn reads), wrap the read/write phase in an outer repeat loop, and host-scale writers/readers. - test_concurrent_append_independent_vertices: host-scale writers and leave the multi_thread runtime unsized. Factor out shared test helpers (stress_thread_count, new_provider, new_shared_provider) and apply them consistently across the bftree neighbor tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The reader task asserted iterations > 0, but nothing guaranteed a reader was polled before the writers finished and set done. Under thread oversubscription a reader could first run after done was already true, read zero times, and trip the assertion — a scheduling artifact, not a data-consistency defect. Restructure the reader as a do-while so it validates the invariants at least once before observing done. Reading the fully-written state is still a valid, consistent observation, so all consistency checks remain meaningful. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
NeighborProvider::append_vectorperforms a read-modify-write cycle (read existing neighbors → append → write back) without synchronization. Under concurrent mutation, the last writer wins and silently drops the other's edges. Stress testing shows edge loss under multi-thread contention on a single vertex.Additionally, the dual-store
set_elementwrote quant before full-precision, meaning if the full-precision write failed after quant succeeded, it would leave a quantized ghost with no backing data. Anddeletedid not clean up the quant store entry.Solution
Striped
Mutextable for per-vertex synchronizationStripedLocks: a fixed-size table ofMutex<()>stripes (diskann-bftree/src/locks.rs).(cpus * 4).next_power_of_two().max(64)— constant memory regardless of dataset size, suitable for billion-scale workloads.floor(2^64 / phi)), keeping false contention negligible.Why striped instead of per-vertex locks?
BfTree's value proposition is spilling to disk for datasets exceeding RAM. At 1B vectors, per-vertex locks would consume gigabytes of in-memory locks — defeating the purpose. Striped locks keep memory constant at the cost of rare false contention.
Reads are lock-free
get_neighborsandget_vectoracquire no stripe lock. They rely on bf-tree's internal thread safety and the fact that DiskANN search is approximate by design: a reader may briefly observe a partially-updated neighbor list, which affects recall momentarily rather than correctness. Writers (set_element,delete,set_neighbors/write_neighbors,append_vector/write_append) serialize on the stripe.Dual-store write ordering
set_element.Quant store cleanup on delete
DeleteQuanttrait: real impl forQuantVectorProvider(deletes the quant entry), no-op forNoStore.deletenow removes both full-precision and quant entries under the stripe lock. Neighbor-topology cleanup remains in the upperDiskANNIndexlayer (drop_adj_list).Testing
Three new concurrent stress tests in
diskann-bftree/src/neighbors.rs:test_concurrent_append_no_lost_edges— 8 threads × 10 edges to the same vertex; asserts every edge survives.test_concurrent_append_independent_vertices— 8 threads appending to disjoint vertex ranges; asserts total edge count.test_concurrent_read_write_consistency— mixed readers + writers on the same vertex; validates no torn reads.