perf(maintenance): back DeletionBitmap with a Roaring bitmap#771
Merged
Conversation
`DeletionBitmap` stored deleted ids as `RwLock<ahash::AHashSet<u64>>` and the `.delmap` (v3) wrote them as a raw `u64` list. For the dense deletion sets that accumulate over a segment's life this is huge — a 10M-doc/10%-deleted segment is ~8 MB on disk (and a multi-MB hash table in RAM whose `is_deleted` probes miss cache). This same bitmap is consumed by both lexical (`filter_deleted_soa`) and every vector index (HNSW/Flat/IVF `is_deleted`), which are per-doc / per-neighbour hot paths. Swap the internal representation to `roaring::RoaringTreemap` (already a dependency since #578; doc ids are the global u64 space): - `is_deleted` is now a branch-light bit test that stays cache-resident for dense sets; `delete_document` uses `RoaringTreemap::insert`'s newly-added bool; `get_deleted_docs` returns ascending ids; `memory_usage` reports the Roaring `serialized_size`. - `.delmap` gains **v4** (`RoaringTreemap::serialize_into`), and the reader still loads v1/v2/v3 (raw id list) for backward compatibility. - The public API is unchanged (`is_deleted` / `delete_document` / etc. keep their signatures), so the ~20 lexical/vector consumers benefit with no changes. Locking is kept as `RwLock<RoaringTreemap>` (in-place O(1) delete). An ArcSwap + RCU model would clone the whole bitmap per delete — O(D²) over a segment's deletions — so lock-free `is_deleted` (consumer snapshot-once or batched delete) is left as a follow-up. `#541`/`#625` (per-domain versions) are subsumed. Tests: v4 round-trip and v3 backward-read compat; existing lexical/vector deletion tests pass unchanged. Docs (en/ja) updated. Closes #684
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #684 (cross-cutting data-structure rewrite, umbrella #537).
Problem
DeletionBitmap(laurus/src/maintenance/deletion.rs) stored deleted ids asRwLock<ahash::AHashSet<u64>>, and the.delmap(v3) wrote them as a rawu64list. For the dense deletion sets that accumulate over a segment's life this is huge — a 10M-doc / 10%-deleted segment is ~8 MB on disk (and a multi-MB hash table in RAM whoseis_deletedprobes miss cache). This same bitmap is consumed by both lexical (filter_deleted_soa) and every vector index (HNSW / Flat / IVFis_deleted), which are per-doc / per-neighbour hot paths.Change
Swap the internal representation to
roaring::RoaringTreemap(a dependency since #578; doc ids are the globalu64space):is_deletedbecomes a branch-light bit test that stays cache-resident for dense sets;delete_documentusesRoaringTreemap::insert's newly-added bool;get_deleted_docsreturns ascending ids;memory_usagereports the Roaringserialized_size..delmapgains a v4 payload (RoaringTreemap::serialize_into); the reader still loads v1/v2/v3 (raw id list) for backward compatibility.is_deleted/delete_document/get_deleted_docs/ … keep their signatures), so the ~20 lexical/vector consumers benefit with no changes — this single change covers both domains and subsumes the per-domain perf(lexical/index): replaceahash::AHashSet<u64>liveDocs withRoaring/FixedBitSet#541 (lexical) / perf(vector/index): replaceDeletionBitmap'sAHashSet<u64>with a Roaring bitmap (with optional denseBitVecshard) #625 (vector).For a dense set, ~125 KB stays L2-resident (bit test) vs a multi-MB hash table that misses cache on each probe — the RAM/on-disk drop is >50×.
Locking decision
Kept as
RwLock<RoaringTreemap>(in-place O(1) delete). AnArcSwap+ RCU model would clone the whole bitmap on everydelete_document— O(D²) over a segment's deletions (which are applied one-at-a-time at upsert/commit and accumulate to hundreds of thousands) — so it would badly regress the write path. Truly lock-freeis_deleted(consumer snapshot-once, or a batched-delete API) is left as a follow-up.Verification
cargo build(full workspace + bindings) ✅cargo clippy --all-targets -- -D warnings— zero warnings ✅cargo fmt --check— clean ✅cargo test -p laurus --lib— 1104 passed / 0 failed (+2);cargo test --workspace— exit 0, 51 binaries ✅markdownlint-cli2— 0 errors; docs (en + ja) updated ✅New unit tests: v4 round-trip (write → read yields the same set/order/metadata) and v3 backward-read (a hand-written v3 payload still loads correctly). The existing lexical/vector deletion tests pass unchanged.
Note on benchmarking
The plan called for an
is_deletedmicro-bench, butDeletionBitmaplives in a private module (maintenance), unreachable from an external Criterion bench — and making itpubpurely for a bench contradicts this PR's "no public API change" goal. The RAM/on-disk win is structural (Roaring vs a rawu64list), and the dense-setis_deletedspeedup is the well-known cache-resident-bitmap vs cache-missing-hash-table property; correctness is covered by the tests above.Follow-up
is_deleted(consumer snapshot-once or batched delete) — avoids the COW-per-delete regression that naive ArcSwap would cause.🤖 Generated with Claude Code