Skip to content

perf(maintenance): migrate DeletionBitmap to RoaringBitmap (lexical + vector) #684

@mosuka

Description

@mosuka

Cross-cutting data-structure rewrite tracked under #537.

Current state

laurus/src/maintenance/deletion.rs:52-77 defines DeletionBitmap as RwLock<ahash::AHashSet<u64>>. This same DeletionBitmap is consumed by both the lexical store and every vector index (HNSW / Flat / IVF). The on-disk .delmap (v3) writes every deleted doc id as a raw u64.

Pain points:

  • For a 10M-doc segment at 10% deletion the on-disk format is ~8 MB; a Roaring bitmap is ~125 KB.
  • Every is_deleted() is a hashed-u64 probe — measurable on HNSW per-neighbour traversals.
  • HNSW search hot path takes a read lock per call.

Proposed direction

  • New .delmap v4 with RoaringBitmap::serialize payload (or FixedBitSet byte array, density-dependent).
  • In-memory representation as arc_swap::ArcSwap<RoaringBitmap>; is_deleted becomes one Arc load + one Roaring::contains.
  • Single migration covering both lexical (InvertedIndexWriter::pending_deletions: HashSet<u64>, MergeEngine::deleted_doc_ids) and vector (HnswSearcher filter, DeletionBitmap).

Acceptance

  • v3 reader path retained for back-compat.
  • HNSW per-neighbour is_deleted shows measurable improvement in microbench.
  • LiveDocs RAM for 10M-doc/10%-deleted segment drops by >50x.

References

  • Lucene FixedBitSet, IndexedDISI (Lucene 10+) for sparse iteration.
  • Qdrant IdTracker (RoaringBitmap-backed).

Round-3 investigation report: ~/.claude/tasks/laurus/20260523_perf_round3_audit/raw/lexical_indexing.md and .../raw/vector_indexing.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions