Skip to content

perf: cross-cutting data-structure rewrites (Round 3) #537

@mosuka

Description

@mosuka

Round 3 of the laurus perf push surfaces five data-structure rewrites that cut across the lexical + vector boundary. Splitting them into per-area umbrellas creates merge conflicts and obscures the "shared infra" theme; this umbrella tracks them together.

Related per-area umbrellas: #533 (lexical/index), #534 (lexical/search), #535 (vector/index), and the vector/search umbrella filed alongside this one.

Scope

  • DeletionBitmap → Roaring/FixedBitSet (cross-cutting with lexical + vector index umbrellas): `RwLock<AHashSet>` is shared by both lexical and vector index. Single migration; sparse/dense decision via density heuristic.
  • FieldId(u16) registry (cross-cutting with lexical/index + vector/index + vector/search): replace `HashMap<String, _>` field lookup throughout the lexical writer, vector storage pools, and search hot paths. `String` field names retained only at the public API boundary.
  • Internal-id u32 migration (cross-cutting with vector index + search): doc IDs as `u64` end-to-end inside the vector index even though they're segment-local ordinals; same opportunity exists in the lexical search packed top-K. `InternalId = u32` everywhere downstream of segment open.
  • Packed-u64 top-K collector & heap (cross-cutting with lexical/search + vector/search): both the lexical `TopDocsCollector` and the HNSW `Candidate / ResultCandidate` heap could use `(score_bits << 32) | doc_id` packed u64 entries for single-integer compares (no NaN branch).
  • Columnar / packed-bits DocValues column format (cross-cutting with lexical/index + lexical/search): per-segment Numeric / Sorted / Bytes columns with `bitpacking` (already a dep) for the lexical side; reused by vector index for ordinal stores.

Why this is a separate umbrella

Each item touches both lexical and vector code paths. They are not additive perf improvements — they unlock several sub-issues each in the per-area umbrellas (e.g. `TopFieldCollector` ordinals need the columnar DV; `filterable HNSW` benefits from the Roaring liveDocs; packed candidate heaps need u32 internal IDs).

Exit criteria

  • All cross-cutting sub-issues below closed or explicitly deferred.
  • Per-umbrella benchmarks show the dependent issues unlocked (e.g. `TopFieldCollector` via ordinal column delivers 2x sort throughput on numeric fields).

Sub-issues

ID Issue Size Title
X-01 #684 X perf(maintenance): migrate DeletionBitmap to RoaringBitmap (lexical + vector)
X-02 #685 X perf: FieldId(u16) registry across crates (drop HashMap<String, _> from hot paths)
X-03 #686 X perf(vector): migrate to InternalId = u32 throughout vector index + search
X-04 #687 X perf: packed-u64 top-K candidate / heap entries (lexical + vector)
X-05 #688 X perf(lexical): columnar / packed-bits DocValues column format

Round-3 investigation report: ~/.claude/tasks/laurus/20260523_perf_round3_audit/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions