Skip to content

perf(vector): migrate to InternalId = u32 throughout vector index + search #686

@mosuka

Description

@mosuka

Cross-cutting data-structure rewrite tracked under #537.

Current state

Doc IDs flow through the vector index as u64:

  • HnswGraph::id_to_index: AHashMap<u64, usize> and nodes: Vec<Vec<Vec<u64>>> (vector/index/hnsw/graph.rs:21,27).
  • ConcurrentHnswGraph::nodes: HashMap<u64, Vec<RwLock<Vec<u64>>>> (vector/index/hnsw/writer.rs:40-41).
  • Candidate / ResultCandidate carry id: u64.
  • IVF inverted lists hold Vec<(u64, String, Vector)>.

A single vector segment is capped at u32 entries elsewhere (IVF / Flat headers use vector_count: u32); the u64 is unnecessary inside the index.

Proposed direction

  • Define InternalId(u32) with TryFrom<u64> at the segment boundary.
  • HNSW neighbour lists: Vec<u32> (50% saving vs Vec<u64>).
  • IVF inverted list ids: Vec<u32>.
  • Pair with the CSR HNSW migration (perf(vector/index): Round-3 roadmap — Qdrant/LanceDB parity #535 children) so the neighbour arena is a single contiguous Vec<u32>.
  • Keep external u64 doc id at the searcher result boundary.

Acceptance

  • HNSW graph RAM drops by ~50% on the neighbour arrays.
  • Candidate heap packed format (X-04) becomes natural.

References

  • Lucene uses int ordinals throughout vector index.
  • Qdrant uses u32 PointOffset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions