Skip to content

feat: support distributed bitmap index build#6598

Merged
jackye1995 merged 8 commits intolance-format:mainfrom
zhangyue19921010:Distributed-Construction-of-Bitmap-Index-v2
Apr 30, 2026
Merged

feat: support distributed bitmap index build#6598
jackye1995 merged 8 commits intolance-format:mainfrom
zhangyue19921010:Distributed-Construction-of-Bitmap-Index-v2

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

@zhangyue19921010 zhangyue19921010 commented Apr 23, 2026

closes: #6599

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request python java labels Apr 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 74.88584% with 110 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/bitmap.rs 65.17% 65 Missing and 44 partials ⚠️
rust/lance/src/index/create.rs 99.15% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Apr 23, 2026

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds non-trivial distributed bitmap build logic — streaming k-way merge across shard files, a tagged partition-ID scheme to distinguish explicit shard_id from implicit fragment_id, and new shard lifecycle handling in merge_index_metadata. I didn't find any bugs, but the new merge machinery and the parameter-routing contract (shard_id requires fragment_ids; multi-fragment shards require explicit shard_id) are worth a human review, especially given the 64% patch coverage on bitmap.rs.

Extended reasoning...

Overview

The PR introduces distributed bitmap index building across 6 files. The meaningful changes are concentrated in rust/lance-index/src/scalar/bitmap.rs (~330 new lines), which adds:

  • BitmapParameters { shard_id: Option<u32> } and a custom BitmapTrainingRequest to replace the previous DefaultTrainingRequest
  • Per-shard output files using a part_<partition_id>_bitmap_page_lookup.lance naming scheme, where partition_id is a tagged u64 that embeds either the explicit shard_id (tag 0) or an implicit fragment_id (tag 1) to prevent collisions
  • BitmapShardCursor, BitmapHeapItem, and a streaming k-way merge (merge_shards) driven by a BinaryHeap<Reverse<...>>, which unions same-key bitmaps across shards without materializing all keys in memory
  • A public merge_index_files entry point wired into Dataset::merge_index_metadata via IndexType::Bitmap
  • Shard file cleanup after a successful merge (best-effort, logs on failure)

The Java builder exposes shardId(int), the Python SupportedDistributedIndices enum gains BITMAP, and both Rust and Python integration tests cover the end-to-end distributed build → merge → commit → query flow.

Security risks

None identified. This is internal indexing logic with no new auth/crypto/permission surface. The shard filename parsing is bounded to internal prefixes/suffixes and uses safe u64::parse.

Level of scrutiny

Higher-than-mechanical scrutiny is warranted. This is a feature addition to a production index build path, not a config/typo change. The parameter-routing contract is subtle: shard_id is only valid with fragment_ids, and a multi-fragment distributed build requires an explicit shard_id (otherwise partition-id collisions would corrupt the merged output). The k-way merge is new machinery with its own ordering/drain-same-key logic. A regression here could silently produce incorrect bitmap results post-merge.

Other factors

  • Codecov reports 64.77% patch coverage on bitmap.rs (68 missing + 44 partial lines). The new merge/cursor/error paths are likely where coverage is thin.
  • The Rust test test_distributed_build_bitmap exercises the shard → merge → commit → query flow, and two Python tests cover both the explicit shard_id path and the single-fragment-per-shard implicit path. Error paths (mismatched value types across shards, empty shards, missing shard files, multi-fragment without shard_id) do not appear to have dedicated tests.
  • Best-effort shard cleanup (cleanup_bitmap_shard_files) silently warns on failure — worth a maintainer's call on whether that's the desired behavior for leaked part_* files after a successful merge.
  • The PR was explicitly flagged for @claude review by Xuanwo, so a human-facing summary is appropriate.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

@BubbleCal Hi. Would u mind to take a look? Thanks!

Comment thread rust/lance-index/src/scalar/bitmap.rs Outdated
}
}

if heap.is_empty() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitmap indexes already support an empty final page lookup file, we can finish an empty merged bitmap file using the shard schema’s value_type instead of failing

@jackye1995
Copy link
Copy Markdown
Contributor

mostly looks good to me, just 1 comment

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending CI

@jackye1995 jackye1995 merged commit 57d2aad into lance-format:main Apr 30, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support distributed bitmap index build

3 participants