feat: support distributed bitmap index build#6598
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@claude review |
|
I'll analyze this and get back to you. |
There was a problem hiding this comment.
This adds non-trivial distributed bitmap build logic — streaming k-way merge across shard files, a tagged partition-ID scheme to distinguish explicit shard_id from implicit fragment_id, and new shard lifecycle handling in merge_index_metadata. I didn't find any bugs, but the new merge machinery and the parameter-routing contract (shard_id requires fragment_ids; multi-fragment shards require explicit shard_id) are worth a human review, especially given the 64% patch coverage on bitmap.rs.
Extended reasoning...
Overview
The PR introduces distributed bitmap index building across 6 files. The meaningful changes are concentrated in rust/lance-index/src/scalar/bitmap.rs (~330 new lines), which adds:
BitmapParameters { shard_id: Option<u32> }and a customBitmapTrainingRequestto replace the previousDefaultTrainingRequest- Per-shard output files using a
part_<partition_id>_bitmap_page_lookup.lancenaming scheme, wherepartition_idis a tagged u64 that embeds either the explicit shard_id (tag 0) or an implicit fragment_id (tag 1) to prevent collisions BitmapShardCursor,BitmapHeapItem, and a streaming k-way merge (merge_shards) driven by aBinaryHeap<Reverse<...>>, which unions same-key bitmaps across shards without materializing all keys in memory- A public
merge_index_filesentry point wired intoDataset::merge_index_metadataviaIndexType::Bitmap - Shard file cleanup after a successful merge (best-effort, logs on failure)
The Java builder exposes shardId(int), the Python SupportedDistributedIndices enum gains BITMAP, and both Rust and Python integration tests cover the end-to-end distributed build → merge → commit → query flow.
Security risks
None identified. This is internal indexing logic with no new auth/crypto/permission surface. The shard filename parsing is bounded to internal prefixes/suffixes and uses safe u64::parse.
Level of scrutiny
Higher-than-mechanical scrutiny is warranted. This is a feature addition to a production index build path, not a config/typo change. The parameter-routing contract is subtle: shard_id is only valid with fragment_ids, and a multi-fragment distributed build requires an explicit shard_id (otherwise partition-id collisions would corrupt the merged output). The k-way merge is new machinery with its own ordering/drain-same-key logic. A regression here could silently produce incorrect bitmap results post-merge.
Other factors
- Codecov reports 64.77% patch coverage on bitmap.rs (68 missing + 44 partial lines). The new merge/cursor/error paths are likely where coverage is thin.
- The Rust test
test_distributed_build_bitmapexercises the shard → merge → commit → query flow, and two Python tests cover both the explicit shard_id path and the single-fragment-per-shard implicit path. Error paths (mismatched value types across shards, empty shards, missing shard files, multi-fragment without shard_id) do not appear to have dedicated tests. - Best-effort shard cleanup (
cleanup_bitmap_shard_files) silently warns on failure — worth a maintainer's call on whether that's the desired behavior for leakedpart_*files after a successful merge. - The PR was explicitly flagged for
@claude reviewby Xuanwo, so a human-facing summary is appropriate.
|
@BubbleCal Hi. Would u mind to take a look? Thanks! |
| } | ||
| } | ||
|
|
||
| if heap.is_empty() { |
There was a problem hiding this comment.
bitmap indexes already support an empty final page lookup file, we can finish an empty merged bitmap file using the shard schema’s value_type instead of failing
|
mostly looks good to me, just 1 comment |

closes: #6599