Skip to content

perf: reduce IO requests from loading bitmap index#6703

Merged
wjones127 merged 2 commits into
lance-format:mainfrom
wjones127:feat/stream-bitmap-index-load
May 6, 2026
Merged

perf: reduce IO requests from loading bitmap index#6703
wjones127 merged 2 commits into
lance-format:mainfrom
wjones127:feat/stream-bitmap-index-load

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented May 6, 2026

Current code loads 2048 values from lookup file at a time, regardless of width. This PR changes us to stream, and defers the choice of batch size to the reader.

In my simple benchmark, this reduced IO requests 55x (down to 9 from 497).

import lance
import pyarrow as pa

data = pa.table({"id": range(1_000_000)})
ds = lance.write_dataset(data, "/tmp/test_bitmap", mode="overwrite")

ds.create_scalar_index("id", "bitmap")

ds = lance.dataset("/tmp/test_bitmap", block_size=64 * 1024)
ds.io_stats_incremental()
ds.to_table(filter="id == 123456").to_pandas()
print(ds.io_stats_incremental())
# Before: IOStats(read_iops=497, read_bytes=2621958, write_iops=0, write_bytes=0)
# After: IOStats(read_iops=9, read_bytes=2621958, write_iops=0, write_bytes=0)

Closes #6660

🤖 Generated with Claude Code

Previously, BitmapIndex::load read all keys into a single RecordBatch
before iterating them into the index map. With large cardinality columns
(e.g. 1M unique 1KB strings), this doubled peak memory since both the
batch and the map must coexist.

Add a streaming read_range_stream API to IndexReader that returns a
stream of batches instead of a single collected batch. The
current_reader::FileReader implementation uses the new
read_range_as_stream method on FileReader to pass a per-call
batch_size_bytes (8MB default), enabling byte-bounded batching for
v2.1+ files with a graceful fallback for v2.0.

BitmapIndex::load now consumes the keys column as a stream, keeping
only one batch in memory at a time. This also reduces IOPs from
ceil(N/2048) to a single streaming read with readahead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 changed the title Stream keys during bitmap index load to reduce peak memory perf: reduce IO requests from loading bitmap index May 6, 2026
Remove the dedicated read_range_as_stream method from FileReader and
instead call read_stream_projected directly in the IndexReader override.
Return Pin<Box<dyn RecordBatchStream>> to avoid double boxing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/bitmap.rs 77.77% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 marked this pull request as ready for review May 6, 2026 20:48
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

@wjones127 wjones127 merged commit cbda485 into lance-format:main May 6, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loading a high cardinality bitmap index can generate a lot of IO requests

2 participants