Skip to content

Conversation

@jackye1995
Copy link
Contributor

Patch to release/v1.0

This also include related PRs to making the related benchmark and CI work

The btree and bitmap index are important and critical features but
lacking a good micro-benchmark for the search (we do have some
macro-benchmarks). This PR adds micro benchmarks for btree and bitmap
search and covers a variety of scenarios.

* Floats (fixed-width data) vs. Strings (variable width data)
* Equality vs. Range
* Small result set vs. Large result set
* High cardinality data vs. Low cardinality data
lance-format@82df343
parallelizes bitmap loading but it introduced a clone of the bitmap
index. Since this was a clone of the index itself and not just an arc
clone it ended up being a very expensive deep clone, especially at
scale. It also led to unbounded memory usage since all clones needed to
live at the same time and this could cause the search to crash.

Fortunately, these clones are not needed, so we can simply remove them.
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@github-actions github-actions bot added bug Something isn't working python labels Dec 5, 2025
@jackye1995 jackye1995 changed the title fix: remove expensive clone in bitmap search fix: critical fixes for 1.0.0-rc3 Dec 5, 2025
valkum and others added 6 commits December 5, 2025 11:55
Part of lancedb/lancedb#2771, continuing
lancedb/lancedb#2568.
`lance-namespace-impls` uses `lance` which pulled in default features
(aws,gcp,azure,oss), which pulled `aws-*` crates into `lancedb` even
when `lancedb` was used with `default-features = false`.

I don't have tests at hand, but it would probably be good to run `cargo
tree -p lance -e no-build -e no-dev --no-default-features -i aws-config`
in CI (maybe for each crate?) to avoid adding any dependency back by
default in a future PR.

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
This PR will remove cached wheels before building to address CI broken
like:

```shell
Processing ./target/wheels/pylance-1.1.0b1-cp39-abi3-manylinux_2_28_aarch64.whl
Processing ./target/wheels/pylance-1.1.0b2-cp39-abi3-manylinux_2_28_aarch64.whl (from pylance==1.1.0b2)
ERROR: Cannot install pylance 1.1.0b1 (from /runner/_work/lance/lance/python/target/wheels/pylance-1.1.0b1-cp39-abi3-manylinux_2_28_aarch64.whl) and pylance 1.1.0b2 (from /runner/_work/lance/lance/python/target/wheels/pylance-1.1.0b2-cp39-abi3-manylinux_2_28_aarch64.whl) because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested pylance 1.1.0b1 (from /runner/_work/lance/lance/python/target/wheels/pylance-1.1.0b1-cp39-abi3-manylinux_2_28_aarch64.whl)
The user requested pylance 1.1.0b2 (from /runner/_work/lance/lance/python/target/wheels/pylance-1.1.0b2-cp39-abi3-manylinux_2_28_aarch64.whl)

Additionally, some packages in these conflicts have no matching distributions available for your environment:
pylance

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip to attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
```

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full
responsibility for all changes.**

Signed-off-by: Xuanwo <github@xuanwo.io>
I've noticed that debug / info logging has been flooded lately with
things like...

```
[2025-11-18T14:12:05Z INFO  lance::events] target="lance::execution" type="plan_run" plan_summary="Projection(OneShot)" output_rows=2 iops=0 requests=0 bytes_read=0 indices_loaded=0 parts_loaded=0 index_comparisons=0 
[2025-11-18T14:12:05Z INFO  lance::events] target="lance::execution" type="plan_run" plan_summary="Projection(OneShot)" output_rows=2 iops=0 requests=0 bytes_read=0 indices_loaded=0 parts_loaded=0 index_comparisons=0 
[2025-11-18T14:12:05Z INFO  lance::events] target="lance::execution" type="plan_run" plan_summary="Projection(OneShot)" output_rows=2 iops=0 requests=0 bytes_read=0 indices_loaded=0 parts_loaded=0 index_comparisons=0 
[2025-11-18T14:12:05Z INFO  lance::events] target="lance::execution" type="plan_run" plan_summary="Projection(OneShot)" output_rows=2 iops=0 requests=0 bytes_read=0 indices_loaded=0 parts_loaded=0 index_comparisons=0 
[2025-11-18T14:12:05Z INFO  lance::events] target="lance::execution" type="plan_run" plan_summary="Projection(OneShot)" output_rows=2 iops=0 requests=0 bytes_read=0 indices_loaded=0 parts_loaded=0 index_comparisons=0 
```

This is because we create and execute a dummy plan to apply projection.
This "dummy plan execution" is overly heavyweight on its own and should
be optimized (see lance-format#5069).
However, this short-term patch will at least remove the extraneous
logging
If we log `self` then it will log the object store which can be a lot of
info (especially if it is a memory object store)
This adds in-memory datasets to the random access benchmarks and adds a
parallel version of the benchmark. In addition we instrument the
project_batch function which currently is too heavyweight and it is
useful to track.
@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/exec.rs 66.66% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 force-pushed the release-1.0-patch-btree branch from 86e3653 to 06bb00e Compare December 5, 2025 20:03
…ance-format#5392)

When `enable_stable_row_ids=true`, `take_blobs_by_indices` was failing
with "index out of bounds" for rows in fragment 1+. The bug was caused
by passing row addresses to `blob::take_blobs` which expected row IDs.

Root cause:
- `take_blobs_by_indices` converts indices to row addresses
- It passed these addresses to `take_blobs` which calls `take_builder`
- `TakeBuilder.get_row_addrs()` looked up the values in the row ID index
- For fragment 0: addresses (0,1,2) matched row IDs (0,1,2) by accident
- For fragment 1+: addresses (4294967296+) didn't match any row IDs
- This caused empty results and missing `_rowaddr` column → panic

Fix:
- Add `take_blobs_by_addresses()` that uses
`TakeBuilder::try_new_from_addresses` to bypass the row ID index lookup
- Update `take_blobs_by_indices` to call the new function
- Add defensive fix in `do_take_rows` to include `_rowaddr` column in
empty batches

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
@jackye1995 jackye1995 force-pushed the release-1.0-patch-btree branch from 06bb00e to 9023ad4 Compare December 5, 2025 20:20
@jackye1995 jackye1995 merged commit 0140adf into lance-format:release/v1.0 Dec 5, 2025
30 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants