feat: add range-based sorted BTree index support by touch-of-grey · Pull Request #440 · lance-format/lance-ray

touch-of-grey · 2026-02-12T02:54:43Z

Summary

Add sorted=True parameter to create_scalar_index() for range-based BTree indexing
When enabled, the indexed column is globally sorted via Ray, split into non-overlapping ranges, and each range builds a separate BTree partition
Replaces the expensive k-way merge with a fast sequential concat for large datasets
Implements 6-phase workflow: read → sort → split → build → merge → commit
Depends on pylance PR: feat(python): add preprocessed_data and range_id for range-based BTree index lance#5941

Changes

lance_ray/index.py: Add _build_range_partition Ray remote task and _build_sorted_btree_index orchestration function
tests/test_distributed_indexing.py: Add TestDistributedSortedBTreeIndexing with 13 test cases

Test plan

test_sorted_btree_basic — multi-fragment dataset, equality/range queries
test_sorted_btree_vs_fragment_correctness — 9 parametrized comparisons against fragment-based btree
test_sorted_btree_string_column — string column sorting
test_sorted_btree_rejects_non_btree — validation error for non-BTREE
test_sorted_btree_single_worker — edge case with num_workers=1
make fix passes

Closes #92

cc @jackye1995 @chenghao-guo

🤖 Generated with Claude Code

Add sorted=True parameter to create_scalar_index() for range-based BTree indexing. When enabled, the indexed column is globally sorted via Ray, split into non-overlapping ranges, and each range builds a separate BTree partition — replacing the expensive k-way merge with a fast sequential concat. - Add _build_sorted_btree_index with 6-phase workflow (read, sort, split, build, merge, commit) - Add _build_range_partition Ray remote task - Add TestDistributedSortedBTreeIndexing with 13 test cases covering basic functionality, correctness vs fragment-based, string columns, validation, and single-worker edge case Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chenghao-guo · 2026-02-12T08:12:57Z

Thanks a lot for your contribution! Will review and revert back in a few days due to holiday.

chenghao-guo · 2026-02-13T07:29:40Z

lance_ray/index.py

+
+    logger.info("Phase 3: Splitting into %d range partitions", num_workers)
+    sorted_ds = sorted_ds.repartition(num_workers, shuffle=False)
+    table_refs = sorted_ds.to_arrow_refs()


ray.data.Dataset.sort(column) to perform a global range-partitioning shuffle. At 1B rows, this becomes a very heavy operation in terms of memory, network, and disk I/O. Could we better document the expected scale limits and recommend the machine size regarding different size of data?

Btw, this may involve object spilling and we may also estimate the memory/disk size on documentation
https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html

I don't have much experience tbh in the exact scale that Ray sorting would be limiting. I read this blog: https://www.anyscale.com/blog/ray-breaks-the-usd1-tb-barrier-as-the-worlds-most-cost-efficient-sorting, but also saw https://discuss.ray.io/t/implementation-of-sort-is-not-optimal/12123, so looks like there is mixed result for Ray-based sorting. Do you have any sense of the recommended machine size and scale limits to share with? Do you suggest any other way to sort if not using Ray?

Yes—based on your investigation, ray sort looks like the best option for this use case, and I’m aligned with that direction.
From my previous experience, processing ~1Billion rows typically requires at least ~100 CPU cores, with a memory-to-core ratio around 4 GB/core (i.e., ~400 GB RAM) for workloads at this scale.
If possible, we could run a more robust test with a larger rows count (closer to production scale), so we can provide a more reliable capacity estimate.

chenghao-guo · 2026-02-14T13:44:56Z

Overall looks good to me. I will test the performance using 1 billion dataset, and merge it once the pylance PR merged and UT pass. Thanks a lot for your contribution! Happy chinese new year!

github-actions bot added the enhancement New feature or request label Feb 12, 2026

chenghao-guo reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add range-based sorted BTree index support#440

feat: add range-based sorted BTree index support#440
touch-of-grey wants to merge 1 commit intolance-format:mainfrom
touch-of-grey:feat/sorted-btree-index

touch-of-grey commented Feb 12, 2026

Uh oh!

chenghao-guo commented Feb 12, 2026

Uh oh!

chenghao-guo Feb 13, 2026

Uh oh!

touch-of-grey Feb 13, 2026

Uh oh!

chenghao-guo Feb 14, 2026

Uh oh!

chenghao-guo commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

touch-of-grey commented Feb 12, 2026

Summary

Changes

Test plan

Uh oh!

chenghao-guo commented Feb 12, 2026

Uh oh!

chenghao-guo Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

touch-of-grey Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chenghao-guo Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

chenghao-guo commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments