Skip to content

feat: add range-based sorted BTree index support#440

Open
touch-of-grey wants to merge 1 commit intolance-format:mainfrom
touch-of-grey:feat/sorted-btree-index
Open

feat: add range-based sorted BTree index support#440
touch-of-grey wants to merge 1 commit intolance-format:mainfrom
touch-of-grey:feat/sorted-btree-index

Conversation

@touch-of-grey
Copy link

Summary

  • Add sorted=True parameter to create_scalar_index() for range-based BTree indexing
  • When enabled, the indexed column is globally sorted via Ray, split into non-overlapping ranges, and each range builds a separate BTree partition
  • Replaces the expensive k-way merge with a fast sequential concat for large datasets
  • Implements 6-phase workflow: read → sort → split → build → merge → commit
  • Depends on pylance PR: feat(python): add preprocessed_data and range_id for range-based BTree index lance#5941

Changes

  • lance_ray/index.py: Add _build_range_partition Ray remote task and _build_sorted_btree_index orchestration function
  • tests/test_distributed_indexing.py: Add TestDistributedSortedBTreeIndexing with 13 test cases

Test plan

  • test_sorted_btree_basic — multi-fragment dataset, equality/range queries
  • test_sorted_btree_vs_fragment_correctness — 9 parametrized comparisons against fragment-based btree
  • test_sorted_btree_string_column — string column sorting
  • test_sorted_btree_rejects_non_btree — validation error for non-BTREE
  • test_sorted_btree_single_worker — edge case with num_workers=1
  • make fix passes

Closes #92

cc @jackye1995 @chenghao-guo

🤖 Generated with Claude Code

Add sorted=True parameter to create_scalar_index() for range-based
BTree indexing. When enabled, the indexed column is globally sorted
via Ray, split into non-overlapping ranges, and each range builds a
separate BTree partition — replacing the expensive k-way merge with
a fast sequential concat.

- Add _build_sorted_btree_index with 6-phase workflow (read, sort,
  split, build, merge, commit)
- Add _build_range_partition Ray remote task
- Add TestDistributedSortedBTreeIndexing with 13 test cases covering
  basic functionality, correctness vs fragment-based, string columns,
  validation, and single-worker edge case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the enhancement New feature or request label Feb 12, 2026
@chenghao-guo
Copy link
Collaborator

Thanks a lot for your contribution! Will review and revert back in a few days due to holiday.


logger.info("Phase 3: Splitting into %d range partitions", num_workers)
sorted_ds = sorted_ds.repartition(num_workers, shuffle=False)
table_refs = sorted_ds.to_arrow_refs()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.data.Dataset.sort(column) to perform a global range-partitioning shuffle. At 1B rows, this becomes a very heavy operation in terms of memory, network, and disk I/O. Could we better document the expected scale limits and recommend the machine size regarding different size of data?

Btw, this may involve object spilling and we may also estimate the memory/disk size on documentation
https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much experience tbh in the exact scale that Ray sorting would be limiting. I read this blog: https://www.anyscale.com/blog/ray-breaks-the-usd1-tb-barrier-as-the-worlds-most-cost-efficient-sorting, but also saw https://discuss.ray.io/t/implementation-of-sort-is-not-optimal/12123, so looks like there is mixed result for Ray-based sorting. Do you have any sense of the recommended machine size and scale limits to share with? Do you suggest any other way to sort if not using Ray?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes—based on your investigation, ray sort looks like the best option for this use case, and I’m aligned with that direction.
From my previous experience, processing ~1Billion rows typically requires at least ~100 CPU cores, with a memory-to-core ratio around 4 GB/core (i.e., ~400 GB RAM) for workloads at this scale.
If possible, we could run a more robust test with a larger rows count (closer to production scale), so we can provide a more reliable capacity estimate.

@chenghao-guo
Copy link
Collaborator

Overall looks good to me. I will test the performance using 1 billion dataset, and merge it once the pylance PR merged and UT pass. Thanks a lot for your contribution! Happy chinese new year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ray-based sorting for btree index

2 participants

Comments