feat: add range-based sorted BTree index support#440
feat: add range-based sorted BTree index support#440touch-of-grey wants to merge 1 commit intolance-format:mainfrom
Conversation
Add sorted=True parameter to create_scalar_index() for range-based BTree indexing. When enabled, the indexed column is globally sorted via Ray, split into non-overlapping ranges, and each range builds a separate BTree partition — replacing the expensive k-way merge with a fast sequential concat. - Add _build_sorted_btree_index with 6-phase workflow (read, sort, split, build, merge, commit) - Add _build_range_partition Ray remote task - Add TestDistributedSortedBTreeIndexing with 13 test cases covering basic functionality, correctness vs fragment-based, string columns, validation, and single-worker edge case Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks a lot for your contribution! Will review and revert back in a few days due to holiday. |
|
|
||
| logger.info("Phase 3: Splitting into %d range partitions", num_workers) | ||
| sorted_ds = sorted_ds.repartition(num_workers, shuffle=False) | ||
| table_refs = sorted_ds.to_arrow_refs() |
There was a problem hiding this comment.
ray.data.Dataset.sort(column) to perform a global range-partitioning shuffle. At 1B rows, this becomes a very heavy operation in terms of memory, network, and disk I/O. Could we better document the expected scale limits and recommend the machine size regarding different size of data?
Btw, this may involve object spilling and we may also estimate the memory/disk size on documentation
https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html
There was a problem hiding this comment.
I don't have much experience tbh in the exact scale that Ray sorting would be limiting. I read this blog: https://www.anyscale.com/blog/ray-breaks-the-usd1-tb-barrier-as-the-worlds-most-cost-efficient-sorting, but also saw https://discuss.ray.io/t/implementation-of-sort-is-not-optimal/12123, so looks like there is mixed result for Ray-based sorting. Do you have any sense of the recommended machine size and scale limits to share with? Do you suggest any other way to sort if not using Ray?
There was a problem hiding this comment.
Yes—based on your investigation, ray sort looks like the best option for this use case, and I’m aligned with that direction.
From my previous experience, processing ~1Billion rows typically requires at least ~100 CPU cores, with a memory-to-core ratio around 4 GB/core (i.e., ~400 GB RAM) for workloads at this scale.
If possible, we could run a more robust test with a larger rows count (closer to production scale), so we can provide a more reliable capacity estimate.
|
Overall looks good to me. I will test the performance using 1 billion dataset, and merge it once the pylance PR merged and UT pass. Thanks a lot for your contribution! Happy chinese new year! |
Summary
sorted=Trueparameter tocreate_scalar_index()for range-based BTree indexingChanges
lance_ray/index.py: Add_build_range_partitionRay remote task and_build_sorted_btree_indexorchestration functiontests/test_distributed_indexing.py: AddTestDistributedSortedBTreeIndexingwith 13 test casesTest plan
test_sorted_btree_basic— multi-fragment dataset, equality/range queriestest_sorted_btree_vs_fragment_correctness— 9 parametrized comparisons against fragment-based btreetest_sorted_btree_string_column— string column sortingtest_sorted_btree_rejects_non_btree— validation error for non-BTREEtest_sorted_btree_single_worker— edge case with num_workers=1make fixpassesCloses #92
cc @jackye1995 @chenghao-guo
🤖 Generated with Claude Code