Lance Adaptive Partition Artifact Shuffle #6981
Replies: 2 comments 4 replies
-
|
Benchmark over a PoC:
|
Beta Was this translation helpful? Give feedback.
-
|
(sigh it's really annoying to comment on GH discussion, but I'll do my best here.)
issue(non-blocking): That's more of the solution, not the goal, right? Goal is something like: "Be able to read a partition with no more than 5 read requests" or something like that.
question: Why is this important? Why should buckets get bigger?
praise: This is a good goal On design goals: It seems like you are missing a lot of important factors here. Here's a few key constraints that I would add:
^ Both of these are goals I have in my optimizations to the two-file reader.
issue(blocking): Could you explain why the current solution is lacking? it's very unclear from this document why we need the new solution.
question: Why JSON for manifest? Versus Lance for example. We could have 30,000+ partitions, and hundreds of ranges per partition. I would think we'd like a more compact /efficient format, no?
question: So is the peak memory use then One thing I found in my research is there was a tradeoff between
note: this is essentially what I was saying above 👍
suggestion: having bucket assignment adds a lot of complexity. I think it would be reasonable to show benchmarks demonstrating they have real benefits before we add this.
question: is there a reason we need
suggestion: these should be pulled up into goals. It's a lot easier to read a design document when the goals are up front. Otherwise it's hard to judge the design details and whether they are accomplishing the state goals.
issue(blocking): Can you explain why this doesn't work in the current format? I would think the offsets file would support this pretty easily.
issue(blocking): Can you explain why more files is better? What is the benefit of splitting across files?
question: Is the two-file shuffler not agnostic to the payload? I thought it was.
issue(non-blocking): again, this sounds like a benefit that isn't new to this design. Why mention it? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
This proposal introduces an adaptive bucketed partition artifact as the shuffle layout used during Lance IVF index builds.
During an IVF index build, each row vector is first assigned to a target partition. The build pipeline then produces encoded payloads that will later be written into the final index. The role of the shuffle layout is to reorganize these encoded rows into partition-local streams so that the finalization phase can read them partition by partition and write the final index storage sequentially.
The adaptive bucketed partition artifact uses a manifest to record the physical location of every logical partition directly:
Bucket files store the encoded payload, while the manifest records the mapping from partitions to row ranges. The number of bucket files is determined adaptively based on the size of the input: small inputs use only a handful of buckets, and the bucket count grows as the input size increases. This approach combines the low fixed overhead of a small-file layout with the read efficiency gained from explicit partition ranges.
Background
Lance’s IVF index divides the vector space into multiple partitions. During the build, each vector is assigned to a partition and then encoded into a payload that the final storage writer can consume.
A typical encoded row contains:
The exact content of the payload depends on the index variant. For example, IVF-PQ uses a PQ code, while other variants may use different encoded representations. The shuffle layout only cares about the physical organization of
_part_idand the payload.The shuffle step during the build can be abstracted as:
During finalization, rows are consumed partition by partition and written to the final index file. This access pattern means that the core data structure of the shuffle layout should be designed around partition ranges.
Design Goals
Current Shuffle Layout
The current two-file shuffle layout uses one data file and one offsets file:
The data file stores rows that have been sorted per batch. The offsets file stores the end position of each partition inside each batch. To read a particular partition, the reader first calculates the row range for that partition in every batch from the offsets file, then reads those ranges from the data file.
Key characteristics of this layout:
The adaptive bucketed partition artifact keeps the approach of concentrated writes while promoting partition ranges to first-class information in a manifest.
Proposed Layout
The new shuffle artifact consists of a manifest and a set of bucket files:
Bucket files store the encoded payload rows. The manifest records, for each logical partition, the associated bucket file and row ranges:
The schema stored inside a bucket file is the input schema with
_part_idremoved — i.e., the pure payload schema. For IVF-PQ, the bucket rows might look like:_part_idis only used during writes to decide which logical partition a row belongs to. Because the manifest already expresses partition membership, the rows inside bucket files can focus on the payload that the finalization phase actually needs.Write Model
The writer receives encoded batches and distributes rows into bucket buffers based on
_part_id.When a bucket buffer is flushed, the writer sorts the buffered rows within that bucket by
_part_id, appends the payload rows to the corresponding bucket file, and records the resulting partition ranges in the manifest state.The overall flow is:
A single logical partition may end up with multiple ranges in the manifest. This allows the writer to flush continuously under bounded memory while still preserving the ability to read partitions directly.
Read Model
To read a particular partition, the reader queries the manifest directly:
The finalization phase sees a partition-local stream. The artifact’s bucket policy and range organization are encapsulated behind the manifest reader, so the finalization logic can keep working in terms of per-partition streams.
Adaptive Bucket Policy
The bucket count is determined by the estimated total payload size:
A suggested initial policy:
Small inputs naturally converge to a single bucket:
Large inputs increase the bucket count as the payload size grows:
The flush threshold also scales with the bucket count. The goal is to cap total buffer memory while ensuring that each flush produces reasonably large contiguous ranges:
This policy keeps fixed costs low for small inputs and gives large inputs more bucket-level parallelism and shorter per-bucket partition range lists.
Bucket Assignment
Bucket assignment determines how logical partitions are mapped to bucket files.
HashModulo
HashModulo scatters consecutive partition IDs across different buckets. It is suitable when partition size estimates are not yet available, providing stable bucket size distribution.
RangeByPartitionId
RangeByPartitionId places consecutive partition IDs into the same bucket. It preserves partition locality, allowing the finalization phase to reuse the same bucket file sequentially when processing partitions in ID order.
SizeBalancedRange
SizeBalancedRange uses partition size estimates to form contiguous partition ranges so that each bucket is roughly the target size:
This strategy combines range locality with bucket size balance and is the preferred choice when partition size estimates are available.
Manifest Semantics
The manifest is the logical entry point for the artifact. It must describe:
The semantics of a partition entry are:
Ranges are expressed as row offsets. This aligns naturally with the batch- and range-based reading model of the Lance file reader and keeps the abstraction at the schema level.
Layout Example
Assume 8 IVF partitions and an adaptive policy that chooses 2 buckets. With
RangeByPartitionId, the layout could look like:The manifest records:
Even when partition 2 has zero rows, the manifest can still keep its logical entry. The reader uses
num_rowsto return an empty stream.Expected Benefits
Direct partition reads
The finalization phase consumes rows partition by partition. The manifest stores partition-to-range mappings directly, so the reader can locate physical rows straight from a partition ID.
Adaptive file count
The bucket count grows with the payload size. Small inputs use only a few buckets, large inputs use more. The number of files is controlled by the layout policy and scales with the data volume.
Generic payload support
Bucket files store payload columns. The shuffle layout treats the payload as opaque columns, so the same layout can host PQ codes, SQ codes, RQ codes, or any other encoded representation.
Stable index format boundary
The artifact is a build-time shuffle layout. The final Lance index format continues to be written by the finalization phase. This boundary allows the shuffle layout to evolve independently.
Design Decision
The long-term primary path for Lance IVF shuffles should center around the adaptive bucketed partition artifact.
This layout makes partition ranges the core data structure and uses a manifest to explicitly express the mapping from logical partitions to physical rows. The adaptive bucket policy selects the right file organization for both small and large inputs. The bucket assignment policy chooses between bucket size balance and partition locality.
The overall design aligns the shuffle layer with the real data flow of the IVF build:
Beta Was this translation helpful? Give feedback.
All reactions