Lance Adaptive Partition Artifact Shuffle #6981

Xuanwo · 2026-05-28T11:24:56Z

Xuanwo
May 28, 2026
Maintainer

Abstract

This proposal introduces an adaptive bucketed partition artifact as the shuffle layout used during Lance IVF index builds.

During an IVF index build, each row vector is first assigned to a target partition. The build pipeline then produces encoded payloads that will later be written into the final index. The role of the shuffle layout is to reorganize these encoded rows into partition-local streams so that the finalization phase can read them partition by partition and write the final index storage sequentially.

The adaptive bucketed partition artifact uses a manifest to record the physical location of every logical partition directly:

partition_id -> bucket file + row ranges

Bucket files store the encoded payload, while the manifest records the mapping from partitions to row ranges. The number of bucket files is determined adaptively based on the size of the input: small inputs use only a handful of buckets, and the bucket count grows as the input size increases. This approach combines the low fixed overhead of a small-file layout with the read efficiency gained from explicit partition ranges.

Background

Lance’s IVF index divides the vector space into multiple partitions. During the build, each vector is assigned to a partition and then encoded into a payload that the final storage writer can consume.

A typical encoded row contains:

_rowid: original row id
_part_id: target IVF partition id
payload: encoded vector representation

The exact content of the payload depends on the index variant. For example, IVF-PQ uses a PQ code, while other variants may use different encoded representations. The shuffle layout only cares about the physical organization of _part_id and the payload.

The shuffle step during the build can be abstracted as:

input:
  unordered encoded rows

output:
  partition 0 -> encoded rows
  partition 1 -> encoded rows
  ...
  partition N -> encoded rows

During finalization, rows are consumed partition by partition and written to the final index file. This access pattern means that the core data structure of the shuffle layout should be designed around partition ranges.

Design Goals

Make partition ranges the central piece of information in the layout.
Allow the finalization phase to read encoded rows directly by partition.
Let bucket file sizes scale adaptively with the size of the input.
Support different vector index payloads and keep the shuffle layout decoupled from the concrete encoding.
Keep the final Lance index format stable; this design lives purely at the build-time shuffle layout layer.

Current Shuffle Layout

The current two-file shuffle layout uses one data file and one offsets file:

shuffle_data.lance
  [batch0 sorted by _part_id][batch1 sorted by _part_id][batch2 sorted by _part_id]...

shuffle_offsets.lance
  batch0: end(part0), end(part1), ..., end(partN)
  batch1: end(part0), end(part1), ..., end(partN)
  batch2: end(part0), end(part1), ..., end(partN)

The data file stores rows that have been sorted per batch. The offsets file stores the end position of each partition inside each batch. To read a particular partition, the reader first calculates the row range for that partition in every batch from the offsets file, then reads those ranges from the data file.

Key characteristics of this layout:

Data is written into a small number of concentrated files.
Partition ranges are derived from the offsets matrix.
The size of the offsets matrix grows with both the number of spill batches and the number of partitions.

The adaptive bucketed partition artifact keeps the approach of concentrated writes while promoting partition ranges to first-class information in a manifest.

Proposed Layout

The new shuffle artifact consists of a manifest and a set of bucket files:

artifact/
  manifest.json
  partitions/
    bucket-00000.lance
    bucket-00001.lance
    bucket-00002.lance

Bucket files store the encoded payload rows. The manifest records, for each logical partition, the associated bucket file and row ranges:

manifest:
  num_partitions
  payload_schema
  bucket_policy
  partitions:
    0:
      bucket: partitions/bucket-00000.lance
      num_rows: 1024
      ranges:
        - offset: 0
          num_rows: 1024
    1:
      bucket: partitions/bucket-00001.lance
      num_rows: 2048
      ranges:
        - offset: 4096
          num_rows: 2048

The schema stored inside a bucket file is the input schema with _part_id removed — i.e., the pure payload schema. For IVF-PQ, the bucket rows might look like:

_rowid, __pq_code

_part_id is only used during writes to decide which logical partition a row belongs to. Because the manifest already expresses partition membership, the rows inside bucket files can focus on the payload that the finalization phase actually needs.

Write Model

The writer receives encoded batches and distributes rows into bucket buffers based on _part_id.

When a bucket buffer is flushed, the writer sorts the buffered rows within that bucket by _part_id, appends the payload rows to the corresponding bucket file, and records the resulting partition ranges in the manifest state.

The overall flow is:

encoded batches
  -> bucket assignment
  -> bucket-local grouping by partition
  -> append payload rows to bucket file
  -> record partition ranges in manifest

A single logical partition may end up with multiple ranges in the manifest. This allows the writer to flush continuously under bounded memory while still preserving the ability to read partitions directly.

Read Model

To read a particular partition, the reader queries the manifest directly:

partition_id
  -> bucket file
  -> row ranges
  -> encoded row stream

The finalization phase sees a partition-local stream. The artifact’s bucket policy and range organization are encapsulated behind the manifest reader, so the finalization logic can keep working in terms of per-partition streams.

Adaptive Bucket Policy

The bucket count is determined by the estimated total payload size:

estimated_payload_bytes = expected_num_rows * payload_row_width
num_buckets = ceil(estimated_payload_bytes / target_bucket_bytes)
num_buckets = clamp(num_buckets, 1, max_buckets, num_partitions)

A suggested initial policy:

target_bucket_bytes = 64 MiB
max_buckets = 256

Small inputs naturally converge to a single bucket:

artifact/
  manifest.json
  partitions/
    bucket-00000.lance

Large inputs increase the bucket count as the payload size grows:

artifact/
  manifest.json
  partitions/
    bucket-00000.lance
    bucket-00001.lance
    ...
    bucket-00NNN.lance

The flush threshold also scales with the bucket count. The goal is to cap total buffer memory while ensuring that each flush produces reasonably large contiguous ranges:

target_total_buffer_bytes = 128 MiB
bucket_buffer_rows =
  target_total_buffer_bytes / num_buckets / payload_row_width

This policy keeps fixed costs low for small inputs and gives large inputs more bucket-level parallelism and shorter per-bucket partition range lists.

Bucket Assignment

Bucket assignment determines how logical partitions are mapped to bucket files.

HashModulo

bucket_id = partition_id % num_buckets

HashModulo scatters consecutive partition IDs across different buckets. It is suitable when partition size estimates are not yet available, providing stable bucket size distribution.

RangeByPartitionId

bucket_id = partition_id / partitions_per_bucket

RangeByPartitionId places consecutive partition IDs into the same bucket. It preserves partition locality, allowing the finalization phase to reuse the same bucket file sequentially when processing partitions in ID order.

SizeBalancedRange

SizeBalancedRange uses partition size estimates to form contiguous partition ranges so that each bucket is roughly the target size:

bucket-00000: partitions [0, 512),  target size ~= 64 MiB
bucket-00001: partitions [512, 980), target size ~= 64 MiB
bucket-00002: partitions [980, N),   target size ~= 64 MiB

This strategy combines range locality with bucket size balance and is the preferred choice when partition size estimates are available.

Manifest Semantics

The manifest is the logical entry point for the artifact. It must describe:

artifact version
number of logical partitions
payload schema
bucket assignment policy
row count for each partition
bucket file and ranges for each partition

The semantics of a partition entry are:

partition_id:
  num_rows: total rows in this logical partition
  bucket: physical bucket file
  ranges: row ranges inside the bucket file

Ranges are expressed as row offsets. This aligns naturally with the batch- and range-based reading model of the Lance file reader and keeps the abstraction at the schema level.

Layout Example

Assume 8 IVF partitions and an adaptive policy that chooses 2 buckets. With RangeByPartitionId, the layout could look like:

artifact/
  manifest.json
  partitions/
    bucket-00000.lance  # partitions 0, 1, 2, 3
    bucket-00001.lance  # partitions 4, 5, 6, 7

The manifest records:

partition 0 -> bucket-00000.lance ranges [0..100)
partition 1 -> bucket-00000.lance ranges [100..260)
partition 2 -> bucket-00000.lance ranges [260..260)
partition 3 -> bucket-00000.lance ranges [260..500)
partition 4 -> bucket-00001.lance ranges [0..80)
partition 5 -> bucket-00001.lance ranges [80..220)
partition 6 -> bucket-00001.lance ranges [220..390)
partition 7 -> bucket-00001.lance ranges [390..600)

Even when partition 2 has zero rows, the manifest can still keep its logical entry. The reader uses num_rows to return an empty stream.

Expected Benefits

Direct partition reads

The finalization phase consumes rows partition by partition. The manifest stores partition-to-range mappings directly, so the reader can locate physical rows straight from a partition ID.

Adaptive file count

The bucket count grows with the payload size. Small inputs use only a few buckets, large inputs use more. The number of files is controlled by the layout policy and scales with the data volume.

Generic payload support

Bucket files store payload columns. The shuffle layout treats the payload as opaque columns, so the same layout can host PQ codes, SQ codes, RQ codes, or any other encoded representation.

Stable index format boundary

The artifact is a build-time shuffle layout. The final Lance index format continues to be written by the finalization phase. This boundary allows the shuffle layout to evolve independently.

Design Decision

The long-term primary path for Lance IVF shuffles should center around the adaptive bucketed partition artifact.

This layout makes partition ranges the core data structure and uses a manifest to explicitly express the mapping from logical partitions to physical rows. The adaptive bucket policy selects the right file organization for both small and large inputs. The bucket assignment policy chooses between bucket size balance and partition locality.

The overall design aligns the shuffle layer with the real data flow of the IVF build:

encoded rows -> partition-aware artifact -> partition-local finalization streams

Xuanwo · 2026-05-28T11:25:53Z

Xuanwo
May 28, 2026
Maintainer Author

Benchmark over a PoC:

rows / partitions	two-file write/read	adaptive artifact write/read	artifact files
1M / 1024	73ms / 102ms	55ms / 77ms	2
3M / 1024	214ms / 128ms	132ms / 101ms	3
10M / 256	629ms / 81ms	387ms / 50ms	5
10M / 1024	776ms / 290ms	490ms / 165ms	5
10M / 4096	777ms / 1083ms	473ms / 522ms	5

3 replies

BubbleCal May 28, 2026
Maintainer

Great to see these improvements!

Which policy are these numbers from? data distribution really matters for shuffling / kmeans performance, so does the policy.

Xuanwo Jun 1, 2026
Maintainer Author

These numbers are from HashModulo:

bucket_id = partition_id % num_buckets

The benchmark uses pq_width=16 and batch_size=65536.

_part_id is generated as:

partition_id = (row_id * 2654435761 + 17) % num_partitions

For bucket count, the tested cases were:

1M / 1024: num_buckets=1, artifact files = 2
3M / 1024: num_buckets=2, artifact files = 3
10M / 256: num_buckets=4, artifact files = 5
10M / 1024: num_buckets=4, artifact files = 5
10M / 4096: num_buckets=4, artifact files = 5

wjones127 Jun 16, 2026
Maintainer

question: where do the performance gains come from? That wasn't clear from the document.

wjones127 · 2026-06-16T20:40:01Z

wjones127
Jun 16, 2026
Maintainer

(sigh it's really annoying to comment on GH discussion, but I'll do my best here.)

Make partition ranges the central piece of information in the layout.

issue(non-blocking): That's more of the solution, not the goal, right? Goal is something like: "Be able to read a partition with no more than 5 read requests" or something like that.

Let bucket file sizes scale adaptively with the size of the input.

question: Why is this important? Why should buckets get bigger?

Support different vector index payloads and keep the shuffle layout decoupled from the concrete encoding.

praise: This is a good goal

On design goals: It seems like you are missing a lot of important factors here. Here's a few key constraints that I would add:

Should be able to shuffle in a limited amount of memory: 1.5 GB / core or less.
Should be able to read a partition from the bucket files in tens of reads, not thousands.

^ Both of these are goals I have in my optimizations to the two-file reader.

Key characteristics of this layout:

Data is written into a small number of concentrated files.

Partition ranges are derived from the offsets matrix.

The size of the offsets matrix grows with both the number of spill batches and the number of partitions.

The adaptive bucketed partition artifact keeps the approach of concentrated writes while promoting partition ranges to first-class information in a manifest.

issue(blocking): Could you explain why the current solution is lacking? it's very unclear from this document why we need the new solution.

Proposed Layout

question: Why JSON for manifest? Versus Lance for example. We could have 30,000+ partitions, and hundreds of ranges per partition. I would think we'd like a more compact /efficient format, no?

The writer receives encoded batches and distributes rows into bucket buffers based on _part_id.

When a bucket buffer is flushed, the writer sorts the buffered rows within that bucket by _part_id, appends the payload rows to the corresponding bucket file, and records the resulting partition ranges in the manifest state.

question: So is the peak memory use then O(num_buckets * buffer_size)? Whereas the current solution is just the buffer size? How many buckets do we think we might typically use?

One thing I found in my research is there was a tradeoff between buffer_size and number of read requests needed to load a partition. If you made the buffer_size smaller, each partition would be split into more ranges. Larger buffer_size meant fewer ranges. Realistically, though, if you have a big enough machine (even 32GB of RAM), you can sort the full PQ codes in memory.

The flush threshold also scales with the bucket count. The goal is to cap total buffer memory while ensuring that each flush produces reasonably large contiguous ranges

note: this is essentially what I was saying above 👍

Bucket assignment determines how logical partitions are mapped to bucket files.

suggestion: having bucket assignment adds a lot of complexity. I think it would be reasonable to show benchmarks demonstrating they have real benefits before we add this.

Manifest semantics ... bucket assignment policy

question: is there a reason we need bucket_policy in the manifest? It's just a writer concern right? The reader just figures out the buckets based on the bucket key in each partition, right?

Expected benefits

suggestion: these should be pulled up into goals. It's a lot easier to read a design document when the goals are up front. Otherwise it's hard to judge the design details and whether they are accomplishing the state goals.

The finalization phase consumes rows partition by partition. The manifest stores partition-to-range mappings directly, so the reader can locate physical rows straight from a partition ID.

issue(blocking): Can you explain why this doesn't work in the current format? I would think the offsets file would support this pretty easily.

The bucket count grows with the payload size. Small inputs use only a few buckets, large inputs use more. The number of files is controlled by the layout policy and scales with the data volume.

issue(blocking): Can you explain why more files is better? What is the benefit of splitting across files?

Bucket files store payload columns. The shuffle layout treats the payload as opaque columns, so the same layout can host PQ codes, SQ codes, RQ codes, or any other encoded representation.

question: Is the two-file shuffler not agnostic to the payload? I thought it was.

The artifact is a build-time shuffle layout. The final Lance index format continues to be written by the finalization phase. This boundary allows the shuffle layout to evolve independently.

issue(non-blocking): again, this sounds like a benefit that isn't new to this design. Why mention it?

1 reply

Xuanwo Jun 17, 2026
Maintainer Author

Let me clarify the framing.

The original motivation is the GPU build path. We need a Rust-native shuffle artifact that can be produced by GPU / external encoders and consumed by the normal Lance finalizer. The CPU TwoFileShuffler is the baseline for deciding whether this artifact can become the shared shuffle layout instead of being GPU-only.

The posted numbers are PoC-level microbenchmarks. They show the artifact layout is competitive with TwoFileShuffler in the tested CPU shuffle cases, but they are not a formal benchmark suite yet.

Could you explain why the current solution is lacking?

There are two current baselines: the existing GPU path has a separate Python-side shuffle format, while the CPU path uses TwoFileShuffler. The proposal should explain this more clearly: the goal is a shared shuffle contract for CPU/GPU/external encoders.

Can you explain why this doesn't work in the current format?

TwoFileShuffler can read a partition. The difference is representation: two-file derives partition ranges from per-batch offsets, while the artifact stores the materialized partition_id -> bucket file + row ranges mapping directly.

Can you explain why more files is better?

More files are not the goal. The bucketed artifact is meant to provide a finalizer-readable format for the GPU/external encoding path. The adaptive bucket count is there so that, when compared with TwoFileShuffler, small inputs stay close to the two-file shape while larger inputs can split payload into a small number of bucket files.

So is the peak memory use then O(num_buckets * buffer_size)?

It should be bounded by a total staging budget, not num_buckets * fixed_buffer_size.

is there a reason we need bucket_policy in the manifest?

Probably not as required reader semantics. The reader only needs the materialized partition-to-ranges mapping. The policy is mainly a writer concern.

Why JSON for manifest?

JSON was only a PoC choice. The proposal should define the logical manifest schema; the physical encoding can be compact if the manifest gets large.

where do the performance gains come from?

For the PoC CPU benchmark, mainly from directly materialized partition ranges and adaptive bucket sizing.

I’ll update the proposal framing and run a formal benchmark against the GPU path and TwoFileShuffler, including different artifact policies / data distributions before claiming a default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance Adaptive Partition Artifact Shuffle #6981

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Lance Adaptive Partition Artifact Shuffle #6981

Uh oh!

Xuanwo May 28, 2026 Maintainer

Abstract

Background

Design Goals

Current Shuffle Layout

Proposed Layout

Write Model

Read Model

Adaptive Bucket Policy

Bucket Assignment

HashModulo

RangeByPartitionId

SizeBalancedRange

Manifest Semantics

Layout Example

Expected Benefits

Direct partition reads

Adaptive file count

Generic payload support

Stable index format boundary

Design Decision

Replies: 2 comments · 4 replies

Uh oh!

Xuanwo May 28, 2026 Maintainer Author

Uh oh!

BubbleCal May 28, 2026 Maintainer

Uh oh!

Xuanwo Jun 1, 2026 Maintainer Author

Uh oh!

wjones127 Jun 16, 2026 Maintainer

Uh oh!

wjones127 Jun 16, 2026 Maintainer

Uh oh!

Xuanwo Jun 17, 2026 Maintainer Author

Xuanwo
May 28, 2026
Maintainer

Replies: 2 comments 4 replies

Xuanwo
May 28, 2026
Maintainer Author

BubbleCal May 28, 2026
Maintainer

Xuanwo Jun 1, 2026
Maintainer Author

wjones127 Jun 16, 2026
Maintainer

wjones127
Jun 16, 2026
Maintainer

Xuanwo Jun 17, 2026
Maintainer Author