Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: define IVFShuffler trait #2376

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions rust/lance-index/src/vector.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ pub mod residual;
pub mod sq;
pub mod transform;
pub mod utils;
pub mod v3;
chebbyChefNEQ marked this conversation as resolved.
Show resolved Hide resolved

// TODO: Make these crate private once the migration from lance to lance-index is done.
pub const PQ_CODE_COLUMN: &str = "__pq_code";
Expand Down
4 changes: 4 additions & 0 deletions rust/lance-index/src/vector/v3.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The Lance Authors

pub mod shuffler;
35 changes: 35 additions & 0 deletions rust/lance-index/src/vector/v3/shuffler.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The Lance Authors

//! Shuffler is a component that takes a stream of record batches and shuffles them into
//! the corresponding IVF partitions.

use lance_core::Result;
use lance_io::stream::RecordBatchStream;

#[async_trait::async_trait]
/// A reader that can read the shuffled partitions.
pub trait IvfShuffleReader: Send + Sync {
/// Read a partition by partition_id
/// will return Ok(None) if partition_size is 0
/// check reader.partiton_size(partition_id) before calling this function
async fn read_partition(
&self,
partition_id: usize,
) -> Result<Option<Box<dyn RecordBatchStream + Unpin + 'static>>>;

/// Get the size of the partition by partition_id
fn partiton_size(&self, partition_id: usize) -> Result<usize>;
chebbyChefNEQ marked this conversation as resolved.
Show resolved Hide resolved
}

#[async_trait::async_trait]
/// A shuffler that can shuffle the incoming stream of record batches into IVF partitions.
/// Returns a IvfShuffleReader that can be used to read the shuffled partitions.
pub trait IvfShuffler {
/// Shuffle the incoming stream of record batches into IVF partitions.
/// Returns a IvfShuffleReader that can be used to read the shuffled partitions.
async fn shuffle(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this function is finished, do we expect that all shuffles are done? Or do we expect Shuffles are in flight, so that IvfShufflerReader might return partial results in read_partition()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this function finishes, all shuffle should have been completed

mut self,
data: Box<dyn RecordBatchStream + Unpin + 'static>,
) -> Result<Box<dyn IvfShuffleReader>>;
}
Loading