Skip to content

Merge batches from multiple datafiles in the same Fragment#815

Merged
eddyxu merged 28 commits intomainfrom
lei/merge_from_data_files
May 3, 2023
Merged

Merge batches from multiple datafiles in the same Fragment#815
eddyxu merged 28 commits intomainfrom
lei/merge_from_data_files

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented Apr 30, 2023

No description provided.

@eddyxu eddyxu marked this pull request as draft April 30, 2023 00:32
@eddyxu eddyxu added WIP work in progress donotmerge Do not merge labels Apr 30, 2023
@eddyxu eddyxu force-pushed the lei/merge_from_data_files branch 2 times, most recently from 4d0d3e3 to 6737601 Compare May 2, 2023 22:40
@eddyxu eddyxu force-pushed the lei/merge_from_data_files branch from 6737601 to bf3b27a Compare May 3, 2023 03:11
@eddyxu eddyxu removed WIP work in progress donotmerge Do not merge labels May 3, 2023
Copy link
Copy Markdown
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments/questions


def add_columns(
self,
value_func: Callable[[pa.RecordBatch], pa.RecordBatch],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the value_func's output will be the augmented Fragment that will then get written as the new version right? Going to assume the API will improve in subsequent PRs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the idea is that value_func will take of matching the input data and the merge data?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value_func will be "the producer" of the new data. This is similar to pandas udf in spark, so it takes in a "RecordBatch" as input, and create a same size "RecordBatch" as output

Comment thread rust/src/dataset.rs
}

/// Create a new version of [`Dataset`] from a collection of fragments.
pub async fn create_version_from_fragments(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use this to manually create a version after distributed writes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly the purpose of exposing this function as pub.

Comment thread rust/src/io/exec/scan.rs
)
.await
{
let file_fragment = FileFragment::new(dataset.clone(), frag.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice the FileFragment abstraction really simplifies this code

// TODO: use tokio::async buffer to make parallel reads.
let mut batches = vec![];
for (reader, schema) in self.readers.iter() {
let batch = reader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these readers be constructed with the schema so you don't need to pass tuples around?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done via another refactory. Currently FileReader does take the schema. Can make FileReader hold the projection schema?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine, not a blocker 🤷

@eddyxu eddyxu changed the title [WIP] Merge batches from multiple datafiles in the same Fragment Merge batches from multiple datafiles in the same Fragment May 3, 2023
Copy link
Copy Markdown
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gogogogogogogo

@eddyxu eddyxu merged commit 9f08965 into main May 3, 2023
@eddyxu eddyxu deleted the lei/merge_from_data_files branch May 3, 2023 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants