Skip to content

[Rust] Expose DataFragment as pubilc dataset api.#769

Merged
eddyxu merged 12 commits intomainfrom
lei/fragment
Apr 12, 2023
Merged

[Rust] Expose DataFragment as pubilc dataset api.#769
eddyxu merged 12 commits intomainfrom
lei/fragment

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented Apr 12, 2023

a FileFragment struct modeled after pyarrow.dataset.Fragment

@eddyxu eddyxu self-assigned this Apr 12, 2023
@eddyxu eddyxu added arrow Apache Arrow related issues rust Rust related tasks labels Apr 12, 2023
Comment thread rust/src/dataset.rs
Comment thread rust/src/dataset.rs
/// Get fragments.
///
/// If `filter` is provided, only fragments with the given name will be returned.
pub fn get_fragments(&self) -> Vec<FileFragment> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing to have both get_fragments and fragments ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, so i will refactor it later to remove fragments entirely.

Comment thread rust/src/dataset/fragment.rs Outdated
}

async fn do_open(&self, paths: &[&str]) -> Result<FileReader> {
// TODO: support open multiple data failes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: data files

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

offset: None,
nearest: None,
with_row_id: false,
fragment: None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of a scanner that only does 1 fragment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to support pyarrow.dataset.Fragment.to_batch().
So via exposing DataFragment using pyarrow API, users can get a list of Fragments on different machine, and each Fragment can be used as basic unit to read a file, distributedly.

}
}

pub fn from_fragment(dataset: Arc<Dataset>, fragment: Fragment) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileFragment::scan()

@eddyxu eddyxu merged commit 019d910 into main Apr 12, 2023
@eddyxu eddyxu deleted the lei/fragment branch April 12, 2023 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Apache Arrow related issues rust Rust related tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants