Skip to content

Conversation

@jstlaurent
Copy link
Contributor

Changelogs

This PR updates the file size and checksum set as metadata on a dataset's table content to use the Parquet file's attributes. This will let the Hub correctly check the size of the uploaded file and
set the state of the dataset to ready.

Links


Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • [ ] Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • [ ] Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

…d attribute to use those of the Parquet file to be uploaded
@jstlaurent jstlaurent requested a review from hadim as a code owner November 13, 2023 20:56
@jstlaurent jstlaurent added the fix Annotates any PR that fixes bugs label Nov 13, 2023
@jstlaurent jstlaurent self-assigned this Nov 13, 2023
Copy link
Contributor

@hadim hadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Julien.

I would assume we don't have any concern about buffer not being garbage collected right? And so no need for anything like del buffer?

@jstlaurent
Copy link
Contributor Author

LGTM. Thanks Julien.

I would assume we don't have any concern about buffer not being garbage collected right? And so no need for anything like del buffer?

It'll be GC-able when it goes out of scope at the end of the function call. That being said, a large enough Parquet file generated here will probably blow up the process memory. I don't know Pandas well enough to say if it's also an issue with the dataframe in the Dataset instance.

In any case, we'll have to be smarter when handling very large datasets, like streaming from disk and multi-part uploads.

Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢

Copy link
Contributor

@zhu0619 zhu0619 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jstlaurent

@jstlaurent jstlaurent merged commit 861ee1e into main Nov 14, 2023
@jstlaurent jstlaurent deleted the fix/dataset-table-upload-metadata branch November 14, 2023 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Annotates any PR that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants