Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parquet datasets with >10k files #2503

Merged
merged 3 commits into from
Feb 27, 2024

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 26, 2024

By respecting the 10k files limit per folder when copying them in refs/convert/parquet .

To do so I'm using a directory scheme like config/train-part0/0000.parquet up to config/train-part9/9999.parquet when there are >10k files

Close #2498

cc @guipenedo

@lhoestq lhoestq requested a review from severo February 26, 2024 18:11
self.partial = partial

@property
def path_in_repo(self) -> str:
# For paths like "en/partial-train/0000.parquet" in the C4 dataset.
# Note that "-" is forbidden for split names so it doesn't create directory names collisions.
partial_prefix = PARTIAL_PREFIX if self.partial else ""
# For paths like "en/train-part0/0000.parquet", "en/train-part1/0000.parquet" up to "en/train-part9/9999.parquet".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we could have more than 10 directories? more than 100? ideally we would want the directories to be order alphabetically, so maybe we should add a zero padding, such as train-part00, or train-part000...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No since the max number of files is 100k, so the maximum file name is train-part9/9999 (file number 100,000)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Is it a hard limit? It does not seem so, from https://huggingface.co/docs/hub/repositories-recommendations

They recommend:

The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to create a repository structure that uses subdirectories. For example, a repo with 1k folders from 000/ to 999/, each containing at most 1000 files, is already enough.

so... supporting 1M files

Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

I still think we could allow more than 100k, ie 1M. But as we raise an explicit exception, it will be easy to recompute later too...

@lhoestq lhoestq merged commit de720f1 into main Feb 27, 2024
22 checks passed
@lhoestq lhoestq deleted the support-parquet-datasets-with-more-than-10k-files branch February 27, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support datasets with >10k Parquet files
2 participants