-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parquet datasets with >10k files #2503
Support parquet datasets with >10k files #2503
Conversation
self.partial = partial | ||
|
||
@property | ||
def path_in_repo(self) -> str: | ||
# For paths like "en/partial-train/0000.parquet" in the C4 dataset. | ||
# Note that "-" is forbidden for split names so it doesn't create directory names collisions. | ||
partial_prefix = PARTIAL_PREFIX if self.partial else "" | ||
# For paths like "en/train-part0/0000.parquet", "en/train-part1/0000.parquet" up to "en/train-part9/9999.parquet". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we could have more than 10 directories? more than 100? ideally we would want the directories to be order alphabetically, so maybe we should add a zero padding, such as train-part00, or train-part000...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No since the max number of files is 100k, so the maximum file name is train-part9/9999 (file number 100,000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Is it a hard limit? It does not seem so, from https://huggingface.co/docs/hub/repositories-recommendations
They recommend:
The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to create a repository structure that uses subdirectories. For example, a repo with 1k folders from 000/ to 999/, each containing at most 1000 files, is already enough.
so... supporting 1M files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
I still think we could allow more than 100k, ie 1M. But as we raise an explicit exception, it will be easy to recompute later too...
By respecting the 10k files limit per folder when copying them in
refs/convert/parquet
.To do so I'm using a directory scheme like config/train-part0/0000.parquet up to config/train-part9/9999.parquet when there are >10k files
Close #2498
cc @guipenedo