Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help dataset owner to chose between configs and splits? #2721

Open
severo opened this issue Apr 15, 2024 · 2 comments
Open

Help dataset owner to chose between configs and splits? #2721

severo opened this issue Apr 15, 2024 · 2 comments
Labels
P2 Nice to have question Further information is requested

Comments

@severo
Copy link
Collaborator

severo commented Apr 15, 2024

See https://huggingface.slack.com/archives/C039P47V1L5/p1713172703779839

Am I correct in assuming that if you specify a "config" in a dataset, only the given config is downloaded, but if you specify a split, all splits for that config are downloaded? I came across it when using facebook's belebele (https://huggingface.co/datasets/facebook/belebele). Instead of a config for each language, they use a split for each language, but that seems to mean that the full dataset is downloaded, even if you select just one language split.

For languages, we recommend using different configs, not splits.

Maybe we should also show a warning / open a PR/discussion? when a dataset contains more than 5 splits, hinting that it might be better to use configs?

@severo severo added the question Further information is requested label Apr 15, 2024
@severo severo changed the title Docs: make it clear when to use splits or configs Docs: make it clear when to use splits or configs? Apr 15, 2024
@severo severo changed the title Docs: make it clear when to use splits or configs? Help dataset owner to chose between configs and splits? Apr 15, 2024
@severo
Copy link
Collaborator Author

severo commented Apr 15, 2024

See a discussion on the Hub: https://huggingface.co/datasets/facebook/belebele/discussions/5

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo severo added the P2 Nice to have label May 24, 2024
@severo severo reopened this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Nice to have question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant