Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide one "partial" field per entry in aggregated responses #1532

Open
severo opened this issue Jul 19, 2023 · 6 comments
Open

provide one "partial" field per entry in aggregated responses #1532

severo opened this issue Jul 19, 2023 · 6 comments
Labels
feature request Request for a new feature P2 Nice to have question Further information is requested

Comments

@severo
Copy link
Collaborator

severo commented Jul 19, 2023

For example, https://datasets-server.huggingface.co/size?dataset=c4 only provides a global partial: true field and the response does not explicit that the "train" split is partial, while the "test" one is complete.

Every entry in configs and splits should also include its own partial field, to be able to show this information in the viewer (selects)

  • currently:
    Capture d’écran 2023-07-19 à 16 00 28
  • ideally, something like:
    Capture d’écran 2023-07-19 à 16 01 39

Endpoints where we want these extra fields:

  • /info, dataset-level
  • /size, dataset-level
  • /size, config-level
@severo
Copy link
Collaborator Author

severo commented Jul 19, 2023

Note that this means changing the format (and implementation) of the config-parquet-and-info step, and recomputing all its artifacts 😬

Also: the field partial should be added to every entry of splits in the /info response (or provided in another format, if we want to preserve the "info" as exported by the datasets library)

@severo severo added the feature request Request for a new feature label Jul 19, 2023
@severo
Copy link
Collaborator Author

severo commented Jul 19, 2023

Maybe https://github.com/huggingface/moon-landing/pull/7079 (internal) is sufficient for now, ie: show a general warning for the dataset if some of its splits is partial.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo severo added question Further information is requested P2 Nice to have labels Aug 22, 2023
@severo
Copy link
Collaborator Author

severo commented May 15, 2024

Note that this means changing the format (and implementation) of the config-parquet-and-info step, and recomputing all its artifacts 😬

I think we should store which splits are partial and which are complete. Opening an issue for that -> #2809, and this one will depend on it.

@lhoestq
Copy link
Member

lhoestq commented May 16, 2024

Note that we can get this info per split already for free for most datasets:

  • if there is only one split then its partial value is equal to the partial value at config level
  • if there is the split size info in the YAML in the README.md then the partial value is True if it matches the size generated by config-parquet-and-info

So actually we should be able to retrieve most of the partial values no ?

@severo
Copy link
Collaborator Author

severo commented May 16, 2024

yes, it would be a good way to migrate the cache entries to the new schema instead of recomputing in #2809

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature P2 Nice to have question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants