Store which splits are partial and which are complete #2809

severo · 2024-05-15T14:33:17Z

In each step, we should store if we truncated the data or not.

Currently, config-parquet-and-info only stores the fact that some of the splits have been partially converted to parquet, but not the list of them.

We want to have the info for each split.

The same goes when we convert to duckdb, and when we compute the statistics. It should be the case on each truncation, so that we can show trustable information in the viewer.

Related issues and discussions:

provide one "partial" field per entry in aggregated responses #1532
https://github.com/huggingface-internal/moon-landing/issues/9429
https://huggingface.slack.com/archives/C0311GZ7R6K/p1715339968924859 for example:

There i a new dataset https://huggingface.co/datasets/H-D-T/Buzz with 31M samples but the viewer only shows 3.1M. It looks like it only loads the first json file instead of all batches. That lead me and other wrongly believing its 3.1M instead of 31M (edited)
https://huggingface.slack.com/archives/C04L6P8KNQ5/p1710947648069719

On another subject, we noticed that the stats showed for datasets like https://huggingface.co/datasets/bigcode/the-stack-v2 are computed only on the first 5GB but the Viewer doesn't show this info anywhere (could be very discrete like on hover or using a small "(i)" for information)

The text was updated successfully, but these errors were encountered:

severo mentioned this issue May 15, 2024

provide one "partial" field per entry in aggregated responses #1532

Open

severo added improvement / optimization P1 Not as needed as P0, but still important/wanted labels May 15, 2024

severo added P2 Nice to have and removed P1 Not as needed as P0, but still important/wanted labels Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store which splits are partial and which are complete #2809

Store which splits are partial and which are complete #2809

severo commented May 15, 2024 •

edited

Loading

Store which splits are partial and which are complete #2809

Store which splits are partial and which are complete #2809

Comments

severo commented May 15, 2024 • edited Loading

severo commented May 15, 2024 •

edited

Loading