You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In each step, we should store if we truncated the data or not.
Currently, config-parquet-and-info only stores the fact that some of the splits have been partially converted to parquet, but not the list of them.
We want to have the info for each split.
The same goes when we convert to duckdb, and when we compute the statistics. It should be the case on each truncation, so that we can show trustable information in the viewer.
There i a new dataset https://huggingface.co/datasets/H-D-T/Buzz with 31M samples but the viewer only shows 3.1M. It looks like it only loads the first json file instead of all batches. That lead me and other wrongly believing its 3.1M instead of 31M (edited)
On another subject, we noticed that the stats showed for datasets like https://huggingface.co/datasets/bigcode/the-stack-v2 are computed only on the first 5GB but the Viewer doesn't show this info anywhere (could be very discrete like on hover or using a small "(i)" for information)
The text was updated successfully, but these errors were encountered:
In each step, we should store if we truncated the data or not.
Currently,
config-parquet-and-info
only stores the fact that some of the splits have been partially converted to parquet, but not the list of them.We want to have the info for each split.
The same goes when we convert to duckdb, and when we compute the statistics. It should be the case on each truncation, so that we can show trustable information in the viewer.
Related issues and discussions:
The text was updated successfully, but these errors were encountered: