-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
While browsing the dataset at https://huggingface.co/datasets/NeuML/wikipedia-20250123, I noticed that a dataset with nearly 7M entries was estimated to be only 4M in size—almost half the actual amount. According to the post-download loading and the dataset_info (https://huggingface.co/datasets/NeuML/wikipedia-20250123/blob/main/train/dataset_info.json), the true data volume is indeed close to 7M. This significant discrepancy could mislead users when sorting datasets by row count. Why not directly retrieve this information from dataset_info?
Not sure if this is the right place to report this bug, but leaving it here for the team's awareness.
Metadata
Metadata
Assignees
Labels
No labels