Skip to content

Front-end statistical data quantity deviation #7507

@rangehow

Description

@rangehow

Describe the bug

While browsing the dataset at https://huggingface.co/datasets/NeuML/wikipedia-20250123, I noticed that a dataset with nearly 7M entries was estimated to be only 4M in size—almost half the actual amount. According to the post-download loading and the dataset_info (https://huggingface.co/datasets/NeuML/wikipedia-20250123/blob/main/train/dataset_info.json), the true data volume is indeed close to 7M. This significant discrepancy could mislead users when sorting datasets by row count. Why not directly retrieve this information from dataset_info?

Not sure if this is the right place to report this bug, but leaving it here for the team's awareness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions