Skip to content

How are download stats generated for datasets on the hub  #100

@davanstrien

Description

@davanstrien

I am hoping this is the most appropriate repository for this topic. Let me know if I should open elsewhere. I have tried searching for this information but I may have missed something.

Is your feature request related to a problem? Please describe.
The Hugging Face hub provides some download stats in the public interface for datasets:
Screenshot 2022-04-14 at 16 03 43

It would be useful to know a bit more about how they are calculated in different situations:

For datasets hosted outside the hub (i.e. a dataset script points to files hosted somewhere else):

I'm assuming, in this case, a download is registered when you do:

ds = load_dataset('awesome-dataset')

For datasets hosted inside the hub (i.e. a dataset script points to files hosted somewhere else).

In the case where files are hosted directly on the hub, is there any distinction made between the number of files that are downloaded? i.e. if a subset of a dataset is loaded i.e.

ds = load_dataset('awesome-dataset', 'subset')

and this only requires accessing a subset of files hosted on the hub is this counted differently from downloading all the data hosted on the hub?

For direct downloads of files hosted on the hub.

For example, if for the dataset hosted at https://huggingface.co/datasets/w11wo/imdb-javanese I wget the underlying csv.zip files, is that registered in the download stats?

Describe the solution you'd like
It would be great to provide more documentation (apologies if I've missed existing documentation) on how the download stats are generated for datasets.

For context in the library/academic publishing word world, some repositories/publishers follow COUNTER guidance on recording stats. For example, repositories that follow this: https://www.projectcounter.org/code-practice-research-data/repositories-that-have-implemented-the-code-of-practice-for-research-data/. This is intended to ensure that you can make more valid comparisons across platforms. This is likely to be relevant only to a subset of your users so might not be worth the effort to meet these requirements but adding documentation about how stats are generated might be helpful. There is an implementation guide related to this: https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md

Describe alternatives you've considered
For files hosted outside of Hugging Face, it is often possible to get download stats however, it would be useful to be able to cross-reference those with the hub stats.

Additional context
Having download stats is useful in a few scenarios:

  • For institutions thinking about sharing datasets, they often need to demonstrate impact. One way to do this is to cite download stats from the Hub. In particular, it is useful to be able to compare those with other stats in order to have a sense of how much putting data in the Hub helps with overall usage. As an example, the British Library recently added some datasets to the hub (https://huggingface.co/datasets/blbooksgenre) -- it would be great to know how much doing this has generated new usage compared to only hosting the files on the BL's data repository.

  • For researchers/academics, there is a slow move away from only focusing on 'traditional' metrics i.e. citation counts, h-index etc. to also considering other ways in which they can make contributions. Again, contributing a dataset to the hub could be a valuable contribution that they may want to cite in whatever reporting structures they have in their university/country/funder -- for this having more confidence in the stats from the hub could be very helpful in demonstrating impact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions