How are download stats generated for datasets on the hub  

I am hoping this is the most appropriate repository for this topic. Let me know if I should open elsewhere. I have tried searching for this information but I may have missed something.

**Is your feature request related to a problem? Please describe.**
The Hugging Face hub provides some download stats in the public interface for datasets: 
<img width="124" alt="Screenshot 2022-04-14 at 16 03 43" src="https://user-images.githubusercontent.com/8995957/163418570-1063d67e-8c2e-4f0f-b9fc-47c980d69467.png">

It would be useful to know a bit more about how they are calculated in different situations:

### For datasets hosted outside the hub (i.e. a dataset script points to files hosted somewhere else):

I'm assuming, in this case, a download is registered when you do: 

```python
ds = load_dataset('awesome-dataset')
```

### For datasets hosted inside the hub (i.e. a dataset script points to files hosted somewhere else).

In the case where files are hosted directly on the hub, is there any distinction made between the number of files that are downloaded? i.e. if a subset of a dataset is loaded i.e. 

```python
ds = load_dataset('awesome-dataset', 'subset')
```
and this only requires accessing a subset of files hosted on the hub is this counted differently from downloading all the data hosted on the hub?

### For direct downloads of files hosted on the hub. 

For example, if for the dataset hosted at https://huggingface.co/datasets/w11wo/imdb-javanese I `wget` the underlying `csv.zip` files, is that registered in the download stats? 

**Describe the solution you'd like**
It would be great to provide more documentation (apologies if I've missed existing documentation) on how the download stats are generated for datasets.

For context in the library/academic publishing word world, some repositories/publishers follow [COUNTER](https://www.projectcounter.org/) guidance on recording stats. For example, repositories that follow this: https://www.projectcounter.org/code-practice-research-data/repositories-that-have-implemented-the-code-of-practice-for-research-data/. This is intended to ensure that you can make more valid comparisons across platforms. This is likely to be relevant only to a subset of your users so might not be worth the effort to meet these requirements but adding documentation about how stats are generated might be helpful. There is an implementation guide related to this: https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md 

**Describe alternatives you've considered**
For files hosted outside of Hugging Face, it is often possible to get download stats however, it would be useful to be able to cross-reference those with the hub stats. 

**Additional context**
Having download stats is useful in a few scenarios:

- For institutions thinking about sharing datasets, they often need to demonstrate impact. One way to do this is to cite download stats from the Hub. In particular, it is useful to be able to compare those with other stats in order to have a sense of how much putting data in the Hub helps with overall usage. As an example, the British Library recently added some datasets to the hub (https://huggingface.co/datasets/blbooksgenre) -- it would be great to know how much doing this has generated new usage compared to only hosting the files on the BL's data repository. 

- For researchers/academics, there is a slow move away from only focusing on 'traditional' metrics i.e. citation counts, h-index etc. to also considering other ways in which they can make contributions. Again, contributing a dataset to the hub could be a valuable contribution that they may want to cite in whatever reporting structures they have in their university/country/funder -- for this having more confidence in the stats from the hub could be very helpful in demonstrating impact. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How are download stats generated for datasets on the hub #100

For datasets hosted outside the hub (i.e. a dataset script points to files hosted somewhere else):

For datasets hosted inside the hub (i.e. a dataset script points to files hosted somewhere else).

For direct downloads of files hosted on the hub.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How are download stats generated for datasets on the hub #100

Description

For datasets hosted outside the hub (i.e. a dataset script points to files hosted somewhere else):

For datasets hosted inside the hub (i.e. a dataset script points to files hosted somewhere else).

For direct downloads of files hosted on the hub.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions