From 0aaaf81d4f977093c0c038232cbeee97e0e858b1 Mon Sep 17 00:00:00 2001 From: osanseviero Date: Fri, 17 Nov 2023 16:34:07 +0100 Subject: [PATCH 1/8] Add docs on download stats --- docs/hub/_toctree.yml | 2 ++ docs/hub/datasets-faq.md | 8 ++++++++ docs/hub/models-faq.md | 8 ++++++-- 3 files changed, 16 insertions(+), 2 deletions(-) create mode 100644 docs/hub/datasets-faq.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index ad3406ade..d7d9727ab 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -156,6 +156,8 @@ title: File names and splits - local: datasets-manual-configuration title: Manual Configuration + - local: datasets-faq + title: Frequently Asked Questions - local: spaces title: Spaces isExpanded: true diff --git a/docs/hub/datasets-faq.md b/docs/hub/datasets-faq.md new file mode 100644 index 000000000..ee9216c68 --- /dev/null +++ b/docs/hub/datasets-faq.md @@ -0,0 +1,8 @@ +# Datasets Frequently Asked Questions + +## How are download stats generated for datasets? + +The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. This means that: + +* Whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source, the download count is not impacted. +* If a user manually downloads the data using tools like `wget` or through the Hub's user interface (UI), those downloads will not be included in the download count. \ No newline at end of file diff --git a/docs/hub/models-faq.md b/docs/hub/models-faq.md index 890da38a5..32258c094 100644 --- a/docs/hub/models-faq.md +++ b/docs/hub/models-faq.md @@ -1,4 +1,4 @@ -# Frequently Asked Questions +# Models Frequently Asked Questions ## How can I see what dataset was used to train the model? @@ -42,4 +42,8 @@ If the model card includes a link to a paper on arXiv, the Hugging Face Hub will -Read more about paper pages [here](./paper-pages). \ No newline at end of file +Read more about paper pages [here](./paper-pages). + +## How are download stats generated for models? + +Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. Every `GET` request to these files will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. From 2b638fbda68f5c6f6ac52db5117dce527ab6cb30 Mon Sep 17 00:00:00 2001 From: osanseviero Date: Fri, 17 Nov 2023 16:59:07 +0100 Subject: [PATCH 2/8] Destroy the wall --- docs/hub/models-faq.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/hub/models-faq.md b/docs/hub/models-faq.md index 32258c094..0a9fbc247 100644 --- a/docs/hub/models-faq.md +++ b/docs/hub/models-faq.md @@ -46,4 +46,6 @@ Read more about paper pages [here](./paper-pages). ## How are download stats generated for models? -Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. Every `GET` request to these files will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. +Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. + +Every `GET` request to these files will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. From b21de0af9f0930384d8f2379172567cb2aa34421 Mon Sep 17 00:00:00 2001 From: Omar Sanseviero Date: Fri, 17 Nov 2023 19:13:19 +0100 Subject: [PATCH 3/8] Apply suggestions from code review Co-authored-by: Daniel van Strien Co-authored-by: Julien Chaumond --- docs/hub/datasets-faq.md | 4 ++-- docs/hub/models-faq.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/hub/datasets-faq.md b/docs/hub/datasets-faq.md index ee9216c68..b7ef7e65f 100644 --- a/docs/hub/datasets-faq.md +++ b/docs/hub/datasets-faq.md @@ -4,5 +4,5 @@ The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. This means that: -* Whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source, the download count is not impacted. -* If a user manually downloads the data using tools like `wget` or through the Hub's user interface (UI), those downloads will not be included in the download count. \ No newline at end of file +* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source. +* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count. \ No newline at end of file diff --git a/docs/hub/models-faq.md b/docs/hub/models-faq.md index 0a9fbc247..49b2db4d0 100644 --- a/docs/hub/models-faq.md +++ b/docs/hub/models-faq.md @@ -46,6 +46,6 @@ Read more about paper pages [here](./paper-pages). ## How are download stats generated for models? -Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. +Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. -Every `GET` request to these files will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. +Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. From 747ca2b9f8b801c5d1dff79c5e79db7fa3b0226b Mon Sep 17 00:00:00 2001 From: osanseviero Date: Fri, 17 Nov 2023 19:19:55 +0100 Subject: [PATCH 4/8] Move to their own sections --- docs/hub/_toctree.yml | 6 ++++-- docs/hub/{datasets-faq.md => datasets-download-stats.md} | 2 +- docs/hub/models-download-stats.md | 7 +++++++ docs/hub/models-faq.md | 6 ------ 4 files changed, 12 insertions(+), 9 deletions(-) rename docs/hub/{datasets-faq.md => datasets-download-stats.md} (94%) create mode 100644 docs/hub/models-download-stats.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index d7d9727ab..24069f59c 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -108,6 +108,8 @@ title: Widget Examples - local: models-inference title: Inference API docs + - local: models-download-stats + title: Models Download Stats - local: models-faq title: Frequently Asked Questions - local: models-advanced @@ -156,8 +158,8 @@ title: File names and splits - local: datasets-manual-configuration title: Manual Configuration - - local: datasets-faq - title: Frequently Asked Questions + - local: datasets-download-stats + title: Datasets Download Stats - local: spaces title: Spaces isExpanded: true diff --git a/docs/hub/datasets-faq.md b/docs/hub/datasets-download-stats.md similarity index 94% rename from docs/hub/datasets-faq.md rename to docs/hub/datasets-download-stats.md index b7ef7e65f..84fc46cde 100644 --- a/docs/hub/datasets-faq.md +++ b/docs/hub/datasets-download-stats.md @@ -1,4 +1,4 @@ -# Datasets Frequently Asked Questions +# Datasets Download Stats ## How are download stats generated for datasets? diff --git a/docs/hub/models-download-stats.md b/docs/hub/models-download-stats.md new file mode 100644 index 000000000..63b1e02f6 --- /dev/null +++ b/docs/hub/models-download-stats.md @@ -0,0 +1,7 @@ +# Models Download Stats + +## How are download stats generated for models? + +Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. + +Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. diff --git a/docs/hub/models-faq.md b/docs/hub/models-faq.md index 49b2db4d0..0ec34bf63 100644 --- a/docs/hub/models-faq.md +++ b/docs/hub/models-faq.md @@ -43,9 +43,3 @@ If the model card includes a link to a paper on arXiv, the Hugging Face Hub will Read more about paper pages [here](./paper-pages). - -## How are download stats generated for models? - -Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. - -Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. From 209404356478f4f577de67281e20047202fdf916 Mon Sep 17 00:00:00 2001 From: osanseviero Date: Fri, 17 Nov 2023 19:39:59 +0100 Subject: [PATCH 5/8] Open source all the way --- docs/hub/models-download-stats.md | 150 ++++++++++++++++++++++++++++++ 1 file changed, 150 insertions(+) diff --git a/docs/hub/models-download-stats.md b/docs/hub/models-download-stats.md index 63b1e02f6..003f1afe6 100644 --- a/docs/hub/models-download-stats.md +++ b/docs/hub/models-download-stats.md @@ -5,3 +5,153 @@ Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. + +## Which are the query files for different libraries? + +By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, and `meta.yaml`. For the following set of libraries, there are specific query files + +```json +{ + "adapter-transformers": { + filter: [ + { + term: { path: "adapter_config.json" }, + }, + ], + }, + "asteroid": { + filter: [ + { + term: { path: "pytorch_model.bin" }, + }, + ], + }, + "flair": { + filter: [ + { + term: { path: "pytorch_model.bin" }, + }, + ], + }, + "keras": { + filter: [ + { + term: { path: "saved_model.pb" }, + }, + ], + }, + "ml-agents": { + filter: [ + { + wildcard: { path: "*.onnx" }, + }, + ], + }, + "nemo": { + filter: [ + { + wildcard: { path: "*.nemo" }, + }, + ], + }, + "open_clip": { + filter: [ + { + wildcard: { path: "*pytorch_model.bin" }, + }, + ], + }, + "sample-factory": { + filter: [ + { + term: { path: "cfg.json" }, + }, + ], + }, + "paddlenlp": { + filter: [ + { + term: { path: "model_config.json" }, + }, + ], + }, + "speechbrain": { + filter: [ + { + term: { path: "hyperparams.yaml" }, + }, + ], + }, + "sklearn": { + filter: [ + { + term: { path: "sklearn_model.joblib" }, + }, + ], + }, + "spacy": { + filter: [ + { + wildcard: { path: "*.whl" }, + }, + ], + }, + "stanza": { + filter: [ + { + term: { path: "models/default.zip" }, + }, + ], + }, + "stable-baselines3": { + filter: [ + { + wildcard: { path: "*.zip" }, + }, + ], + }, + "timm": { + filter: [ + { + terms: { path: ["pytorch_model.bin", "model.safetensors"] }, + }, + ], + }, + "diffusers": { + /// Filter out nested safetensors and pickle weights to avoid double counting downloads from the diffusers lib + must_not: [ + { + wildcard: { path: "*/*.safetensors" }, + }, + { + wildcard: { path: "*/*.bin" }, + }, + ], + /// Include documents that match at least one of the following rules + should: [ + /// Downloaded from diffusers lib + { + term: { path: "model_index.json" }, + }, + /// Direct downloads (LoRa, Auto1111 and others) + { + wildcard: { path: "*.safetensors" }, + }, + { + wildcard: { path: "*.ckpt" }, + }, + { + wildcard: { path: "*.bin" }, + }, + ], + minimum_should_match: 1, + }, + "peft": { + filter: [ + { + term: { path: "adapter_config.json" }, + }, + ], + } +} +``` \ No newline at end of file From 7975404994e9efec6831df7a6367e92a6ca79746 Mon Sep 17 00:00:00 2001 From: osanseviero Date: Fri, 17 Nov 2023 19:45:57 +0100 Subject: [PATCH 6/8] Update order and add to index --- docs/hub/_toctree.yml | 4 ++-- docs/hub/index.md | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 24069f59c..c9ba0ef02 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -151,6 +151,8 @@ sections: - local: datasets-viewer-configure title: Configure the Dataset Viewer + - local: datasets-download-stats + title: Datasets Download Stats - local: datasets-data-files-configuration title: Data files Configuration sections: @@ -158,8 +160,6 @@ title: File names and splits - local: datasets-manual-configuration title: Manual Configuration - - local: datasets-download-stats - title: Datasets Download Stats - local: spaces title: Spaces isExpanded: true diff --git a/docs/hub/index.md b/docs/hub/index.md index ec13e8612..d7f0a8570 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -31,6 +31,7 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k Tasks Widgets Inference API +Download Stats
@@ -44,6 +45,7 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k Downloading Datasets Libraries Dataset Viewer +Download Stats Data files Configuration
From 26e58098a0d8daaf1ef3625c4a9439a1fd7c90f2 Mon Sep 17 00:00:00 2001 From: Omar Sanseviero Date: Mon, 20 Nov 2023 09:08:58 +0100 Subject: [PATCH 7/8] Update datasets-download-stats.md --- docs/hub/datasets-download-stats.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md index 84fc46cde..8f3c3a5db 100644 --- a/docs/hub/datasets-download-stats.md +++ b/docs/hub/datasets-download-stats.md @@ -2,7 +2,7 @@ ## How are download stats generated for datasets? -The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. This means that: +The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that: * The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source. -* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count. \ No newline at end of file +* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count. From 5f1705047e32ee9b931d5cf980a63470b3f72d52 Mon Sep 17 00:00:00 2001 From: Omar Sanseviero Date: Mon, 20 Nov 2023 09:09:27 +0100 Subject: [PATCH 8/8] Update models-download-stats.md --- docs/hub/models-download-stats.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/models-download-stats.md b/docs/hub/models-download-stats.md index 003f1afe6..4acfa9785 100644 --- a/docs/hub/models-download-stats.md +++ b/docs/hub/models-download-stats.md @@ -2,9 +2,9 @@ ## How are download stats generated for models? -Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. +Counting the number of downloads for models is not a trivial task as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models), and different formats depending on the library. To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. -Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. +Every HTTP request to these files, including `GET` and `HEAD` will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` and `adapter_config.json`. ## Which are the query files for different libraries? @@ -154,4 +154,4 @@ By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, a ], } } -``` \ No newline at end of file +```