diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index a7a10e6f5..a9267dfd8 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -12,7 +12,7 @@ The Hub's web-based interface allows users without any developer experience to u A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. -1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). +1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). 2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.
@@ -21,7 +21,7 @@ A repository hosts all your dataset files, including the revision history, makin ### Upload dataset -1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)). +1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)).
@@ -70,7 +70,7 @@ Make sure the Dataset Viewer correctly shows your data, or [Configure the Datase ## Using the `huggingface_hub` client library -The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. +The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. ## Using other libraries @@ -79,7 +79,7 @@ See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) ## Using Git -Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. +Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. ## File formats @@ -94,7 +94,7 @@ The Hub natively supports multiple file formats: It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). -Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets. +Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets. You may want to convert your files to these formats to benefit from all the Hub features. Other formats and structures may not be recognized by the Hub. diff --git a/docs/hub/datasets-cards.md b/docs/hub/datasets-cards.md index e2d2dc4b0..debe84099 100644 --- a/docs/hub/datasets-cards.md +++ b/docs/hub/datasets-cards.md @@ -4,7 +4,7 @@ Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used. -You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the `README.md` file. +You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file. ## Dataset card metadata diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index 00e5ef840..414db5c5d 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -1,7 +1,7 @@ # Dask [Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. -Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub: First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: @@ -17,7 +17,7 @@ from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` -Finally, you can use Hugging Face paths in Dask: +Finally, you can use [Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask: ```python import dask.dataframe as dd diff --git a/docs/hub/datasets-downloading.md b/docs/hub/datasets-downloading.md index 1ed63e38d..0dbd78d54 100644 --- a/docs/hub/datasets-downloading.md +++ b/docs/hub/datasets-downloading.md @@ -2,7 +2,7 @@ ## Integrated libraries -If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with 🤗 Datasets below. +If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in dataset library" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/samsum?library=true) shows how to do so with 🤗 Datasets below.
@@ -16,7 +16,7 @@ If a dataset on the Hub is tied to a [supported library](./datasets-libraries), ## Using the Hugging Face Client Library -You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas. +You can use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas. ```py from huggingface_hub import hf_hub_download @@ -32,7 +32,7 @@ dataset = pd.read_csv( ## Using Git -Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running: +Since all datasets on the Hub are Git repositories, you can clone the datasets locally by running: ```bash git lfs install diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index a308a972d..38769c707 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -1,7 +1,7 @@ # DuckDB [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. -Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: +Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub: First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: @@ -17,7 +17,7 @@ from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` -Finally, you can use Hugging Face paths in DuckDB: +Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in DuckDB: ```python >>> from huggingface_hub import HfFileSystem @@ -39,3 +39,5 @@ You can reload it later: >>> duckdb.register_filesystem(fs) >>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df() ``` + +To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index 897af7ff4..a242f2014 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -2,7 +2,7 @@ To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. -This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. +This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Datasets Hub features like the Dataset Viewer. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub. Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration). diff --git a/docs/hub/datasets-gated.md b/docs/hub/datasets-gated.md index 9094a9878..1ad90e526 100644 --- a/docs/hub/datasets-gated.md +++ b/docs/hub/datasets-gated.md @@ -12,9 +12,10 @@ The User Access request dialog can be modified to include additional text and ch --- extra_gated_prompt: "You agree to not attempt to determine the identity of individuals in this dataset" extra_gated_fields: - Company: text - Country: text - I agree to use this dataset for non-commercial use ONLY: checkbox + Name: text + Affiliation: text + Email: text + I agree to not attempt to determine the identity of speakers in this dataset: checkbox --- ``` @@ -73,4 +74,4 @@ In some cases, you might also want to modify the text in the heading of the gate extra_gated_heading: "Acknowledge license to accept the repository" extra_gated_button_content: "Acknowledge license" --- -``` \ No newline at end of file +``` diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md index bdf6470cf..656a68f52 100644 --- a/docs/hub/datasets-libraries.md +++ b/docs/hub/datasets-libraries.md @@ -1,6 +1,6 @@ # Libraries -The Dataset Hub has support for several libraries in the Open Source ecosystem. +The Datasets Hub has support for several libraries in the Open Source ecosystem. Thanks to the [huggingface_hub Python library](../huggingface_hub), it's easy to enable sharing your datasets on the Hub. We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. diff --git a/docs/hub/datasets-overview.md b/docs/hub/datasets-overview.md index dd227b922..7a9a3dc8e 100644 --- a/docs/hub/datasets-overview.md +++ b/docs/hub/datasets-overview.md @@ -2,13 +2,13 @@ ## Datasets on the Hub -The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a Dataset Preview to showcase the data. +The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a [Dataset Viewer](./datasets-viewer) to showcase the data. -Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Structure your repository guide](https://huggingface.co/docs/datasets/repository_structure). Following the supported repo structure will ensure that your repository will have a preview on its dataset page on the Hub. +Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer. ## Search for datasets -Like models and Spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you. +Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 9972816fb..6de07287c 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -1,7 +1,7 @@ # Pandas [Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. -Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub: First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: @@ -17,7 +17,7 @@ from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` -Finally, you can use Hugging Face paths in Pandas: +Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in Pandas: ```python import pandas as pd diff --git a/docs/hub/datasets-usage.md b/docs/hub/datasets-usage.md index d00bdd148..af44032be 100644 --- a/docs/hub/datasets-usage.md +++ b/docs/hub/datasets-usage.md @@ -1,6 +1,6 @@ # Using 🤗 Datasets -Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the **Use in dataset library** button to copy the code to load a dataset. +Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the [**Use in dataset library** button](https://huggingface.co/datasets/samsum?library=true) to copy the code to load a dataset. First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: