Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/hub/datasets-adding.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The Hub's web-based interface allows users without any developer experience to u

A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible.

1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.

<div class="flex justify-center">
Expand All @@ -21,7 +21,7 @@ A repository hosts all your dataset files, including the revision history, makin

### Upload dataset

1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)).
1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)).

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/upload_files.png"/>
Expand Down Expand Up @@ -70,7 +70,7 @@ Make sure the Dataset Viewer correctly shows your data, or [Configure the Datase

## Using the `huggingface_hub` client library

The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.
The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.

## Using other libraries

Expand All @@ -79,7 +79,7 @@ See the list of [Libraries supported by the Datasets Hub](./datasets-libraries)

## Using Git

Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.
Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.

## File formats

Expand All @@ -94,7 +94,7 @@ The Hub natively supports multiple file formats:

It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz).

Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets.
Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets.

You may want to convert your files to these formats to benefit from all the Hub features.
Other formats and structures may not be recognized by the Hub.
2 changes: 1 addition & 1 deletion docs/hub/datasets-cards.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.

You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the `README.md` file.
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.

## Dataset card metadata

Expand Down
4 changes: 2 additions & 2 deletions docs/hub/datasets-dask.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Dask

[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:

First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:

Expand All @@ -17,7 +17,7 @@ from huggingface_hub import HfApi
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
```

Finally, you can use Hugging Face paths in Dask:
Finally, you can use [Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask:

```python
import dask.dataframe as dd
Expand Down
6 changes: 3 additions & 3 deletions docs/hub/datasets-downloading.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Integrated libraries

If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with 🤗 Datasets below.
If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in dataset library" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/samsum?library=true) shows how to do so with 🤗 Datasets below.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-usage.png"/>
Expand All @@ -16,7 +16,7 @@ If a dataset on the Hub is tied to a [supported library](./datasets-libraries),

## Using the Hugging Face Client Library

You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.
You can use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.

```py
from huggingface_hub import hf_hub_download
Expand All @@ -32,7 +32,7 @@ dataset = pd.read_csv(

## Using Git

Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running:
Since all datasets on the Hub are Git repositories, you can clone the datasets locally by running:

```bash
git lfs install
Expand Down
6 changes: 4 additions & 2 deletions docs/hub/datasets-duckdb.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# DuckDB

[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system.
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:

First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:

Expand All @@ -17,7 +17,7 @@ from huggingface_hub import HfApi
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
```

Finally, you can use Hugging Face paths in DuckDB:
Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in DuckDB:

```python
>>> from huggingface_hub import HfFileSystem
Expand All @@ -39,3 +39,5 @@ You can reload it later:
>>> duckdb.register_filesystem(fs)
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df()
```

To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system).
2 changes: 1 addition & 1 deletion docs/hub/datasets-file-names-and-splits.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files.

This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer.
This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Datasets Hub features like the Dataset Viewer.
A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub.

Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration).
Expand Down
9 changes: 5 additions & 4 deletions docs/hub/datasets-gated.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@ The User Access request dialog can be modified to include additional text and ch
---
extra_gated_prompt: "You agree to not attempt to determine the identity of individuals in this dataset"
extra_gated_fields:
Company: text
Country: text
I agree to use this dataset for non-commercial use ONLY: checkbox
Name: text
Affiliation: text
Email: text
I agree to not attempt to determine the identity of speakers in this dataset: checkbox
---
```

Expand Down Expand Up @@ -73,4 +74,4 @@ In some cases, you might also want to modify the text in the heading of the gate
extra_gated_heading: "Acknowledge license to accept the repository"
extra_gated_button_content: "Acknowledge license"
---
```
```
2 changes: 1 addition & 1 deletion docs/hub/datasets-libraries.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Libraries

The Dataset Hub has support for several libraries in the Open Source ecosystem.
The Datasets Hub has support for several libraries in the Open Source ecosystem.
Thanks to the [huggingface_hub Python library](../huggingface_hub), it's easy to enable sharing your datasets on the Hub.
We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward.

Expand Down
6 changes: 3 additions & 3 deletions docs/hub/datasets-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Datasets on the Hub

The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a Dataset Preview to showcase the data.
The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a [Dataset Viewer](./datasets-viewer) to showcase the data.

Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Structure your repository guide](https://huggingface.co/docs/datasets/repository_structure). Following the supported repo structure will ensure that your repository will have a preview on its dataset page on the Hub.
Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.

## Search for datasets

Like models and Spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-main.png"/>
Expand Down
4 changes: 2 additions & 2 deletions docs/hub/datasets-pandas.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Pandas

[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit.
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub:

First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:

Expand All @@ -17,7 +17,7 @@ from huggingface_hub import HfApi
HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
```

Finally, you can use Hugging Face paths in Pandas:
Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in Pandas:

```python
import pandas as pd
Expand Down
2 changes: 1 addition & 1 deletion docs/hub/datasets-usage.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Using 🤗 Datasets

Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the **Use in dataset library** button to copy the code to load a dataset.
Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the [**Use in dataset library** button](https://huggingface.co/datasets/samsum?library=true) to copy the code to load a dataset.

First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:

Expand Down