Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
e60a969
more datasets docs
lhoestq Oct 31, 2023
4d7bc4a
add configure your dataset
lhoestq Oct 31, 2023
0f7a201
minor
lhoestq Oct 31, 2023
de191ca
minor
lhoestq Oct 31, 2023
1216878
update toc
lhoestq Oct 31, 2023
441c8cc
minor
lhoestq Oct 31, 2023
c9e24a0
add dataset structure docs
lhoestq Oct 31, 2023
c4e5ed0
rename sections
lhoestq Nov 2, 2023
3a50846
Apply suggestions from code review
lhoestq Nov 3, 2023
0d1e1f4
sylvain's comments: Adding a new dataset
lhoestq Nov 3, 2023
b1b6f5b
sylvain-s comments: Configure the Dataset Viewer
lhoestq Nov 3, 2023
1e2923a
sylvain's comments: File names and splits
lhoestq Nov 3, 2023
07c493e
sylvain's comments: Manual Configuration
lhoestq Nov 3, 2023
10fc614
sylvain's comments: Libraries
lhoestq Nov 3, 2023
624c0c4
sylvain's comment: login
lhoestq Nov 3, 2023
3d72fb8
sylvain's comments: Using 🤗 Datasets
lhoestq Nov 3, 2023
bb04591
lucain's comment: Adding a new dataset
lhoestq Nov 3, 2023
1b463dc
add duckdb write
lhoestq Nov 3, 2023
164589b
add create repo step for dask and pandas
lhoestq Nov 3, 2023
f13de19
minor
lhoestq Nov 5, 2023
0002a38
minor
lhoestq Nov 5, 2023
167ea21
typo
lhoestq Nov 6, 2023
a379824
Update docs/hub/_toctree.yml
lhoestq Nov 6, 2023
7bf18c5
Apply suggestions from code review
lhoestq Nov 6, 2023
3e4836b
rename titles for consistency
lhoestq Nov 6, 2023
1b4b81a
Add Downloading Datasets
lhoestq Nov 6, 2023
e6d7d39
Apply suggestions from code review
lhoestq Nov 8, 2023
4b74ddc
Merge branch 'more-datasets-docs-continued' into more-datasets-docs
lhoestq Nov 8, 2023
0028030
more links to list of supported formats
lhoestq Nov 8, 2023
fe1e75e
move supported file formats to upload page
lhoestq Nov 9, 2023
e257db7
fix link to row
lhoestq Nov 9, 2023
79d2812
fix links
lhoestq Nov 9, 2023
2ffc2cc
again
lhoestq Nov 9, 2023
c9d0f53
remove duplicate
lhoestq Nov 9, 2023
dac5618
move configure viewer to bottom of page
lhoestq Nov 9, 2023
74dbd4f
fix to_parquet example
lhoestq Nov 9, 2023
8e3da3c
minor fix
lhoestq Nov 14, 2023
e63d0e4
Update docs/hub/datasets-file-names-and-splits.md
lhoestq Nov 14, 2023
fa1093e
Apply suggestions from code review
lhoestq Nov 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,35 @@
title: Dataset Cards
- local: datasets-gated
title: Gated Datasets
- local: datasets-adding
title: Uploading Datasets
- local: datasets-downloading
title: Downloading Datasets
- local: datasets-libraries
title: Integrated Libraries
sections:
- local: datasets-dask
title: Dask
- local: datasets-usage
title: Datasets
- local: datasets-duckdb
title: DuckDB
- local: datasets-pandas
title: Pandas
- local: datasets-webdataset
title: WebDataset
- local: datasets-viewer
title: Dataset Viewer
- local: datasets-usage
title: Using Datasets
- local: datasets-adding
title: Adding New Datasets
sections:
- local: datasets-viewer-configure
title: Configure the Dataset Viewer
- local: datasets-data-files-configuration
title: Data files Configuration
sections:
- local: datasets-file-names-and-splits
title: File names and splits
- local: datasets-manual-configuration
title: Manual Configuration
- local: spaces
title: Spaces
isExpanded: true
Expand Down
103 changes: 95 additions & 8 deletions docs/hub/datasets-adding.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,100 @@
# Adding new datasets
# Uploading datasets

Any Hugging Face user can create a dataset! You can start by [creating your dataset repository](https://huggingface.co/new-dataset) and choosing one of the following methods to upload your dataset:
The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away!

* [Add files manually to the repository through the UI](https://huggingface.co/docs/datasets/upload_dataset#upload-your-files)
* [Push files with the `push_to_hub` method from 🤗 Datasets](https://huggingface.co/docs/datasets/upload_dataset#upload-from-python)
* [Use Git to commit and push your dataset files](https://huggingface.co/docs/datasets/share#clone-the-repository)
Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet.

While in many cases it's possible to just add raw data to your dataset repo in any supported formats (JSON, CSV, Parquet, text, images, audio files, …), for some large datasets you may want to [create a loading script](https://huggingface.co/docs/datasets/dataset_script#create-a-dataset-loading-script). This script defines the different configurations and splits of your dataset, as well as how to download and process the data.
## Upload using the Hub UI

## Datasets outside a namespace
The Hub's web-based interface allows users without any developer experience to upload a dataset.

Datasets outside a namespace are maintained by the Hugging Face team. Unlike the naming convention used for community datasets (`username/dataset_name` or `org/dataset_name`), datasets outside a namespace can be referenced directly by their name (e.g. [`glue`](https://huggingface.co/datasets/glue)). If you find that an improvement is needed, use their "Community" tab to open a discussion or submit a PR on the Hub to propose edits.
### Create a repository

A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible.

1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset).
2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/create_repo.png"/>
</div>

### Upload dataset

1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)).

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/upload_files.png"/>
</div>

2. Drag and drop your dataset files.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/commit_files.png"/>
</div>

3. After uploading your dataset files, they are stored in your dataset repository.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/files_stored.png"/>
</div>

### Create a Dataset card

Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a link to https://huggingface.co/docs/hub/datasets-cards for users that don't know what a Dataset card is.


1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/dataset_card.png"/>
</div>

2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card.

You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/metadata_ui.png"/>
</div>

3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details.

You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail).

### Dataset Viewer

The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it.
It is enabled by default for all public datasets.

Make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure).

## Using the `huggingface_hub` client library

The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more.

## Using other libraries

Some libraries like [🤗 Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub.
See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information.
Comment on lines +77 to +78
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to have links to the upload examples from these libraries' dedicated doc pages here (we can split these pages into the Upload and Download sections to make them linkable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links already appear on the navigation tab on the left when you are on this page

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. But I'm not sure why these links are not expanded (automatically) when clicking on the [Libraries supported by the Datasets Hub](./datasets-libraries) link on my machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm also seeing this behavior when clicking on the Libraries link from the documentation index page.
Let's see with the docs front-end team if we can fix that


## Using Git

Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets.

## File formats

The Hub natively supports multiple file formats:

- CSV (.csv, .tsv)
- JSON Lines, JSON (.jsonl, .json)
- Parquet (.parquet)
- Text (.txt)
- Images (.png, .jpg, etc.)
- Audio (.wav, .mp3, etc.)

It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz).

Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets.

You may want to convert your files to these formats to benefit from all the Hub features.
Other formats and structures may not be recognized by the Hub.
47 changes: 47 additions & 0 deletions docs/hub/datasets-dask.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Dask

[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:

First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:

```
huggingface-cli login
```

Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using:

```python
from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
```

Finally, you can use Hugging Face paths in Dask:

```python
import dask.dataframe as dd

df.to_parquet("hf://datasets/username/my_dataset")

# or write in separate directories if the dataset has train/validation/test splits
df_train.to_parquet("hf://datasets/username/my_dataset/train")
df_valid.to_parquet("hf://datasets/username/my_dataset/validation")
df_test .to_parquet("hf://datasets/username/my_dataset/test")
```

This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format.
You can reload it later:

```python
import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/username/my_dataset")

# or read from separate directories if the dataset has train/validation/test splits
df_train = dd.read_parquet("hf://datasets/username/my_dataset/train")
df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation")
df_test = dd.read_parquet("hf://datasets/username/my_dataset/test")
```

For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system).
29 changes: 29 additions & 0 deletions docs/hub/datasets-data-files-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Data files Configuration

There are no constraints on how to structure dataset repositories.

However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.
Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`.

## File names and splits

To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation.

## Manual configuration

You can choose the data files to show in the Dataset Viewer for your dataset using YAML.
It is useful if you want to specify which file goes into which split manually.

You can also define multiple configurations (or subsets) for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).

See the documentation on [Manual configuration](./datasets-manual-configuration) for more information.

## Image and Audio datasets

For image and audio classification datasets, you can also use directories to name the image and audio classes.
And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them.

We provide two guides that you can check out:

- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset)
- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset)
44 changes: 44 additions & 0 deletions docs/hub/datasets-downloading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Downloading datasets

## Integrated libraries

If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with 🤗 Datasets below.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-usage.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-usage-dark.png"/>
</div>

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-usage-modal.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-usage-modal-dark.png"/>
</div>

## Using the Hugging Face Client Library

You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas.

```py
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "YOUR_REPO_ID"
FILENAME = "data.csv"

dataset = pd.read_csv(
hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
)
```

## Using Git

Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running:

```bash
git lfs install
git clone git@hf.co:datasets/<dataset ID> # example: git clone git@hf.co:datasets/allenai/c4
```

If you have write-access to the particular dataset repo, you'll also have the ability to commit and push revisions to the dataset.

Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos.
41 changes: 41 additions & 0 deletions docs/hub/datasets-duckdb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# DuckDB

[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system.
Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub:

First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:

```
huggingface-cli login
```

Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using:

```python
from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")
```

Finally, you can use Hugging Face paths in DuckDB:

```python
>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> duckdb.sql("COPY tbl TO 'hf://datasets/username/my_dataset/data.parquet' (FORMAT PARQUET);")
```

This creates a file `data.parquet` in the dataset repository `username/my_dataset` containing your dataset in Parquet format.
You can reload it later:

```python
>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df()
```
Loading