From e60a969d5dff28ec88bf214a550c8621f3bd875b Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 16:54:25 +0100 Subject: [PATCH 01/38] more datasets docs --- docs/hub/_toctree.yml | 19 ++++-- docs/hub/datasets-adding.md | 87 ++++++++++++++++++++++++--- docs/hub/datasets-dask.md | 39 ++++++++++++ docs/hub/datasets-duckdb.md | 21 +++++++ docs/hub/datasets-libraries.md | 16 +++++ docs/hub/datasets-pandas.md | 39 ++++++++++++ docs/hub/datasets-usage.md | 6 +- docs/hub/datasets-viewer-configure.md | 56 +++++++++++++++++ docs/hub/datasets-viewer.md | 27 +++++---- docs/hub/datasets-webdataset.md | 20 ++++++ docs/hub/index.md | 3 +- 11 files changed, 304 insertions(+), 29 deletions(-) create mode 100644 docs/hub/datasets-dask.md create mode 100644 docs/hub/datasets-duckdb.md create mode 100644 docs/hub/datasets-libraries.md create mode 100644 docs/hub/datasets-pandas.md create mode 100644 docs/hub/datasets-viewer-configure.md create mode 100644 docs/hub/datasets-webdataset.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index dd584bf59..70991e6f7 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -125,12 +125,23 @@ title: Dataset Cards - local: datasets-gated title: Gated Datasets - - local: datasets-viewer - title: Dataset Viewer - - local: datasets-usage - title: Using Datasets - local: datasets-adding title: Adding New Datasets + - local: datasets-viewer + title: Dataset Viewer + - local: datasets-libraries + title: Libraries + sections: + - local: datasets-dask + title: Dask + - local: datasets-usage + title: Datasets + - local: datasets-duckdb + title: DuckDB + - local: datasets-pandas + title: Pandas + - local: datasets-webdataset + title: WebDataset - local: spaces title: Spaces isExpanded: true diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 1a7f01fc8..492dffce2 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -1,13 +1,84 @@ -# Adding new datasets +# Adding a new dataset -Any Hugging Face user can create a dataset! You can start by [creating your dataset repository](https://huggingface.co/new-dataset) and choosing one of the following methods to upload your dataset: +The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and popular research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! -* [Add files manually to the repository through the UI](https://huggingface.co/docs/datasets/upload_dataset#upload-your-files) -* [Push files with the `push_to_hub` method from πŸ€— Datasets](https://huggingface.co/docs/datasets/upload_dataset#upload-from-python) -* [Use Git to commit and push your dataset files](https://huggingface.co/docs/datasets/share#clone-the-repository) +Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. -While in many cases it's possible to just add raw data to your dataset repo in any supported formats (JSON, CSV, Parquet, text, images, audio files, …), for some large datasets you may want to [create a loading script](https://huggingface.co/docs/datasets/dataset_script#create-a-dataset-loading-script). This script defines the different configurations and splits of your dataset, as well as how to download and process the data. +## Upload with the Hub UI -## Datasets outside a namespace +The Hub's web-based interface allows users without any developer experience to upload a dataset. -Datasets outside a namespace are maintained by the Hugging Face team. Unlike the naming convention used for community datasets (`username/dataset_name` or `org/dataset_name`), datasets outside a namespace can be referenced directly by their name (e.g. [`glue`](https://huggingface.co/datasets/glue)). If you find that an improvement is needed, use their "Community" tab to open a discussion or submit a PR on the Hub to propose edits. \ No newline at end of file +### Create a repository + +A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. + +1. Click on your profile and select **New Dataset** to create a new dataset repository. +2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization. + +
+ +
+ +### Upload dataset + +1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others. For text data extensions like `.csv`, `.json`, `.jsonl`, and `.txt`, we recommend compressing them before uploading to the Hub (to `.zip` or `.gz` file extension for example). + + Text file extensions are not tracked by Git LFS by default, and if they're greater than 10MB, they will not be committed and uploaded. Take a look at the `.gitattributes` file in your repository for a complete list of tracked file extensions. For this tutorial, you can use the following sample `.csv` files since they're small: train.csv, test.csv. + +
+ +
+ +2. Drag and drop your dataset files and add a brief descriptive commit message. + +
+ +
+ +3. After uploading your dataset files, they are stored in your dataset repository. + +
+ +
+ +### Create a Dataset card + +Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. + +1. Click on **Create Dataset Card** to create a Dataset card. This button creates a `README.md` file in your repository. + +
+ +
+ +2. At the top, you'll see the **Metadata UI** with several fields to select from like license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub. When you select an option from each field, they'll be automatically added to the top of the dataset card. + + You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of (but not required) tag options like `annotations_creators`, to help you choose the appropriate tags. + +
+ +
+ +3. Click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. Filling out the template is a great way to introduce your dataset to the community and help users understand how to use it. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). + +### Dataset Viewer + +The [Dataset Viewer](./datasets-viewer) is crucial to know what the data actually look like. +You can [configure it](./datasets-viewer-configure) and specify which files to show and how they should be shown. + +## Using Git + +Since model repos are just Git repositories, you can use Git to push your model files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your models. + +## Using the `huggingface_hub` client library + +The rich feature set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading models to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. + +## Using other libraries + +Some libraries [πŸ€— Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. +See the list of [Libraries supported by the Datasets Hub](./datasets-libraries.md) for more information. + +## Using Git + +Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md new file mode 100644 index 000000000..2ca11ad89 --- /dev/null +++ b/docs/hub/datasets-dask.md @@ -0,0 +1,39 @@ +# Dask + +[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: + +First login using + +``` +huggingface-cli login +``` + +And then you can use Hugging Face paths in Dask: + +```python +import dask.dataframe as dd + +df.write_parquet("hf://datasets/username/my_dataset") + +# or write in separate directories if the dataset has train/validation/test splits +df_train.write_parquet("hf://datasets/username/my_dataset/train") +df_valid.write_parquet("hf://datasets/username/my_dataset/validation") +df_test .write_parquet("hf://datasets/username/my_dataset/test") +``` + +This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. +You can reload it later: + +```python +import dask.dataframe as dd + +df = dd.read_parquet("hf://datasets/username/my_dataset") + +# or read from separate directories if the dataset has train/validation/test splits +df_train = dd.read_parquet("hf://datasets/username/my_dataset/train") +df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") +df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") +``` + +To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md new file mode 100644 index 000000000..0ea465fe0 --- /dev/null +++ b/docs/hub/datasets-duckdb.md @@ -0,0 +1,21 @@ +# DuckDB + +[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL OLAP database management system. +Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: + +First login using + +``` +huggingface-cli login +``` + +And then you can use Hugging Face paths in DuckDB: + +```python +>>> from huggingface_hub import HfFileSystem +>>> import duckdb + +>>> fs = HfFileSystem() +>>> duckdb.register_filesystem(fs) +>>> df = duckdb.query(f"SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10").df() +``` diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md new file mode 100644 index 000000000..6f95fadd4 --- /dev/null +++ b/docs/hub/datasets-libraries.md @@ -0,0 +1,16 @@ +# Libraries + +The Dataset Hub has support for several libraries in the Open Source ecosystem. +Thanks to the `huggingface_hub` Python library, it's easy to enable sharing your datasets on the Hub. +The Hub supports many libraries, and we're working on expanding this support! +We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. + +The table below summarizes the supported libraries and their level of integration. + +| Library | Description | Download from Hub | Push to Hub | +|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|---|----| +| [Dask](https://github.com/dask/dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. | βœ… | βœ… | +| [Datasets](https://github.com/huggingface/datasets) | πŸ€— Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | βœ… | βœ… | +| [DuckDB](https://github.com/duckdb/duckdb) | In-process SQL OLAP database management system. | βœ… | βœ… | +| [Pandas](https://github.com/pandas-dev/pandas) | Python data analysis toolkit. | βœ… | βœ… | +| [WebDataset](https://github.com/webdataset/webdataset) | Library to write I/O pipelines for large datasets. | βœ… | ❌ | diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md new file mode 100644 index 000000000..abe02f1ea --- /dev/null +++ b/docs/hub/datasets-pandas.md @@ -0,0 +1,39 @@ +# Pandas + +[Pandas](https://github.com/pandas-dev/pandas) is widely used Python data analysis toolkit. +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: + +First login using + +``` +huggingface-cli login +``` + +And then you can use Hugging Face paths in Pandas: + +```python +import pandas as pd + +df.write_parquet("hf://datasets/username/my_dataset/data.parquet") + +# or write in separate files if the dataset has train/validation/test splits +df_train.write_parquet("hf://datasets/username/my_dataset/train.parquet") +df_valid.write_parquet("hf://datasets/username/my_dataset/validation.parquet") +df_test .write_parquet("hf://datasets/username/my_dataset/test.parquet") +``` + +This creates a dataset repository `username/my_dataset` containing your Pandas dataset in Parquet format. +You can reload it later: + +```python +import pandas as pd + +df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet") + +# or read from separate files if the dataset has train/validation/test splits +df_train = pd.read_parquet("hf://datasets/username/my_dataset/train.parquet") +df_valid = pd.read_parquet("hf://datasets/username/my_dataset/validation.parquet") +df_test = pd.read_parquet("hf://datasets/username/my_dataset/test.parquet") +``` + +To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). diff --git a/docs/hub/datasets-usage.md b/docs/hub/datasets-usage.md index b32807d7b..b156c2efd 100644 --- a/docs/hub/datasets-usage.md +++ b/docs/hub/datasets-usage.md @@ -2,8 +2,4 @@ Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using πŸ€— Datasets. You can click on the **Use in dataset library** button to copy the code to load a dataset. -Some datasets on the Hub contain a [loading script](https://huggingface.co/docs/datasets/dataset_script), which allows you to easily [load the dataset when you need it](https://huggingface.co/docs/datasets/load_hub). - -Many datasets however do not need to include a loading script, for instance when their data is stored directly in the repository in formats such as CSV, JSON and Parquet. πŸ€— Datasets can [load those kinds of datasets](https://huggingface.co/docs/datasets/loading#hugging-face-hub) automatically without a loading script. - -For more information about using πŸ€— Datasets, check out the [tutorials](https://huggingface.co/docs/datasets/tutorial) and [how-to guides](https://huggingface.co/docs/datasets/how_to) available in the πŸ€— Datasets documentation. \ No newline at end of file +For more information about using πŸ€— Datasets, check out the [tutorials](https://huggingface.co/docs/datasets/tutorial) and [how-to guides](https://huggingface.co/docs/datasets/how_to) available in the πŸ€— Datasets documentation. diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md new file mode 100644 index 000000000..9bce3239d --- /dev/null +++ b/docs/hub/datasets-viewer-configure.md @@ -0,0 +1,56 @@ +# Configure the Dataset Viewer + +The Dataset Viewer supports many data files formats, from text to tabular and from image to audio formats. +It also separates the train/validation/test splits based on file and folder names. + +To configure the Dataset Viewer for your dataset, make sure your dataset is in a supported data format and structured the right way. + +## Supported data formats + +The dataset viewer supports multiple file formats: + +- CSV (.csv, .tsv) +- JSON Lines, JSON (.jsonl, .json) +- Text (.txt) +- Images (.png, .jpg, etc.) +- Audio (.wav, .mp3, etc.) +- Parquet (.parquet) + +Parquet is often a good option: it is a column-oriented format designed for data storage and retrieval. +Parquet files are smaller than CSV files, and they also support nested data structures which makes them ideal for storing complex data. +Parquet key features are: + +- Efficient compression: Parquet’s columnar storage format enables efficient data compression, reducing storage costs and download/upload times. +- Fast query performance: Parquet’s columnar storage format allows to only load or scan the data you need, which enables faster query performance. +- Compatible with many data tools: Parquet is compatible with a wide range of data analysis and manipulation tools like Pandas and DuckDB, and also with big data processing frameworks including Apache Spark and Dask. + +The dataset viewer also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). + +## Supported dataset structures + +You can name the data files or their folder after their split names train/validation/test. +If there are no split names, all the data files are considered part of the train split. + +For image and audio classification datasets, you can also use directories to name the image and audio classes. +And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. + +It is also possible to customize your splits manually. +Indeed, you can use YAML to: + +- List the data files per split +- Define multiple datasets configurations (e.g. if you dataset has multiple subsets or languages) +- Pass dataset building parameters (e.g. the separator used in your CSV files). + +For more information, feel free to check out the guide on [How to structure your dataset repository](https://huggingface.co/docs/datasets/repository_structure) + +## Disable the viewer + +The dataset viewer can be disabled. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add a `viewer` property with the value `false`. + +``` +--- +viewer: false +--- +``` + +Note that the viewer is always disabled on the private datasets. diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 61377f2e7..00ecc215b 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -7,9 +7,19 @@ The dataset page includes a table with the contents of the dataset, arranged by +## Inspect data distributions + +At the top of each column you can see histograms representing the distributions of numerical values and text lengths. +For categorical data there is also the number of rows from each class. + +## Filter by value + +If you click on a bar of a histogram from a numerical column, the dataset viewer will filter the data and show only the rows with values that fall in the selected range. +Similarly, if you select one class from a categorical column, it will show only the rows from the selected category. + ## Search a word in the dataset -You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of type `string`, even if the values are nested in a dictionary. +You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list. ## Share a specific row @@ -17,7 +27,7 @@ You can share a specific row by clicking on it, and then copying the URL in the ## Access the parquet files -Every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset with libraries such as Polars, Pandas or DuckDB. +To power the dataset viewer, every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset with libraries such as Polars, Pandas or DuckDB. You can also access the list of Parquet files programmatically using the [Hub API](./api#endpoints-table): https://huggingface.co/api/datasets/glue/parquet. @@ -35,14 +45,9 @@ For the biggest datasets, the page shows a preview of the first 100 rows instead -## Disable the viewer - -The dataset viewer can be disabled. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add a `viewer` property with the value `false`. +## Configure the Dataset Viewer -``` ---- -viewer: false ---- -``` +To have a nice and working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. +There is also an option to configure the Dataset Viewer using YAML. -Note that the viewer is always disabled on the private datasets. +For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure) diff --git a/docs/hub/datasets-webdataset.md b/docs/hub/datasets-webdataset.md new file mode 100644 index 000000000..987cf6d3e --- /dev/null +++ b/docs/hub/datasets-webdataset.md @@ -0,0 +1,20 @@ +# WebDataset + +[WebDataset](https://github.com/webdataset/webdataset) is a library to write I/O pipelines for large datasets. +Since it supports streaming data using HTTP, you can use the Hugging Face data files URLs to stream a dataset in WebDataset format: + +First login using + +``` +huggingface-cli login +``` + +And then you can stream Hugging Face datasets in WebDataset: + +```python +>>> import webdataset as wds +>>> from huggingface_hub import HfFolder + +>>> hf_token = HfFolder().get_token() +>>> dataset = wds.WebDataset(f"pipe:curl -s -L https://huggingface.co/datasets/username/my_wds_dataset/resolve/main/train-000000.tar -H 'Authorization:Bearer {hf_token}'") +``` diff --git a/docs/hub/index.md b/docs/hub/index.md index 86402db0d..c33d3cc3e 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -42,9 +42,10 @@ The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k Datasets Overview Dataset Cards Gated Datasets +Adding New Datasets Dataset viewer +Libraries Using Datasets -Adding New Datasets
From 4d7bc4a4aab760e8552e461e7510c14a558c0bdc Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 17:06:43 +0100 Subject: [PATCH 02/38] add configure your dataset --- docs/hub/datasets-viewer-configure.md | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md index 9bce3239d..9dc9b8e66 100644 --- a/docs/hub/datasets-viewer-configure.md +++ b/docs/hub/datasets-viewer-configure.md @@ -26,22 +26,34 @@ Parquet key features are: The dataset viewer also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). -## Supported dataset structures +## Define the dataset splits -You can name the data files or their folder after their split names train/validation/test. +You can name the data files or their folder after their split names (train/validation/test). If there are no split names, all the data files are considered part of the train split. -For image and audio classification datasets, you can also use directories to name the image and audio classes. -And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. +For more information, feel free to check out the documentation on on [Automatic splits detection](https://huggingface.co/docs/datasets/repository_structure#automatic-splits-detection) + +## Configure the dataset It is also possible to customize your splits manually. Indeed, you can use YAML to: - List the data files per split -- Define multiple datasets configurations (e.g. if you dataset has multiple subsets or languages) +- Use custom split names - Pass dataset building parameters (e.g. the separator used in your CSV files). +- Define multiple datasets configurations (e.g. if you dataset has multiple subsets or languages) + +Check out the guide on [How to structure your dataset repository](https://huggingface.co/docs/datasets/repository_structure) for more details. + +## Image and audio datasets + +For image and audio classification datasets, you can also use directories to name the image and audio classes. +And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. + +Those two guides can be useful: -For more information, feel free to check out the guide on [How to structure your dataset repository](https://huggingface.co/docs/datasets/repository_structure) +- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset) +- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset) ## Disable the viewer From 0f7a20198504f530e553b8e03e845827e3b2f581 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 17:33:48 +0100 Subject: [PATCH 03/38] minor --- docs/hub/index.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/hub/index.md b/docs/hub/index.md index c33d3cc3e..8e6d03232 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -42,10 +42,9 @@ The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k Datasets Overview Dataset Cards Gated Datasets -Adding New Datasets +Adding a new Dataset Dataset viewer Libraries -Using Datasets
From de191caf0f546fa609c09da4b2901beeae06ee46 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 17:37:10 +0100 Subject: [PATCH 04/38] minor --- docs/hub/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 70991e6f7..32d7fd0c0 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -126,7 +126,7 @@ - local: datasets-gated title: Gated Datasets - local: datasets-adding - title: Adding New Datasets + title: Adding a new Dataset - local: datasets-viewer title: Dataset Viewer - local: datasets-libraries From 121687849f9818211064198f9bca3eb4120d0809 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 17:39:37 +0100 Subject: [PATCH 05/38] update toc --- docs/hub/_toctree.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 32d7fd0c0..27284e163 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -129,6 +129,9 @@ title: Adding a new Dataset - local: datasets-viewer title: Dataset Viewer + sections: + - local: datasets-viewer-configure + title: Configure the Dataset Viewer - local: datasets-libraries title: Libraries sections: From 441c8cc3bb148cec39efe8e994a127ab9758c7e5 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 17:41:21 +0100 Subject: [PATCH 06/38] minor --- docs/hub/datasets-viewer.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 00ecc215b..62883f925 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -7,6 +7,13 @@ The dataset page includes a table with the contents of the dataset, arranged by
+## Configure the Dataset Viewer + +To have a nice and working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. +There is also an option to configure the Dataset Viewer using YAML. + +For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). + ## Inspect data distributions At the top of each column you can see histograms representing the distributions of numerical values and text lengths. @@ -44,10 +51,3 @@ For the biggest datasets, the page shows a preview of the first 100 rows instead - -## Configure the Dataset Viewer - -To have a nice and working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. -There is also an option to configure the Dataset Viewer using YAML. - -For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure) From c9e24a06eb5f5fd8e5dfcb738945bae16f7b19a4 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Tue, 31 Oct 2023 19:51:41 +0100 Subject: [PATCH 07/38] add dataset structure docs --- docs/hub/_toctree.yml | 8 ++ docs/hub/datasets-adding.md | 4 +- docs/hub/datasets-basic-structure.md | 144 ++++++++++++++++++++++ docs/hub/datasets-custom-structure.md | 142 +++++++++++++++++++++ docs/hub/datasets-repository-structure.md | 27 ++++ docs/hub/datasets-viewer-configure.md | 39 ++---- docs/hub/datasets-viewer.md | 2 +- docs/hub/index.md | 3 +- 8 files changed, 335 insertions(+), 34 deletions(-) create mode 100644 docs/hub/datasets-basic-structure.md create mode 100644 docs/hub/datasets-custom-structure.md create mode 100644 docs/hub/datasets-repository-structure.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 27284e163..b2b32d32c 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -132,6 +132,14 @@ sections: - local: datasets-viewer-configure title: Configure the Dataset Viewer + - local: datasets-repository-structure + title: Dataset Repository Structure + sections: + - local: datasets-basic-structure + title: Basic Structure + sections: + - local: datasets-custom-structure + title: Custom Structure - local: datasets-libraries title: Libraries sections: diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 492dffce2..64f24eb16 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -64,7 +64,9 @@ Adding a Dataset card is super valuable for helping users find your dataset and ### Dataset Viewer The [Dataset Viewer](./datasets-viewer) is crucial to know what the data actually look like. -You can [configure it](./datasets-viewer-configure) and specify which files to show and how they should be shown. +It is generally enabled by default for any dataset, depending on the dataset structure. + +Please refer to the documentation on [Dataset Structure](./datasets-structure). ## Using Git diff --git a/docs/hub/datasets-basic-structure.md b/docs/hub/datasets-basic-structure.md new file mode 100644 index 000000000..7e15f551a --- /dev/null +++ b/docs/hub/datasets-basic-structure.md @@ -0,0 +1,144 @@ +# Basic Repository Structure + +To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. + +This guide will show you how to structure your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. +A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a dataset viewer on its dataset page on the Hub. + +Note that you can also define your own custom structure, see the documentation on [Custom Structure](./datasets-custom-structure) for more information + +## Main use-case + +The simplest dataset structure has two files: `train.csv` and `test.csv` (this works with any supported file format). + +Your repository will also contain a `README.md` file, the [dataset card](dataset_card) displayed on your dataset page. + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +β”œβ”€β”€ train.csv +└── test.csv +``` + +In this simple case, you'll get a dataset with two splits: `train` (containing examples from `train.csv`) and `test` (containing examples from `test.csv`). + +## Splits + +Certain patterns in the dataset repository can be used to assign certain files to train/validation/test splits. + +### Directory name + +You can place your data files into different directories named `train`, `test`, and `validation` where each directory contains the data files for that split: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +└── data/ + β”œβ”€β”€ train/ + β”‚ └── bees.csv + β”œβ”€β”€ test/ + β”‚ └── more_bees.csv + └── validation/ + └── even_more_bees.csv +``` + +### File name + +If you don't have any non-traditional splits, then you can place the split name anywhere in the data file and it is automatically inferred. The only rule is that the split name must be delimited by non-word characters, like `test-file.csv` for example instead of `testfile.csv`. Supported delimiters include underscores, dashes, spaces, dots, and numbers. + +For example, the following file names are all acceptable: + +- train split: `train.csv`, `my_train_file.csv`, `train1.csv` +- validation split: `validation.csv`, `my_validation_file.csv`, `validation1.csv` +- test split: `test.csv`, `my_test_file.csv`, `test1.csv` + +Here is an example where all the files are placed into a directory named `data`: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +└── data/ + β”œβ”€β”€ train.csv + β”œβ”€β”€ test.csv + └── validation.csv +``` + +### Keywords + +There are several ways to name splits. Validation splits are sometimes called "dev", and test splits may be referred to as "eval". +These other split names are also supported, and the following keywords are equivalent: + +- train, training +- validation, valid, val, dev +- test, testing, eval, evaluation + +Therefore the structure below is a valid repository: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +└── data/ + β”œβ”€β”€ training.csv + β”œβ”€β”€ eval.csv + └── valid.csv +``` + +### Custom split name + +If your dataset splits have custom names that aren't `train`, `test`, or `validation`, then you can name your data files like `data/-xxxxx-of-xxxxx.csv`. + +Here is an example with three splits, `train`, `test`, and `random`: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +└── data/ + β”œβ”€β”€ train-00000-of-00003.csv + β”œβ”€β”€ train-00001-of-00003.csv + β”œβ”€β”€ train-00002-of-00003.csv + β”œβ”€β”€ test-00000-of-00001.csv + β”œβ”€β”€ random-00000-of-00003.csv + β”œβ”€β”€ random-00001-of-00003.csv + └── random-00002-of-00003.csv +``` + +### Multiple files per split + +If one of your splits comprises several files, πŸ€— Datasets can still infer whether it is the train, validation, and test split from the file name. +For example, if your train and test splits span several files: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +β”œβ”€β”€ train_0.csv +β”œβ”€β”€ train_1.csv +β”œβ”€β”€ train_2.csv +β”œβ”€β”€ train_3.csv +β”œβ”€β”€ test_0.csv +└── test_1.csv +``` + +Make sure all the files of your `train` set have *train* in their names (same for test and validation). +Even if you add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example), +πŸ€— Datasets can still infer the appropriate split. + +For convenience, you can also place your data files into different directories. +In this case, the split name is inferred from the directory name. + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +└── data/ + β”œβ”€β”€ train/ + β”‚ β”œβ”€β”€ shard_0.csv + β”‚ β”œβ”€β”€ shard_1.csv + β”‚ β”œβ”€β”€ shard_2.csv + β”‚ └── shard_3.csv + └── test/ + β”œβ”€β”€ shard_0.csv + └── shard_1.csv +``` + +### Single split + +If you don't define splits using directory or file names, then it'll treat all the files as a single train split. If your dataset splits aren't loading as expected, it may be due to an incorrect pattern. diff --git a/docs/hub/datasets-custom-structure.md b/docs/hub/datasets-custom-structure.md new file mode 100644 index 000000000..7c0ee2b56 --- /dev/null +++ b/docs/hub/datasets-custom-structure.md @@ -0,0 +1,142 @@ +# Custom Structure + +To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. + +This guide will show you how to configure a custom structure for your dataset repository. +A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a dataset viewer on its dataset page on the Hub. + +## Define your splits and subsets in YAML + +## Splits + +If you have multiple files and want to define which file goes into which split, you can use the YAML `configs` field at the top of your README.md. + +For example, given a repository like this one: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +β”œβ”€β”€ data.csv +└── holdout.csv +``` + +You can define your splits by adding the `configs` field in the YAML block at the top of your README.md: + +```yaml +--- +configs: +- config_name: default + data_files: + - split: train + path: "data.csv" + - split: test + path: "holdout.csv" +--- +``` + + +You can select multiple files per split using a list of paths: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +β”œβ”€β”€ data/ +β”‚ β”œβ”€β”€ abc.csv +β”‚ └── def.csv +└── holdout/ + └── ghi.csv +``` + +```yaml +--- +configs: +- config_name: default + data_files: + - split: train + path: + - "data/abc.csv" + - "data/def.csv" + - split: test + path: "holdout/ghi.csv" +--- +``` + +Or you can use glob patterns to automatically list all the files you need: + +```yaml +--- +configs: +- config_name: default + data_files: + - split: train + path: "data/*.csv" + - split: test + path: "holdout/*.csv" +--- +``` + + + +Note that `config_name` field is required even if you have a single configuration. + + + +## Configurations + +Your dataset might have several subsets of data that you want to be able to load separately. In that case you can define a list of configurations inside the `configs` field in YAML: + +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +β”œβ”€β”€ main_data.csv +└── additional_data.csv +``` + +```yaml +--- +configs: +- config_name: main_data + data_files: "main_data.csv" +- config_name: additional_data + data_files: "additional_data.csv" +--- +``` + +Each configuration is shown separately on the Hugging Face Hub, and can be loaded by passing its name as a second parameter: + +```python +from datasets import load_dataset + +main_data = load_dataset("my_dataset_repository", "main_data") +additional_data = load_dataset("my_dataset_repository", "additional_data") +``` + +## Builder parameters + +Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which configuration to load your `csv` files: + +```yaml +--- +configs: +- config_name: tab + data_files: "main_data.csv" + sep: "\t" +- config_name: comma + data_files: "additional_data.csv" + sep: "," +--- +``` + +Refer to [specific builders' documentation](./package_reference/builder_classes) to see what configuration parameters they have. + + + +You can set a default configuration using `default: true`, e.g. you can run `main_data = load_dataset("my_dataset_repository")` if you set + +```yaml +- config_name: main_data + data_files: "main_data.csv" + default: true +``` + + diff --git a/docs/hub/datasets-repository-structure.md b/docs/hub/datasets-repository-structure.md new file mode 100644 index 000000000..0d7579cbe --- /dev/null +++ b/docs/hub/datasets-repository-structure.md @@ -0,0 +1,27 @@ +# Dataset Repository Structure + +There are no constrains in how to structure dataset repositories. + +However certain features of the Hub expect certain structures. +For example if you want the Dataset Viewer to show certain data files or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. +Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. + +## Define splits and subsets + +To structure your dataset by naming your data files or directories according to their split names, see the [Basic Repository Structure](./datasets-basic-structure) documentation. + +Alternatively you can define the a custom structure for your dataset using YAML. +It is useful if you want to specify which file goes in which split manually, and also to define multiple configurations (or subsets) for your dataset. +It is also possible to pass dataset building parameters (e.g. the separator to use for CSV files). + +See the documentation on datasets [Custom Structure](./datasets-custom-structure) for more information. + +## Image and Audio datasets + +For image and audio classification datasets, you can also use directories to name the image and audio classes. +And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. + +We provide two guides that you can check out: + +- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset) +- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset) diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md index 9dc9b8e66..a68e11c32 100644 --- a/docs/hub/datasets-viewer-configure.md +++ b/docs/hub/datasets-viewer-configure.md @@ -11,49 +11,26 @@ The dataset viewer supports multiple file formats: - CSV (.csv, .tsv) - JSON Lines, JSON (.jsonl, .json) +- Parquet (.parquet) - Text (.txt) - Images (.png, .jpg, etc.) - Audio (.wav, .mp3, etc.) -- Parquet (.parquet) - -Parquet is often a good option: it is a column-oriented format designed for data storage and retrieval. -Parquet files are smaller than CSV files, and they also support nested data structures which makes them ideal for storing complex data. -Parquet key features are: - -- Efficient compression: Parquet’s columnar storage format enables efficient data compression, reducing storage costs and download/upload times. -- Fast query performance: Parquet’s columnar storage format allows to only load or scan the data you need, which enables faster query performance. -- Compatible with many data tools: Parquet is compatible with a wide range of data analysis and manipulation tools like Pandas and DuckDB, and also with big data processing frameworks including Apache Spark and Dask. The dataset viewer also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). -## Define the dataset splits - -You can name the data files or their folder after their split names (train/validation/test). -If there are no split names, all the data files are considered part of the train split. - -For more information, feel free to check out the documentation on on [Automatic splits detection](https://huggingface.co/docs/datasets/repository_structure#automatic-splits-detection) +## Configure dropdowns for split or subsets -## Configure the dataset +In the Dataset Viewer you can view the train/validation/test splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). -It is also possible to customize your splits manually. -Indeed, you can use YAML to: +To define those dropdowns, you can name the data files or their folder after their split names (train/validation/test). +It is also possible to customize your splits manually using YAML, which allows to: -- List the data files per split - Use custom split names +- List the data files per split - Pass dataset building parameters (e.g. the separator used in your CSV files). -- Define multiple datasets configurations (e.g. if you dataset has multiple subsets or languages) - -Check out the guide on [How to structure your dataset repository](https://huggingface.co/docs/datasets/repository_structure) for more details. - -## Image and audio datasets - -For image and audio classification datasets, you can also use directories to name the image and audio classes. -And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. - -Those two guides can be useful: +- Define multiple dataset configurations (e.g. if you dataset has multiple subsets or languages) -- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset) -- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset) +For more information, feel free to check out the documentation on [Dataset Repository Structure](./datasets-repository-structure.md). ## Disable the viewer diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 62883f925..fc393640d 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -10,7 +10,7 @@ The dataset page includes a table with the contents of the dataset, arranged by ## Configure the Dataset Viewer To have a nice and working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. -There is also an option to configure the Dataset Viewer using YAML. +There is also an option to configure your dataset using YAML. For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). diff --git a/docs/hub/index.md b/docs/hub/index.md index 8e6d03232..c5ed6582d 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -43,7 +43,8 @@ The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k Dataset Cards Gated Datasets Adding a new Dataset -Dataset viewer +Dataset Viewer +Dataset Repository Structure Libraries From c4e5ed01ebc13fcac80ca3773eeef96a2728d9a8 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 2 Nov 2023 16:03:56 +0100 Subject: [PATCH 08/38] rename sections --- docs/hub/_toctree.yml | 12 +++++------ ...d => datasets-data-files-configuration.md} | 20 ++++++++++--------- ...e.md => datasets-file-names-and-splits.md} | 8 +++++--- ...re.md => datasets-manual-configuration.md} | 2 +- docs/hub/datasets-viewer-configure.md | 11 +++------- docs/hub/index.md | 2 +- 6 files changed, 27 insertions(+), 28 deletions(-) rename docs/hub/{datasets-repository-structure.md => datasets-data-files-configuration.md} (50%) rename docs/hub/{datasets-basic-structure.md => datasets-file-names-and-splits.md} (92%) rename docs/hub/{datasets-custom-structure.md => datasets-manual-configuration.md} (99%) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index b2b32d32c..38510247c 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -132,14 +132,14 @@ sections: - local: datasets-viewer-configure title: Configure the Dataset Viewer - - local: datasets-repository-structure - title: Dataset Repository Structure + - local: datasets-data-files-configuration + title: Data files Configuration sections: - - local: datasets-basic-structure - title: Basic Structure + - local: datasets-file-names-and-splits + title: File names and splits sections: - - local: datasets-custom-structure - title: Custom Structure + - local: datasets-manual-configuration + title: Manual Configuration - local: datasets-libraries title: Libraries sections: diff --git a/docs/hub/datasets-repository-structure.md b/docs/hub/datasets-data-files-configuration.md similarity index 50% rename from docs/hub/datasets-repository-structure.md rename to docs/hub/datasets-data-files-configuration.md index 0d7579cbe..ea5616efc 100644 --- a/docs/hub/datasets-repository-structure.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -1,20 +1,22 @@ -# Dataset Repository Structure +# Data files Configuration There are no constrains in how to structure dataset repositories. -However certain features of the Hub expect certain structures. -For example if you want the Dataset Viewer to show certain data files or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. +But if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. -## Define splits and subsets +## File names and splits -To structure your dataset by naming your data files or directories according to their split names, see the [Basic Repository Structure](./datasets-basic-structure) documentation. +To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation. -Alternatively you can define the a custom structure for your dataset using YAML. -It is useful if you want to specify which file goes in which split manually, and also to define multiple configurations (or subsets) for your dataset. -It is also possible to pass dataset building parameters (e.g. the separator to use for CSV files). +## Manual configuration -See the documentation on datasets [Custom Structure](./datasets-custom-structure) for more information. +You can choose the data files to show in the Dataset Viewer for your dataset using YAML. +It is useful if you want to specify which file goes in which split manually for example + +You can also define multiple configurations (or subsets) for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). + +See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. ## Image and Audio datasets diff --git a/docs/hub/datasets-basic-structure.md b/docs/hub/datasets-file-names-and-splits.md similarity index 92% rename from docs/hub/datasets-basic-structure.md rename to docs/hub/datasets-file-names-and-splits.md index 7e15f551a..e903b6878 100644 --- a/docs/hub/datasets-basic-structure.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -1,11 +1,11 @@ -# Basic Repository Structure +# File names and splits To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. -This guide will show you how to structure your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. +This guide will show you how to name your fiels and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a dataset viewer on its dataset page on the Hub. -Note that you can also define your own custom structure, see the documentation on [Custom Structure](./datasets-custom-structure) for more information +Note that you can also define your own custom structure, see the documentation on [Manual Configuration](./datasets-manual-configuration) for more information ## Main use-case @@ -22,6 +22,8 @@ my_dataset_repository/ In this simple case, you'll get a dataset with two splits: `train` (containing examples from `train.csv`) and `test` (containing examples from `test.csv`). +If your dataset doesn't have any train/validation/test splits, feel free to use whatever file names you want. + ## Splits Certain patterns in the dataset repository can be used to assign certain files to train/validation/test splits. diff --git a/docs/hub/datasets-custom-structure.md b/docs/hub/datasets-manual-configuration.md similarity index 99% rename from docs/hub/datasets-custom-structure.md rename to docs/hub/datasets-manual-configuration.md index 7c0ee2b56..518d77403 100644 --- a/docs/hub/datasets-custom-structure.md +++ b/docs/hub/datasets-manual-configuration.md @@ -1,4 +1,4 @@ -# Custom Structure +# Manual Configuration To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md index a68e11c32..489ed6107 100644 --- a/docs/hub/datasets-viewer-configure.md +++ b/docs/hub/datasets-viewer-configure.md @@ -18,19 +18,14 @@ The dataset viewer supports multiple file formats: The dataset viewer also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). -## Configure dropdowns for split or subsets +## Configure dropdowns for splits or subsets In the Dataset Viewer you can view the train/validation/test splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). To define those dropdowns, you can name the data files or their folder after their split names (train/validation/test). -It is also possible to customize your splits manually using YAML, which allows to: +It is also possible to customize your splits manually using YAML. -- Use custom split names -- List the data files per split -- Pass dataset building parameters (e.g. the separator used in your CSV files). -- Define multiple dataset configurations (e.g. if you dataset has multiple subsets or languages) - -For more information, feel free to check out the documentation on [Dataset Repository Structure](./datasets-repository-structure.md). +For more information, feel free to check out the documentation on [Data files Configuration](./datasets-data-files-configuration.md). ## Disable the viewer diff --git a/docs/hub/index.md b/docs/hub/index.md index c5ed6582d..3f2dd834b 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -44,7 +44,7 @@ The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k Gated Datasets Adding a new Dataset Dataset Viewer -Dataset Repository Structure +Data files Configuration Libraries From 3a508462e9482528a02dceaf1071aefed13c1128 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Fri, 3 Nov 2023 16:48:58 +0100 Subject: [PATCH 09/38] Apply suggestions from code review Co-authored-by: Sylvain Lesage Co-authored-by: Lucain Co-authored-by: Julien Chaumond --- docs/hub/datasets-adding.md | 6 +++--- docs/hub/datasets-data-files-configuration.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 64f24eb16..e8ed39207 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -12,7 +12,7 @@ The Hub's web-based interface allows users without any developer experience to u A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. -1. Click on your profile and select **New Dataset** to create a new dataset repository. +1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). 2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.
@@ -51,7 +51,7 @@ Adding a Dataset card is super valuable for helping users find your dataset and
-2. At the top, you'll see the **Metadata UI** with several fields to select from like license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub. When you select an option from each field, they'll be automatically added to the top of the dataset card. +2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of (but not required) tag options like `annotations_creators`, to help you choose the appropriate tags. @@ -74,7 +74,7 @@ Since model repos are just Git repositories, you can use Git to push your model ## Using the `huggingface_hub` client library -The rich feature set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading models to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. +The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. ## Using other libraries diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index ea5616efc..c0807a40e 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -1,6 +1,6 @@ # Data files Configuration -There are no constrains in how to structure dataset repositories. +There are no constraints in how to structure dataset repositories. But if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. From 0d1e1f481c4f9d7753b466dd4d79d726c4f7659f Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 17:23:38 +0100 Subject: [PATCH 10/38] sylvain's comments: Adding a new dataset --- docs/hub/datasets-adding.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index e8ed39207..acd0b5e84 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -21,15 +21,13 @@ A repository hosts all your dataset files, including the revision history, makin ### Upload dataset -1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others. For text data extensions like `.csv`, `.json`, `.jsonl`, and `.txt`, we recommend compressing them before uploading to the Hub (to `.zip` or `.gz` file extension for example). - - Text file extensions are not tracked by Git LFS by default, and if they're greater than 10MB, they will not be committed and uploaded. Take a look at the `.gitattributes` file in your repository for a complete list of tracked file extensions. For this tutorial, you can use the following sample `.csv` files since they're small: train.csv, test.csv. +1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others.
-2. Drag and drop your dataset files and add a brief descriptive commit message. +2. Drag and drop your dataset files.
@@ -59,18 +57,16 @@ Adding a Dataset card is super valuable for helping users find your dataset and
-3. Click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. Filling out the template is a great way to introduce your dataset to the community and help users understand how to use it. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). +3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand how to use it. + + You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). ### Dataset Viewer The [Dataset Viewer](./datasets-viewer) is crucial to know what the data actually look like. It is generally enabled by default for any dataset, depending on the dataset structure. -Please refer to the documentation on [Dataset Structure](./datasets-structure). - -## Using Git - -Since model repos are just Git repositories, you can use Git to push your model files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your models. +Make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). ## Using the `huggingface_hub` client library From b1b6f5b29bde2c611ac7a02b751782edb6ef68ea Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 17:33:25 +0100 Subject: [PATCH 11/38] sylvain-s comments: Configure the Dataset Viewer --- docs/hub/datasets-viewer-configure.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md index 489ed6107..b3c00724f 100644 --- a/docs/hub/datasets-viewer-configure.md +++ b/docs/hub/datasets-viewer-configure.md @@ -17,15 +17,16 @@ The dataset viewer supports multiple file formats: - Audio (.wav, .mp3, etc.) The dataset viewer also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). +Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets. ## Configure dropdowns for splits or subsets -In the Dataset Viewer you can view the train/validation/test splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). +In the Dataset Viewer you can view the [train/validation/test](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets) splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). To define those dropdowns, you can name the data files or their folder after their split names (train/validation/test). It is also possible to customize your splits manually using YAML. -For more information, feel free to check out the documentation on [Data files Configuration](./datasets-data-files-configuration.md). +For more information, feel free to check out the documentation on [Data files Configuration](./datasets-data-files-configuration). ## Disable the viewer From 1e2923aefa1039e6017f5be046024e78180d92c9 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 17:48:38 +0100 Subject: [PATCH 12/38] sylvain's comments: File names and splits --- docs/hub/datasets-file-names-and-splits.md | 93 ++++++++++------------ 1 file changed, 41 insertions(+), 52 deletions(-) diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index e903b6878..8261acccc 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -2,51 +2,41 @@ To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. -This guide will show you how to name your fiels and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. +This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a dataset viewer on its dataset page on the Hub. Note that you can also define your own custom structure, see the documentation on [Manual Configuration](./datasets-manual-configuration) for more information -## Main use-case +## Basic use-case -The simplest dataset structure has two files: `train.csv` and `test.csv` (this works with any supported file format). +If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any supported file format and any file name). -Your repository will also contain a `README.md` file, the [dataset card](dataset_card) displayed on your dataset page. +Your repository will also contain a `README.md` file, the [dataset card](./dataset-cards) displayed on your dataset page. ``` my_dataset_repository/ β”œβ”€β”€ README.md -β”œβ”€β”€ train.csv -└── test.csv +└── data.csv ``` -In this simple case, you'll get a dataset with two splits: `train` (containing examples from `train.csv`) and `test` (containing examples from `test.csv`). - -If your dataset doesn't have any train/validation/test splits, feel free to use whatever file names you want. - ## Splits Certain patterns in the dataset repository can be used to assign certain files to train/validation/test splits. -### Directory name +### File name -You can place your data files into different directories named `train`, `test`, and `validation` where each directory contains the data files for that split: + +You can name your data files after the `train`, `test`, and `validation` splits: ``` my_dataset_repository/ β”œβ”€β”€ README.md -└── data/ - β”œβ”€β”€ train/ - β”‚ └── bees.csv - β”œβ”€β”€ test/ - β”‚ └── more_bees.csv - └── validation/ - └── even_more_bees.csv +β”œβ”€β”€ train.csv +β”œβ”€β”€ test.csv +└── validation.csv ``` -### File name - -If you don't have any non-traditional splits, then you can place the split name anywhere in the data file and it is automatically inferred. The only rule is that the split name must be delimited by non-word characters, like `test-file.csv` for example instead of `testfile.csv`. Supported delimiters include underscores, dashes, spaces, dots, and numbers. +If you don't have any non-traditional splits, then you can place the split name anywhere in the data file. The only rule is that the split name must be delimited by non-word characters, like `test-file.csv` for example instead of `testfile.csv`. Supported delimiters include underscores, dashes, spaces, dots, and numbers. For example, the following file names are all acceptable: @@ -54,20 +44,25 @@ For example, the following file names are all acceptable: - validation split: `validation.csv`, `my_validation_file.csv`, `validation1.csv` - test split: `test.csv`, `my_test_file.csv`, `test1.csv` -Here is an example where all the files are placed into a directory named `data`: +### Directory name + +You can place your data files into different directories named `train`, `test`, and `validation` where each directory contains the data files for that split: ``` my_dataset_repository/ β”œβ”€β”€ README.md └── data/ - β”œβ”€β”€ train.csv - β”œβ”€β”€ test.csv - └── validation.csv + β”œβ”€β”€ train/ + β”‚ └── data.csv + β”œβ”€β”€ test/ + β”‚ └── more_data.csv + └── validation/ + └── even_more_data.csv ``` ### Keywords -There are several ways to name splits. Validation splits are sometimes called "dev", and test splits may be referred to as "eval". +There are several ways to refer to train/validation/test splits. Validation splits are sometimes called "dev", and test splits may be referred to as "eval". These other split names are also supported, and the following keywords are equivalent: - train, training @@ -85,29 +80,9 @@ my_dataset_repository/ └── valid.csv ``` -### Custom split name - -If your dataset splits have custom names that aren't `train`, `test`, or `validation`, then you can name your data files like `data/-xxxxx-of-xxxxx.csv`. - -Here is an example with three splits, `train`, `test`, and `random`: - -``` -my_dataset_repository/ -β”œβ”€β”€ README.md -└── data/ - β”œβ”€β”€ train-00000-of-00003.csv - β”œβ”€β”€ train-00001-of-00003.csv - β”œβ”€β”€ train-00002-of-00003.csv - β”œβ”€β”€ test-00000-of-00001.csv - β”œβ”€β”€ random-00000-of-00003.csv - β”œβ”€β”€ random-00001-of-00003.csv - └── random-00002-of-00003.csv -``` - ### Multiple files per split -If one of your splits comprises several files, πŸ€— Datasets can still infer whether it is the train, validation, and test split from the file name. -For example, if your train and test splits span several files: +Splits can span several files, for example: ``` my_dataset_repository/ @@ -121,8 +96,7 @@ my_dataset_repository/ ``` Make sure all the files of your `train` set have *train* in their names (same for test and validation). -Even if you add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example), -πŸ€— Datasets can still infer the appropriate split. +You can even add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example). For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name. @@ -141,6 +115,21 @@ my_dataset_repository/ └── shard_1.csv ``` -### Single split +### Custom split name + +If your dataset splits have custom names that aren't `train`, `test`, or `validation`, then you can name your data files like `data/-xxxxx-of-xxxxx.csv`. + +Here is an example with three splits, `train`, `test`, and `random`: -If you don't define splits using directory or file names, then it'll treat all the files as a single train split. If your dataset splits aren't loading as expected, it may be due to an incorrect pattern. +``` +my_dataset_repository/ +β”œβ”€β”€ README.md +└── data/ + β”œβ”€β”€ train-00000-of-00003.csv + β”œβ”€β”€ train-00001-of-00003.csv + β”œβ”€β”€ train-00002-of-00003.csv + β”œβ”€β”€ test-00000-of-00001.csv + β”œβ”€β”€ random-00000-of-00003.csv + β”œβ”€β”€ random-00001-of-00003.csv + └── random-00002-of-00003.csv +``` From 07c493ea1d337e9aec8a066da4394cb322c01048 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 18:56:22 +0100 Subject: [PATCH 13/38] sylvain's comments: Manual Configuration --- docs/hub/datasets-manual-configuration.md | 32 +++++++++-------------- 1 file changed, 13 insertions(+), 19 deletions(-) diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index 518d77403..b5dc99b08 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -1,15 +1,16 @@ # Manual Configuration -To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. - This guide will show you how to configure a custom structure for your dataset repository. -A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a dataset viewer on its dataset page on the Hub. + +A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. + +It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independant files). ## Define your splits and subsets in YAML ## Splits -If you have multiple files and want to define which file goes into which split, you can use the YAML `configs` field at the top of your README.md. +If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md. For example, given a repository like this one: @@ -20,7 +21,7 @@ my_dataset_repository/ └── holdout.csv ``` -You can define your splits by adding the `configs` field in the YAML block at the top of your README.md: +You can define a configuration for your splits by adding the `configs` field in the YAML block at the top of your README.md: ```yaml --- @@ -34,7 +35,6 @@ configs: --- ``` - You can select multiple files per split using a list of paths: ``` @@ -81,9 +81,12 @@ Note that `config_name` field is required even if you have a single configuratio -## Configurations +## Multiple Configurations + +Your dataset might have several subsets of data that you want to be able to use separately. +For example each configuration has its own dropdown in the Dataset Viewer the Hugging Face Hub. -Your dataset might have several subsets of data that you want to be able to load separately. In that case you can define a list of configurations inside the `configs` field in YAML: +In that case you can define a list of configurations inside the `configs` field in YAML: ``` my_dataset_repository/ @@ -102,15 +105,6 @@ configs: --- ``` -Each configuration is shown separately on the Hugging Face Hub, and can be loaded by passing its name as a second parameter: - -```python -from datasets import load_dataset - -main_data = load_dataset("my_dataset_repository", "main_data") -additional_data = load_dataset("my_dataset_repository", "additional_data") -``` - ## Builder parameters Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which configuration to load your `csv` files: @@ -127,11 +121,11 @@ configs: --- ``` -Refer to [specific builders' documentation](./package_reference/builder_classes) to see what configuration parameters they have. +Refer to the [specific builders' documentation](../datasets/package_reference/builder_classes) to see what configuration parameters they have. -You can set a default configuration using `default: true`, e.g. you can run `main_data = load_dataset("my_dataset_repository")` if you set +You can set a default configuration using `default: true` ```yaml - config_name: main_data From 10fc6147e3e1311844cdf94e6390281189a61da3 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 19:02:49 +0100 Subject: [PATCH 14/38] sylvain's comments: Libraries --- docs/hub/datasets-libraries.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md index 6f95fadd4..bdf6470cf 100644 --- a/docs/hub/datasets-libraries.md +++ b/docs/hub/datasets-libraries.md @@ -1,8 +1,7 @@ # Libraries The Dataset Hub has support for several libraries in the Open Source ecosystem. -Thanks to the `huggingface_hub` Python library, it's easy to enable sharing your datasets on the Hub. -The Hub supports many libraries, and we're working on expanding this support! +Thanks to the [huggingface_hub Python library](../huggingface_hub), it's easy to enable sharing your datasets on the Hub. We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. The table below summarizes the supported libraries and their level of integration. From 624c0c40c200f37f06f9f503c21702786a34eaea Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 19:06:26 +0100 Subject: [PATCH 15/38] sylvain's comment: login --- docs/hub/datasets-dask.md | 2 +- docs/hub/datasets-duckdb.md | 2 +- docs/hub/datasets-pandas.md | 2 +- docs/hub/datasets-webdataset.md | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index 2ca11ad89..dbab6c534 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -3,7 +3,7 @@ [Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: -First login using +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: ``` huggingface-cli login diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index 0ea465fe0..eff9f2969 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -3,7 +3,7 @@ [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL OLAP database management system. Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: -First login using +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: ``` huggingface-cli login diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index abe02f1ea..963bf37e5 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -3,7 +3,7 @@ [Pandas](https://github.com/pandas-dev/pandas) is widely used Python data analysis toolkit. Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: -First login using +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: ``` huggingface-cli login diff --git a/docs/hub/datasets-webdataset.md b/docs/hub/datasets-webdataset.md index 987cf6d3e..15842fc98 100644 --- a/docs/hub/datasets-webdataset.md +++ b/docs/hub/datasets-webdataset.md @@ -3,7 +3,7 @@ [WebDataset](https://github.com/webdataset/webdataset) is a library to write I/O pipelines for large datasets. Since it supports streaming data using HTTP, you can use the Hugging Face data files URLs to stream a dataset in WebDataset format: -First login using +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: ``` huggingface-cli login From 3d72fb807cf81f470fe7e9a54180e804ee35de93 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 19:16:01 +0100 Subject: [PATCH 16/38] =?UTF-8?q?sylvain's=20comments:=20Using=20?= =?UTF-8?q?=F0=9F=A4=97=20Datasets?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/hub/datasets-usage.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/docs/hub/datasets-usage.md b/docs/hub/datasets-usage.md index b156c2efd..51478ecc8 100644 --- a/docs/hub/datasets-usage.md +++ b/docs/hub/datasets-usage.md @@ -2,4 +2,31 @@ Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using πŸ€— Datasets. You can click on the **Use in dataset library** button to copy the code to load a dataset. +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: + +``` +huggingface-cli login +``` + +And then you can load a dataset from the Hugging Face Hub using + +```python +from datasets import load_dataset + +dataset = load_dataset("username/my_dataset") + +# or load the separate splits if the dataset has train/validation/test splits +train_dataset = load_dataset("username/my_dataset", split="train") +valid_dataset = load_dataset("username/my_dataset", split="validation") +test_dataset = load_dataset("username/my_dataset", split="test") +``` + +You can also upload datasets on Hugging Face: + +```python +my_new_dataset.push_to_hub("username/my_new_dataset") +``` + +This creates a dataset repository `username/my_new_dataset` containing your Dataset in Parquet format, that you can reload later. + For more information about using πŸ€— Datasets, check out the [tutorials](https://huggingface.co/docs/datasets/tutorial) and [how-to guides](https://huggingface.co/docs/datasets/how_to) available in the πŸ€— Datasets documentation. From bb0459108ab541b495691616bb2258be41dab269 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 19:27:02 +0100 Subject: [PATCH 17/38] lucain's comment: Adding a new dataset --- docs/hub/datasets-adding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index acd0b5e84..94c7d7bbe 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -43,7 +43,7 @@ A repository hosts all your dataset files, including the revision history, makin Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. -1. Click on **Create Dataset Card** to create a Dataset card. This button creates a `README.md` file in your repository. +1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository.
From 1b463dce1cc5a6a8a058af6d72195c12d8be5fb4 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 20:03:10 +0100 Subject: [PATCH 18/38] add duckdb write --- docs/hub/datasets-duckdb.md | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index eff9f2969..ea85240e0 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -9,7 +9,27 @@ First you need to [Login with your Hugging Face account](../huggingface_hub/quic huggingface-cli login ``` -And then you can use Hugging Face paths in DuckDB: +Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: + +```python +from huggingface_hub import HfApi + +HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +``` + +Finally you can use Hugging Face paths in DuckDB: + +```python +>>> from huggingface_hub import HfFileSystem +>>> import duckdb + +>>> fs = HfFileSystem() +>>> duckdb.register_filesystem(fs) +>>> duckdb.sql("COPY tbl TO 'hf://datasets/username/my_dataset/data.parquet' (FORMAT PARQUET);") +``` + +This creates a file `data.parquet` in the dataset repository `username/my_dataset` containing your dataset in Parquet format. +You can reload it later: ```python >>> from huggingface_hub import HfFileSystem From 164589baa4c3bf6e17c12ca430bf15692afa5e7d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 3 Nov 2023 20:03:18 +0100 Subject: [PATCH 19/38] add create repo step for dask and pandas --- docs/hub/datasets-dask.md | 10 +++++++++- docs/hub/datasets-pandas.md | 10 +++++++++- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index dbab6c534..888eed865 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -9,7 +9,15 @@ First you need to [Login with your Hugging Face account](../huggingface_hub/quic huggingface-cli login ``` -And then you can use Hugging Face paths in Dask: +Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: + +```python +from huggingface_hub import HfApi + +HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +``` + +Finally you can use Hugging Face paths in Dask: ```python import dask.dataframe as dd diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 963bf37e5..85d4879bb 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -9,7 +9,15 @@ First you need to [Login with your Hugging Face account](../huggingface_hub/quic huggingface-cli login ``` -And then you can use Hugging Face paths in Pandas: +Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: + +```python +from huggingface_hub import HfApi + +HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +``` + +Finally you can use Hugging Face paths in Pandas: ```python import pandas as pd From f13de1998eae985b90ba5a0d055c4f27a65122c9 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Sun, 5 Nov 2023 14:37:29 +0100 Subject: [PATCH 20/38] minor --- docs/hub/datasets-adding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 94c7d7bbe..ca90af7af 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -75,7 +75,7 @@ The rich features set in the `huggingface_hub` library allows you to manage repo ## Using other libraries Some libraries [πŸ€— Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. -See the list of [Libraries supported by the Datasets Hub](./datasets-libraries.md) for more information. +See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Using Git From 0002a386544899fc81ebf7cd1e6638d7f66b382c Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Sun, 5 Nov 2023 14:37:46 +0100 Subject: [PATCH 21/38] minor --- docs/hub/datasets-adding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index ca90af7af..3461ad48b 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -74,7 +74,7 @@ The rich features set in the `huggingface_hub` library allows you to manage repo ## Using other libraries -Some libraries [πŸ€— Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. +Some libraries like [πŸ€— Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Using Git From 167ea21aef7e44cf04451f1912af09a45923fc4f Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 6 Nov 2023 12:03:42 +0100 Subject: [PATCH 22/38] typo --- docs/hub/datasets-manual-configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index b5dc99b08..21688ceff 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -4,7 +4,7 @@ This guide will show you how to configure a custom structure for your dataset re A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. -It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independant files). +It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). ## Define your splits and subsets in YAML From a37982427a6247c5dd9ba48482e826e4908ee61e Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Mon, 6 Nov 2023 12:36:59 +0100 Subject: [PATCH 23/38] Update docs/hub/_toctree.yml Co-authored-by: Mishig --- docs/hub/_toctree.yml | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 38510247c..ca6008ca0 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -137,7 +137,6 @@ sections: - local: datasets-file-names-and-splits title: File names and splits - sections: - local: datasets-manual-configuration title: Manual Configuration - local: datasets-libraries From 7bf18c5febdea664e59fe4dd85969cd51c799fad Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Mon, 6 Nov 2023 22:00:55 +0100 Subject: [PATCH 24/38] Apply suggestions from code review Co-authored-by: Daniel van Strien --- docs/hub/datasets-adding.md | 4 ++-- docs/hub/datasets-dask.md | 2 +- docs/hub/datasets-data-files-configuration.md | 2 +- docs/hub/datasets-duckdb.md | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 3461ad48b..f1c242578 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -1,10 +1,10 @@ # Adding a new dataset -The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and popular research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! +The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. -## Upload with the Hub UI +## Upload using the Hub UI The Hub's web-based interface allows users without any developer experience to upload a dataset. diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index 888eed865..beeb5f9b7 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -44,4 +44,4 @@ df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") ``` -To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). +For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index c0807a40e..c3decfa3b 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -2,7 +2,7 @@ There are no constraints in how to structure dataset repositories. -But if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. +However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. ## File names and splits diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index ea85240e0..733b7575d 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -1,6 +1,6 @@ # DuckDB -[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL OLAP database management system. +[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: From 3e4836b0279d6d4f9abccab6286a449148b36be0 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 6 Nov 2023 22:37:33 +0100 Subject: [PATCH 25/38] rename titles for consistency --- docs/hub/_toctree.yml | 32 ++++++++++++++++++-------------- docs/hub/datasets-adding.md | 4 ++-- docs/hub/index.md | 5 +++-- 3 files changed, 23 insertions(+), 18 deletions(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 38510247c..744ece8ac 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -126,7 +126,24 @@ - local: datasets-gated title: Gated Datasets - local: datasets-adding - title: Adding a new Dataset + title: Uploading Datasets + - local: datasets-adding + title: Uploading Datasets + - local: datasets-downloading + title: Downloading Datasets + - local: datasets-libraries + title: Integrated Libraries + sections: + - local: datasets-dask + title: Dask + - local: datasets-usage + title: Datasets + - local: datasets-duckdb + title: DuckDB + - local: datasets-pandas + title: Pandas + - local: datasets-webdataset + title: WebDataset - local: datasets-viewer title: Dataset Viewer sections: @@ -140,19 +157,6 @@ sections: - local: datasets-manual-configuration title: Manual Configuration - - local: datasets-libraries - title: Libraries - sections: - - local: datasets-dask - title: Dask - - local: datasets-usage - title: Datasets - - local: datasets-duckdb - title: DuckDB - - local: datasets-pandas - title: Pandas - - local: datasets-webdataset - title: WebDataset - local: spaces title: Spaces isExpanded: true diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 3461ad48b..0eb4324de 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -1,4 +1,4 @@ -# Adding a new dataset +# Uploading datasets The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and popular research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! @@ -21,7 +21,7 @@ A repository hosts all your dataset files, including the revision history, makin ### Upload dataset -1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others. +1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)).
diff --git a/docs/hub/index.md b/docs/hub/index.md index 3f2dd834b..3ce3e15bf 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -42,10 +42,11 @@ The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k Datasets Overview Dataset Cards Gated Datasets -Adding a new Dataset +Uploading Datasets +Downloading Datasets +Libraries Dataset Viewer Data files Configuration -Libraries
From 1b4b81ae90d0b449e87be05d1d4b2a8752a39015 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Mon, 6 Nov 2023 22:37:44 +0100 Subject: [PATCH 26/38] Add Downloading Datasets --- docs/hub/datasets-downloading.md | 44 ++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 docs/hub/datasets-downloading.md diff --git a/docs/hub/datasets-downloading.md b/docs/hub/datasets-downloading.md new file mode 100644 index 000000000..5edccddda --- /dev/null +++ b/docs/hub/datasets-downloading.md @@ -0,0 +1,44 @@ +# Downloading datasets + +## Integrated libraries + +If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with πŸ€— Datasets below. + +
+ + +
+ +
+ + +
+ +## Using the Hugging Face Client Library + +You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas. + +```py +from huggingface_hub import hf_hub_download +import pandas as pd + +REPO_ID = "YOUR_REPO_ID" +FILENAME = "data.csv" + +dataset = pd.read_csv( + hf_hub_download(repo_id=REPO_ID, filename=FILENAME) +) +``` + +## Using Git + +Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running: + +```bash +git lfs install +git clone git@hf.co:datasets/ # example: git clone git@hf.co:datasets/allenai/c4 +``` + +If you have write-access to the particular dataset repo, you'll also have the ability to commit and push revisions to the dataset. + +Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. From e6d7d3932f504edb75c32602179a2a90e4039896 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Wed, 8 Nov 2023 13:58:49 +0100 Subject: [PATCH 27/38] Apply suggestions from code review Co-authored-by: Polina Kazakova --- docs/hub/datasets-adding.md | 6 +++--- docs/hub/datasets-data-files-configuration.md | 4 ++-- docs/hub/datasets-duckdb.md | 2 +- docs/hub/datasets-file-names-and-splits.md | 2 +- docs/hub/datasets-pandas.md | 2 +- docs/hub/datasets-usage.md | 2 +- docs/hub/datasets-viewer.md | 7 +++---- 7 files changed, 12 insertions(+), 13 deletions(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index f1c242578..9af20a735 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -51,19 +51,19 @@ Adding a Dataset card is super valuable for helping users find your dataset and 2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. - You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of (but not required) tag options like `annotations_creators`, to help you choose the appropriate tags. + You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset.
-3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand how to use it. +3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details. You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). ### Dataset Viewer -The [Dataset Viewer](./datasets-viewer) is crucial to know what the data actually look like. +The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it. It is generally enabled by default for any dataset, depending on the dataset structure. Make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index c3decfa3b..20c2a3963 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -1,6 +1,6 @@ # Data files Configuration -There are no constraints in how to structure dataset repositories. +There are no constraints on how to structure dataset repositories. However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. @@ -12,7 +12,7 @@ To structure your dataset by naming your data files or directories according to ## Manual configuration You can choose the data files to show in the Dataset Viewer for your dataset using YAML. -It is useful if you want to specify which file goes in which split manually for example +It is useful if you want to specify which file goes into which split manually. You can also define multiple configurations (or subsets) for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index 733b7575d..cbcfdb6e3 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -37,5 +37,5 @@ You can reload it later: >>> fs = HfFileSystem() >>> duckdb.register_filesystem(fs) ->>> df = duckdb.query(f"SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10").df() +>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df() ``` diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index 8261acccc..063ecd62a 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -69,7 +69,7 @@ These other split names are also supported, and the following keywords are equiv - validation, valid, val, dev - test, testing, eval, evaluation -Therefore the structure below is a valid repository: +Therefore, the structure below is a valid repository: ``` my_dataset_repository/ diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 85d4879bb..082a429ab 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -1,6 +1,6 @@ # Pandas -[Pandas](https://github.com/pandas-dev/pandas) is widely used Python data analysis toolkit. +[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: diff --git a/docs/hub/datasets-usage.md b/docs/hub/datasets-usage.md index 51478ecc8..d00bdd148 100644 --- a/docs/hub/datasets-usage.md +++ b/docs/hub/datasets-usage.md @@ -21,7 +21,7 @@ valid_dataset = load_dataset("username/my_dataset", split="validation") test_dataset = load_dataset("username/my_dataset", split="test") ``` -You can also upload datasets on Hugging Face: +You can also upload datasets to the Hugging Face Hub: ```python my_new_dataset.push_to_hub("username/my_new_dataset") diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index fc393640d..64796e382 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -9,15 +9,14 @@ The dataset page includes a table with the contents of the dataset, arranged by ## Configure the Dataset Viewer -To have a nice and working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. +To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. There is also an option to configure your dataset using YAML. For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). ## Inspect data distributions -At the top of each column you can see histograms representing the distributions of numerical values and text lengths. -For categorical data there is also the number of rows from each class. +At the top of the columns you can see the graphs representing the distribution of their data. This gives you a quick insight on how balanced your classes are, what are the range and distribution of numerical data and lengths of texts, and what portion of the column data is missing. ## Filter by value @@ -34,7 +33,7 @@ You can share a specific row by clicking on it, and then copying the URL in the ## Access the parquet files -To power the dataset viewer, every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset with libraries such as Polars, Pandas or DuckDB. +To power the dataset viewer, every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB. You can also access the list of Parquet files programmatically using the [Hub API](./api#endpoints-table): https://huggingface.co/api/datasets/glue/parquet. From 00280301762e9ce89ca836eb3618e2b8a430405a Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Wed, 8 Nov 2023 14:23:38 +0100 Subject: [PATCH 28/38] more links to list of supported formats --- docs/hub/datasets-file-names-and-splits.md | 4 ++-- docs/hub/datasets-manual-configuration.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index 063ecd62a..cf9e8b3e0 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -3,13 +3,13 @@ To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. -A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a dataset viewer on its dataset page on the Hub. +A dataset with a [supported structure and file format]((./datasets-viewer-configure#supported-data-formats)) automatically has a dataset viewer on its dataset page on the Hub. Note that you can also define your own custom structure, see the documentation on [Manual Configuration](./datasets-manual-configuration) for more information ## Basic use-case -If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any supported file format and any file name). +If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any [supported file format](./datasets-viewer-configure#supported-data-formats) and any file name). Your repository will also contain a `README.md` file, the [dataset card](./dataset-cards) displayed on your dataset page. diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index 21688ceff..8a45fe8a7 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -2,7 +2,7 @@ This guide will show you how to configure a custom structure for your dataset repository. -A dataset with a supported structure and file format (`.txt`, `.csv`, `.parquet`, `.jsonl`, `.mp3`, `.jpg`, `.zip` etc.) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. +A dataset with a [supported structure and file format]((./datasets-viewer-configure#supported-data-formats)) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). From fe1e75eca31102f1b83b0f1f06c699e0dcd03b6a Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 16:11:27 +0100 Subject: [PATCH 29/38] move supported file formats to upload page --- docs/hub/datasets-adding.md | 20 +++++++++++++++++++- docs/hub/datasets-viewer-configure.md | 16 +--------------- 2 files changed, 20 insertions(+), 16 deletions(-) diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 5cd4ac48b..a7a10e6f5 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -64,7 +64,7 @@ Adding a Dataset card is super valuable for helping users find your dataset and ### Dataset Viewer The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it. -It is generally enabled by default for any dataset, depending on the dataset structure. +It is enabled by default for all public datasets. Make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). @@ -80,3 +80,21 @@ See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) ## Using Git Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. + +## File formats + +The Hub natively supports multiple file formats: + +- CSV (.csv, .tsv) +- JSON Lines, JSON (.jsonl, .json) +- Parquet (.parquet) +- Text (.txt) +- Images (.png, .jpg, etc.) +- Audio (.wav, .mp3, etc.) + +It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). + +Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets. + +You may want to convert your files to these formats to benefit from all the Hub features. +Other formats and structures may not be recognized by the Hub. diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md index b3c00724f..7c2ba595e 100644 --- a/docs/hub/datasets-viewer-configure.md +++ b/docs/hub/datasets-viewer-configure.md @@ -3,21 +3,7 @@ The Dataset Viewer supports many data files formats, from text to tabular and from image to audio formats. It also separates the train/validation/test splits based on file and folder names. -To configure the Dataset Viewer for your dataset, make sure your dataset is in a supported data format and structured the right way. - -## Supported data formats - -The dataset viewer supports multiple file formats: - -- CSV (.csv, .tsv) -- JSON Lines, JSON (.jsonl, .json) -- Parquet (.parquet) -- Text (.txt) -- Images (.png, .jpg, etc.) -- Audio (.wav, .mp3, etc.) - -The dataset viewer also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). -Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets. +To configure the Dataset Viewer for your dataset, first make sure your dataset is in a [supported data format](./datasets-adding#files-formats). ## Configure dropdowns for splits or subsets From e257db7b14741346e16b13c80b4d3d7b9a58d1a6 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 16:14:56 +0100 Subject: [PATCH 30/38] fix link to row --- docs/hub/datasets-viewer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 64796e382..5496d1b56 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -29,7 +29,7 @@ You can search for a word in the dataset by typing it in the search bar at the t ## Share a specific row -You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/glue/viewer/mrpc/test?row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row. +You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/glue/viewer/mrpc/test?p=2&row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row. ## Access the parquet files From 79d28124792ba2217f105324c7406bfb533d0061 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 16:16:18 +0100 Subject: [PATCH 31/38] fix links --- docs/hub/datasets-file-names-and-splits.md | 4 ++-- docs/hub/datasets-manual-configuration.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index cf9e8b3e0..83c06703c 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -3,13 +3,13 @@ To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. -A dataset with a [supported structure and file format]((./datasets-viewer-configure#supported-data-formats)) automatically has a dataset viewer on its dataset page on the Hub. +A dataset with a [supported structure and file format]((./datasets-adding#files-formats)) automatically has a dataset viewer on its dataset page on the Hub. Note that you can also define your own custom structure, see the documentation on [Manual Configuration](./datasets-manual-configuration) for more information ## Basic use-case -If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any [supported file format](./datasets-viewer-configure#supported-data-formats) and any file name). +If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any [supported file format](./datasets-adding#files-formats) and any file name). Your repository will also contain a `README.md` file, the [dataset card](./dataset-cards) displayed on your dataset page. diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index 8a45fe8a7..503b8b4c9 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -2,7 +2,7 @@ This guide will show you how to configure a custom structure for your dataset repository. -A dataset with a [supported structure and file format]((./datasets-viewer-configure#supported-data-formats)) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. +A dataset with a [supported structure and file format]((./datasets-adding#files-formats)) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). From 2ffc2cccca8958cef78e52ba3dee0481339748f8 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 16:17:05 +0100 Subject: [PATCH 32/38] again --- docs/hub/datasets-file-names-and-splits.md | 2 +- docs/hub/datasets-manual-configuration.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index 83c06703c..c823050f9 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -3,7 +3,7 @@ To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. -A dataset with a [supported structure and file format]((./datasets-adding#files-formats)) automatically has a dataset viewer on its dataset page on the Hub. +A dataset with a [supported structure and file format](./datasets-adding#files-formats) automatically has a dataset viewer on its dataset page on the Hub. Note that you can also define your own custom structure, see the documentation on [Manual Configuration](./datasets-manual-configuration) for more information diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index 503b8b4c9..09c02ba41 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -2,7 +2,7 @@ This guide will show you how to configure a custom structure for your dataset repository. -A dataset with a [supported structure and file format]((./datasets-adding#files-formats)) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. +A dataset with a [supported structure and file format](./datasets-adding#files-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). From c9d0f53a63da91b9e2a3c2de336564802cfad273 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 16:19:07 +0100 Subject: [PATCH 33/38] remove duplicate --- docs/hub/_toctree.yml | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 0407fab0f..04001c216 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -127,8 +127,6 @@ title: Gated Datasets - local: datasets-adding title: Uploading Datasets - - local: datasets-adding - title: Uploading Datasets - local: datasets-downloading title: Downloading Datasets - local: datasets-libraries From dac5618997404c3518660936cb2109c99250d188 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 16:20:39 +0100 Subject: [PATCH 34/38] move configure viewer to bottom of page --- docs/hub/datasets-viewer.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 5496d1b56..05c21eb3c 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -7,13 +7,6 @@ The dataset page includes a table with the contents of the dataset, arranged by
-## Configure the Dataset Viewer - -To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. -There is also an option to configure your dataset using YAML. - -For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). - ## Inspect data distributions At the top of the columns you can see the graphs representing the distribution of their data. This gives you a quick insight on how balanced your classes are, what are the range and distribution of numerical data and lengths of texts, and what portion of the column data is missing. @@ -50,3 +43,10 @@ For the biggest datasets, the page shows a preview of the first 100 rows instead
+ +## Configure the Dataset Viewer + +To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. +There is also an option to configure your dataset using YAML. + +For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). From 74dbd4f969b08635062b75ec46d23d619f312215 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 9 Nov 2023 17:10:34 +0100 Subject: [PATCH 35/38] fix to_parquet example --- docs/hub/datasets-dask.md | 8 ++++---- docs/hub/datasets-pandas.md | 8 ++++---- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index beeb5f9b7..701b1aaec 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -22,12 +22,12 @@ Finally you can use Hugging Face paths in Dask: ```python import dask.dataframe as dd -df.write_parquet("hf://datasets/username/my_dataset") +df.to_parquet("hf://datasets/username/my_dataset") # or write in separate directories if the dataset has train/validation/test splits -df_train.write_parquet("hf://datasets/username/my_dataset/train") -df_valid.write_parquet("hf://datasets/username/my_dataset/validation") -df_test .write_parquet("hf://datasets/username/my_dataset/test") +df_train.to_parquet("hf://datasets/username/my_dataset/train") +df_valid.to_parquet("hf://datasets/username/my_dataset/validation") +df_test .to_parquet("hf://datasets/username/my_dataset/test") ``` This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 082a429ab..9d1ebac2b 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -22,12 +22,12 @@ Finally you can use Hugging Face paths in Pandas: ```python import pandas as pd -df.write_parquet("hf://datasets/username/my_dataset/data.parquet") +df.to_parquet("hf://datasets/username/my_dataset/data.parquet") # or write in separate files if the dataset has train/validation/test splits -df_train.write_parquet("hf://datasets/username/my_dataset/train.parquet") -df_valid.write_parquet("hf://datasets/username/my_dataset/validation.parquet") -df_test .write_parquet("hf://datasets/username/my_dataset/test.parquet") +df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet") +df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet") +df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ``` This creates a dataset repository `username/my_dataset` containing your Pandas dataset in Parquet format. From 8e3da3c6d53fa497d99958c4f4727697c67de03c Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Tue, 14 Nov 2023 15:03:51 +0100 Subject: [PATCH 36/38] minor fix --- docs/hub/datasets-downloading.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-downloading.md b/docs/hub/datasets-downloading.md index 5edccddda..1ed63e38d 100644 --- a/docs/hub/datasets-downloading.md +++ b/docs/hub/datasets-downloading.md @@ -26,7 +26,7 @@ REPO_ID = "YOUR_REPO_ID" FILENAME = "data.csv" dataset = pd.read_csv( - hf_hub_download(repo_id=REPO_ID, filename=FILENAME) + hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) ``` From e63d0e4b006c72a62506e9d56404b795cddd568d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Tue, 14 Nov 2023 15:18:12 +0100 Subject: [PATCH 37/38] Update docs/hub/datasets-file-names-and-splits.md Co-authored-by: Lucain --- docs/hub/datasets-file-names-and-splits.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index c823050f9..c361af83e 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -21,7 +21,7 @@ my_dataset_repository/ ## Splits -Certain patterns in the dataset repository can be used to assign certain files to train/validation/test splits. +Some patterns in the dataset repository can be used to assign certain files to train/validation/test splits. ### File name From fa1093e5c0252561323c9c67a93f2339de80a52d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Wed, 15 Nov 2023 12:31:12 +0100 Subject: [PATCH 38/38] Apply suggestions from code review Co-authored-by: Polina Kazakova --- docs/hub/datasets-dask.md | 2 +- docs/hub/datasets-duckdb.md | 2 +- docs/hub/datasets-file-names-and-splits.md | 4 ++-- docs/hub/datasets-manual-configuration.md | 4 ++-- docs/hub/datasets-pandas.md | 2 +- docs/hub/datasets-viewer-configure.md | 2 +- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md index 701b1aaec..00e5ef840 100644 --- a/docs/hub/datasets-dask.md +++ b/docs/hub/datasets-dask.md @@ -17,7 +17,7 @@ from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` -Finally you can use Hugging Face paths in Dask: +Finally, you can use Hugging Face paths in Dask: ```python import dask.dataframe as dd diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index cbcfdb6e3..a308a972d 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -17,7 +17,7 @@ from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` -Finally you can use Hugging Face paths in DuckDB: +Finally, you can use Hugging Face paths in DuckDB: ```python >>> from huggingface_hub import HfFileSystem diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md index c361af83e..897af7ff4 100644 --- a/docs/hub/datasets-file-names-and-splits.md +++ b/docs/hub/datasets-file-names-and-splits.md @@ -3,9 +3,9 @@ To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. -A dataset with a [supported structure and file format](./datasets-adding#files-formats) automatically has a dataset viewer on its dataset page on the Hub. +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub. -Note that you can also define your own custom structure, see the documentation on [Manual Configuration](./datasets-manual-configuration) for more information +Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration). ## Basic use-case diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index 09c02ba41..28586cd7f 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -2,9 +2,9 @@ This guide will show you how to configure a custom structure for your dataset repository. -A dataset with a [supported structure and file format](./datasets-adding#files-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to configure the splits and builder parameters that are used by the Viewer. +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to define the splits, configurations and builder parameters that are used by the Viewer. -It is even possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). +It is also possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). ## Define your splits and subsets in YAML diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md index 9d1ebac2b..9972816fb 100644 --- a/docs/hub/datasets-pandas.md +++ b/docs/hub/datasets-pandas.md @@ -17,7 +17,7 @@ from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` -Finally you can use Hugging Face paths in Pandas: +Finally, you can use Hugging Face paths in Pandas: ```python import pandas as pd diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md index 7c2ba595e..dd08f7848 100644 --- a/docs/hub/datasets-viewer-configure.md +++ b/docs/hub/datasets-viewer-configure.md @@ -1,6 +1,6 @@ # Configure the Dataset Viewer -The Dataset Viewer supports many data files formats, from text to tabular and from image to audio formats. +The Dataset Viewer supports many [data files formats](./datasets-adding#file-formats), from text to tabular and from image to audio formats. It also separates the train/validation/test splits based on file and folder names. To configure the Dataset Viewer for your dataset, first make sure your dataset is in a [supported data format](./datasets-adding#files-formats).