Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Oct 31, 2023

The goal is to make the Dataset Hub docs more focused on the Hub features like the Viewer, and less on the datasets lib. In this PR datasets becomes one of the many libraries that can be use with the Hub instead of being shown as en entry point.

  1. I added a Configure the Dataset Viewer page

    -> this will help users have a working Dataset Viewer without knowledge about how datasets works

  2. I added a Data files Configuration page

    -> this gives more detail on how to structure a dataset (e.g. for splits)

  3. I added a Libraries page and dedicated pages for:

    • Dask
    • DuckDB
    • Datasets (redirects to the datasets docs)
    • Pandas
    • WebDataset

    -> the focus is less on the datasets library to show that people are actually free to use whatever tools they want.
    -> they're pretty simple for now and we should keep enriching them

  4. Added Uploading datasets and Downloading datasets for consistency with model docs

TODO in the datasets docs (will open a PR shortly):

  • remove "create a dataset" docs content and redirect to the one on the Hub instead
  • remove "repository structure" docs content and redirect to the one on the Hub instead

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 31, 2023

The documentation is not available anymore as the PR was closed or merged.

@lhoestq lhoestq marked this pull request as ready for review October 31, 2023 17:04
@lhoestq lhoestq marked this pull request as draft October 31, 2023 17:08
@lhoestq lhoestq changed the title Dataset Viewer and Libraries docs Dataset Viewer, Structure and Libraries docs Oct 31, 2023
@lhoestq lhoestq marked this pull request as ready for review November 2, 2023 17:12
@lhoestq
Copy link
Member Author

lhoestq commented Nov 2, 2023

The docs preview doesn't work (internal report here) but feel free to start reviewing on GitHub :)

also cc @julien-c if you want to check the docs I'm adding to the Datasets section of the Hub documentation

Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for these docs, they are at the correct level of simplicity and avoid having to understand the details of the datasets library.

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you ping me again when the doc-build worked? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a quick glance i do like the fact that we present multiple tools that are compatible with dataset repos! That's quite cool

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last comment, you should also tag @davanstrien for review IMO given the recent post https://huggingface.co/blog/researcher-dataset-sharing and interests in dataset advocacy!!


### Create a Dataset card

Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a link to https://huggingface.co/docs/hub/datasets-cards for users that don't know what a Dataset card is.

@davanstrien
Copy link
Member

Question, do we also want to add Argilla + Prodigy to the overview table in docs/hub/datasets-libraries.md. Could otherwise be worth adding a section on annotation tools that intergrate with the Hub?

I understand it will be in another PR. See #1070 (comment)

Sorry, missed that :)

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some comments from me:

Comment on lines +77 to +78
Some libraries like [🤗 Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub.
See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to have links to the upload examples from these libraries' dedicated doc pages here (we can split these pages into the Upload and Download sections to make them linkable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links already appear on the navigation tab on the left when you are on this page

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. But I'm not sure why these links are not expanded (automatically) when clicking on the [Libraries supported by the Datasets Hub](./datasets-libraries) link on my machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm also seeing this behavior when clicking on the Libraries link from the documentation index page.
Let's see with the docs front-end team if we can fix that

lhoestq and others added 3 commits November 6, 2023 22:00
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
@lhoestq
Copy link
Member Author

lhoestq commented Nov 6, 2023

Thanks for the comments, I took them into account :)

Copy link

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very much like! i also added a few suggestions :)


## Basic use-case

If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any supported file format and any file name).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any supported data format - there are a lot of mentions of supported formats through all the docs so maybe indeed it makes sense to list a complete set of them somewhere and point to it?

@lhoestq
Copy link
Member Author

lhoestq commented Nov 9, 2023

Let me know if you have other comments @mariosasko @polinaeterna @Wauplin

Also cc @julien-c the documentation preview is working now if you want to take a look

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!
Made a last check on the docs and everything looks fine 👍

Co-authored-by: Lucain <lucainp@gmail.com>
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

Copy link

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you a lot for this work! i've left a couple of nits suggestions, you can ignore them except for the link to the file-formats section (otherwise it's broken) and some punctuation.

@julien-c
Copy link
Member

let's 🚢 this verrrrry long-standing PR no?

Co-authored-by: Polina Kazakova <polina@huggingface.co>
@lhoestq lhoestq merged commit 35884ae into main Nov 15, 2023
@lhoestq lhoestq deleted the more-datasets-docs branch November 15, 2023 11:51
@lhoestq
Copy link
Member Author

lhoestq commented Nov 15, 2023

Thanks for all the reviews :)

@lhoestq lhoestq mentioned this pull request Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants