Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets: Adding doc for DuckDB CLI integration #1297

Merged
merged 8 commits into from
May 27, 2024

Conversation

AndreaFrancis
Copy link
Contributor

Following up duckdb/duckdb#11831 for DuckDB CLI integration:

  • Authentication for private and gated datasets (Using DuckDB Secrets Manager)
  • Query datasets (Some basic SELECT examples, DESCRIBE, SUMMARIZE for stats)
  • Perform SQL operations (Text functions and aggregations)
  • Combine datasets, export and publish on the Hub
  • Perform vector similarity search

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@AndreaFrancis AndreaFrancis marked this pull request as ready for review May 23, 2024 14:44
@AndreaFrancis
Copy link
Contributor Author

I am not sure why the sections for DuckDB don't seem to be working as expected in https://moon-ci-docs.huggingface.co/docs/hub/pr_1297/en/datasets-duckdb

@lhoestq
Copy link
Member

lhoestq commented May 23, 2024

I think the tabs on the left are buggy in the preview but work when deployed. The other pages at https://moon-ci-docs.huggingface.co/docs/hub/pr_1297/en/datasets-duckdb are working though.

@lhoestq
Copy link
Member

lhoestq commented May 23, 2024

The current section main page is https://moon-ci-docs.huggingface.co/docs/hub/pr_1297/en/datasets-duckdb and shows the old way of using duckdb (register the filesystem), maybe we can set datasets-duckdb-cli as the main section page instead, and add links to the other pages at the end of this file ?

@AndreaFrancis
Copy link
Contributor Author

The current section main page is https://moon-ci-docs.huggingface.co/docs/hub/pr_1297/en/datasets-duckdb and shows the old way of using duckdb (register the filesystem), maybe we can set datasets-duckdb-cli as the main section page instead, and add links to the other pages at the end of this file ?

Done, I moved to datasets-duckdb-cli instead to be the main section and added the first part of the old file (about what is DuckDB).
I didn't copy the rest because it is already covered (how to upload the results to a new dataset repository).

Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome contents, congrats @AndreaFrancis I learnt a lot.

docs/hub/datasets-duckdb.md Show resolved Hide resolved
docs/hub/datasets-duckdb-cli.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli-select.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli-select.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli-sql.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli-sql.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb-cli-combine-and-export.md Outdated Show resolved Hide resolved
AndreaFrancis and others added 2 commits May 27, 2024 10:20
Co-authored-by: Sylvain Lesage <sylvain.lesage@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing ! just added one comment:

docs/hub/datasets-duckdb-combine-and-export.md Outdated Show resolved Hide resolved
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've learned some new things, thank you!

i've left a couple of minor comms, feel free to ignore if you think it's not important

docs/hub/datasets-duckdb-select.md Show resolved Hide resolved
docs/hub/datasets-duckdb-select.md Outdated Show resolved Hide resolved
docs/hub/datasets-duckdb.md Show resolved Hide resolved
@AndreaFrancis
Copy link
Contributor Author

Thanks for all your feedback, @severo @lhoestq and @polinaeterna; I think it is ready and will merge.

@AndreaFrancis AndreaFrancis merged commit 2bb69f0 into main May 27, 2024
1 check passed
@AndreaFrancis AndreaFrancis deleted the duckdb-cli-integration branch May 27, 2024 20:47
@severo
Copy link
Collaborator

severo commented May 28, 2024

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, this is way clearer than before.

Let's now make the integration as visible as possible and keep building on top of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants