Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create endpoint /dataset-info #670

Merged
merged 30 commits into from
Jan 18, 2023
Merged

Create endpoint /dataset-info #670

merged 30 commits into from
Jan 18, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Dec 23, 2022

No description provided.

@severo
Copy link
Collaborator Author

severo commented Dec 23, 2022

It's only the start, but interested in your opinion on https://github.com/huggingface/datasets-server/pull/670/files#diff-0f066cc0774e19de939dd3c15c9b224c193fe83b71468cdb33315fce49a45ddfR27-R34, @huggingface/datasets

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2022

Codecov Report

Base: 90.67% // Head: 91.07% // Increases project coverage by +0.40% 🎉

Coverage data is based on head (69f2d0b) compared to base (30b508c).
Patch coverage: 95.44% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   90.67%   91.07%   +0.40%     
==========================================
  Files          38       27      -11     
  Lines        2648     1849     -799     
==========================================
- Hits         2401     1684     -717     
+ Misses        247      165      -82     
Flag Coverage Δ
libs_libcommon ?
workers_datasets_based 91.07% <95.44%> (-1.86%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
workers/datasets_based/tests/fixtures/datasets.py 100.00% <ø> (ø)
workers/datasets_based/tests/conftest.py 90.76% <73.68%> (-7.15%) ⬇️
...s_based/src/datasets_based/workers/dataset_info.py 92.00% <92.00%> (ø)
...datasets_based/workers/parquet_and_dataset_info.py 92.52% <92.52%> (ø)
...tasets_based/src/datasets_based/workers/parquet.py 92.15% <93.54%> (+0.33%) ⬆️
...orkers/datasets_based/src/datasets_based/config.py 98.78% <100.00%> (ø)
...orkers/datasets_based/src/datasets_based/worker.py 87.91% <100.00%> (ø)
...atasets_based/src/datasets_based/worker_factory.py 100.00% <100.00%> (ø)
...s/datasets_based/src/datasets_based/worker_loop.py 48.10% <100.00%> (ø)
...c/datasets_based/workers/_datasets_based_worker.py 96.07% <100.00%> (ø)
... and 27 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@severo
Copy link
Collaborator Author

severo commented Dec 26, 2022

@severo
Copy link
Collaborator Author

severo commented Dec 27, 2022

Blocked until https://github.com/huggingface/hffs/ is released publicly. It's now public

@severo
Copy link
Collaborator Author

severo commented Dec 27, 2022

Should we use Dask (https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html) to read the parquet file, or is it better to directly use pyarrow?

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!!

I like your idea of using the metadata in the Parquet footer.

Additionally it contains the features info as additional metadata: ;)

>>> json.loads(metadata.metadata[b"huggingface"])

{'info': {'features': {'sentence1': {'dtype': 'string', '_type': 'Value'},
   'sentence2': {'dtype': 'string', '_type': 'Value'},
   'idx': {'dtype': 'int32', '_type': 'Value'},
   'label': {'names': ['entailment', 'not_entailment'],
    '_type': 'ClassLabel'}}}}

REVISION = "refs/convert/parquet"
fs = hffs.HfFileSystem(self.dataset, repo_type="dataset", revision=REVISION)
metadata = pq.read_metadata(f"{self.config}/{self.filename}", filesystem=fs)
# ^ are we streaming to only read the metadata in the footer, or is the whole parquet file read?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we streaming to only read the metadata in the footer, or is the whole parquet file read?

I think pq.read_metadata only reads the metadata. However it only accepts a string to a local path (not remote URL) or a file-like object, if no filesystem is passed.

Our datasets streaming mode calls fsspec under the hood, analogously to:

url = "https://huggingface.co/datasets/super_glue/resolve/refs%2Fconvert%2Fparquet/axb/super_glue-test.parquet"
with fsspec.open(url) as f:
    metadata = pq.read_metadata(f)

it reads the remote parquet files (from the Hub) with hffs and pyarrow.
using libcommon 0.6.2, we implement get_new_splits to be able to create
the children jobs. Also: ensure the type of the config and split (str)
in /splits
and also update the libraries to fix vulnerabilities (torch, gitpython)
use only one definition. Also: remove the ignored vulnerabilities, since
the dependencies have been updated
we will add "stats" with more details in the parquet worker.

BREAKING CHANGE: 🧨 change the /splits response (num_bytes and num_examples are removed)
It's not very efficient, but we stay in the same architecture model. So:
we first get the list of parquet files and the dataset-info for each
config, then we copy each part to its own response
it does not make sense to have them in libcommon, since we will come
back to only one generic "worker"
we don't check for its value anyway
To test only one subpath, eg TEST_PATH=tests/test_one.py make test
@severo severo changed the title Create endpoint size Create endpoint /dataset-info Jan 18, 2023
@severo severo merged commit b3ac6a1 into main Jan 18, 2023
@severo severo deleted the create-endpoint-size branch January 18, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants