Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad error message when trying to download gated dataset #5953

Closed
patrickvonplaten opened this issue Jun 14, 2023 · 8 comments · Fixed by #5954
Closed

Bad error message when trying to download gated dataset #5953

patrickvonplaten opened this issue Jun 14, 2023 · 8 comments · Fixed by #5954

Comments

@patrickvonplaten
Copy link
Contributor

Describe the bug

When I attempt to download a model from the Hub that is gated without being logged in, I get a nice error message. E.g.:

E.g.

Repository Not Found for url: https://huggingface.co/api/models/DeepFloyd/IF-I-XL-v1.0.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password..
Will try to load from local cache.

If I do the same for a gated dataset on the Hub, I'm not gated a nice error message IMO:

File ~/hf/lib/python3.10/site-packages/fsspec/implementations/http.py:430, in HTTPFileSystem._info(self, url, **kwargs)
    427     except Exception as exc:
    428         if policy == "get":
    429             # If get failed, then raise a FileNotFoundError
--> 430             raise FileNotFoundError(url) from exc
    431         logger.debug(str(exc))
    433 return {"name": url, "size": None, **info, "type": "file"}

FileNotFoundError: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0/resolve/main/n_shards.json

Steps to reproduce the bug

huggingface-cli logout

and then:

from datasets import load_dataset, Audio

# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# Swahili
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "sw", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
sw_sample = next(iter(stream_data))["audio"]["array"]

Expected behavior

Better error message

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 2.12.0
  • Platform: Linux-6.2.0-76060200-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.16.0.dev0
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3
@patrickvonplaten
Copy link
Contributor Author

cc @sanchit-gandhi @Vaibhavs10 @lhoestq - this is mainly for demos that use Common Voice datasets as done here: https://github.com/facebookresearch/fairseq/tree/main/examples/mms#-transformers

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2023

Hi ! the error for me is

FileNotFoundError: Couldn't find a dataset script at /content/mozilla-foundation/common_voice_13_0/common_voice_13_0.py or any data file in the same directory. Couldn't find 'mozilla-foundation/common_voice_13_0' on the Hugging Face Hub either: FileNotFoundError: Dataset 'mozilla-foundation/common_voice_13_0' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.

And tbh idk how you managed to get your error. "n_shards.json" is not even a thing in datasets

@Vaibhavs10
Copy link
Member

Okay, I am able to reproduce @patrickvonplaten's original error: https://github.com/Vaibhavs10/scratchpad/blob/main/cv13_datasets_test.ipynb

Also not sure why it looks for n_shards.json

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2023

Ok I see, this file is downloaded from the CV dataset script - let me investigate

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2023

Ok I see: when you log out you no longer have access to the repository.

Therefore the dataset script is loaded from cache:

WARNING:datasets.load:Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_13_0/22809012aac1fc9803eaffc44122e4149043748e93933935d5ea19898587e4d7 (last modified on Wed Jun 14 10:13:17 2023) since it couldn't be found locally at mozilla-foundation/common_voice_13_0., or remotely on the Hugging Face Hub.

and the script tries to download the n_shards.json but fails

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2023

Is this ok for you #5954 ?

I'll do a release this afternoon

@patrickvonplaten
Copy link
Contributor Author

Cool!

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2023

this is included in the new release 2.13.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants