Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

Closed
aapot opened this issue Mar 30, 2022 · 10 comments
Closed

Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

aapot opened this issue Mar 30, 2022 · 10 comments
Assignees
Labels
dataset bug A bug in a dataset script provided in the library

Comments

@aapot
Copy link

aapot commented Mar 30, 2022

Describe the bug

I wanted to load mozilla-foundation/common_voice_7_0 dataset with fi language and test split from datasets on Colab/Kaggle notebook, but I am getting an error JSONDecodeError: [Errno Expecting value] Not Found: 0 while loading it. The bug seems to affect other languages and splits too than just the fi and test split.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("mozilla-foundation/common_voice_7_0", "fi", split="test", use_auth_token="YOUR TOKEN")

Expected results

load mozilla-foundation/common_voice_7_0 dataset succesfully

Actual results

JSONDecodeError                           Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/requests/models.py in json(self, **kwargs)
    909         try:
--> 910             return complexjson.loads(self.text, **kwargs)
    911         except JSONDecodeError as e:

/opt/conda/lib/python3.7/site-packages/simplejson/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw)
    524             and not use_decimal and not kw):
--> 525         return _default_decoder.decode(s)
    526     if cls is None:

/opt/conda/lib/python3.7/site-packages/simplejson/decoder.py in decode(self, s, _w, _PY3)
    369             s = str(s, self.encoding)
--> 370         obj, end = self.raw_decode(s)
    371         end = _w(s, end).end()

/opt/conda/lib/python3.7/site-packages/simplejson/decoder.py in raw_decode(self, s, idx, _w, _PY3)
    399                 idx += 3
--> 400         return self.scan_once(s, idx=_w(s, idx).end())

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
/tmp/ipykernel_358/370980805.py in <module>
      1 # load Common Voice 7.0 dataset from Huggingface with Finnish "test" split
----> 2 test_dataset = load_dataset("mozilla-foundation/common_voice_7_0", "fi", split="test", use_auth_token=True)

/opt/conda/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1690         ignore_verifications=ignore_verifications,
   1691         try_from_hf_gcs=try_from_hf_gcs,
-> 1692         use_auth_token=use_auth_token,
   1693     )
   1694 

/opt/conda/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    604                     if not downloaded_from_gcs:
    605                         self._download_and_prepare(
--> 606                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    607                         )
    608                     # Sync info

/opt/conda/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos)
   1102 
   1103     def _download_and_prepare(self, dl_manager, verify_infos):
-> 1104         super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
   1105 
   1106     def _get_examples_iterable_for_split(self, split_generator: SplitGenerator) -> ExamplesIterable:

/opt/conda/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    670         split_dict = SplitDict(dataset_name=self.name)
    671         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 672         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    673 
    674         # Checksums verification

~/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_7_0/fe20cac47c166e25b1f096ab661832e3da7cf298ed4a91dcaa1343ad972d175b/common_voice_7_0.py in _split_generators(self, dl_manager)
    151 
    152         self._log_download(self.config.name, bundle_version, hf_auth_token)
--> 153         archive = dl_manager.download(self._get_bundle_url(self.config.name, bundle_url_template))
    154 
    155         if self.config.version < datasets.Version("5.0.0"):

~/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_7_0/fe20cac47c166e25b1f096ab661832e3da7cf298ed4a91dcaa1343ad972d175b/common_voice_7_0.py in _get_bundle_url(self, locale, url_template)
    130         path = urllib.parse.quote(path.encode("utf-8"), safe="~()*!.'")
    131         use_cdn = self.config.size_bytes < 20 * 1024 * 1024 * 1024
--> 132         response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()
    133         return response["url"]
    134 

/opt/conda/lib/python3.7/site-packages/requests/models.py in json(self, **kwargs)
    915                 raise RequestsJSONDecodeError(e.message)
    916             else:
--> 917                 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    918 
    919     @property

JSONDecodeError: [Errno Expecting value] Not Found: 0

Environment info

  • datasets version: 2.0.0
  • Platform: Linux-5.10.90+-x86_64-with-debian-bullseye-sid
  • Python version: 3.7.12
  • PyArrow version: 5.0.0
  • Pandas version: 1.3.5
@aapot aapot added the bug Something isn't working label Mar 30, 2022
@albertvillanova albertvillanova self-assigned this Mar 30, 2022
@albertvillanova
Copy link
Member

Hi @aapot, thanks for reporting.

We are investigating the cause of this issue. We will keep you informed.

@albertvillanova
Copy link
Member

albertvillanova commented Mar 30, 2022

When making HTTP request from code line:

response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()

it cannot be decoded to JSON because it raises a 404 Not Found error.

The request is fixed if removing the /{use_cdn} from the URL.

Maybe there was a change in the Common Voice API?

CC: @anton-l @patrickvonplaten @polinaeterna

@lhoestq lhoestq added dataset bug A bug in a dataset script provided in the library and removed bug Something isn't working labels Mar 30, 2022
@albertvillanova
Copy link
Member

We have contacted by email the data owners of the Common Voice dataset.

@albertvillanova
Copy link
Member

Hotfix: https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0/commit/17b237961e4f7f84a2a0aea645abe5428a9d568e

@albertvillanova
Copy link
Member

I have also made the hotfix for all the rest of Common Voice script versions: 8.0, 6.1, 6.0,..., 1.0

@ngoquanghuy99
Copy link

Hey, is there anything new?
I could not load the dataset.

@patrickvonplaten
Copy link
Contributor

cc @lhoestq @polinaeterna

@anton-l
Copy link
Member

anton-l commented Jun 21, 2022

Hi @ngoquanghuy99! The dataset should load fine if you go through the following steps:

  1. Go to https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0 and click "Access repository" if you see a message about sharing your contact information with Mozilla Foundation at the top of the page. If you've already done that then skip to step 2.
  2. Run the command huggingface-cli login in your terminal or notebook to authenticate your machine.
  3. Load the dataset with use_auth_token=True:
from datasets import load_dataset

dataset = load_dataset("mozilla-foundation/common_voice_9_0", "ab", use_auth_token=True)

@ngoquanghuy99
Copy link

Thanks @anton-l
I could load the dataset now, but in another way.
Thanks anyways!

@abdullaha1rafi
Copy link

Thanks @anton-l I could load the dataset now, but in another way. Thanks anyways!

Can you share the "another way" please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset bug A bug in a dataset script provided in the library
Projects
None yet
Development

No branches or pull requests

7 participants