Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

aapot · 2022-03-30T11:39:41Z

Describe the bug

I wanted to load mozilla-foundation/common_voice_7_0 dataset with fi language and test split from datasets on Colab/Kaggle notebook, but I am getting an error JSONDecodeError: [Errno Expecting value] Not Found: 0 while loading it. The bug seems to affect other languages and splits too than just the fi and test split.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("mozilla-foundation/common_voice_7_0", "fi", split="test", use_auth_token="YOUR TOKEN")

Expected results

load mozilla-foundation/common_voice_7_0 dataset succesfully

Actual results

JSONDecodeError                           Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/requests/models.py in json(self, **kwargs)
    909         try:
--> 910             return complexjson.loads(self.text, **kwargs)
    911         except JSONDecodeError as e:

/opt/conda/lib/python3.7/site-packages/simplejson/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw)
    524             and not use_decimal and not kw):
--> 525         return _default_decoder.decode(s)
    526     if cls is None:

/opt/conda/lib/python3.7/site-packages/simplejson/decoder.py in decode(self, s, _w, _PY3)
    369             s = str(s, self.encoding)
--> 370         obj, end = self.raw_decode(s)
    371         end = _w(s, end).end()

/opt/conda/lib/python3.7/site-packages/simplejson/decoder.py in raw_decode(self, s, idx, _w, _PY3)
    399                 idx += 3
--> 400         return self.scan_once(s, idx=_w(s, idx).end())

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
/tmp/ipykernel_358/370980805.py in <module>
      1 # load Common Voice 7.0 dataset from Huggingface with Finnish "test" split
----> 2 test_dataset = load_dataset("mozilla-foundation/common_voice_7_0", "fi", split="test", use_auth_token=True)

/opt/conda/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1690         ignore_verifications=ignore_verifications,
   1691         try_from_hf_gcs=try_from_hf_gcs,
-> 1692         use_auth_token=use_auth_token,
   1693     )
   1694 

/opt/conda/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    604                     if not downloaded_from_gcs:
    605                         self._download_and_prepare(
--> 606                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    607                         )
    608                     # Sync info

/opt/conda/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos)
   1102 
   1103     def _download_and_prepare(self, dl_manager, verify_infos):
-> 1104         super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
   1105 
   1106     def _get_examples_iterable_for_split(self, split_generator: SplitGenerator) -> ExamplesIterable:

/opt/conda/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    670         split_dict = SplitDict(dataset_name=self.name)
    671         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 672         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    673 
    674         # Checksums verification

~/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_7_0/fe20cac47c166e25b1f096ab661832e3da7cf298ed4a91dcaa1343ad972d175b/common_voice_7_0.py in _split_generators(self, dl_manager)
    151 
    152         self._log_download(self.config.name, bundle_version, hf_auth_token)
--> 153         archive = dl_manager.download(self._get_bundle_url(self.config.name, bundle_url_template))
    154 
    155         if self.config.version < datasets.Version("5.0.0"):

~/.cache/huggingface/modules/datasets_modules/datasets/mozilla-foundation--common_voice_7_0/fe20cac47c166e25b1f096ab661832e3da7cf298ed4a91dcaa1343ad972d175b/common_voice_7_0.py in _get_bundle_url(self, locale, url_template)
    130         path = urllib.parse.quote(path.encode("utf-8"), safe="~()*!.'")
    131         use_cdn = self.config.size_bytes < 20 * 1024 * 1024 * 1024
--> 132         response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()
    133         return response["url"]
    134 

/opt/conda/lib/python3.7/site-packages/requests/models.py in json(self, **kwargs)
    915                 raise RequestsJSONDecodeError(e.message)
    916             else:
--> 917                 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    918 
    919     @property

JSONDecodeError: [Errno Expecting value] Not Found: 0

Environment info

datasets version: 2.0.0
Platform: Linux-5.10.90+-x86_64-with-debian-bullseye-sid
Python version: 3.7.12
PyArrow version: 5.0.0
Pandas version: 1.3.5

The text was updated successfully, but these errors were encountered:

albertvillanova · 2022-03-30T13:38:58Z

Hi @aapot, thanks for reporting.

We are investigating the cause of this issue. We will keep you informed.

albertvillanova · 2022-03-30T14:08:47Z

When making HTTP request from code line:

response = requests.get(f"{_API_URL}/bucket/dataset/{path}/{use_cdn}", timeout=10.0).json()

it cannot be decoded to JSON because it raises a 404 Not Found error.

The request is fixed if removing the /{use_cdn} from the URL.

Maybe there was a change in the Common Voice API?

CC: @anton-l @patrickvonplaten @polinaeterna

albertvillanova · 2022-03-30T15:07:08Z

We have contacted by email the data owners of the Common Voice dataset.

albertvillanova · 2022-03-30T16:17:16Z

Hotfix: https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0/commit/17b237961e4f7f84a2a0aea645abe5428a9d568e

albertvillanova · 2022-03-31T11:52:51Z

I have also made the hotfix for all the rest of Common Voice script versions: 8.0, 6.1, 6.0,..., 1.0

ngoquanghuy99 · 2022-06-19T13:04:59Z

Hey, is there anything new?
I could not load the dataset.

patrickvonplaten · 2022-06-20T16:52:40Z

cc @lhoestq @polinaeterna

anton-l · 2022-06-21T07:12:14Z

Hi @ngoquanghuy99! The dataset should load fine if you go through the following steps:

Go to https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0 and click "Access repository" if you see a message about sharing your contact information with Mozilla Foundation at the top of the page. If you've already done that then skip to step 2.
Run the command huggingface-cli login in your terminal or notebook to authenticate your machine.
Load the dataset with use_auth_token=True:

from datasets import load_dataset

dataset = load_dataset("mozilla-foundation/common_voice_9_0", "ab", use_auth_token=True)

ngoquanghuy99 · 2022-06-21T07:36:23Z

Thanks @anton-l
I could load the dataset now, but in another way.
Thanks anyways!

abdullaha1rafi · 2024-06-09T12:12:45Z

Thanks @anton-l I could load the dataset now, but in another way. Thanks anyways!

Can you share the "another way" please?

aapot added the bug Something isn't working label Mar 30, 2022

albertvillanova self-assigned this Mar 30, 2022

lhoestq added dataset bug A bug in a dataset script provided in the library and removed bug Something isn't working labels Mar 30, 2022

albertvillanova closed this as completed Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

aapot commented Mar 30, 2022

albertvillanova commented Mar 30, 2022

albertvillanova commented Mar 30, 2022 •

edited

Loading

albertvillanova commented Mar 30, 2022

albertvillanova commented Mar 30, 2022

albertvillanova commented Mar 31, 2022

ngoquanghuy99 commented Jun 19, 2022

patrickvonplaten commented Jun 20, 2022

anton-l commented Jun 21, 2022 •

edited

Loading

ngoquanghuy99 commented Jun 21, 2022

abdullaha1rafi commented Jun 9, 2024

Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

Loading mozilla-foundation/common_voice_7_0 dataset failed #4062

Comments

aapot commented Mar 30, 2022

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

albertvillanova commented Mar 30, 2022

albertvillanova commented Mar 30, 2022 • edited Loading

albertvillanova commented Mar 30, 2022

albertvillanova commented Mar 30, 2022

albertvillanova commented Mar 31, 2022

ngoquanghuy99 commented Jun 19, 2022

patrickvonplaten commented Jun 20, 2022

anton-l commented Jun 21, 2022 • edited Loading

ngoquanghuy99 commented Jun 21, 2022

abdullaha1rafi commented Jun 9, 2024

albertvillanova commented Mar 30, 2022 •

edited

Loading

anton-l commented Jun 21, 2022 •

edited

Loading