Yelp not working #4005

patrickvonplaten · 2022-03-24T11:14:00Z

Dataset viewer issue for 'name of the dataset'

Link: https://huggingface.co/datasets/yelp_review_full/viewer/yelp_review_full/train

Doesn't work:

Server error
Status code:   400
Exception:     Error
Message:       line contains NULL

Am I the one who added this dataset ? No

A seamingly copy of the dataset: https://huggingface.co/datasets/SetFit/yelp_review_full works . The original one: https://huggingface.co/datasets/yelp_review_full has > 20K downloads.

The text was updated successfully, but these errors were encountered:

severo · 2022-03-24T17:57:45Z

I don't think it's an issue with the dataset-viewer. Maybe @lhoestq or @albertvillanova could confirm.

>>> from datasets import load_dataset, DownloadMode
>>> import itertools
>>> # without streaming
>>> dataset = load_dataset("yelp_review_full", name="yelp_review_full", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD)

Downloading builder script: 4.39kB [00:00, 5.97MB/s]
Downloading metadata: 2.13kB [00:00, 3.14MB/s]
Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/slesage/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.10k/1.10k [00:00<00:00, 1.39MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/hf/datasets/src/datasets/load.py", line 1687, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 1104, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 676, in _download_and_prepare
    verify_checksums(
  File "/home/slesage/hf/datasets/src/datasets/utils/info_utils.py", line 40, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0']

>>> # with streaming
>>> dataset = load_dataset("yelp_review_full", name="yelp_review_full", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD, streaming=True)

Downloading builder script: 4.39kB [00:00, 5.53MB/s]
Downloading metadata: 2.13kB [00:00, 3.14MB/s]
Traceback (most recent call last):
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 375, in _info
    await _file_info(
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 736, in _file_info
    r.raise_for_status()
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1000, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 403, message='Forbidden', url=URL('https://doc-0g-bs-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/gklhpdq1arj8v15qrg7ces34a8c3413d/1648144575000/07511006523564980941/*/0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0?e=download')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/hf/datasets/src/datasets/load.py", line 1677, in load_dataset
    return builder_instance.as_streaming_dataset(
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 906, in as_streaming_dataset
    splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
  File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/yelp_review_full/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43/yelp_review_full.py", line 102, in _split_generators
    data_dir = dl_manager.download_and_extract(my_urls)
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 800, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 778, in extract
    urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
  File "/home/slesage/hf/datasets/src/datasets/utils/py_utils.py", line 306, in map_nested
    return function(data_struct)
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 783, in _extract
    protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 372, in _get_extraction_protocol
    with fsspec.open(urlpath, **kwargs) as f:
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/spec.py", line 978, in open
    f = self._open(
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 335, in _open
    size = size or self.info(path, **kwargs)["size"]
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
    raise result[0]
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 388, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0&confirm=t

And this is before even trying to access the rows with

>>> rows = list(itertools.islice(dataset, 100))
>>> rows = list(dataset.take(100))

lhoestq · 2022-03-24T18:03:38Z

Yet another issue related to google drive not being nice. Most likely your IP has been banned from using their API programmatically. Do you know if we are allowed to host and redistribute the data ourselves ?

ankitk2109 · 2022-03-24T18:41:53Z

Hi,

Facing the same issue while loading the dataset:

Error: {NonMatchingChecksumError}Checksums didn't match for dataset source files

Thanks

lhoestq · 2022-03-25T10:13:42Z

Facing the same issue while loading the dataset:

Error: {NonMatchingChecksumError}Checksums didn't match for dataset source files

Thanks for reporting. I think this is the same issue. Feel free to try again later, once Google Drive stopped blocking you. You can retry by passing download_mode="force_redownload" to load_dataset

lhoestq · 2022-03-25T10:23:01Z

I noticed that FastAI hosts the Yelp dataset at https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz (from their catalog here)

Let's update the yelp dataset script to download from there instead of Google Drive

lhoestq · 2022-03-25T14:59:57Z

I updated the link to not use Google Drive anymore, we will do a release early next week with the updated download url of the dataset :)

patrickvonplaten added the dataset-viewer Related to the dataset viewer on huggingface.co label Mar 24, 2022

severo self-assigned this Mar 24, 2022

severo removed their assignment Mar 24, 2022

severo removed the dataset-viewer Related to the dataset viewer on huggingface.co label Mar 24, 2022

severo mentioned this issue May 3, 2022

Avoid being blocked by Google (and other providers) huggingface/dataset-viewer#220

Closed

lhoestq self-assigned this Mar 25, 2022

lhoestq mentioned this issue Mar 25, 2022

Replace yelp_review_full data url #4018

Merged

lhoestq closed this as completed in #4018 Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yelp not working #4005

Yelp not working #4005

patrickvonplaten commented Mar 24, 2022

severo commented Mar 24, 2022

lhoestq commented Mar 24, 2022 •

edited

Loading

ankitk2109 commented Mar 24, 2022

lhoestq commented Mar 25, 2022

lhoestq commented Mar 25, 2022 •

edited

Loading

lhoestq commented Mar 25, 2022

Yelp not working #4005

Yelp not working #4005

Comments

patrickvonplaten commented Mar 24, 2022

Dataset viewer issue for 'name of the dataset'

severo commented Mar 24, 2022

lhoestq commented Mar 24, 2022 • edited Loading

ankitk2109 commented Mar 24, 2022

lhoestq commented Mar 25, 2022

lhoestq commented Mar 25, 2022 • edited Loading

lhoestq commented Mar 25, 2022

lhoestq commented Mar 24, 2022 •

edited

Loading

lhoestq commented Mar 25, 2022 •

edited

Loading