Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yelp not working #4005

Closed
patrickvonplaten opened this issue Mar 24, 2022 · 6 comments · Fixed by #4018
Closed

Yelp not working #4005

patrickvonplaten opened this issue Mar 24, 2022 · 6 comments · Fixed by #4018
Assignees

Comments

@patrickvonplaten
Copy link
Contributor

Dataset viewer issue for 'name of the dataset'

Link: https://huggingface.co/datasets/yelp_review_full/viewer/yelp_review_full/train

Doesn't work:

Server error
Status code:   400
Exception:     Error
Message:       line contains NULL

Am I the one who added this dataset ? No

A seamingly copy of the dataset: https://huggingface.co/datasets/SetFit/yelp_review_full works . The original one: https://huggingface.co/datasets/yelp_review_full has > 20K downloads.

@patrickvonplaten patrickvonplaten added the dataset-viewer Related to the dataset viewer on huggingface.co label Mar 24, 2022
@severo severo self-assigned this Mar 24, 2022
@severo
Copy link
Contributor

severo commented Mar 24, 2022

I don't think it's an issue with the dataset-viewer. Maybe @lhoestq or @albertvillanova could confirm.

>>> from datasets import load_dataset, DownloadMode
>>> import itertools
>>> # without streaming
>>> dataset = load_dataset("yelp_review_full", name="yelp_review_full", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD)

Downloading builder script: 4.39kB [00:00, 5.97MB/s]
Downloading metadata: 2.13kB [00:00, 3.14MB/s]
Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/slesage/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.10k/1.10k [00:00<00:00, 1.39MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/hf/datasets/src/datasets/load.py", line 1687, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 1104, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 676, in _download_and_prepare
    verify_checksums(
  File "/home/slesage/hf/datasets/src/datasets/utils/info_utils.py", line 40, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0']

>>> # with streaming
>>> dataset = load_dataset("yelp_review_full", name="yelp_review_full", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD, streaming=True)

Downloading builder script: 4.39kB [00:00, 5.53MB/s]
Downloading metadata: 2.13kB [00:00, 3.14MB/s]
Traceback (most recent call last):
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 375, in _info
    await _file_info(
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 736, in _file_info
    r.raise_for_status()
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1000, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 403, message='Forbidden', url=URL('https://doc-0g-bs-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/gklhpdq1arj8v15qrg7ces34a8c3413d/1648144575000/07511006523564980941/*/0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0?e=download')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/hf/datasets/src/datasets/load.py", line 1677, in load_dataset
    return builder_instance.as_streaming_dataset(
  File "/home/slesage/hf/datasets/src/datasets/builder.py", line 906, in as_streaming_dataset
    splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
  File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/yelp_review_full/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43/yelp_review_full.py", line 102, in _split_generators
    data_dir = dl_manager.download_and_extract(my_urls)
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 800, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 778, in extract
    urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
  File "/home/slesage/hf/datasets/src/datasets/utils/py_utils.py", line 306, in map_nested
    return function(data_struct)
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 783, in _extract
    protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
  File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 372, in _get_extraction_protocol
    with fsspec.open(urlpath, **kwargs) as f:
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/spec.py", line 978, in open
    f = self._open(
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 335, in _open
    size = size or self.info(path, **kwargs)["size"]
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
    raise result[0]
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 388, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0&confirm=t

And this is before even trying to access the rows with

>>> rows = list(itertools.islice(dataset, 100))
>>> rows = list(dataset.take(100))

@severo severo removed their assignment Mar 24, 2022
@severo severo removed the dataset-viewer Related to the dataset viewer on huggingface.co label Mar 24, 2022
@lhoestq
Copy link
Member

lhoestq commented Mar 24, 2022

Yet another issue related to google drive not being nice. Most likely your IP has been banned from using their API programmatically. Do you know if we are allowed to host and redistribute the data ourselves ?

@ankitk2109
Copy link

Hi,

Facing the same issue while loading the dataset:

Error: {NonMatchingChecksumError}Checksums didn't match for dataset source files

Thanks

@lhoestq
Copy link
Member

lhoestq commented Mar 25, 2022

Facing the same issue while loading the dataset:

Error: {NonMatchingChecksumError}Checksums didn't match for dataset source files

Thanks for reporting. I think this is the same issue. Feel free to try again later, once Google Drive stopped blocking you. You can retry by passing download_mode="force_redownload" to load_dataset

@lhoestq
Copy link
Member

lhoestq commented Mar 25, 2022

I noticed that FastAI hosts the Yelp dataset at https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz (from their catalog here)

Let's update the yelp dataset script to download from there instead of Google Drive

@lhoestq
Copy link
Member

lhoestq commented Mar 25, 2022

I updated the link to not use Google Drive anymore, we will do a release early next week with the updated download url of the dataset :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants