-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yelp not working #4005
Comments
I don't think it's an issue with the dataset-viewer. Maybe @lhoestq or @albertvillanova could confirm. >>> from datasets import load_dataset, DownloadMode
>>> import itertools
>>> # without streaming
>>> dataset = load_dataset("yelp_review_full", name="yelp_review_full", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD)
Downloading builder script: 4.39kB [00:00, 5.97MB/s]
Downloading metadata: 2.13kB [00:00, 3.14MB/s]
Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/slesage/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.10k/1.10k [00:00<00:00, 1.39MB/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/slesage/hf/datasets/src/datasets/load.py", line 1687, in load_dataset
builder_instance.download_and_prepare(
File "/home/slesage/hf/datasets/src/datasets/builder.py", line 605, in download_and_prepare
self._download_and_prepare(
File "/home/slesage/hf/datasets/src/datasets/builder.py", line 1104, in _download_and_prepare
super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File "/home/slesage/hf/datasets/src/datasets/builder.py", line 676, in _download_and_prepare
verify_checksums(
File "/home/slesage/hf/datasets/src/datasets/utils/info_utils.py", line 40, in verify_checksums
raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0']
>>> # with streaming
>>> dataset = load_dataset("yelp_review_full", name="yelp_review_full", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD, streaming=True)
Downloading builder script: 4.39kB [00:00, 5.53MB/s]
Downloading metadata: 2.13kB [00:00, 3.14MB/s]
Traceback (most recent call last):
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 375, in _info
await _file_info(
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 736, in _file_info
r.raise_for_status()
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1000, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 403, message='Forbidden', url=URL('https://doc-0g-bs-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/gklhpdq1arj8v15qrg7ces34a8c3413d/1648144575000/07511006523564980941/*/0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0?e=download')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/slesage/hf/datasets/src/datasets/load.py", line 1677, in load_dataset
return builder_instance.as_streaming_dataset(
File "/home/slesage/hf/datasets/src/datasets/builder.py", line 906, in as_streaming_dataset
splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/yelp_review_full/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43/yelp_review_full.py", line 102, in _split_generators
data_dir = dl_manager.download_and_extract(my_urls)
File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 800, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 778, in extract
urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
File "/home/slesage/hf/datasets/src/datasets/utils/py_utils.py", line 306, in map_nested
return function(data_struct)
File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 783, in _extract
protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
File "/home/slesage/hf/datasets/src/datasets/utils/streaming_download_manager.py", line 372, in _get_extraction_protocol
with fsspec.open(urlpath, **kwargs) as f:
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
f = self.fs.open(self.path, mode=mode)
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/spec.py", line 978, in open
f = self._open(
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 335, in _open
size = size or self.info(path, **kwargs)["size"]
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/slesage/.pyenv/versions/datasets/lib/python3.8/site-packages/fsspec/implementations/http.py", line 388, in _info
raise FileNotFoundError(url) from exc
FileNotFoundError: https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0&confirm=t And this is before even trying to access the rows with >>> rows = list(itertools.islice(dataset, 100))
>>> rows = list(dataset.take(100)) |
Yet another issue related to google drive not being nice. Most likely your IP has been banned from using their API programmatically. Do you know if we are allowed to host and redistribute the data ourselves ? |
Hi, Facing the same issue while loading the dataset:
Thanks |
Thanks for reporting. I think this is the same issue. Feel free to try again later, once Google Drive stopped blocking you. You can retry by passing |
I noticed that FastAI hosts the Yelp dataset at https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz (from their catalog here) Let's update the yelp dataset script to download from there instead of Google Drive |
I updated the link to not use Google Drive anymore, we will do a release early next week with the updated download url of the dataset :) |
Dataset viewer issue for 'name of the dataset'
Link: https://huggingface.co/datasets/yelp_review_full/viewer/yelp_review_full/train
Doesn't work:
Am I the one who added this dataset ? No
A seamingly copy of the dataset: https://huggingface.co/datasets/SetFit/yelp_review_full works . The original one: https://huggingface.co/datasets/yelp_review_full has > 20K downloads.
The text was updated successfully, but these errors were encountered: