Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]504 Server Error when loading dataset which was already cached #5831

Open
SingL3 opened this issue May 9, 2023 · 6 comments
Open

[Bug]504 Server Error when loading dataset which was already cached #5831

SingL3 opened this issue May 9, 2023 · 6 comments

Comments

@SingL3
Copy link

SingL3 commented May 9, 2023

Describe the bug

I have already cached the dataset using:

dataset = load_dataset("databricks/databricks-dolly-15k",
                        cache_dir="/mnt/data/llm/datasets/databricks-dolly-15k")

After that, I tried to load it again using the same machine, I got this error:

Traceback (most recent call last):
  File "/mnt/home/llm/pythia/train.py", line 16, in <module>
    dataset = load_dataset("databricks/databricks-dolly-15k",
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/datasets/load.py", line 1773, in load_dataset
    builder_instance = load_dataset_builder(
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/datasets/load.py", line 1502, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/datasets/load.py", line 1219, in dataset_module_factory
    raise e1 from None
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/datasets/load.py", line 1186, in dataset_module_factory
    raise e
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/datasets/load.py", line 1160, in dataset_module_factory
    dataset_info = hf_api.dataset_info(
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 1667, in dataset_info
    hf_raise_for_status(r)
  File "/mnt/data/conda/envs/pythia_ft/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py", line 301, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/datasets/databricks/databricks-dolly-15k

Steps to reproduce the bug

  1. cache the databrick-dolly-15k dataset using load_dataset, setting a cache_dir
  2. use load_dataset again, setting the same cache_dir

Expected behavior

Dataset loaded succuessfully.

Environment info

  • datasets version: 2.12.0
  • Platform: Linux-4.18.0-372.16.1.el8_6.x86_64-x86_64-with-glibc2.27
  • Python version: 3.9.16
  • Huggingface_hub version: 0.14.1
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3
@lucidBrot
Copy link

lucidBrot commented May 9, 2023

I am experiencing the same problem with the following environment:

  • datasets version: 2.11.0
  • Platform: Linux 5.19.0-41-generic x86_64 GNU/Linux
  • Python version: 3.8.5
  • Huggingface_hub version: 0.13.3
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3

Trying to get some diagnostics, I got the following:

>>> from huggingface_hub import scan_cache_dir
>>> sd = scan_cache_dir()
>>> sd
HFCacheInfo(size_on_disk=0, repos=frozenset(), warnings=[CorruptedCacheException('Repo path is not a directory: /home/myname/.cache/huggingface/hub/version_diffusers_cache.txt')])

However, that might also be because I had tried to manually specify the cache_dir and that resulted in trying to download the dataset again ... but into a folder one level higher up than it should have.

Note that my issue is with the huggan/wikiart dataset, so it is not a dataset-specific issue.

@TobiasLee
Copy link

same problem with a private dataset repo, seems the huggingface hub server got some connection problem?

@qmeeus
Copy link

qmeeus commented May 9, 2023

Yes, dataset server seems down for now

@mariosasko
Copy link
Collaborator

@SingL3 You can avoid this error by setting the HF_DATASETS_OFFLINE env variable to 1. By default, if an internet connection is available, we check whether the cache of a cached dataset is up-to-date.

@lucidBrot datasets' cache is still not aligned with huggigface_hub's. We plan to align it eventually.

@albertvillanova
Copy link
Member

Today we had a big issue affecting the Hugging Face Hub, thus all the 504 Server Error: Gateway Time-out errors.

It is fixed now and loading your datasets should work as expected.

@SingL3 SingL3 closed this as completed May 9, 2023
@SingL3
Copy link
Author

SingL3 commented May 10, 2023

Hi, @albertvillanova.
If there is a locally cached version of datasets or something cache using huggingface_hub, when a network problem(either client or server) occurs, is it a better way to fallback to use the current cached version rather than raise a exception and exit?

@SingL3 SingL3 reopened this May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants