Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with offline mode #4760

Closed
SaulLu opened this issue Jul 28, 2022 · 16 comments
Closed

Issue with offline mode #4760

SaulLu opened this issue Jul 28, 2022 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@SaulLu
Copy link
Contributor

SaulLu commented Jul 28, 2022

Describe the bug

I can't retrieve a cached dataset with offline mode enabled

Steps to reproduce the bug

To reproduce my issue, first, you'll need to run a script that will cache the dataset

import os
os.environ["HF_DATASETS_OFFLINE"] = "0"

import datasets

datasets.logging.set_verbosity_info()
ds_name = "SaulLu/toy_struc_dataset"
ds = datasets.load_dataset(ds_name)
print(ds)

then, you can try to reload it in offline mode:

import os
os.environ["HF_DATASETS_OFFLINE"] = "1"

import datasets

datasets.logging.set_verbosity_info()
ds_name = "SaulLu/toy_struc_dataset"
ds = datasets.load_dataset(ds_name)
print(ds)

Expected results

I would have expected the 2nd snippet not to return any errors

Actual results

The 2nd snippet returns:

Traceback (most recent call last):
  File "/home/lucile_huggingface_co/sandbox/evaluate/test_cache_datasets.py", line 8, in <module>
    ds = datasets.load_dataset(ds_name)
  File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1723, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1500, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/lucile_huggingface_co/anaconda3/envs/evaluate-dev/lib/python3.8/site-packages/datasets/load.py", line 1241, in dataset_module_factory
    raise ConnectionError(f"Couln't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couln't reach the Hugging Face Hub for dataset 'SaulLu/toy_struc_dataset': Offline mode is enabled.

Environment info

  • datasets version: 2.4.0
  • Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.3

Maybe I'm misunderstanding something in the use of the offline mode (see doc), is that the case?

@SaulLu SaulLu added the bug Something isn't working label Jul 28, 2022
@albertvillanova
Copy link
Member

Hi @SaulLu, thanks for reporting.

I think offline mode is not supported for datasets containing only data files (without any loading script). I'm having a look into this...

@albertvillanova albertvillanova self-assigned this Jul 28, 2022
@SaulLu
Copy link
Contributor Author

SaulLu commented Jul 28, 2022

Thanks for your feedback!

To give you a little more info, if you don't set the offline mode flag, the script will load the cache. I first noticed this behavior with the evaluate library, and while trying to understand the downloading flow I realized that I had a similar error with datasets.

@albertvillanova
Copy link
Member

This is an issue we have to fix.

@lhoestq
Copy link
Member

lhoestq commented Jul 28, 2022

This is related to #3547

@thuzhf
Copy link

thuzhf commented May 10, 2023

Still not fixed? ......

@lhoestq
Copy link
Member

lhoestq commented May 11, 2023

#5331 will be helpful to fix this, as it updates the cache directory template to be aligned with the other datasets

@ManuelFay
Copy link
Contributor

Any updates ?

@je-santos
Copy link

I'm facing the same problem

@lhoestq
Copy link
Member

lhoestq commented Jan 23, 2024

This issue has been fixed in datasets 2.16 by #6493. The cache is now working properly :)

You just have to update datasets:

pip install -U datasets

@lhoestq lhoestq closed this as completed Jan 23, 2024
@jaded0
Copy link

jaded0 commented Feb 15, 2024

I'm on version 2.17.0, and this exact problem is still persisting.

@lhoestq
Copy link
Member

lhoestq commented Feb 15, 2024

Can you share some code to reproduce your issue ?

Also make sure your cache was populated with recent versions of datasets. Datasets cached with old versions may not be reloadable in offline mode, though we did our best to keep as much backward compatibility as possible.

@BramVanroy
Copy link
Contributor

I'm not sure if this is related @lhoestq but I am experiencing a similar issue when using offline mode:

$ python -c "from datasets import load_dataset; load_dataset('openai_humaneval', split='test')"
$ HF_DATASETS_OFFLINE=1 python -c "from datasets import load_dataset; load_dataset('openai_humaneval', split='test')"
Using the latest cached version of the dataset since openai_humaneval couldn't be found on the Hugging Face Hub (offline mode is enabled).
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 122, in __init__
    config_name, version, hash = _find_hash_in_cache(
  File "/dodrio/scratch/projects/2023_071/alignment-handbook/.venv/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 48, in _find_hash_in_cache
    raise ValueError(
ValueError: Couldn't find cache for openai_humaneval for config 'default'
Available configs in the cache: ['openai_humaneval']

@lhoestq
Copy link
Member

lhoestq commented Mar 19, 2024

Thanks for reporting @BramVanroy, I managed to reproduce and I opened a fix here: #6741

@BramVanroy
Copy link
Contributor

Awesome, thanks for the quick fix @lhoestq! Looking forward to update my dependency version list.

@noforit
Copy link

noforit commented Mar 25, 2024

Thanks for reporting @BramVanroy, I managed to reproduce and I opened a fix here: #6741

Thanks a lot! I have faced the same problem. Can I use your fix code to directly replace the existing version code? I noticed that this fix has not been merged yet. Will it affect other functionalities?

@lhoestq
Copy link
Member

lhoestq commented Mar 25, 2024

I just merged the fix, you can install datasets from source or wait for the patch release which will be out in the coming days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants