Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load a dataset #6261

Closed
joaopedrosdmm opened this issue Sep 26, 2023 · 5 comments
Closed

Can't load a dataset #6261

joaopedrosdmm opened this issue Sep 26, 2023 · 5 comments

Comments

@joaopedrosdmm
Copy link

Describe the bug

Can't seem to load the JourneyDB dataset.

It throws the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[15], line 2
     1 # If the dataset is gated/private, make sure you have run huggingface-cli login
----> 2 dataset = load_dataset("JourneyDB/JourneyDB", data_files="data", use_auth_token=True)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1664, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
  1661 ignore_verifications = ignore_verifications or save_infos
  1663 # Create a dataset builder
-> 1664 builder_instance = load_dataset_builder(
  1665     path=path,
  1666     name=name,
  1667     data_dir=data_dir,
  1668     data_files=data_files,
  1669     cache_dir=cache_dir,
  1670     features=features,
  1671     download_config=download_config,
  1672     download_mode=download_mode,
  1673     revision=revision,
  1674     use_auth_token=use_auth_token,
  1675     **config_kwargs,
  1676 )
  1678 # Return iterable dataset in case of streaming
  1679 if streaming:

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1490, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
  1488     download_config = download_config.copy() if download_config else DownloadConfig()
  1489     download_config.use_auth_token = use_auth_token
-> 1490 dataset_module = dataset_module_factory(
  1491     path,
  1492     revision=revision,
  1493     download_config=download_config,
  1494     download_mode=download_mode,
  1495     data_dir=data_dir,
  1496     data_files=data_files,
  1497 )
  1499 # Get dataset builder class from the processing script
  1500 builder_cls = import_main_class(dataset_module.module_path)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1238, in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
  1236                 raise ConnectionError(f"Couln't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
  1237             if isinstance(e1, FileNotFoundError):
-> 1238                 raise FileNotFoundError(
  1239                     f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
  1240                     f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
  1241                 ) from None
  1242             raise e1 from None
  1243 else:

FileNotFoundError: Couldn't find a dataset script at /kaggle/working/JourneyDB/JourneyDB/JourneyDB.py or any data file in the same directory. Couldn't find 'JourneyDB/JourneyDB' on the Hugging Face Hub either: FileNotFoundError: Unable to find data in dataset repository JourneyDB/JourneyDB with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']

Steps to reproduce the bug

from huggingface_hub import notebook_login
notebook_login()
!pip install -q datasets
from datasets import load_dataset

dataset = load_dataset("JourneyDB/JourneyDB", data_files="data", use_auth_token=True)

Expected behavior

Load the dataset

Environment info

Notebook

@joaopedrosdmm
Copy link
Author

I believe is due to the fact that doesn't work with .tgz files.

@mariosasko
Copy link
Collaborator

mariosasko commented Sep 26, 2023

JourneyDB/JourneyDB is a gated dataset, so this error means you are not authenticated to access it, either by using an invalid token or by not agreeing to the terms in the dialog on the dataset page.

I believe is due to the fact that doesn't work with .tgz files.

Indeed, the dataset's data files structure is not supported natively by datasets. To load it, one option is to clone the repo (or download it with huggingface_hub.snapshot_download) and use Dataset.from_generator to process the files.

@joaopedrosdmm
Copy link
Author

JourneyDB/JourneyDB is a gated dataset, so this error means you are not authenticated to access it, either by using an invalid token or by not agreeing to the terms in the dialog on the dataset page.´

I did authentication with:

from huggingface_hub import notebook_login
notebook_login()

Isn't that the correct way to do it?

Indeed, the dataset's data files structure is not supported natively by datasets. To load it, one option is to clone the repo (or download it with huggingface_hub.snapshot_download) and use Dataset.from_generator to process the files.

Great suggestion I will give it a try.

@mariosasko
Copy link
Collaborator

Have you accepted the terms in the dialog here?

IIRC Kaggle preinstalls an outdated datasets version, so it's also a good idea to update it before importing datasets (and do the same for huggingface_hub)

@joaopedrosdmm
Copy link
Author

Sorry for the late reply. Yes, I did. Thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants