Can't load a dataset #6261

joaopedrosdmm · 2023-09-26T15:46:25Z

Describe the bug

Can't seem to load the JourneyDB dataset.

It throws the following error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[15], line 2
     1 # If the dataset is gated/private, make sure you have run huggingface-cli login
----> 2 dataset = load_dataset("JourneyDB/JourneyDB", data_files="data", use_auth_token=True)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1664, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
  1661 ignore_verifications = ignore_verifications or save_infos
  1663 # Create a dataset builder
-> 1664 builder_instance = load_dataset_builder(
  1665     path=path,
  1666     name=name,
  1667     data_dir=data_dir,
  1668     data_files=data_files,
  1669     cache_dir=cache_dir,
  1670     features=features,
  1671     download_config=download_config,
  1672     download_mode=download_mode,
  1673     revision=revision,
  1674     use_auth_token=use_auth_token,
  1675     **config_kwargs,
  1676 )
  1678 # Return iterable dataset in case of streaming
  1679 if streaming:

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1490, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
  1488     download_config = download_config.copy() if download_config else DownloadConfig()
  1489     download_config.use_auth_token = use_auth_token
-> 1490 dataset_module = dataset_module_factory(
  1491     path,
  1492     revision=revision,
  1493     download_config=download_config,
  1494     download_mode=download_mode,
  1495     data_dir=data_dir,
  1496     data_files=data_files,
  1497 )
  1499 # Get dataset builder class from the processing script
  1500 builder_cls = import_main_class(dataset_module.module_path)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1238, in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
  1236                 raise ConnectionError(f"Couln't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
  1237             if isinstance(e1, FileNotFoundError):
-> 1238                 raise FileNotFoundError(
  1239                     f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
  1240                     f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
  1241                 ) from None
  1242             raise e1 from None
  1243 else:

FileNotFoundError: Couldn't find a dataset script at /kaggle/working/JourneyDB/JourneyDB/JourneyDB.py or any data file in the same directory. Couldn't find 'JourneyDB/JourneyDB' on the Hugging Face Hub either: FileNotFoundError: Unable to find data in dataset repository JourneyDB/JourneyDB with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']

Steps to reproduce the bug

from huggingface_hub import notebook_login
notebook_login()

!pip install -q datasets
from datasets import load_dataset

dataset = load_dataset("JourneyDB/JourneyDB", data_files="data", use_auth_token=True)

Expected behavior

Load the dataset

Environment info

Notebook

The text was updated successfully, but these errors were encountered:

joaopedrosdmm · 2023-09-26T15:54:18Z

I believe is due to the fact that doesn't work with .tgz files.

mariosasko · 2023-09-26T16:17:04Z

JourneyDB/JourneyDB is a gated dataset, so this error means you are not authenticated to access it, either by using an invalid token or by not agreeing to the terms in the dialog on the dataset page.

I believe is due to the fact that doesn't work with .tgz files.

Indeed, the dataset's data files structure is not supported natively by datasets. To load it, one option is to clone the repo (or download it with huggingface_hub.snapshot_download) and use Dataset.from_generator to process the files.

joaopedrosdmm · 2023-09-27T10:23:51Z

JourneyDB/JourneyDB is a gated dataset, so this error means you are not authenticated to access it, either by using an invalid token or by not agreeing to the terms in the dialog on the dataset page.´

I did authentication with:

from huggingface_hub import notebook_login
notebook_login()

Isn't that the correct way to do it?

Indeed, the dataset's data files structure is not supported natively by datasets. To load it, one option is to clone the repo (or download it with huggingface_hub.snapshot_download) and use Dataset.from_generator to process the files.

Great suggestion I will give it a try.

mariosasko · 2023-09-27T15:13:38Z

Have you accepted the terms in the dialog here?

IIRC Kaggle preinstalls an outdated datasets version, so it's also a good idea to update it before importing datasets (and do the same for huggingface_hub)

joaopedrosdmm · 2023-10-05T10:23:23Z

Sorry for the late reply. Yes, I did. Thanks for the tip!

joaopedrosdmm closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't load a dataset #6261

Can't load a dataset #6261

joaopedrosdmm commented Sep 26, 2023

joaopedrosdmm commented Sep 26, 2023

mariosasko commented Sep 26, 2023 •

edited

Loading

joaopedrosdmm commented Sep 27, 2023

mariosasko commented Sep 27, 2023

joaopedrosdmm commented Oct 5, 2023

Can't load a dataset #6261

Can't load a dataset #6261

Comments

joaopedrosdmm commented Sep 26, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

joaopedrosdmm commented Sep 26, 2023

mariosasko commented Sep 26, 2023 • edited Loading

joaopedrosdmm commented Sep 27, 2023

mariosasko commented Sep 27, 2023

joaopedrosdmm commented Oct 5, 2023

mariosasko commented Sep 26, 2023 •

edited

Loading