Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise a proper exception when trying to stream a dataset that requires to manually download files #2749

Closed
severo opened this issue Aug 3, 2021 · 2 comments · Fixed by #2758
Assignees
Labels
bug Something isn't working

Comments

@severo
Copy link
Contributor

severo commented Aug 3, 2021

Describe the bug

At least for 'reclor', 'telugu_books', 'turkish_movie_sentiment', 'ubuntu_dialogs_corpus', 'wikihow', trying to load_dataset in streaming mode raises a TypeError without any detail about why it fails.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("reclor", streaming=True)

Expected results

Ideally: raise a specific exception, something like ManualDownloadError.

Or at least give the reason in the message, as when we load in normal mode:

from datasets import load_dataset
dataset = load_dataset("reclor")
AssertionError: The dataset reclor with config default requires manual data.
 Please follow the manual download instructions:   to use ReClor you need to download it manually. Please go to its homepage (http://whyu.me/reclor/) fill the google
  form and you will receive a download link and a password to extract it.Please extract all files in one folder and use the path folder in datasets.load_dataset('reclor', data_dir='path/to/folder/folder_name')
  .
 Manual data can be loaded with `datasets.load_dataset(reclor, data_dir='<path/to/manual/data>')

Actual results

TypeError: expected str, bytes or os.PathLike object, not NoneType

Environment info

  • datasets version: 1.11.0
  • Platform: macOS-11.5-x86_64-i386-64bit
  • Python version: 3.8.11
  • PyArrow version: 4.0.1
@severo severo added the bug Something isn't working label Aug 3, 2021
@albertvillanova
Copy link
Member

Hi @severo, thanks for reporting.

As discussed, datasets requiring manual download should be:

  • programmatically identifiable
  • properly handled with more clear error message when trying to load them with streaming

In relation with programmatically identifiability, note that for datasets requiring manual download, their builder have a property manual_download_instructions which is not None:

# Dataset requiring manual download:
builder.manual_download_instructions is not None

@severo
Copy link
Contributor Author

severo commented Aug 9, 2021

Thanks @albertvillanova

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants