Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got error when load cnn_dailymail dataset #3830

Closed
wgong0510 opened this issue Mar 5, 2022 · 2 comments
Closed

Got error when load cnn_dailymail dataset #3830

wgong0510 opened this issue Mar 5, 2022 · 2 comments
Labels
duplicate This issue or pull request already exists

Comments

@wgong0510
Copy link

When using datasets.load_dataset method to load cnn_dailymail dataset, got error as below:

  • windows os: FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'D:\SourceCode\DataScience\HuggingFace\Data\downloads\1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b\cnn\stories'
  • google colab: NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

The code is to load dataset:
windows os:

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0", cache_dir="D:\\SourceCode\\DataScience\\HuggingFace\\Data")

google colab:

import datasets

train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")
@dynamicwebpaige
Copy link

dynamicwebpaige commented Mar 5, 2022

Was able to reproduce the issue on Colab; full logs below.

---------------------------------------------------------------------------
NotADirectoryError                        Traceback (most recent call last)
[<ipython-input-2-39967739ba7f>](https://localhost:8080/#) in <module>()
      1 import datasets
      2 
----> 3 train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")

5 frames
[/usr/local/lib/python3.7/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, script_version, **config_kwargs)
   1705         ignore_verifications=ignore_verifications,
   1706         try_from_hf_gcs=try_from_hf_gcs,
-> 1707         use_auth_token=use_auth_token,
   1708     )
   1709 

[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    593                     if not downloaded_from_gcs:
    594                         self._download_and_prepare(
--> 595                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    596                         )
    597                     # Sync info

[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    659         split_dict = SplitDict(dataset_name=self.name)
    660         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 661         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    662 
    663         # Checksums verification

[/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cnn_dailymail.py](https://localhost:8080/#) in _split_generators(self, dl_manager)
    253     def _split_generators(self, dl_manager):
    254         dl_paths = dl_manager.download_and_extract(_DL_URLS)
--> 255         train_files = _subset_filenames(dl_paths, datasets.Split.TRAIN)
    256         # Generate shared vocabulary
    257 

[/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cnn_dailymail.py](https://localhost:8080/#) in _subset_filenames(dl_paths, split)
    154     else:
    155         logger.fatal("Unsupported split: %s", split)
--> 156     cnn = _find_files(dl_paths, "cnn", urls)
    157     dm = _find_files(dl_paths, "dm", urls)
    158     return cnn + dm

[/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cnn_dailymail.py](https://localhost:8080/#) in _find_files(dl_paths, publisher, url_dict)
    133     else:
    134         logger.fatal("Unsupported publisher: %s", publisher)
--> 135     files = sorted(os.listdir(top_dir))
    136 
    137     ret_files = []

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

@albertvillanova albertvillanova added the duplicate This issue or pull request already exists label Mar 7, 2022
@albertvillanova
Copy link
Member

Hi @jon-tow, thanks for reporting. And hi @dynamicwebpaige, thanks for your investigation.

This issue was already reported

and its root cause is a change in the Google Drive service. See:

We have already fixed it. See:

We are planning to make a patch release today (indeed, we were planning to do it last Friday).

In the meantime, you can get this fix by installing our library from the GitHub master branch:

pip install git+https://github.com/huggingface/datasets#egg=datasets

Then, if you had previously tried to load the data and got the checksum error, you should force the redownload of the data (before the fix, you just downloaded and cached the virus scan warning page, instead of the data file):

load_dataset("...", download_mode="force_redownload")

CC: @lhoestq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants