Got error when load cnn_dailymail dataset #3830

wgong0510 · 2022-03-05T01:43:12Z

When using datasets.load_dataset method to load cnn_dailymail dataset, got error as below:

windows os: FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'D:\SourceCode\DataScience\HuggingFace\Data\downloads\1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b\cnn\stories'
google colab: NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

The code is to load dataset:
windows os:

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0", cache_dir="D:\\SourceCode\\DataScience\\HuggingFace\\Data")

google colab:

import datasets

train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")

The text was updated successfully, but these errors were encountered:

dynamicwebpaige · 2022-03-05T02:02:47Z

Was able to reproduce the issue on Colab; full logs below.

---------------------------------------------------------------------------
NotADirectoryError                        Traceback (most recent call last)
[<ipython-input-2-39967739ba7f>](https://localhost:8080/#) in <module>()
      1 import datasets
      2 
----> 3 train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")

5 frames
[/usr/local/lib/python3.7/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, script_version, **config_kwargs)
   1705         ignore_verifications=ignore_verifications,
   1706         try_from_hf_gcs=try_from_hf_gcs,
-> 1707         use_auth_token=use_auth_token,
   1708     )
   1709 

[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    593                     if not downloaded_from_gcs:
    594                         self._download_and_prepare(
--> 595                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    596                         )
    597                     # Sync info

[/usr/local/lib/python3.7/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    659         split_dict = SplitDict(dataset_name=self.name)
    660         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 661         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    662 
    663         # Checksums verification

[/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cnn_dailymail.py](https://localhost:8080/#) in _split_generators(self, dl_manager)
    253     def _split_generators(self, dl_manager):
    254         dl_paths = dl_manager.download_and_extract(_DL_URLS)
--> 255         train_files = _subset_filenames(dl_paths, datasets.Split.TRAIN)
    256         # Generate shared vocabulary
    257 

[/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cnn_dailymail.py](https://localhost:8080/#) in _subset_filenames(dl_paths, split)
    154     else:
    155         logger.fatal("Unsupported split: %s", split)
--> 156     cnn = _find_files(dl_paths, "cnn", urls)
    157     dm = _find_files(dl_paths, "dm", urls)
    158     return cnn + dm

[/root/.cache/huggingface/modules/datasets_modules/datasets/cnn_dailymail/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234/cnn_dailymail.py](https://localhost:8080/#) in _find_files(dl_paths, publisher, url_dict)
    133     else:
    134         logger.fatal("Unsupported publisher: %s", publisher)
--> 135     files = sorted(os.listdir(top_dir))
    136 
    137     ret_files = []

NotADirectoryError: [Errno 20] Not a directory: '/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories'

albertvillanova · 2022-03-07T06:53:40Z

Hi @jon-tow, thanks for reporting. And hi @dynamicwebpaige, thanks for your investigation.

This issue was already reported

Unable to Download CNN-Dailymail Dataset #3784

and its root cause is a change in the Google Drive service. See:

Bug downloading Virus scan warning page from Google Drive URLs #3786

We have already fixed it. See:

Fix Google Drive URL to avoid Virus scan warning #3787

We are planning to make a patch release today (indeed, we were planning to do it last Friday).

In the meantime, you can get this fix by installing our library from the GitHub master branch:

pip install git+https://github.com/huggingface/datasets#egg=datasets

Then, if you had previously tried to load the data and got the checksum error, you should force the redownload of the data (before the fix, you just downloaded and cached the virus scan warning page, instead of the data file):

load_dataset("...", download_mode="force_redownload")

CC: @lhoestq

albertvillanova added the duplicate This issue or pull request already exists label Mar 7, 2022

albertvillanova closed this as completed Mar 7, 2022

osanseviero mentioned this issue Mar 13, 2022

cnn_dailymail is broken in pinned version of datasets nlp-with-transformers/notebooks#28

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got error when load cnn_dailymail dataset #3830

Got error when load cnn_dailymail dataset #3830

wgong0510 commented Mar 5, 2022

dynamicwebpaige commented Mar 5, 2022 •

edited

Loading

albertvillanova commented Mar 7, 2022

Got error when load cnn_dailymail dataset #3830

Got error when load cnn_dailymail dataset #3830

Comments

wgong0510 commented Mar 5, 2022

dynamicwebpaige commented Mar 5, 2022 • edited Loading

albertvillanova commented Mar 7, 2022

dynamicwebpaige commented Mar 5, 2022 •

edited

Loading