FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json #5286

roritol · 2022-11-23T14:54:15Z

Describe the bug

I follow the steps provided on the website https://huggingface.co/datasets/wikipedia

$ pip install apache_beam mwparserfromhell

from datasets import load_dataset
load_dataset("wikipedia", "20220301.en")

however this results in the following error:

raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

If I then prompt the system with:

load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

the following error occurs:

raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Here is the exact code:

Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from datasets import load_dataset
load_dataset('wikipedia', '20220301.en')
Downloading and preparing dataset wikipedia/20220301.en to /home/[EDITED]/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 22.2MB/s]
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1741, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1879, in _download_and_prepare
raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
Downloading and preparing dataset wikipedia/20220301.en to /home/[EDITED]/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 18.8MB/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1741, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1909, in _download_and_prepare
super()._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 891, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/rorytol/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia.py", line 945, in _split_generators
downloaded_files = dl_manager.download_and_extract({"info": info_url})
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 447, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 311, in download
downloaded_path_or_paths = map_nested(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 444, in map_nested
mapped = [
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 445, in
_single_map_nested((function, obj, types, None, True, None))
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
return function(data_struct)
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 338, in _download
return cached_path(url_or_filename, download_config=download_config)
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/file_utils.py", line 183, in cached_path
output_path = get_from_cache(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/file_utils.py", line 530, in get_from_cache
raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Steps to reproduce the bug

$ pip install apache_beam mwparserfromhell

from datasets import load_dataset
load_dataset("wikipedia", "20220301.en")
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

Expected behavior

Download the dataset

Environment info

Running linux on a remote workstation operated through a macbook terminal

Python 3.10.6

The text was updated successfully, but these errors were encountered:

roritol · 2022-11-25T11:33:14Z

I found a solution

If you specifically install datasets==1.18 and then run

import datasets
wiki = datasets.load_dataset('wikipedia', '20200501.en')
then this should work (it worked for me.)

Armanasq · 2023-10-03T19:46:44Z

I have the same problem here but installing datasets==1.18 wont work for me

roritol closed this as completed Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json #5286

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json #5286

roritol commented Nov 23, 2022

roritol commented Nov 25, 2022

Armanasq commented Oct 3, 2023

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json #5286

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json #5286

Comments

roritol commented Nov 23, 2022

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

roritol commented Nov 25, 2022

Armanasq commented Oct 3, 2023