Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json #5286

Closed
roritol opened this issue Nov 23, 2022 · 2 comments

Comments

@roritol
Copy link

roritol commented Nov 23, 2022

Describe the bug

I follow the steps provided on the website https://huggingface.co/datasets/wikipedia

$ pip install apache_beam mwparserfromhell

from datasets import load_dataset
load_dataset("wikipedia", "20220301.en")

however this results in the following error:

raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

If I then prompt the system with:

load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

the following error occurs:

raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Here is the exact code:

Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from datasets import load_dataset
load_dataset('wikipedia', '20220301.en')
Downloading and preparing dataset wikipedia/20220301.en to /home/[EDITED]/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 22.2MB/s]
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1741, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1879, in _download_and_prepare
raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
Downloading and preparing dataset wikipedia/20220301.en to /home/[EDITED]/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 18.8MB/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1741, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1909, in _download_and_prepare
super()._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 891, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/rorytol/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia.py", line 945, in _split_generators
downloaded_files = dl_manager.download_and_extract({"info": info_url})
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 447, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 311, in download
downloaded_path_or_paths = map_nested(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 444, in map_nested
mapped = [
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 445, in
_single_map_nested((function, obj, types, None, True, None))
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
return function(data_struct)
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 338, in _download
return cached_path(url_or_filename, download_config=download_config)
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/file_utils.py", line 183, in cached_path
output_path = get_from_cache(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/file_utils.py", line 530, in get_from_cache
raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json

Steps to reproduce the bug

$ pip install apache_beam mwparserfromhell

from datasets import load_dataset
load_dataset("wikipedia", "20220301.en")
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')

Expected behavior

Download the dataset

Environment info

Running linux on a remote workstation operated through a macbook terminal

Python 3.10.6

@roritol
Copy link
Author

roritol commented Nov 25, 2022

I found a solution

If you specifically install datasets==1.18 and then run

import datasets
wiki = datasets.load_dataset('wikipedia', '20200501.en')
then this should work (it worked for me.)

@roritol roritol closed this as completed Nov 25, 2022
@Armanasq
Copy link

Armanasq commented Oct 3, 2023

I have the same problem here but installing datasets==1.18 wont work for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants