You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from datasets import load_dataset
load_dataset("wikipedia", "20220301.en")
however this results in the following error:
raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage: load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
from datasets import load_dataset
load_dataset('wikipedia', '20220301.en')
Downloading and preparing dataset wikipedia/20220301.en to /home/[EDITED]/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 22.2MB/s]
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1741, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1879, in _download_and_prepare
raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in load_dataset or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/
If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called DirectRunner (you may run out of memory).
Example of usage: load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
Downloading and preparing dataset wikipedia/20220301.en to /home/[EDITED]/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 18.8MB/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 1741, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 822, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1909, in _download_and_prepare
super()._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 891, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/rorytol/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/wikipedia.py", line 945, in _split_generators
downloaded_files = dl_manager.download_and_extract({"info": info_url})
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 447, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 311, in download
downloaded_path_or_paths = map_nested(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 444, in map_nested
mapped = [
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 445, in
_single_map_nested((function, obj, types, None, True, None))
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
return function(data_struct)
File "/usr/local/lib/python3.10/dist-packages/datasets/download/download_manager.py", line 338, in _download
return cached_path(url_or_filename, download_config=download_config)
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/file_utils.py", line 183, in cached_path
output_path = get_from_cache(
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/file_utils.py", line 530, in get_from_cache
raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json
Steps to reproduce the bug
$ pip install apache_beam mwparserfromhell
from datasets import load_dataset
load_dataset("wikipedia", "20220301.en")
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
Expected behavior
Download the dataset
Environment info
Running linux on a remote workstation operated through a macbook terminal
Python 3.10.6
The text was updated successfully, but these errors were encountered:
Describe the bug
I follow the steps provided on the website https://huggingface.co/datasets/wikipedia
$ pip install apache_beam mwparserfromhell
however this results in the following error:
raise MissingBeamOptions(
datasets.builder.MissingBeamOptions: Trying to generate a dataset using Apache Beam, yet no Beam Runner or PipelineOptions() has been provided in
load_dataset
or in the builder arguments. For big datasets it has to run on large-scale data processing tools like Dataflow, Spark, etc. More information about Apache Beam runners at https://beam.apache.org/documentation/runners/capability-matrix/If you really want to run it locally because you feel like the Dataset is small enough, you can use the local beam runner called
DirectRunner
(you may run out of memory).Example of usage:
load_dataset('wikipedia', '20220301.en', beam_runner='DirectRunner')
If I then prompt the system with:
the following error occurs:
raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/enwiki/20220301/dumpstatus.json
Here is the exact code:
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Steps to reproduce the bug
$ pip install apache_beam mwparserfromhell
Expected behavior
Download the dataset
Environment info
Running linux on a remote workstation operated through a macbook terminal
Python 3.10.6
The text was updated successfully, but these errors were encountered: