Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError while downloading wikipedia dataset for any language #4915

Open
Shilpac20 opened this issue Aug 30, 2022 · 5 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@Shilpac20
Copy link

Shilpac20 commented Aug 30, 2022

Describe the bug

Hi, I am currently trying to download wikipedia dataset using
load_dataset("wikipedia", language="aa", date="20220401", split="train",beam_runner='DirectRunner'). However, I end up in getting filenotfound error. I get this error for any language I try to download.

Environment:

Steps to reproduce the bug

from datasets import load_dataset
load_dataset("wikipedia", language="aa", date="20220401", split="train",beam_runner='DirectRunner')

Expected results

to load the dataset

Actual results

I am pasting the error trace here:
Downloading builder script: 35.9kB [00:00, ?B/s]
Downloading metadata: 30.4kB [00:00, 1.94MB/s]
Using custom data configuration 20220401.aa-date=20220401,language=aa
Downloading and preparing dataset wikipedia/20220401.aa to C:\Users\Shilpa.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading data: 100%|████████████████████████████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 712kB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.82s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 35.6k/35.6k [00:00<00:00, 84.3kB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.93s/it]
Traceback (most recent call last):
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py", line 1193, in process
self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 202, in open_writer
return FileBasedSinkWriter(self, writer_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 419, in init
self.temp_handle = self.sink.open(temp_shard_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py", line 553, in open
self._file_handle = super().open(temp_path)
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 139, in open
temp_path, self.mime_type, self.compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py", line 224, in create
return filesystem.create(path, mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 163, in create
return self._path_open(path, 'wb', mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 140, in _path_open
raw_file = io.open(path, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "G:/abc/temp.py", line 32, in
beam_runner='DirectRunner')
File "G:\Python3.7\lib\site-packages\datasets\load.py", line 1751, in load_dataset
use_auth_token=use_auth_token,
File "G:\Python3.7\lib\site-packages\datasets\builder.py", line 705, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "G:\Python3.7\lib\site-packages\datasets\builder.py", line 1394, in _download_and_prepare
pipeline_results = pipeline.run()
File "G:\Python3.7\lib\site-packages\apache_beam\pipeline.py", line 574, in run
return self.runner.run_pipeline(self, self._options)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\direct\direct_runner.py", line 131, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 201, in run_pipeline
options)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 212, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 443, in run_stages
runner_execution_context, bundle_context_manager, bundle_input)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 776, in _execute_bundle
bundle_manager))
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 1000, in _run_bundle
data_input, data_output, input_timers, expected_timer_output)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 1309, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\worker_handlers.py", line 380, in push
response = self.worker.do_instruction(request)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py", line 598, in do_instruction
getattr(request, request_type), request.instruction_id)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py", line 635, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py", line 1004, in process_bundle
element.data)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py", line 227, in process_encoded
self.output(decoded_value)
File "apache_beam\runners\worker\operations.py", line 526, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam\runners\worker\operations.py", line 528, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam\runners\worker\operations.py", line 237, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 324, in apache_beam.runners.worker.operations.GeneralPurposeConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 905, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py", line 1193, in process
self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 202, in open_writer
return FileBasedSinkWriter(self, writer_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 419, in init
self.temp_handle = self.sink.open(temp_shard_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py", line 553, in open
self._file_handle = super().open(temp_path)
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 139, in open
temp_path, self.mime_type, self.compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py", line 224, in create
return filesystem.create(path, mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 163, in create
return self._path_open(path, 'wb', mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 140, in _path_open
raw_file = io.open(path, mode)
RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train' [while running 'train/Save to parquet/Write/WriteImpl/WriteBundles']

Environment info

Python: 3.7.6
Windows 10 Pro
datasets :2.4.0
apache_beam: 2.41.0
mwparserfromhell: 0.6.4

@Shilpac20 Shilpac20 added the bug Something isn't working label Aug 30, 2022
@albertvillanova
Copy link
Member

Hi @Shilpac20,

As explained in the Wikipedia dataset card: https://huggingface.co/datasets/wikipedia

You can find the full list of languages and dates here.

This means that, before passing a specific date, you should first make sure it is available online, as Wikimedia only keeps last X months (depending on the size of the corresponding language dump)): e.g. to see which dates "aa" Wikipedia is available online, see https://dumps.wikimedia.org/aawiki/ (as of today 2022-08-31, the available dates are from 20220401 to 20220820).

@Shilpac20
Copy link
Author

Hi, the date that I have specified "20220401" is available for the language "aa". The error persists for any other available dates as present in https://dumps.wikimedia.org/aawiki/. The error is mainly due to apache beam not able to write the downloaded files. Any help on this?

@albertvillanova
Copy link
Member

I see, sorry, I misread your issue.

We are investigating this.

@multipleofzero
Copy link

multipleofzero commented Nov 30, 2022

I am struggling with basically the same issue. I am trying to download the German Wikipedia dump.

As per the documentation, "20220301.de" should be available as a pre-processed dataset.

Issuing the command mentioned in the documentation cited above

from datasets import load_dataset
load_dataset("wikipedia", "20220301.de")

raises the following FileNotFound error

FileNotFoundError: Couldn't find file at https://dumps.wikimedia.org/dewiki/20220301/dumpstatus.json

Using the (undocumented?) call to load_dataset() with language and date parameters

load_dataset("wikipedia", language="de", date="20220301", beam_runner="DirectRunner")

produces the same error.

EDIT: as I am using datasets v2.7.1, I should be looking at that version's documentation! It is mentioned there, that additional kwargs are "passed to the BuilderConfig and used in the DatasetBuilder". So I guess that is how language and date are used.

As I can see a folder 20221130 on https://dumps.wikimedia.org/dewiki/, I also tried

from datasets import load_dataset
load_dataset("wikipedia", "20221130.de")

which throws another error:

ValueError: BuilderConfig 20221120.de not found. Available: ['20220301.aa', ... '20220301.de', ...

basically telling me that the dataset I originally requested ('20220301.de') is available...

It seems that load_dataset is not handling the vanishing older dumps for Wikipedia correctly?

@RandyAndy-byte
Copy link

RandyAndy-byte commented Dec 4, 2022

I am able to start downloading the dataset when trying anything with the recent dumps for 20221201. But obviously, those are the big wiki dumps and I need the smaller preloaded version.

I am now getting some error when the files show up in my cache but it will say FileNotFoundError at the end of the download for some reason. The cache directory to the datasets\wikipedia\date.bn\ had something in it, then when the error came up it disappeared.

It is easy to test with the langauge "bn" because the amount of files is low.

dataset = load_dataset('wikipedia', date="20221201", language="bn", split='train', beam_runner='DirectRunner')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants