-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileNotFoundError while downloading wikipedia dataset for any language #4915
Comments
Hi @Shilpac20, As explained in the Wikipedia dataset card: https://huggingface.co/datasets/wikipedia
This means that, before passing a specific date, you should first make sure it is available online, as Wikimedia only keeps last X months (depending on the size of the corresponding language dump)): e.g. to see which dates "aa" Wikipedia is available online, see https://dumps.wikimedia.org/aawiki/ (as of today 2022-08-31, the available dates are from 20220401 to 20220820). |
Hi, the date that I have specified "20220401" is available for the language "aa". The error persists for any other available dates as present in https://dumps.wikimedia.org/aawiki/. The error is mainly due to apache beam not able to write the downloaded files. Any help on this? |
I see, sorry, I misread your issue. We are investigating this. |
I am struggling with basically the same issue. I am trying to download the German Wikipedia dump. As per the documentation, Issuing the command mentioned in the documentation cited above
raises the following
Using the (undocumented?) call to
produces the same error. EDIT: as I am using As I can see a folder
which throws another error:
basically telling me that the dataset I originally requested ( It seems that |
I am able to start downloading the dataset when trying anything with the recent dumps for 20221201. But obviously, those are the big wiki dumps and I need the smaller preloaded version. I am now getting some error when the files show up in my cache but it will say FileNotFoundError at the end of the download for some reason. The cache directory to the datasets\wikipedia\date.bn\ had something in it, then when the error came up it disappeared. It is easy to test with the langauge "bn" because the amount of files is low. dataset = load_dataset('wikipedia', date="20221201", language="bn", split='train', beam_runner='DirectRunner') |
Describe the bug
Hi, I am currently trying to download wikipedia dataset using
load_dataset("wikipedia", language="aa", date="20220401", split="train",beam_runner='DirectRunner'). However, I end up in getting filenotfound error. I get this error for any language I try to download.
Environment:
Steps to reproduce the bug
Expected results
to load the dataset
Actual results
I am pasting the error trace here:
Downloading builder script: 35.9kB [00:00, ?B/s]
Downloading metadata: 30.4kB [00:00, 1.94MB/s]
Using custom data configuration 20220401.aa-date=20220401,language=aa
Downloading and preparing dataset wikipedia/20220401.aa to C:\Users\Shilpa.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...
Downloading data: 100%|████████████████████████████████████████████████████████████| 11.1k/11.1k [00:00<00:00, 712kB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.82s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 35.6k/35.6k [00:00<00:00, 84.3kB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.93s/it]
Traceback (most recent call last):
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py", line 1193, in process
self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 202, in open_writer
return FileBasedSinkWriter(self, writer_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 419, in init
self.temp_handle = self.sink.open(temp_shard_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py", line 553, in open
self._file_handle = super().open(temp_path)
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 139, in open
temp_path, self.mime_type, self.compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py", line 224, in create
return filesystem.create(path, mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 163, in create
return self._path_open(path, 'wb', mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 140, in _path_open
raw_file = io.open(path, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "G:/abc/temp.py", line 32, in
beam_runner='DirectRunner')
File "G:\Python3.7\lib\site-packages\datasets\load.py", line 1751, in load_dataset
use_auth_token=use_auth_token,
File "G:\Python3.7\lib\site-packages\datasets\builder.py", line 705, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "G:\Python3.7\lib\site-packages\datasets\builder.py", line 1394, in _download_and_prepare
pipeline_results = pipeline.run()
File "G:\Python3.7\lib\site-packages\apache_beam\pipeline.py", line 574, in run
return self.runner.run_pipeline(self, self._options)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\direct\direct_runner.py", line 131, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 201, in run_pipeline
options)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 212, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 443, in run_stages
runner_execution_context, bundle_context_manager, bundle_input)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 776, in _execute_bundle
bundle_manager))
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 1000, in _run_bundle
data_input, data_output, input_timers, expected_timer_output)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 1309, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\portability\fn_api_runner\worker_handlers.py", line 380, in push
response = self.worker.do_instruction(request)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py", line 598, in do_instruction
getattr(request, request_type), request.instruction_id)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\sdk_worker.py", line 635, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py", line 1004, in process_bundle
element.data)
File "G:\Python3.7\lib\site-packages\apache_beam\runners\worker\bundle_processor.py", line 227, in process_encoded
self.output(decoded_value)
File "apache_beam\runners\worker\operations.py", line 526, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam\runners\worker\operations.py", line 528, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam\runners\worker\operations.py", line 237, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 324, in apache_beam.runners.worker.operations.GeneralPurposeConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 905, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1491, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "apache_beam\runners\common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
File "apache_beam\runners\worker\operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
File "apache_beam\runners\worker\operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\worker\operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam\runners\common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam\runners\common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam\runners\common.py", line 837, in apache_beam.runners.common.PerWindowInvoker.invoke_process
File "apache_beam\runners\common.py", line 981, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
File "apache_beam\runners\common.py", line 1571, in apache_beam.runners.common._OutputHandler.handle_process_outputs
File "G:\Python3.7\lib\site-packages\apache_beam\io\iobase.py", line 1193, in process
self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 202, in open_writer
return FileBasedSinkWriter(self, writer_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 419, in init
self.temp_handle = self.sink.open(temp_shard_path)
File "G:\Python3.7\lib\site-packages\apache_beam\io\parquetio.py", line 553, in open
self._file_handle = super().open(temp_path)
File "G:\Python3.7\lib\site-packages\apache_beam\options\value_provider.py", line 193, in _f
return fnc(self, *args, **kwargs)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filebasedsink.py", line 139, in open
temp_path, self.mime_type, self.compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\filesystems.py", line 224, in create
return filesystem.create(path, mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 163, in create
return self._path_open(path, 'wb', mime_type, compression_type)
File "G:\Python3.7\lib\site-packages\apache_beam\io\localfilesystem.py", line 140, in _path_open
raw_file = io.open(path, mode)
RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Shilpa\.cache\huggingface\datasets\wikipedia\20220401.aa-date=20220401,language=aa\2.0.0\aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559.incomplete\beam-temp-wikipedia-train-880233e8287e11edaf9d3ca067f2714e\20a05238-6106-4420-a713-4eca6dd5959a.wikipedia-train' [while running 'train/Save to parquet/Write/WriteImpl/WriteBundles']
Environment info
Python: 3.7.6
Windows 10 Pro
datasets :2.4.0
apache_beam: 2.41.0
mwparserfromhell: 0.6.4
The text was updated successfully, but these errors were encountered: