Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]How to solve [datasets.builder.DatasetGenerationError: An error occurred while generating the dataset] #35

Open
1 task done
Shawnzheng011019 opened this issue Oct 23, 2023 · 7 comments
Labels
question Further information is requested

Comments

@Shawnzheng011019
Copy link

What is your question?

Traceback (most recent call last):
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1618, in _prepare_split_single
writer = writer_class(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_writer.py", line 334, in init
self.stream = self._fs.open(fs_token_paths[2][0], "wb")
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\spec.py", line 1309, in open
f = self._open(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\implementations\local.py", line 180, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\implementations\local.py", line 298, in init
self._open()
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\implementations\local.py", line 303, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/shawn/.cache/huggingface/datasets/named_entity_recognition_dataset_builder/default-c270794ce0d
23d06/0.0.0/db737b9bb893f20fb03d04403a30bf7c033256c212b7e9f0ebc6e9c958535c51.incomplete/named_entity_recognition_dataset_builder-train-00000-00000-of-NNNNN.arro
w'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\shawn\anaconda3\envs\pytorch\Scripts\adaseq.exe_main
.py", line 7, in
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\main.py", line 13, in run
main(prog='adaseq')
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands_init
.py", line 29, in main
args.func(args)
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands\train.py", line 84, in train_model_from_args
train_model(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands\train.py", line 156, in train_model
trainer = build_trainer_from_partial_objects(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands\train.py", line 185, in build_trainer_from_partial_objects
dm = DatasetManager.from_config(task=config.task, **config.dataset)
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\data\dataset_manager.py", line 182, in from_config
hfdataset = hf_load_dataset(path, name=name, **kwargs)
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\load.py", line 1797, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 909, in download_and_prepare
self._download_and_prepare(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1670, in _download_and_prepare
super()._download_and_prepare(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1004, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1508, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1665, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

What have you tried?

set http proxy and successfully conneted to Youtube.

Code (if necessary)

No response

What's your environment?

  • AdaSeq Version (e.g., 1.0 or master):
  • ModelScope Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.12.1):
  • OS (e.g., Ubuntu 20.04):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Shawnzheng011019 Shawnzheng011019 added the question Further information is requested label Oct 23, 2023
@Shawnzheng011019
Copy link
Author

environment was set automatically by the file requiremets.txt

@ykallan
Copy link

ykallan commented Dec 16, 2023

同样遇到这个问题,看起来应该是adaseq加载数据集的时候,可能处理逻辑有问题,加载数据集的格式

···text
data_type: json_spans
···

可能有点问题

@PPPP-kaqiu
Copy link

是因为数据集找不到或者数据集不是标准的解析格式,可以按照toy msra的加载代码重写一下数据加载

@houyuchao
Copy link

@PPPP-kaqiu 你重新写了吗?可以分享一下吗

@lichen146
Copy link

@Shawnzheng011019 请问解决了吗,大哥

@PPPP-kaqiu
Copy link

完全按照hf dataset的格式写数据加载脚本,yaml的数据加载就只写数据那个文件夹就好了

@lichen146
Copy link

@PPPP-kaqiu 加个微信吧大哥,求教啊WX:Xugeyuan923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants