Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: scalene分析报错 #104

Closed
3 tasks done
simplew2011 opened this issue Nov 29, 2023 · 4 comments
Closed
3 tasks done

[Bug]: scalene分析报错 #104

simplew2011 opened this issue Nov 29, 2023 · 4 comments
Assignees
Labels
bug Something isn't working stale-issue

Comments

@simplew2011
Copy link

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

To Reproduce 如何复现

pip install -U scalene
scalene tools/process_data.py --config configs/demo/process.yaml

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'

Logs 报错日志

2023-11-29 11:13:15 | INFO | data_juicer.core.executor:107 - Processing data...
2023-11-29 11:13:15 | ERROR | data_juicer.core.executor:165 - An error occurred during Op [language_id_score_filter].
Traceback (most recent call last):
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 131, in run
dataset = dataset.add_column(name=Fields.stats,
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 255, in add_column
return NestedDataset(super().add_column(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5446, in add_column
dataset = self.flatten_indices() if self._indices is not None else self
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3592, in flatten_indices
return self.map(
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3392, in _map_single
buf_writer, writer, tmp_file = init_buffer_and_writer()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3326, in init_buffer_and_writer
tmp_file = tempfile.NamedTemporaryFile("wb", dir=os.path.dirname(cache_file_name), delete=False)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1pvg2dc5/tmpbf_w1ds6'

Screenshots 截图

image

Additional 额外信息

No response

@simplew2011 simplew2011 added the bug Something isn't working label Nov 29, 2023
@simplew2011
Copy link
Author

似乎和use_cache: false有关

@HYLcool
Copy link
Collaborator

HYLcool commented Nov 29, 2023

Hi @simplew2011 , thanks for your interest and use!

We think it's a problem with both scalene and datasets. We wrote a simple snippet of test code like below:

from datasets import load_dataset, disable_caching

disable_caching()

ds = load_dataset('json', data_files='demo-dataset.jsonl', split='train')
ds = ds.filter(lambda s: s['text'], num_proc=4)
ds = ds.add_column(name='id', column=list(range(ds.num_rows)))

And we got different results:

  • When running with python command, it will be ok.
  • When running with the installed scalene command, the same FileNotFoundError will be raised at the same position as you mentioned above.
  • When running with the installed scalene command with a single process after removing the num_proc parameter in the ds.filter function of the code snippet above, it will be ok, too.

Therefore, it seems that we can do nothing on Data-Juicer to solve this problem. Maybe you need to open issues in the repos of scalene or datasets and ask for help from their developers.

Copy link

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

Copy link

Close this stale issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale-issue
Projects
None yet
Development

No branches or pull requests

2 participants