[Bug]: scalene分析报错 #104

simplew2011 · 2023-11-29T03:24:50Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

https://zhuanlan.zhihu.com/p/654006173

To Reproduce 如何复现

pip install -U scalene
scalene tools/process_data.py --config configs/demo/process.yaml

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'

Logs 报错日志

2023-11-29 11:13:15 | INFO | data_juicer.core.executor:107 - Processing data...
2023-11-29 11:13:15 | ERROR | data_juicer.core.executor:165 - An error occurred during Op [language_id_score_filter].
Traceback (most recent call last):
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 131, in run
dataset = dataset.add_column(name=Fields.stats,
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 255, in add_column
return NestedDataset(super().add_column(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5446, in add_column
dataset = self.flatten_indices() if self._indices is not None else self
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3592, in flatten_indices
return self.map(
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3392, in _map_single
buf_writer, writer, tmp_file = init_buffer_and_writer()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3326, in init_buffer_and_writer
tmp_file = tempfile.NamedTemporaryFile("wb", dir=os.path.dirname(cache_file_name), delete=False)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1pvg2dc5/tmpbf_w1ds6'

Screenshots 截图

Additional 额外信息

No response

The text was updated successfully, but these errors were encountered:

simplew2011 · 2023-11-29T03:37:07Z

似乎和use_cache: false有关

HYLcool · 2023-11-29T13:36:59Z

Hi @simplew2011 , thanks for your interest and use!

We think it's a problem with both scalene and datasets. We wrote a simple snippet of test code like below:

from datasets import load_dataset, disable_caching

disable_caching()

ds = load_dataset('json', data_files='demo-dataset.jsonl', split='train')
ds = ds.filter(lambda s: s['text'], num_proc=4)
ds = ds.add_column(name='id', column=list(range(ds.num_rows)))

And we got different results:

When running with python command, it will be ok.
When running with the installed scalene command, the same FileNotFoundError will be raised at the same position as you mentioned above.
When running with the installed scalene command with a single process after removing the num_proc parameter in the ds.filter function of the code snippet above, it will be ok, too.

Therefore, it seems that we can do nothing on Data-Juicer to solve this problem. Maybe you need to open issues in the repos of scalene or datasets and ask for help from their developers.

github-actions · 2023-12-21T09:31:59Z

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions · 2023-12-25T09:31:54Z

Close this stale issue.

simplew2011 added the bug Something isn't working label Nov 29, 2023

github-project-automation bot added this to data-juicer Nov 29, 2023

github-project-automation bot moved this to Todo in data-juicer Nov 29, 2023

HYLcool moved this from Todo to In Progress in data-juicer Nov 29, 2023

HYLcool self-assigned this Nov 29, 2023

simplew2011 mentioned this issue Nov 30, 2023

No such file or directory: '/tmp/tmp1pvg2dc5/tmpbf_w1ds6' plasma-umass/scalene#731

Open

github-actions bot added the stale-issue label Dec 21, 2023

github-actions bot closed this as completed Dec 25, 2023

github-project-automation bot moved this from In Progress to Done in data-juicer Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: scalene分析报错 #104

[Bug]: scalene分析报错 #104

simplew2011 commented Nov 29, 2023

simplew2011 commented Nov 29, 2023

HYLcool commented Nov 29, 2023

github-actions bot commented Dec 21, 2023

github-actions bot commented Dec 25, 2023

[Bug]: scalene分析报错 #104

[Bug]: scalene分析报错 #104

Comments

simplew2011 commented Nov 29, 2023

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

simplew2011 commented Nov 29, 2023

HYLcool commented Nov 29, 2023

github-actions bot commented Dec 21, 2023

github-actions bot commented Dec 25, 2023