Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text' #296

Closed
3 tasks done
promisecc opened this issue Apr 15, 2024 · 2 comments · Fixed by #300
Closed
3 tasks done

[Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text' #296

promisecc opened this issue Apr 15, 2024 · 2 comments · Fixed by #300
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@promisecc
Copy link

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

macos

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

last version

Python Version Python版本

3.9

Describe the bug 描述这个bug

2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x122843dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later.
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x124c03dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>

KeyError: 'text'

To Reproduce 如何复现

2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x122843dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later.
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x124c03dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>

KeyError: 'text'

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

@promisecc promisecc added the bug Something isn't working label Apr 15, 2024
@HYLcool
Copy link
Collaborator

HYLcool commented Apr 16, 2024

@promisecc

感谢你对data-juicer的关注与使用~

注意到你的待分析的数据集中包括以下三个文本字段:['instruction', 'input', 'output'],虽然你设置了text_keys为['instruction', 'output'],但算子的text_key依然为'text',请你检查一下是不是单独为算子设置了text_key参数为'text',如是的话可以把算子中的text_key参数设置移除,这样就能继承使用全局的text_keys设置了。

此外如果方便的话,你也可以分享一下你的配置文件内容,这有利于我们进一步帮助你定位问题~

@HYLcool HYLcool self-assigned this Apr 16, 2024
@HYLcool HYLcool added the good first issue Good for newcomers label Apr 18, 2024
@HYLcool HYLcool linked a pull request Apr 18, 2024 that will close this issue
HYLcool pushed a commit that referenced this issue Apr 18, 2024
* fix Bug: KeyError: 'text' 

File data_juice/config/config.py lines 418-429 did not consider the situation when arg: text_key was initialized to 'text', resulting in arg: text_key not being updated properly and always being initialized to the value of 'text'

* Fix Bug: key_text do not update correctly

* Update config.py

Normalize Format
@github-project-automation github-project-automation bot moved this from Todo to Done in data-juicer Apr 18, 2024
@HYLcool
Copy link
Collaborator

HYLcool commented Apr 18, 2024

Closed by PR #300 fixed by @shiweijiezero . Thanks!👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants