[Bug]: 运行tools/analyze_data.py报错，出现 KeyError: 'text' #296

promisecc · 2024-04-15T08:25:19Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

macos

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

last version

Python Version Python版本

3.9

Describe the bug 描述这个bug

2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x122843dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later.
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x124c03dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>

KeyError: 'text'

To Reproduce 如何复现

2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x122843dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later.
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x124c03dc0>

File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})

File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>

File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>

KeyError: 'text'

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

The text was updated successfully, but these errors were encountered:

HYLcool · 2024-04-16T06:26:51Z

嗨 @promisecc

感谢你对data-juicer的关注与使用~

注意到你的待分析的数据集中包括以下三个文本字段：['instruction', 'input', 'output']，虽然你设置了text_keys为['instruction', 'output']，但算子的text_key依然为'text'，请你检查一下是不是单独为算子设置了text_key参数为'text'，如是的话可以把算子中的text_key参数设置移除，这样就能继承使用全局的text_keys设置了。

此外如果方便的话，你也可以分享一下你的配置文件内容，这有利于我们进一步帮助你定位问题~

* fix Bug: KeyError: 'text' File data_juice/config/config.py lines 418-429 did not consider the situation when arg: text_key was initialized to 'text', resulting in arg: text_key not being updated properly and always being initialized to the value of 'text' * Fix Bug: key_text do not update correctly * Update config.py Normalize Format

HYLcool · 2024-04-18T04:04:29Z

Closed by PR #300 fixed by @shiweijiezero . Thanks!👍🏻

promisecc added the bug Something isn't working label Apr 15, 2024

github-project-automation bot added this to data-juicer Apr 15, 2024

github-project-automation bot moved this to Todo in data-juicer Apr 15, 2024

HYLcool self-assigned this Apr 16, 2024

shiweijiezero mentioned this issue Apr 17, 2024

fix Bug: KeyError: 'text' Corresponding to issue #296 #300

Merged

HYLcool added the good first issue Good for newcomers label Apr 18, 2024

HYLcool linked a pull request Apr 18, 2024 that will close this issue

fix Bug: KeyError: 'text' Corresponding to issue #296 #300

Merged

HYLcool closed this as completed in #300 Apr 18, 2024

github-project-automation bot moved this from Todo to Done in data-juicer Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 运行tools/analyze_data.py报错，出现 KeyError: 'text' #296

[Bug]: 运行tools/analyze_data.py报错，出现 KeyError: 'text' #296

promisecc commented Apr 15, 2024

HYLcool commented Apr 16, 2024

HYLcool commented Apr 18, 2024

[Bug]: 运行tools/analyze_data.py报错，出现 KeyError: 'text' #296

[Bug]: 运行tools/analyze_data.py报错，出现 KeyError: 'text' #296

Comments

promisecc commented Apr 15, 2024

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

HYLcool commented Apr 16, 2024

HYLcool commented Apr 18, 2024