You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
macos
Installation Method 安装方式
from source
Data-Juicer Version Data-Juicer版本
last version
Python Version Python版本
3.9
Describe the bug 描述这个bug
2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x122843dc0>
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later.
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x124c03dc0>
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
KeyError: 'text'
To Reproduce 如何复现
2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x122843dc0>
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run pip install -v -e .[sci] to install all requirements for all OPs, or run pip install sentencepiece kenlm with library version specified by environments/science_requires.txt to install libraries required by this OP. Data processing will skip this OP later.
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 13, in
main()
└ <function main at 0x124c03dc0>
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
KeyError: 'text'
Configs 配置信息
No response
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered:
* fix Bug: KeyError: 'text'
File data_juice/config/config.py lines 418-429 did not consider the situation when arg: text_key was initialized to 'text', resulting in arg: text_key not being updated properly and always being initialized to the value of 'text'
* Fix Bug: key_text do not update correctly
* Update config.py
Normalize Format
Before Reporting 报告之前
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
macos
Installation Method 安装方式
from source
Data-Juicer Version Data-Juicer版本
last version
Python Version Python版本
3.9
Describe the bug 描述这个bug
2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning :
load_model
does not return WordVectorModel or SupervisedModel any more, but aFastText
object which is very similar.2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run
pip install -v -e .[sci]
to install all requirements for all OPs, or runpip install sentencepiece kenlm
with library version specified byenvironments/science_requires.txt
to install libraries required by this OP. Data processing will skip this OP later.2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
KeyError: 'text'
To Reproduce 如何复现
2024-04-15 16:15:39 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (83335), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x17edb4f70>
└ <data_juicer.core.analyser.Analyser object at 0x101d91250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x16f77a280>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16f77fac0>
│ └ <function NestedDataset.map at 0x17edb44c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 1000
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '8d414177a08f8240'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x16f77a1f0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 1000
│ │ })
│ └ <function Dataset.map at 0x139ca5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x1394380d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x16f839e50>, <multiprocess.pool.ApplyResult object at 0x16f850790>, <multiprocess....
(data_juicer) guangshengliu@MacBook-Air data % cd data-juicer
(data_juicer) guangshengliu@MacBook-Air data-juicer % python tools/analyze_data.py --config configs/demo/analyser.yaml
2024-04-15 16:20:47.909 | DEBUG | data_juicer.utils.availability_utils:_is_package_available:116 - Detected torch version 2.2.2
2024-04-15 16:20:48.682 | INFO | data_juicer:setup_mp:58 - Setting multiprocess start method to 'fork'.
2024-04-15 16:20:48.682 | DEBUG | data_juicer:setup_cuda:72 - _USE_CUDA: False | MP: fork (MainProcess)
2024-04-15 16:20:51 | INFO | data_juicer.config.config:533 - Back up the input config file [/Users/guangshengliu/LLM/data/data-juicer/configs/demo/analyser.yaml] into the work_dir [/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser]
2024-04-15 16:20:51 | INFO | data_juicer.config.config:554 - Configuration table:
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│ key │ values │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│ config │ [Path_fr(configs/demo/analyser.yaml, cwd=/Users/guangshengliu/LLM/data/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ path_3sigma_recipe │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name │ 'demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type │ 'default' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data/aqua_train.json' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser/demo-analyser-result.jsonl' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size │ 0 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_stats_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ keep_hashes_in_res_ds │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ np │ 4 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys │ ['instruction', 'output'] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key │ 'images' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token │ '<__dj__image>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_key │ 'audios' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ audio_special_token │ '<__dj__audio>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_key │ 'videos' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ video_special_token │ '<__dj__video>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token │ '<|__dj__eoc|>' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache │ True │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir │ '/Users/guangshengliu/.cache/huggingface/datasets' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir │ None │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace │ [] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num │ 10 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ process │ [{'language_id_score_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}, │
│ │ {'perplexity_filter': {'accelerator': 'cpu', │
│ │ 'audio_key': 'audios', │
│ │ 'cpu_required': 1, │
│ │ 'image_key': 'images', │
│ │ 'lang': 'zh', │
│ │ 'max_ppl': 1500, │
│ │ 'mem_required': 0, │
│ │ 'spec_numprocs': 0, │
│ │ 'text_key': 'text', │
│ │ 'use_actor': False, │
│ │ 'video_key': 'videos'}}] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ False │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address │ 'auto' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir │ '/Users/guangshengliu/LLM/data/data-juicer/outputs/demo-analyser' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp │ '20240415162051' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir │ '/Users/guangshengliu/LLM/data/data-juicer/raw_data' │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix │ False │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:39 - Using cache compression method: [None]
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:44 - Setting up data formatter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:53 - Preparing exporter...
2024-04-15 16:20:51 | INFO | data_juicer.core.analyser:75 - Loading dataset from data formatter...
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Downloading and preparing dataset json/default to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 5377.31it/s]
Extracting data files: 100%|##########| 1/1 [00:00<00:00, 484.95it/s]
2024-04-15 16:20:53 | INFO | logging:952 - Setting num_proc from 4 back to 1 for the json split to disable multiprocessing as it only contains one shard.
2024-04-15 16:20:53 | INFO | datasets.load:1791 - Dataset json downloaded and prepared to /Users/guangshengliu/.cache/huggingface/datasets/json/default-f3511e4c073db59c/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|##########| 1/1 [00:00<00:00, 352.61it/s]
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats...
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:200 - There are 2728 sample(s) in the original dataset.
2024-04-15 16:20:53 | INFO | data_juicer.format.formatter:214 - 2728 samples left after filtering empty text.
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:136 - sampled 2728 from 2728
2024-04-15 16:20:53 | INFO | data_juicer.format.mixture_formatter:142 - There are 2728 in final dataset
2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:81 - Preparing process operators...
2024-04-15 16:20:53 | INFO | data_juicer.utils.model_utils:102 - Loading fasttext language identification model...
Warning :
load_model
does not return WordVectorModel or SupervisedModel any more, but aFastText
object which is very similar.2024-04-15 16:20:53 | WARNING | data_juicer.ops.load:24 - This OP [perplexity_filter] is unavailable due to importing third-party requirements of this OP failure: ['sentencepiece', 'kenlm']. You can either run
pip install -v -e .[sci]
to install all requirements for all OPs, or runpip install sentencepiece kenlm
with library version specified byenvironments/science_requires.txt
to install libraries required by this OP. Data processing will skip this OP later.2024-04-15 16:20:53 | INFO | data_juicer.core.analyser:86 - Computing the stats of dataset...
2024-04-15 16:20:53 | ERROR | main:13 - An error has been caught in function '', process 'MainProcess' (84118), thread 'MainThread' (8040734208):
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1353, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3358, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3261, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
return f(*args, **kargs)
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 56, in compute_stats
text = sample[self.text_key].lower().replace('\n', ' ')
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 280, in getitem
value = self.data[key]
KeyError: 'text'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/guangshengliu/LLM/data/data-juicer/tools/analyze_data.py", line 9, in main
analyser.run()
│ └ <function Analyser.run at 0x291876f70>
└ <data_juicer.core.analyser.Analyser object at 0x10405c250>
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/analyser.py", line 100, in run
dataset = dataset.map(op.compute_stats,
│ │ │ └ <function LanguageIDScoreFilter.compute_stats at 0x291ebea60>
│ │ └ <data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x16d487d60>
│ └ <function NestedDataset.map at 0x2918764c0>
└ Dataset({
features: ['instruction', 'input', 'output', 'dj__stats'],
num_rows: 2728
})
File "/Users/guangshengliu/LLM/data/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ └ [<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>]
└ <class 'data_juicer.core.data.NestedDataset'>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad51f0>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 4, 'desc': 'language_id_score_filter_compute_stats', 'new_fingerprint': '573d685b3bbeed84'}
│ │ │ └ (<function LanguageIDScoreFilter.compute_stats at 0x291e9a0d0>,)
│ │ └ Dataset({
│ │ features: ['instruction', 'input', 'output', 'dj__stats'],
│ │ num_rows: 2728
│ │ })
│ └ <function Dataset.map at 0x13fad5160>
└ typing.Union
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in iflatmap_unordered(
│ │ │ └ <function iflatmap_unordered at 0x13f2780d0>
│ │ └ 0
│ └ False
└ 3
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in iflatmap_unordered
[async_result.get() for async_result in async_results]
└ [<multiprocess.pool.ApplyResult object at 0x291e2a670>, <multiprocess.pool.ApplyResult object at 0x291e2a790>, <multiprocess....
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1377, in
[async_result.get() for async_result in async_results]
│ │ └ <multiprocess.pool.ApplyResult object at 0x291e2a670>
│ └ <function ApplyResult.get at 0x13f2764c0>
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
File "/Users/guangshengliu/opt/anaconda3/envs/data_juicer/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
│ └ KeyError('text')
└ <multiprocess.pool.ApplyResult object at 0x291e2a670>
KeyError: 'text'
Configs 配置信息
No response
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered: