Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocess_dataset dataset.map crashed with TypeError: cannot pickle 'builtins.CoreBPE' object #328

Closed
songkq opened this issue Aug 3, 2023 · 5 comments
Labels
solved This problem has been already solved.

Comments

@songkq
Copy link

songkq commented Aug 3, 2023

During preprocess_dataset mapping the dataset, it crashed. Could you please give some advice?

Spawning 32 processes
Running tokenizer on dataset (num_proc=32):   0%|          | 0/11694 [00:00<?, ? examples/s]
@hiyouga
Copy link
Owner

hiyouga commented Aug 3, 2023

use preprocessing_num_workers=1

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Aug 3, 2023
@songkq
Copy link
Author

songkq commented Aug 3, 2023

@hiyouga preprocessing_num_workers=1 can work but a little slow.
I'm confusing why preprocessing_num_workers cannot be set to be greater than 1 with the commit 2780792 .
Since preprocessing_num_workers > 1 can work well with this commit 513e1f1.

@hiyouga
Copy link
Owner

hiyouga commented Aug 3, 2023

This problem is related to the tiktoken tokenizer, looks like you are using the Qwen-7B model.

Related issue: huggingface/datasets#5536 huggingface/datasets#5769

@songkq
Copy link
Author

songkq commented Aug 3, 2023

@hiyouga Thanks.

@songkq songkq closed this as completed Aug 3, 2023
@hiyouga hiyouga added solved This problem has been already solved. and removed pending This problem is yet to be addressed. labels Aug 3, 2023
@songkq
Copy link
Author

songkq commented Aug 12, 2023

使用替代方案GPT2Tokenizer支持多线程:https://huggingface.co/vonjack/Qwen-LLaMAfied-HFTok-7B-Chat/tree/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

2 participants