New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization slows towards end of dataset #6734
Comments
Hi ! First note that if the dataset is not heterogeneous / shuffled, there might be places in the data with shorter texts that are faster to tokenize. Moreover, the way |
I did see some comments about how num_proc=None could help and outputting numpy arrays can also help in the docs, but this seems quite odd now dropping down to 1it/s Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46048888/46390354 [12:33:30<4:20:32, 21.84 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46049888/46390354 [12:36:11<8:37:59, 10.95 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46050888/46390354 [12:46:35<24:56:56, 3.78 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46051888/46390354 [12:56:43<35:08:10, 2.68 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46052888/46390354 [13:06:58<42:05:41, 2.23 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46053888/46390354 [13:16:01<44:40:18, 2.09 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46054888/46390354 [13:25:11<46:35:28, 2.00 examples/s]
Running tokenizer on dataset (num_proc=48): 99%|█████████▉| 46055888/46390354 [13:34:23<47:55:34, 1.94 examples/s] |
@ethansmith2000 Hi, did you solve this problem? I'm strugging with the same problem now. |
Describe the bug
Mapped tokenization slows down substantially towards end of dataset.
train set started off very slow, caught up to 20k then tapered off til the end.
what's particularly strange is that the tokenization crashed a few times before due to errors with invalid tokens somewhere or corrupted downloads, and the speed ups/downs consistently happened the same times
and validation set as well
Steps to reproduce the bug
running through slurm script
using this dataset https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Expected behavior
Constant speed throughout
Environment info
datasets
version: 2.15.0huggingface_hub
version: 0.19.4fsspec
version: 2023.10.0The text was updated successfully, but these errors were encountered: