Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization slows towards end of dataset #6734

Open
ethansmith2000 opened this issue Mar 15, 2024 · 3 comments
Open

Tokenization slows towards end of dataset #6734

ethansmith2000 opened this issue Mar 15, 2024 · 3 comments

Comments

@ethansmith2000
Copy link

ethansmith2000 commented Mar 15, 2024

Describe the bug

Mapped tokenization slows down substantially towards end of dataset.

train set started off very slow, caught up to 20k then tapered off til the end.

what's particularly strange is that the tokenization crashed a few times before due to errors with invalid tokens somewhere or corrupted downloads, and the speed ups/downs consistently happened the same times

Running tokenizer on dataset (num_proc=48):   0%|          | 847000/881416735 [12:18<252:45:45, 967.72 examples/s]
Running tokenizer on dataset (num_proc=48):   0%|          | 848000/881416735 [12:19<224:16:10, 1090.66 examples/s]

Running tokenizer on dataset (num_proc=48):  10%|| 84964000/881416735 [3:48:00<11:21:34, 19476.01 examples/s]
Running tokenizer on dataset (num_proc=48):  10%|| 84967000/881416735 [3:48:00<12:04:01, 18333.79 examples/s]

Running tokenizer on dataset (num_proc=48):  61%|██████    | 538631977/881416735 [13:46:40<27:50:04, 3420.84 examples/s]
Running tokenizer on dataset (num_proc=48):  61%|██████    | 538632977/881416735 [13:46:40<23:48:20, 3999.77 examples/s]

Running tokenizer on dataset (num_proc=48): 100%|█████████▉| 881365886/881416735 [38:30:19<04:34, 185.10 examples/s]
Running tokenizer on dataset (num_proc=48): 100%|█████████▉| 881366886/881416735 [38:30:25<04:36, 180.57 examples/s]

and validation set as well

Running tokenizer on dataset (num_proc=48):  90%|████████▉ | 41544000/46390354 [28:44<02:37, 30798.76 examples/s]
Running tokenizer on dataset (num_proc=48):  90%|████████▉ | 41550000/46390354 [28:44<02:08, 37698.08 examples/s]

Running tokenizer on dataset (num_proc=48):  96%|█████████▋| 44747422/46390354 [2:15:48<12:22:44, 36.87 examples/s]
Running tokenizer on dataset (num_proc=48):  96%|█████████▋| 44747422/46390354 [2:16:00<12:22:44, 36.87 examples/s]

Steps to reproduce the bug

using the following kwargs
with accelerator.main_process_first():
        lm_datasets = tokenized_datasets.map(
            group_texts,
            batched=True,
            num_proc=48
            load_from_cache_file=True,
            desc=f"Grouping texts in chunks of {block_size}",
        )

running through slurm script

#SBATCH --partition=gpu-nvidia-a100
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=8
#SBATCH --cpus-per-task=96

using this dataset https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Expected behavior

Constant speed throughout

Environment info

  • datasets version: 2.15.0
  • Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.10
  • Python version: 3.8.18
  • huggingface_hub version: 0.19.4
  • PyArrow version: 14.0.1
  • Pandas version: 2.0.3
  • fsspec version: 2023.10.0
@lhoestq
Copy link
Member

lhoestq commented Mar 15, 2024

Hi ! First note that if the dataset is not heterogeneous / shuffled, there might be places in the data with shorter texts that are faster to tokenize.

Moreover, the way num_proc works is by slicing the dataset and passing each slice to a process to run the map() function. So at the very end of map(), some processes might have finished transforming their slice of data while others are still running, causing the throughput to become lower.

@ethansmith2000
Copy link
Author

I did see some comments about how num_proc=None could help and outputting numpy arrays can also help in the docs, but this seems quite odd now dropping down to 1it/s

Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46048888/46390354 [12:33:30<4:20:32, 21.84 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46049888/46390354 [12:36:11<8:37:59, 10.95 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46050888/46390354 [12:46:35<24:56:56,  3.78 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46051888/46390354 [12:56:43<35:08:10,  2.68 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46052888/46390354 [13:06:58<42:05:41,  2.23 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46053888/46390354 [13:16:01<44:40:18,  2.09 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46054888/46390354 [13:25:11<46:35:28,  2.00 examples/s]
Running tokenizer on dataset (num_proc=48):  99%|█████████▉| 46055888/46390354 [13:34:23<47:55:34,  1.94 examples/s]

@lsh0520
Copy link

lsh0520 commented Apr 11, 2024

@ethansmith2000 Hi, did you solve this problem? I'm strugging with the same problem now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants