Training data processing is slow #4

yunlongia · 2024-05-29T07:04:23Z

Hello, I'm processing redpajama data and it's unacceptably slow, especially processing book domain, any suggestions please?
Or can you share a copy of your processed training data, thanks a lot!

howard-yen · 2024-06-01T01:31:28Z

Hi, can you please share some details on what step is giving you trouble?

If you are running into slow speed with the tokenization, then I would recommend checking out the SentencePiece tokenizer instead of using the Transformers tokenizer (I talk about it here).
From my experience, the SentencePiece tokenizer is much faster with longer sequences (which matters a lot for the books domain), whereas the Transformers tokenizer is faster at large batches of shorter sequences.
It is easy to switch over to the SentencePiece tokenizer, simply by uncommenting this line.
You can also shard this process across many processes if you have the CPU cores to do so. To do this, you can change this line and specify a large shard_size.

If you are running in slow processing for the sampling step, you can try increasing the number of shards as we talk about it here. Let me know if you need help with something else!

I would be happy to share the training data, though it totals to about 5T, which can be very slow over a network. If you are still running issues and want me to send the training data, please email me at hyen@princeton.edu and we can figure a way to do this :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training data processing is slow #4

Training data processing is slow #4

yunlongia commented May 29, 2024

howard-yen commented Jun 1, 2024

Training data processing is slow #4

Training data processing is slow #4

Comments

yunlongia commented May 29, 2024

howard-yen commented Jun 1, 2024