Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data processing is slow #4

Open
yunlongia opened this issue May 29, 2024 · 1 comment
Open

Training data processing is slow #4

yunlongia opened this issue May 29, 2024 · 1 comment

Comments

@yunlongia
Copy link

Hello, I'm processing redpajama data and it's unacceptably slow, especially processing book domain, any suggestions please?
Or can you share a copy of your processed training data, thanks a lot!

@howard-yen
Copy link
Collaborator

Hi, can you please share some details on what step is giving you trouble?

If you are running into slow speed with the tokenization, then I would recommend checking out the SentencePiece tokenizer instead of using the Transformers tokenizer (I talk about it here).
From my experience, the SentencePiece tokenizer is much faster with longer sequences (which matters a lot for the books domain), whereas the Transformers tokenizer is faster at large batches of shorter sequences.
It is easy to switch over to the SentencePiece tokenizer, simply by uncommenting this line.
You can also shard this process across many processes if you have the CPU cores to do so. To do this, you can change this line and specify a large shard_size.

If you are running in slow processing for the sampling step, you can try increasing the number of shards as we talk about it here. Let me know if you need help with something else!

I would be happy to share the training data, though it totals to about 5T, which can be very slow over a network. If you are still running issues and want me to send the training data, please email me at hyen@princeton.edu and we can figure a way to do this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants