Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speedup tokenizer.encode? #62

Closed
PeiqinSun opened this issue Oct 12, 2023 · 2 comments
Closed

How to speedup tokenizer.encode? #62

PeiqinSun opened this issue Oct 12, 2023 · 2 comments

Comments

@PeiqinSun
Copy link

I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.

How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?

Looking forward to hearing from you in your free time. Thank you very much.

@VatsaDev
Copy link

Wrong repo? This is tinyLlama, not megatron LM, besides, for tokenizer.encode, Thats most likely an hf method, so you would have to look at the HF repos instead.

  • if possible, you could try tiktoken, which is supposably 3x faster

@PeiqinSun
Copy link
Author

Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants