You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.
How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?
Looking forward to hearing from you in your free time. Thank you very much.
The text was updated successfully, but these errors were encountered:
Wrong repo? This is tinyLlama, not megatron LM, besides, for tokenizer.encode, Thats most likely an hf method, so you would have to look at the HF repos instead.
if possible, you could try tiktoken, which is supposably 3x faster
I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.
How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?
Looking forward to hearing from you in your free time. Thank you very much.
The text was updated successfully, but these errors were encountered: