How to speedup tokenizer.encode? #62

PeiqinSun · 2023-10-12T10:04:38Z

I found in the pre-trained datasets, there are some docs has large amount chars, which cause a long time to encode them. For example, a doc has 15955671 chars, will cost 6.6 hours to encode it.

How do you speedup it? split the doc into many sub-docs? But I use the megatron to pre-train, has any idea?

Looking forward to hearing from you in your free time. Thank you very much.

VatsaDev · 2023-10-12T17:56:52Z

Wrong repo? This is tinyLlama, not megatron LM, besides, for tokenizer.encode, Thats most likely an hf method, so you would have to look at the HF repos instead.

if possible, you could try tiktoken, which is supposably 3x faster

PeiqinSun · 2023-10-23T12:20:45Z

Thanks for your time.

PeiqinSun closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speedup tokenizer.encode? #62

How to speedup tokenizer.encode? #62

PeiqinSun commented Oct 12, 2023

VatsaDev commented Oct 12, 2023

PeiqinSun commented Oct 23, 2023

How to speedup tokenizer.encode? #62

How to speedup tokenizer.encode? #62

Comments

PeiqinSun commented Oct 12, 2023

VatsaDev commented Oct 12, 2023

PeiqinSun commented Oct 23, 2023