Why not use rayon? #54

smr97 · 2020-01-12T18:41:32Z

Hey, to start off, congratulations on this successful release of a tokenizer written in rust. It is indeed a great idea and as a fellow Rust user I'm happy to see it in use in NLP. I was wondering why the functional code you have for tokenization, at it's core does not use Rayon? The word tokenization seems embarrassingly parallel over the number of words. It should probably be a free speedup for typical multi-core machines. Furthermore it should be a relatively small code-change, not a complete rewrite I think.

Let me know what are your thoughts!

n1t0 · 2020-01-12T22:10:25Z

We are actually using rayon to encode multiple sequences in parallel, see https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/tokenizer/mod.rs#L357.

Otherwise, we don't make any assumption on the type of content that we are going to process. According to your choice of PreTokenizer and Model, there isn't necessarily a concept of word, and thus parallelization at this level isn't really possible, at least not trivially.

smr97 · 2020-01-13T09:07:32Z

Thanks for the response! I am new to NLP, so I did not know that this concept is not generic. Also I had missed this call to par_iter(), so thanks for pointing that out.

On an unrelated note, could you please point me to a time/throughput comparison with WordPiece tokenizer in python that is used for the BERT model? It would help me push the case for switching to this tokenizer at my workplace.

n1t0 · 2020-01-13T14:52:06Z

Hey @smr97, you can have a look at this file: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/example.py
It provides an example of processing between the Python tokenizer which is available in transformers and the one provided by this library. You can process a large enough file, like one of the wiki-text-raw to see the difference.
If you need to download a vocabulary for BERT, all the links are available here: https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py#L32

smr97 · 2020-01-14T10:21:08Z

Thanks for the useful example, that is exactly what I wanted. I ran this with the wiki-text-raw-train file (541MB) and saw a pretty amazing speedup (MacBook 2017)!

Just FYI though, there seems to be a bug in the script, it throws this error as shown in the figure. I checked the rust source and it seems that the decode_batch function indeed has two positional arguments (so the error is un-called for). I have not worked with python bindings to Rust, so am not sure where the bug really is.

…ly one possible arg. This is suggested by the current issue #54 (comment). kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens allows both (named and positional) syntax. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

mfuntowicz · 2020-01-15T19:12:37Z

@smr97 It should be fixed on master now :)

smr97 · 2020-01-16T07:30:08Z

Sounds good! Thanks for the prompt responses on this, I find this work very impressive, and will spread the word.

mfuntowicz mentioned this issue Jan 15, 2020

Remove kwargs mapping on Tokenizer decode/decode_batch #70

Merged

smr97 closed this as completed Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not use rayon? #54

Why not use rayon? #54

smr97 commented Jan 12, 2020

n1t0 commented Jan 12, 2020

smr97 commented Jan 13, 2020

n1t0 commented Jan 13, 2020

smr97 commented Jan 14, 2020

mfuntowicz commented Jan 15, 2020

smr97 commented Jan 16, 2020

Why not use rayon? #54

Why not use rayon? #54

Comments

smr97 commented Jan 12, 2020

n1t0 commented Jan 12, 2020

smr97 commented Jan 13, 2020

n1t0 commented Jan 13, 2020

smr97 commented Jan 14, 2020

mfuntowicz commented Jan 15, 2020

smr97 commented Jan 16, 2020