Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not use rayon? #54

Closed
smr97 opened this issue Jan 12, 2020 · 6 comments
Closed

Why not use rayon? #54

smr97 opened this issue Jan 12, 2020 · 6 comments

Comments

@smr97
Copy link

smr97 commented Jan 12, 2020

Hey, to start off, congratulations on this successful release of a tokenizer written in rust. It is indeed a great idea and as a fellow Rust user I'm happy to see it in use in NLP. I was wondering why the functional code you have for tokenization, at it's core does not use Rayon? The word tokenization seems embarrassingly parallel over the number of words. It should probably be a free speedup for typical multi-core machines. Furthermore it should be a relatively small code-change, not a complete rewrite I think.

Let me know what are your thoughts!

@n1t0
Copy link
Member

n1t0 commented Jan 12, 2020

We are actually using rayon to encode multiple sequences in parallel, see https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/tokenizer/mod.rs#L357.

Otherwise, we don't make any assumption on the type of content that we are going to process. According to your choice of PreTokenizer and Model, there isn't necessarily a concept of word, and thus parallelization at this level isn't really possible, at least not trivially.

@smr97
Copy link
Author

smr97 commented Jan 13, 2020

Thanks for the response! I am new to NLP, so I did not know that this concept is not generic. Also I had missed this call to par_iter(), so thanks for pointing that out.

On an unrelated note, could you please point me to a time/throughput comparison with WordPiece tokenizer in python that is used for the BERT model? It would help me push the case for switching to this tokenizer at my workplace.

@n1t0
Copy link
Member

n1t0 commented Jan 13, 2020

Hey @smr97, you can have a look at this file: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/example.py
It provides an example of processing between the Python tokenizer which is available in transformers and the one provided by this library. You can process a large enough file, like one of the wiki-text-raw to see the difference.
If you need to download a vocabulary for BERT, all the links are available here: https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py#L32

@smr97
Copy link
Author

smr97 commented Jan 14, 2020

Thanks for the useful example, that is exactly what I wanted. I ran this with the wiki-text-raw-train file (541MB) and saw a pretty amazing speedup (MacBook 2017)!

Just FYI though, there seems to be a bug in the script, it throws this error as shown in the figure. I checked the rust source and it seems that the decode_batch function indeed has two positional arguments (so the error is un-called for). I have not worked with python bindings to Rust, so am not sure where the bug really is.

Screenshot 2020-01-14 at 3 45 08 PM

mfuntowicz added a commit that referenced this issue Jan 15, 2020
…ly one possible arg.

This is suggested by the current issue #54 (comment).

kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens
allows both (named and positional) syntax.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
mfuntowicz added a commit that referenced this issue Jan 15, 2020
…ly one possible arg.

This is suggested by the current issue #54 (comment).

kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens
allows both (named and positional) syntax.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
@mfuntowicz
Copy link
Member

@smr97 It should be fixed on master now :)

@smr97
Copy link
Author

smr97 commented Jan 16, 2020

Sounds good! Thanks for the prompt responses on this, I find this work very impressive, and will spread the word.

@smr97 smr97 closed this as completed Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants