-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not use rayon? #54
Comments
We are actually using Otherwise, we don't make any assumption on the type of content that we are going to process. According to your choice of |
Thanks for the response! I am new to NLP, so I did not know that this concept is not generic. Also I had missed this call to par_iter(), so thanks for pointing that out. On an unrelated note, could you please point me to a time/throughput comparison with WordPiece tokenizer in python that is used for the BERT model? It would help me push the case for switching to this tokenizer at my workplace. |
Hey @smr97, you can have a look at this file: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/example.py |
Thanks for the useful example, that is exactly what I wanted. I ran this with the wiki-text-raw-train file (541MB) and saw a pretty amazing speedup (MacBook 2017)! Just FYI though, there seems to be a bug in the script, it throws this error as shown in the figure. I checked the rust source and it seems that the |
…ly one possible arg. This is suggested by the current issue #54 (comment). kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens allows both (named and positional) syntax. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
…ly one possible arg. This is suggested by the current issue #54 (comment). kwargs cannot be called as positional argument, they have to be named one, replacing kwargs with the actual skip_special_tokens allows both (named and positional) syntax. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
@smr97 It should be fixed on master now :) |
Sounds good! Thanks for the prompt responses on this, I find this work very impressive, and will spread the word. |
Hey, to start off, congratulations on this successful release of a tokenizer written in rust. It is indeed a great idea and as a fellow Rust user I'm happy to see it in use in NLP. I was wondering why the functional code you have for tokenization, at it's core does not use Rayon? The word tokenization seems embarrassingly parallel over the number of words. It should probably be a free speedup for typical multi-core machines. Furthermore it should be a relatively small code-change, not a complete rewrite I think.
Let me know what are your thoughts!
The text was updated successfully, but these errors were encountered: