-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WordLevelTrainer not deterministic #717
Comments
Yes you're totally right! Do you want to take a stab at it? |
Hi @n1t0 , I am on 0.10.3 and I can still replicate the non-determinism issue with the following code from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Sequence, Digits, Whitespace
from tokenizers.normalizers import BertNormalizer
unk_token = "[UNK]"
tok = Tokenizer(WordLevel(unk_token=unk_token))
trainer = WordLevelTrainer(vocab_size=3, special_tokens=[unk_token])
tok.pre_tokenizer = Sequence([Digits(), Whitespace()])
tok.normalizer = BertNormalizer()
example = "Hello, my name is Bert and I am 20 years old"
tok.train_from_iterator([example], trainer=trainer)
tok.get_vocab()
# run 1: {'[UNK]': 0, 'old': 1, 'is': 2}
# run 2: {'[UNK]': 0, 'is': 1, 'mimmo': 2} is the fix only available on master? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi everyone and thanks for the great work.
I'm using the library to create a word level tokenizer. Unfortunately, given the exact same data, the output given by the class
WordLevelTrainer
is not deterministic. I suspect this happens when the count of two words is exactly the same. Their order in the resulting json (and thus the corresponding token id) is sometimes inverted.I don't know much Rust, but I believe it would be an easy fix at line 43 of
src/models/wordlevel/trainer.rs
. If understand correctly, here the hashmap (dictionary) is ordered by values, with ignored keys (_). It would be sufficient to order it by keys in case of same counts.Reproducibility of the tokenizer given the same data can be very useful in some scenarios.
Let me know if I should look a bit into rust and create a pull request for this or if I understood the code wrong and the source non-determinism is somewhere else.
Thanks a lot for the help!
Luca
Using
tokenizers==0.10.3
The text was updated successfully, but these errors were encountered: