Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordLevelTrainer not deterministic #717

Closed
lucacampanella opened this issue May 26, 2021 · 3 comments · Fixed by #718
Closed

WordLevelTrainer not deterministic #717

lucacampanella opened this issue May 26, 2021 · 3 comments · Fixed by #718

Comments

@lucacampanella
Copy link
Contributor

Hi everyone and thanks for the great work.
I'm using the library to create a word level tokenizer. Unfortunately, given the exact same data, the output given by the class WordLevelTrainer is not deterministic. I suspect this happens when the count of two words is exactly the same. Their order in the resulting json (and thus the corresponding token id) is sometimes inverted.

I don't know much Rust, but I believe it would be an easy fix at line 43 of src/models/wordlevel/trainer.rs. If understand correctly, here the hashmap (dictionary) is ordered by values, with ignored keys (_). It would be sufficient to order it by keys in case of same counts.

Reproducibility of the tokenizer given the same data can be very useful in some scenarios.
Let me know if I should look a bit into rust and create a pull request for this or if I understood the code wrong and the source non-determinism is somewhere else.

Thanks a lot for the help!
Luca

Using tokenizers==0.10.3

@n1t0
Copy link
Member

n1t0 commented May 26, 2021

Yes you're totally right! Do you want to take a stab at it?

@lucacampanella
Copy link
Contributor Author

Hi @n1t0 and thanks a lot for the fast reply.
I took a stab at it and created a pull request #718

There is maybe a more concise way of doing it, let me know what you think. The code was tested only in the rust playground and not with the codebase.

@n1t0 n1t0 closed this as completed in #718 Aug 13, 2021
@pietrolesci
Copy link

pietrolesci commented Oct 4, 2021

Hi @n1t0 ,

I am on 0.10.3 and I can still replicate the non-determinism issue with the following code

from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Sequence, Digits, Whitespace
from tokenizers.normalizers import BertNormalizer

unk_token = "[UNK]"
tok = Tokenizer(WordLevel(unk_token=unk_token))
trainer = WordLevelTrainer(vocab_size=3, special_tokens=[unk_token])
tok.pre_tokenizer = Sequence([Digits(), Whitespace()])
tok.normalizer = BertNormalizer()

example = "Hello, my name is Bert and I am 20 years old"
tok.train_from_iterator([example], trainer=trainer)
tok.get_vocab()

# run 1: {'[UNK]': 0, 'old': 1, 'is': 2}
# run 2: {'[UNK]': 0, 'is': 1, 'mimmo': 2}

is the fix only available on master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants