WordLevelTrainer not deterministic #717

lucacampanella · 2021-05-26T14:37:09Z

Hi everyone and thanks for the great work.
I'm using the library to create a word level tokenizer. Unfortunately, given the exact same data, the output given by the class WordLevelTrainer is not deterministic. I suspect this happens when the count of two words is exactly the same. Their order in the resulting json (and thus the corresponding token id) is sometimes inverted.

I don't know much Rust, but I believe it would be an easy fix at line 43 of src/models/wordlevel/trainer.rs. If understand correctly, here the hashmap (dictionary) is ordered by values, with ignored keys (_). It would be sufficient to order it by keys in case of same counts.

Reproducibility of the tokenizer given the same data can be very useful in some scenarios.
Let me know if I should look a bit into rust and create a pull request for this or if I understood the code wrong and the source non-determinism is somewhere else.

Thanks a lot for the help!
Luca

Using tokenizers==0.10.3

The text was updated successfully, but these errors were encountered:

n1t0 · 2021-05-26T14:41:46Z

Yes you're totally right! Do you want to take a stab at it?

lucacampanella · 2021-05-26T17:03:03Z

Hi @n1t0 and thanks a lot for the fast reply.
I took a stab at it and created a pull request #718

There is maybe a more concise way of doing it, let me know what you think. The code was tested only in the rust playground and not with the codebase.

pietrolesci · 2021-10-04T17:05:18Z

Hi @n1t0 ,

I am on 0.10.3 and I can still replicate the non-determinism issue with the following code

from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Sequence, Digits, Whitespace
from tokenizers.normalizers import BertNormalizer

unk_token = "[UNK]"
tok = Tokenizer(WordLevel(unk_token=unk_token))
trainer = WordLevelTrainer(vocab_size=3, special_tokens=[unk_token])
tok.pre_tokenizer = Sequence([Digits(), Whitespace()])
tok.normalizer = BertNormalizer()

example = "Hello, my name is Bert and I am 20 years old"
tok.train_from_iterator([example], trainer=trainer)
tok.get_vocab()

# run 1: {'[UNK]': 0, 'old': 1, 'is': 2}
# run 2: {'[UNK]': 0, 'is': 1, 'mimmo': 2}

is the fix only available on master?

lucacampanella mentioned this issue May 26, 2021

Fix word level tokenizer determinism #718

Merged

n1t0 closed this as completed in #718 Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordLevelTrainer not deterministic #717

WordLevelTrainer not deterministic #717

lucacampanella commented May 26, 2021

n1t0 commented May 26, 2021

lucacampanella commented May 26, 2021

pietrolesci commented Oct 4, 2021 •

edited

WordLevelTrainer not deterministic #717

WordLevelTrainer not deterministic #717

Comments

lucacampanella commented May 26, 2021

n1t0 commented May 26, 2021

lucacampanella commented May 26, 2021

pietrolesci commented Oct 4, 2021 • edited

pietrolesci commented Oct 4, 2021 •

edited