AddedVocabulary does not play well with the Model #523

n1t0 · 2020-11-13T17:45:17Z

Current state

The AddedVocabulary adds new tokens on top of the Model, making the following assumption: "The Model will never change".
So, this makes a few things impossible:

Re-training/changing the Model after tokens were added
Adding tokens manually at the very beginning of the vocab, for example before training
We can't extract these added tokens during the pre-processing step of the training, which would be desirable to fix issues like bug when using special token with uppercase #438

Goal

Make the AddedVocabulary more versatile and robust toward the change/update of the Model, and also capable of having tokens at the beginning and at the end of the vocabulary.
This would probably require that the AddedVocabulary is always the one doing the conversion tokens<->ids, providing the vocab etc. effectively letting us keep the Model unaware of its existence.

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2021-01-06T11:31:34Z

Sorry to have taken forever on this issue! I looked into it and I think, I understand the problem now more or less.

As I understood it, the assumption that "the model never changes" is cemented in this line:

tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs

Line 252 in ae6534f

let new_id = (model.get_vocab_size() + self.added_tokens_map.len()) as u32;

=> When adding a token to the AddedVocabulary the id corresponds to the size of the model's vocab at the time of adding the token, but cannot change anymore if the model's vocab changes afterward.

Currently, the class flow when doing the tokens<->ids conversion is implemented as follows (please correct me if I'm wrong). Ask TokenizerImpl to convert token_to_id. The TokenizerImpl then just forwards the call to its self.added_vocabulary, but also passes a read-only reference of self.model to self.added_vocabulary's token_to-id method. Then the AddedVocabulary checks if the token is in the Model's vocab (usually self.vocab) and if it's not checks in its own map.

I see three different approaches for now:

As proposed, to make AddedVocaburaly more versatile, instead of passing self.model to the AddedVocabulary we could pass self.model.vocab to the AddedVocabulary for the token_to_id and id_to_token method. This way, we are always aware of the current vocab's size.
Following this approach, I guess it makes more sense then to save "relative" token_ids in AddedVocabulary instead of absolut ones for those tokens that are appended to the end of the vocabulary, no? Also, the add_tokens method of AddedVocabulary would not need a reference to the model anymore. Tokens that are appended to the beginning of the vocabulary could get absolute ids and tokens that are appended to the end could get relative ids => so tokens_to_id in AddedVocabulary could look something like:

token_to_id(token, vocab):
     if token in vocab:
        return vocab[token]
     elif token in self.begin_added_tokens:
        return self.begin_added_tokens[token]
     elif token in self.end_added_tokens:
        return len(vocab) + self.end_added_tokens[token]

As I understood it does it mean we should delete all token_to_id and id_to_token methods of the models? Since AddedVocabulary has the vocab we don't need it anymore no?

One small problem I see here is that not all modes have the same type of "vocabulary". BPE, Wordpiece and Wordlevel all have self.vocab and self.vocab_r, but Unigram seems to use a self.tokens_to_ids of type TokenMap instead of Vocab -> maybe change that for Unigram?

Or should we give AddedVocabulary a mutable reference to the vocab so that the vocab can be changed directly? Then we don't need additional "added_tokens_map" in AddedVocabulary. Approach 1) sounds better to me, but not 100% sure.
Or a completely different method would be to let the TokenizerImpl have more responsible for token_to_id so that the TokenizerImpl class actually checks if the token is in self.model.vocab_size and if not it just passes the size of self.model.vocab to the AddedVocabulary's token_to_id method. This approach actually seems like the most logical to me since both AddedVocabulary and Model are members of TokenizerImpl, so that ideally AddedVocabulary should not handle any Model logic but just get the minimal information from model.vocab it needs (which is just the size of the vocab IMO) for token_to_id from TokenizerImpl.

Both 1) & 3) seem reasonable to me...would be super happy about some misunderstanding I have and your feedback @n1t0 :-)

n1t0 · 2021-01-06T14:29:33Z

Awesome! There's no rush so take the time you need!

I'll add a few pieces of information that may help with the decision:

Every Model is different, and so we can't rely on their internals. The only thing we have to deal with all of them is the interface described in the trait Model. Also, any Model can be used independently of the TokenizerImpl so it makes sense to keep all these methods.
The AddedVocabulary exists because we can't really modify the vocabulary of the Model in many cases. For example with BPE, we can't add a new token because it would require us to add all the necessary merge operations too, but we can't do it. Same with Unigram. So the AddedVocabulary works by adding vocabulary on top of the Model and it has the responsibility of choosing which one to use.

So I think approach 1) is probably the most likely to work, but we'll have to stick with what's available on trait Model.

One thing that is very important and might be tricky to handle (I didn't dig this at all for now) is the serialization/deserialization of the TokenizerImpl and AddedVocabulary. We need to keep all the existing tokenizers out in the wild to keep working, while probably having to add some info to handle whether a token is at the beginning or the end of the vocabulary. (Will probably be necessary to add some tests there.) I'll be able to help on this side of course!

sadra-barikbin · 2022-07-13T14:08:01Z

Hi @n1t0 @patrickvonplaten
I just came across the third issue @n1t0 mentioned:

We can't extract these added tokens during the pre-processing step of the training, which would be desirable to fix issues like

In my use case I want to replace all the numbers with an added token '[NUM]'. My code is as follows:

tokenizer = Tokenizer(Unigram())

NUM_token = AddedToken('[NUM]')
tokenizer.add_tokens([NUM_token])

# Normalization is applied in both training and encoding. For training see:
# https://github.com/huggingface/tokenizers/blob/adf90dcd722cbc20af930441988b317a19815878/tokenizers/src/tokenizer/mod.rs#L1028
tokenizer.normalizer = normalizers.Replace(Regex('[0-9]+'), '[NUM]')

trainer = UnigramTrainer()
tokenizer.train_from_iterator(iterator, trainer)

During encoding, extract_and_normalize of added_vocabulary is called which breaks the input sequence on added tokens so as to prevent tokenizing them, but during training such a thing does not take place and added tokens existing in corpus are used in model training.

The solution might be to ask added_vocabulary to break input sequences on added tokens then feed them to the trainer.

Narsil · 2022-07-15T10:35:13Z

The solution might be to ask added_vocabulary to break input sequences on added tokens then feed them to the trainer.

I think you're spot on.
PRs are welcome on this, but as you might understand this is a very delicate change to operate, as backward compatibility is critical now for transformers to work properly. So this PR will take time no matter who takes up on it.
(Also we're scarce on time to dedicate on tokenizers too unfortunately, we still have to release 13.0 which is going ot happen as soon as I have time to dedicate to it).

github-actions · 2024-05-06T01:50:03Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

n1t0 added the enhancement New feature or request label Nov 13, 2020

This was referenced Nov 13, 2020

Ability to re-train a Tokenizer with relevant parameters #525

Closed

Training improvements #528

Closed

patrickvonplaten self-assigned this Dec 12, 2020

n1t0 mentioned this issue Jan 6, 2021

CharBPETokenizer.save() json misses first argument of add_tokens; unk token null #320

Closed

This was referenced May 28, 2021

Make AddedVocabulary more versatile #719

Closed

Make AddedVocabulary more versatile #720

Closed

github-actions bot added the Stale label May 6, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddedVocabulary does not play well with the Model #523

AddedVocabulary does not play well with the Model #523

n1t0 commented Nov 13, 2020

patrickvonplaten commented Jan 6, 2021 •

edited

Loading

n1t0 commented Jan 6, 2021

sadra-barikbin commented Jul 13, 2022

Narsil commented Jul 15, 2022

github-actions bot commented May 6, 2024

AddedVocabulary does not play well with the Model #523

AddedVocabulary does not play well with the Model #523

Comments

n1t0 commented Nov 13, 2020

Current state

Goal

patrickvonplaten commented Jan 6, 2021 • edited Loading

n1t0 commented Jan 6, 2021

sadra-barikbin commented Jul 13, 2022

Narsil commented Jul 15, 2022

github-actions bot commented May 6, 2024

patrickvonplaten commented Jan 6, 2021 •

edited

Loading