-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AddedVocabulary does not play well with the Model #523
Comments
Sorry to have taken forever on this issue! I looked into it and I think, I understand the problem now more or less. As I understood it, the assumption that "the model never changes" is cemented in this line:
AddedVocabulary the id corresponds to the size of the model's vocab at the time of adding the token, but cannot change anymore if the model's vocab changes afterward.
Currently, the class flow when doing the tokens<->ids conversion is implemented as follows (please correct me if I'm wrong). Ask I see three different approaches for now:
token_to_id(token, vocab):
if token in vocab:
return vocab[token]
elif token in self.begin_added_tokens:
return self.begin_added_tokens[token]
elif token in self.end_added_tokens:
return len(vocab) + self.end_added_tokens[token] As I understood it does it mean we should delete all One small problem I see here is that not all modes have the same type of "vocabulary". BPE, Wordpiece and Wordlevel all have
Both 1) & 3) seem reasonable to me...would be super happy about some misunderstanding I have and your feedback @n1t0 :-) |
Awesome! There's no rush so take the time you need! I'll add a few pieces of information that may help with the decision:
So I think approach 1) is probably the most likely to work, but we'll have to stick with what's available on One thing that is very important and might be tricky to handle (I didn't dig this at all for now) is the serialization/deserialization of the |
Hi @n1t0 @patrickvonplaten
In my use case I want to replace all the numbers with an added token '[NUM]'. My code is as follows: tokenizer = Tokenizer(Unigram())
NUM_token = AddedToken('[NUM]')
tokenizer.add_tokens([NUM_token])
# Normalization is applied in both training and encoding. For training see:
# https://github.com/huggingface/tokenizers/blob/adf90dcd722cbc20af930441988b317a19815878/tokenizers/src/tokenizer/mod.rs#L1028
tokenizer.normalizer = normalizers.Replace(Regex('[0-9]+'), '[NUM]')
trainer = UnigramTrainer()
tokenizer.train_from_iterator(iterator, trainer) During encoding, extract_and_normalize of The solution might be to ask |
I think you're spot on. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Current state
The
AddedVocabulary
adds new tokens on top of theModel
, making the following assumption: "The Model will never change".So, this makes a few things impossible:
Model
after tokens were addedGoal
Make the
AddedVocabulary
more versatile and robust toward the change/update of theModel
, and also capable of having tokens at the beginning and at the end of the vocabulary.This would probably require that the
AddedVocabulary
is always the one doing the conversion tokens<->ids, providing the vocab etc. effectively letting us keep theModel
unaware of its existence.The text was updated successfully, but these errors were encountered: