You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When adding new tokens to an existing tokenizer, the tokenizer's vocab size variable doesn't change. I believe it should be updated every time the tokens change.
I think you should use print(len(tokenizer)) instead of print(tokenizer.vocab_size) (as the vocab_size is a fixed attribute, referring to the base vocabulary without any additional tokens). Refer to this and this.
Env:
transformers
version: 4.8.2When adding new tokens to an existing tokenizer, the tokenizer's vocab size variable doesn't change. I believe it should be updated every time the tokens change.
Here is a google colab to reproduce: https://colab.research.google.com/drive/1mC_eSmHOgA_F5fPX7AsUt86jAbC7iSSw?usp=sharing
Specifics:
Outputs: (50257, 50257, 50258)
The same happens when I do the following as well
tokenizer = AutoTokenizer.from_pretrained("gpt2", additional_special_tokens=["new_token"])
The text was updated successfully, but these errors were encountered: