Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab Size does not change when adding new tokens #12632

Closed
ncoop57 opened this issue Jul 11, 2021 · 2 comments
Closed

Vocab Size does not change when adding new tokens #12632

ncoop57 opened this issue Jul 11, 2021 · 2 comments

Comments

@ncoop57
Copy link
Contributor

ncoop57 commented Jul 11, 2021

Env:

  • transformers version: 4.8.2
  • Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.9.0+cu102

When adding new tokens to an existing tokenizer, the tokenizer's vocab size variable doesn't change. I believe it should be updated every time the tokens change.

Here is a google colab to reproduce: https://colab.research.google.com/drive/1mC_eSmHOgA_F5fPX7AsUt86jAbC7iSSw?usp=sharing

Specifics:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
current_size = tokenizer.vocab_size
tokenizer.add_tokens(["new_token"])
tokenizer.vocab_size, current_size, len(tokenizer.vocab)

Outputs: (50257, 50257, 50258)

The same happens when I do the following as well
tokenizer = AutoTokenizer.from_pretrained("gpt2", additional_special_tokens=["new_token"])

@NielsRogge
Copy link
Contributor

NielsRogge commented Jul 12, 2021

I think you should use print(len(tokenizer)) instead of print(tokenizer.vocab_size) (as the vocab_size is a fixed attribute, referring to the base vocabulary without any additional tokens). Refer to this and this.

@ncoop57
Copy link
Contributor Author

ncoop57 commented Jul 13, 2021

ah okay, didn't realize this was expected behavior. Thanks!

@ncoop57 ncoop57 closed this as completed Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants