Vocab Size does not change when adding new tokens #12632

ncoop57 · 2021-07-11T19:08:16Z

Env:

transformers version: 4.8.2
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.9.0+cu102

When adding new tokens to an existing tokenizer, the tokenizer's vocab size variable doesn't change. I believe it should be updated every time the tokens change.

Here is a google colab to reproduce: https://colab.research.google.com/drive/1mC_eSmHOgA_F5fPX7AsUt86jAbC7iSSw?usp=sharing

Specifics:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
current_size = tokenizer.vocab_size
tokenizer.add_tokens(["new_token"])
tokenizer.vocab_size, current_size, len(tokenizer.vocab)

Outputs: (50257, 50257, 50258)

The same happens when I do the following as well
tokenizer = AutoTokenizer.from_pretrained("gpt2", additional_special_tokens=["new_token"])

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-07-12T08:15:17Z

I think you should use print(len(tokenizer)) instead of print(tokenizer.vocab_size) (as the vocab_size is a fixed attribute, referring to the base vocabulary without any additional tokens). Refer to this and this.

ncoop57 · 2021-07-13T00:50:25Z

ah okay, didn't realize this was expected behavior. Thanks!

ncoop57 closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocab Size does not change when adding new tokens #12632

Vocab Size does not change when adding new tokens #12632

ncoop57 commented Jul 11, 2021 •

edited

NielsRogge commented Jul 12, 2021 •

edited

ncoop57 commented Jul 13, 2021

Vocab Size does not change when adding new tokens #12632

Vocab Size does not change when adding new tokens #12632

Comments

ncoop57 commented Jul 11, 2021 • edited

NielsRogge commented Jul 12, 2021 • edited

ncoop57 commented Jul 13, 2021

ncoop57 commented Jul 11, 2021 •

edited

NielsRogge commented Jul 12, 2021 •

edited