Include unused entries in vocabulary of "from scratch" models #41

jowagner · 2020-11-27T11:01:01Z

As discussed in issue #33, having a few unused entries in the vocabulary is a great idea to make it easier for users of a model to add extra tokens for fine-tuning. We should do this as well when training our final "from scratch" models. Multilingual BERT provides 99 such entries. We should use the same number of entries and use the same ["[unused%d]" %i for i in range(99)] format.

The text was updated successfully, but these errors were encountered:

jbrry · 2020-11-27T13:05:26Z

Good point. I checked the vocab.txt files produced by wiki-bert-pipeline and I can confirm that they follow the same approach:

head -n 106 vocab.txt

[unused..]
[unused98]
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
a

alanagiasi · 2020-11-27T13:39:45Z

As discussed in issue #33, having a few unused entries in the vocabulary is a great idea to make it easier for users of a model to add extra tokens for fine-tuning. We should do this as well when training our final "from scratch" models. Multilingual BERT provides 99 such entries. We should use the same number of entries and use the same ["[unused%d]" %i for i in range(99)] format.

Just to note a tiny boundary issue there. The stop argument in the range() function is exclusive, so to cover 0-99 is ["[unused%d]" %i for i in range(100)]
Also to note the first token is typically [PAD], followed by ["[unused%d]" %i for i in range(100)], followed by [UNK] [CLS] [SEP] [MASK] etc.

Sorry if that seems pedantic to point out, I had the opposite in R recently since it's c() function is inclusive :^)

jowagner · 2020-11-28T09:57:02Z

@alanagiasi Your code would produce 100 entries, one more than previous work quoted in #33 (comment).

As to the placement of [PAD] and other special tokens, I agree it is best to follow what the existing tools do as otherwise it may cause problems with other software, including software that we currently are not using.

As to confusions when moving between programming languages, I like to write code in a way that makes things clear even when the reader doesn't know the language specific details, e.g. instead of the above I would write something like

number_of_entries = 99
list_of_entries = []
for entry_index in range(number_of_entries):
    list_of_entries.append('[unused%d]' %entry_index)

where

the use of _index implies it starts at 0,
number_of_ makes clear how many items are created
the two facts above together make clear that the last entry will be '[unused98]'
use of list_ makes clear that [] is an empty list here
initialisation of the list and use of .append() make clear that we are creating a list and in what order the elements will appear (To somebody with mathematical training, Python's list comprehensions look like a set definition, i.e. one would expect the order of items to be undefined and duplicate elements to be discarded, which is btw the right data structure for a vocabulary.)

alanagiasi · 2020-11-28T18:45:58Z

@jowagner yes it produces 100 entries, I double checked the vocabulary file on Google Drive and it has 100 entries i.e. unused0 to unused99. Thanks for the comment you linked earlier, I double checked mBERT (HuggingFace implementation) and it has 99 entries i.e. unused1 to unused99 which agrees with the Chau et al., (2020) paper James cited.

@jbrry How was the vocab.txt file on Google Drive generated, do you happen to know if there are options to specify the number of `unused' tokens etc?

jbrry · 2020-11-29T17:04:49Z

@alanagiasi, Yes the file used to create the vocabulary in wiki-bert-pipeline can be seen here. It populates the unused entries as well as the padding and special tokens: https://github.com/spyysalo/sent2wordpiece/blob/47ba44e4bb4faa50bc617a7da93987f94a934d3f/sent2wordpiece.py

jowagner · 2020-11-30T10:52:46Z

Ok, so the answer is yes, we have those unused entries in our "from scratch" models. Nothing to do. Closing.

jowagner added the enhancement New feature or request label Nov 27, 2020

jowagner mentioned this issue Nov 30, 2020

Investigate sentpiece vocabulary conversion #44

Open

jowagner closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include unused entries in vocabulary of "from scratch" models #41

Include unused entries in vocabulary of "from scratch" models #41

jowagner commented Nov 27, 2020

jbrry commented Nov 27, 2020

alanagiasi commented Nov 27, 2020

jowagner commented Nov 28, 2020 •

edited

alanagiasi commented Nov 28, 2020

jbrry commented Nov 29, 2020

jowagner commented Nov 30, 2020

Include unused entries in vocabulary of "from scratch" models #41

Include unused entries in vocabulary of "from scratch" models #41

Comments

jowagner commented Nov 27, 2020

jbrry commented Nov 27, 2020

alanagiasi commented Nov 27, 2020

jowagner commented Nov 28, 2020 • edited

alanagiasi commented Nov 28, 2020

jbrry commented Nov 29, 2020

jowagner commented Nov 30, 2020

jowagner commented Nov 28, 2020 •

edited