-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include unused entries in vocabulary of "from scratch" models #41
Comments
Good point. I checked the
|
Just to note a tiny boundary issue there. The stop argument in the Sorry if that seems pedantic to point out, I had the opposite in R recently since it's c() function is inclusive :^) |
@alanagiasi Your code would produce 100 entries, one more than previous work quoted in #33 (comment). As to the placement of As to confusions when moving between programming languages, I like to write code in a way that makes things clear even when the reader doesn't know the language specific details, e.g. instead of the above I would write something like number_of_entries = 99
list_of_entries = []
for entry_index in range(number_of_entries):
list_of_entries.append('[unused%d]' %entry_index) where
|
@jowagner yes it produces 100 entries, I double checked the vocabulary file on Google Drive and it has 100 entries i.e. @jbrry How was the vocab.txt file on Google Drive generated, do you happen to know if there are options to specify the number of `unused' tokens etc? |
@alanagiasi, Yes the file used to create the vocabulary in |
Ok, so the answer is yes, we have those unused entries in our "from scratch" models. Nothing to do. Closing. |
As discussed in issue #33, having a few unused entries in the vocabulary is a great idea to make it easier for users of a model to add extra tokens for fine-tuning. We should do this as well when training our final "from scratch" models. Multilingual BERT provides 99 such entries. We should use the same number of entries and use the same
["[unused%d]" %i for i in range(99)]
format.The text was updated successfully, but these errors were encountered: