Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I change vocab size for pretrained model? #237

Closed
hahmyg opened this issue Jan 30, 2019 · 7 comments
Closed

How can I change vocab size for pretrained model? #237

hahmyg opened this issue Jan 30, 2019 · 7 comments

Comments

@hahmyg
Copy link

hahmyg commented Jan 30, 2019

Is there way to change (expand) vocab size for pretrained model?

When I input the new token id to model, it returns:

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1108 with torch.no_grad():
1109 torch.embedding_renorm_(weight, input, max_norm, norm_type)
-> 1110 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1111
1112

RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorMath.cpp:352

@rodgzilla
Copy link
Contributor

Hi,

If you want to modify the vocabulary, you should refer to this part of the original repo README https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

@tholor
Copy link
Contributor

tholor commented Jan 30, 2019

If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(google-research/bert#9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

@thomwolf
Copy link
Member

thomwolf commented Feb 5, 2019

@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.

@thomwolf thomwolf closed this as completed Feb 5, 2019
@chenshaolong
Copy link

If you don't want a complete new vocabulary (which would require training from scratch), but extend the pretrained one with a couple of domain specific tokens, this comment from Jacob Devlin might help:

[...] if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

(google-research/bert#9)

I am currently experimenting with approach a). Since there are 993 unused tokens this might already help for the most important tokens in your domain.

@tholor I have exactly the same situation as you had. I'm wondering If you can tell me how your experiment with approach (a) went. Did it improve the accuracy. I really appreciate if you can share your conclusion.

@vyraun
Copy link

vyraun commented Oct 3, 2019

@tholor and @rodgzilla answers are the way to go.
Closing this issue since there no activity.
Feel free to re-open if needed.

Hi @thomwolf , for implementing models like VideoBERT we need to append thousands of entries to the word embedding lookup table. How could we do so in Pytorch/any such examples using the library?

@sachinshinde1391
Copy link

@tholor Can you guide me on how you are counting 993 unused tokens? I see only first 100 places of unused tokens?

@aribenjamin
Copy link

For those finding this on the web, I found the following answer helpful: #1413 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants