Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does fine-tuned model use the same vocabulary as pre-trained model? #5

Closed
zhangguanqun opened this issue Dec 22, 2020 · 5 comments
Closed

Comments

@zhangguanqun
Copy link

The pre-trained model use learned joint vocabulary with 64867 tokens, merged by 32k operations. That is the file vocab.bpe.32000.
Does fine-tuned model (e.g. en2de) released in icloud use the same vocabulary file as pre-trained model?

@linzehui
Copy link
Owner

yes

@zhangguanqun
Copy link
Author

appreciate

@zhangguanqun
Copy link
Author

If I want to fine-tune this model to support new languages, new tokens should be added to existing file?
That is, new file has larger vocabulary size than 64867?
If do so, the embedding params of checkpoint (e.g. pretrain_checkpoint_last_RAS.pt) released in this project should be expanded before fine-tuning.

@PANXiao1994
Copy link
Collaborator

If I want to fine-tune this model to support new languages, new tokens should be added to existing file?
That is, new file has larger vocabulary size than 64867?
If do so, the embedding params of checkpoint (e.g. pretrain_checkpoint_last_RAS.pt) released in this project should be expanded before fine-tuning.

Yes, you are right, if you want to expand the supported languages, you need to merge the newly added tokens into the existing vocabulary, then randomly initialize the embedding vectors of the new tokens.

@zhangguanqun
Copy link
Author

appreciate * 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants