-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are the weights of the lm head of the model tied with the word embeddings? #138
Comments
And I am curious about the difference between tied or untied word embeddings. Does it have an effect on training stability? |
Long ago at the time of LSTMs, I used to train a lot of LMs and tying embedding weights was a game changer: much faster convergence and less memory usage. I don't know about nowadays though. To answer your first question, I looked at the weights of the tok_embeddings and output tensors in the checkpoint files and they differ. So tying was not used for training. |
if by "lm head" you are referring to the output layer on top of the transformer (the |
I want to know if there are some particular reasons to not use weight tying? @glample |
Thanks for the amazing work.
I wonder whether the weights of the lm head of the model are tied with the word embeddings of the model. From the code, it seems that they are not tied.
The text was updated successfully, but these errors were encountered: