Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the weights of the lm head of the model tied with the word embeddings? #138

Closed
lvcc2018 opened this issue Mar 6, 2023 · 4 comments
Closed

Comments

@lvcc2018
Copy link

lvcc2018 commented Mar 6, 2023

Thanks for the amazing work.
I wonder whether the weights of the lm head of the model are tied with the word embeddings of the model. From the code, it seems that they are not tied.

@lvcc2018
Copy link
Author

lvcc2018 commented Mar 6, 2023

And I am curious about the difference between tied or untied word embeddings. Does it have an effect on training stability?

@benob
Copy link

benob commented Mar 8, 2023

Long ago at the time of LSTMs, I used to train a lot of LMs and tying embedding weights was a game changer: much faster convergence and less memory usage. I don't know about nowadays though.

To answer your first question, I looked at the weights of the tok_embeddings and output tensors in the checkpoint files and they differ. So tying was not used for training.

@glample
Copy link
Contributor

glample commented Mar 10, 2023

if by "lm head" you are referring to the output layer on top of the transformer (the Linear(hidden_dim, vocab_size)), then no, they are not shared with the input word embeddings.

@wuwuwuxxx
Copy link

I want to know if there are some particular reasons to not use weight tying? @glample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants