Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding scaling #1

Open
jamesanto opened this issue Mar 11, 2024 · 1 comment
Open

Embedding scaling #1

jamesanto opened this issue Mar 11, 2024 · 1 comment

Comments

@jamesanto
Copy link

Thank you for the excellent blog post and code.

Looking at the embedding scaling logic, it looks like you are first scaling the word embedding weights by 1/scale during initialisation and then you are scaling the word embeddings after lookup by scale. Is this intentional? Don't they cancel each other?

@jamesanto
Copy link
Author

I think I get it now,

We scale down the weights, so that it can be shared at the output layer. So to offset this scaling, we are scaling up the values before feeding it to the transformer.

This is also different from many implementations where the embedding is only scaled up before feeding it to the transformer. There are some discussions suggesting that it's to make sure that the positional embedding does not overwhelm the word embeddings, but there's no consensus.

Do you have any references for implementing the way you have implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant