Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Encoder implementation is different from the original "Attention is all need" paper? #5

Closed
chaoyanghe opened this issue Nov 19, 2020 · 4 comments

Comments

@chaoyanghe
Copy link

chaoyanghe commented Nov 19, 2020

Hi, I checked your code at

class Block(nn.Module):
.

Your implementation is: Attention(LayerNorm(x)) + x, but the original Transformer is: LayerNorm(x +Attention(x)). Is this an error or deliberately implemented like this?

@jeonsworld
Copy link
Owner

jeonsworld commented Nov 20, 2020

As in paper On Layer Normalization in the Transformer Architecture, the position of Layer Normalization in Transformer implementation is used as pre-LN and post-LN. For example, Transformer Encoder-based BERT uses post-LN, but Vision Transformer uses pre-LN.
In conclusion, that implementation is correct.

@jeonsworld
Copy link
Owner

Additional comments.
Attention is all you need uses post-LN.

@jeonsworld
Copy link
Owner

I believe the issue has been answered and close the issue.

@chaoyanghe
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants