-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LayerNorm(x + Sublayer(x)) #1
Comments
So their notation is a little confusing, from their code it seems like the x you add in does not have the previous norm applied. At the end we return x_3 = layer_norm(x_2) where x_2 = x_1 + dropout(sublayer(layer_norm(x_1)) where So you can think of the layer_norm being applied last, but not to the residual you add. |
Hi I have the same question here. So is there a reason why the annotated transformer code uses different layernorm+residual as the one in the original paper? |
is that won't be a problem? since you also normalized the input embedding by coding like that. |
In the text (between cells 7 and 8), you say the output of each sub-layer is:
but in the code it looks like it's implemented as
Am I understanding correctly? (I'm guessing this doesn't matter in practice, but just checking)
Thanks
The text was updated successfully, but these errors were encountered: