Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayerNorm(x + Sublayer(x)) #1

Closed
bkj opened this issue Apr 6, 2018 · 3 comments
Closed

LayerNorm(x + Sublayer(x)) #1

bkj opened this issue Apr 6, 2018 · 3 comments
Labels
question Further information is requested

Comments

@bkj
Copy link

bkj commented Apr 6, 2018

In the text (between cells 7 and 8), you say the output of each sub-layer is:

layer_norm(x + dropout(sublayer(x)))

but in the code it looks like it's implemented as

x + dropout(sublayer(layer_norm(x))

Am I understanding correctly? (I'm guessing this doesn't matter in practice, but just checking)

Thanks

@srush
Copy link
Contributor

srush commented Apr 6, 2018

So their notation is a little confusing, from their code it seems like the x you add in does not have the previous norm applied.

At the end we return x_3 = layer_norm(x_2) where

x_2 = x_1 + dropout(sublayer(layer_norm(x_1)) where
x_1 = x_0 + dropout(sublayer(layer_norm(x_0))

So you can think of the layer_norm being applied last, but not to the residual you add.

@sth4k
Copy link

sth4k commented Jun 29, 2019

Hi I have the same question here. So is there a reason why the annotated transformer code uses different layernorm+residual as the one in the original paper?

@imleibao
Copy link

imleibao commented Sep 3, 2019

is that won't be a problem? since you also normalized the input embedding by coding like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants