LayerNorm(x + Sublayer(x)) #1

bkj · 2018-04-06T16:45:28Z

In the text (between cells 7 and 8), you say the output of each sub-layer is:

layer_norm(x + dropout(sublayer(x)))

but in the code it looks like it's implemented as

x + dropout(sublayer(layer_norm(x))

Am I understanding correctly? (I'm guessing this doesn't matter in practice, but just checking)

Thanks

The text was updated successfully, but these errors were encountered:

srush · 2018-04-06T18:15:11Z

So their notation is a little confusing, from their code it seems like the x you add in does not have the previous norm applied.

At the end we return x_3 = layer_norm(x_2) where

x_2 = x_1 + dropout(sublayer(layer_norm(x_1)) where
x_1 = x_0 + dropout(sublayer(layer_norm(x_0))

So you can think of the layer_norm being applied last, but not to the residual you add.

sth4k · 2019-06-29T17:23:47Z

Hi I have the same question here. So is there a reason why the annotated transformer code uses different layernorm+residual as the one in the original paper?

imleibao · 2019-09-03T08:04:19Z

is that won't be a problem? since you also normalized the input embedding by coding like that.

adding pad

srush added the question Further information is requested label Apr 6, 2018

srush closed this as completed Apr 21, 2018

eraoul mentioned this issue Jan 8, 2019

not working with pytorch 0.4 #11

Closed

jwc19890114 mentioned this issue Mar 19, 2019

on the section of 'Greedy Decoding', got a problem #29

Closed

imleibao mentioned this issue Sep 3, 2019

Question about the normalization in func 'SublayerConnection' #44

Closed

srush pushed a commit that referenced this issue Nov 22, 2021

Merge pull request #1 from austinvhuang/add_padding

f4529c9

adding pad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayerNorm(x + Sublayer(x)) #1

LayerNorm(x + Sublayer(x)) #1

bkj commented Apr 6, 2018

srush commented Apr 6, 2018

sth4k commented Jun 29, 2019

imleibao commented Sep 3, 2019

LayerNorm(x + Sublayer(x)) #1

LayerNorm(x + Sublayer(x)) #1

Comments

bkj commented Apr 6, 2018

srush commented Apr 6, 2018

sth4k commented Jun 29, 2019

imleibao commented Sep 3, 2019