Curious Question: Is LayerNorm in the wrong position or is that deliberate? #33

turner-rovco · 2020-08-26T08:48:36Z

Code:

def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x

Paper says: Output of each sub-layer is LayerNorm(x + Sublayer(x))

I changed the code to the following and the results seem (slightly) better on play_char, albeit it's a qualitative assessment...

def forward(self, x):
x = self.ln1(x + self.attn(x))
x = self.ln2(x + self.feed_forward(x))
return x

fpgaminer · 2020-08-26T20:17:54Z

GPT-2 changed it:

Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self-attention block.

Source: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

(NOTE: I'm not the repo owner, just trying to help)

turner-rovco · 2020-08-27T06:48:45Z

Ah perfect, thanks!

turner-rovco closed this as completed Aug 27, 2020

fpgaminer mentioned this issue Nov 12, 2020

Layer norm should be after residual block #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curious Question: Is LayerNorm in the wrong position or is that deliberate? #33

Curious Question: Is LayerNorm in the wrong position or is that deliberate? #33

turner-rovco commented Aug 26, 2020 •

edited

fpgaminer commented Aug 26, 2020

turner-rovco commented Aug 27, 2020

Curious Question: Is LayerNorm in the wrong position or is that deliberate? #33

Curious Question: Is LayerNorm in the wrong position or is that deliberate? #33

Comments

turner-rovco commented Aug 26, 2020 • edited

fpgaminer commented Aug 26, 2020

turner-rovco commented Aug 27, 2020

turner-rovco commented Aug 26, 2020 •

edited