You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self-attention block.
Code:
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
Paper says: Output of each sub-layer is LayerNorm(x + Sublayer(x))
I changed the code to the following and the results seem (slightly) better on play_char, albeit it's a qualitative assessment...
def forward(self, x):
x = self.ln1(x + self.attn(x))
x = self.ln2(x + self.feed_forward(x))
return x
The text was updated successfully, but these errors were encountered: