LayerNorm #9

jingweiz · 2020-11-26T04:10:49Z

Dear Juho,
Thanks for making the code public!
One quick question, if I read the code correctly, LayerNorm was never used in any of the three examples you opensourced here in this repo is that correct?
If so, is it because they give bit inferior performances? And have you tried moving the LayerNorm layer inside the skip connections instead of before/after the skip connections like done in several more recent papers such that you have an connection directly from output to input?
Thanks in advance and looking forward to your reply!

The text was updated successfully, but these errors were encountered:

juho-lee · 2020-11-26T10:20:07Z

We empirically found that the multiple stacks of ISABs won't train with LayerNorm for some data, and the results didn't degrade too much without LayerNorm even for the ones working with it, so decided not to include it.
The recent result (https://arxiv.org/abs/2002.04745, actually not really recent anymore) says that moving the position of the LayerNorm before the attention improves the performance, so if you are to apply it you might consider this.

jingweiz closed this as completed Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayerNorm #9

LayerNorm #9

jingweiz commented Nov 26, 2020

juho-lee commented Nov 26, 2020

LayerNorm #9

LayerNorm #9

Comments

jingweiz commented Nov 26, 2020

juho-lee commented Nov 26, 2020