You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear Juho,
Thanks for making the code public!
One quick question, if I read the code correctly, LayerNorm was never used in any of the three examples you opensourced here in this repo is that correct?
If so, is it because they give bit inferior performances? And have you tried moving the LayerNorm layer inside the skip connections instead of before/after the skip connections like done in several more recent papers such that you have an connection directly from output to input?
Thanks in advance and looking forward to your reply!
The text was updated successfully, but these errors were encountered:
We empirically found that the multiple stacks of ISABs won't train with LayerNorm for some data, and the results didn't degrade too much without LayerNorm even for the ones working with it, so decided not to include it.
The recent result (https://arxiv.org/abs/2002.04745, actually not really recent anymore) says that moving the position of the LayerNorm before the attention improves the performance, so if you are to apply it you might consider this.
Dear Juho,
Thanks for making the code public!
One quick question, if I read the code correctly,
LayerNorm
was never used in any of the three examples you opensourced here in this repo is that correct?If so, is it because they give bit inferior performances? And have you tried moving the
LayerNorm
layer inside the skip connections instead of before/after the skip connections like done in several more recent papers such that you have an connection directly from output to input?Thanks in advance and looking forward to your reply!
The text was updated successfully, but these errors were encountered: