rezero with norm #4

AllenDun · 2020-03-13T09:01:02Z

great work! In your paper, rezero shows two main benefits both in deeper learning and faster convergence. Various forms of norm and residual connections are listd In Table 1. I am curious about the form of rezero with norm, e.g., x(i+1) = x(i) + aF(Norm(x(i))). Will it be worse or better?
Thanks

majumderb · 2020-03-13T09:06:16Z

Hi, thanks for your comment. Your equation corresponds to prenorm, Table 1, row 4. We show that Pre-norm performs as worse as GPT-norm compared to ReZero in the case of Transformers.

majumderb closed this as completed Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rezero with norm #4

rezero with norm #4

AllenDun commented Mar 13, 2020

majumderb commented Mar 13, 2020

rezero with norm #4

rezero with norm #4

Comments

AllenDun commented Mar 13, 2020

majumderb commented Mar 13, 2020