You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
great work! In your paper, rezero shows two main benefits both in deeper learning and faster convergence. Various forms of norm and residual connections are listd In Table 1. I am curious about the form of rezero with norm, e.g., x(i+1) = x(i) + aF(Norm(x(i))). Will it be worse or better?
Thanks
The text was updated successfully, but these errors were encountered:
Hi, thanks for your comment. Your equation corresponds to prenorm, Table 1, row 4. We show that Pre-norm performs as worse as GPT-norm compared to ReZero in the case of Transformers.
great work! In your paper, rezero shows two main benefits both in deeper learning and faster convergence. Various forms of norm and residual connections are listd In Table 1. I am curious about the form of rezero with norm, e.g., x(i+1) = x(i) + aF(Norm(x(i))). Will it be worse or better?
Thanks
The text was updated successfully, but these errors were encountered: