Does it work in not so deep architectures? #7

wotulong · 2020-03-25T09:20:21Z

Thanks for your greate job.Do you have any experiment of ReZero applied in different layers of transformers, like 1 layer Transformer layer and it performance , 2 layer Transformer layers and it performance, and so on.Does it make convergence faster in not so deep net?Thank you.

calclavia · 2020-03-26T20:33:44Z

In our paper, we experimented with Transformers with 12 layers and found speed increase. We have not tried shallower models, but we hypothesize it should also bring speed improvements.

wotulong · 2020-03-27T05:28:12Z

Thanks you.I'll try in my project later which is not so deep.

bluesky314 · 2020-04-10T16:17:05Z

Any update @wotulong on shallower models?

wotulong closed this as completed Mar 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does it work in not so deep architectures? #7

Does it work in not so deep architectures? #7

wotulong commented Mar 25, 2020

calclavia commented Mar 26, 2020

wotulong commented Mar 27, 2020

bluesky314 commented Apr 10, 2020

Does it work in not so deep architectures? #7

Does it work in not so deep architectures? #7

Comments

wotulong commented Mar 25, 2020

calclavia commented Mar 26, 2020

wotulong commented Mar 27, 2020

bluesky314 commented Apr 10, 2020