LayerNorm/GatedRMS inconsistency #1

inspirit · 2022-04-01T17:15:49Z

Hi!
looking through pipeline it seems there are some inconsistencies with normalisation

# ReLA
input to GRMSNorm
# att code
output: Linear(inner_dim, dim) + GRMSNorm
# next in FF module 
input to LayerNorm

here we have problem with double norm since we have last layer GRMSNorm in att and first layer LayerNorm in FF.

looking at the paper it seems that in ReLA GRMSNorm is applied to result of mult(attn, v) before output projection not after projection like in this code.
I also confused about usage of LayerNorm in FF should it be GRMSNorm instead? not clear from the paper as well

The text was updated successfully, but these errors were encountered:

lucidrains · 2022-04-06T15:31:29Z

@inspirit hello there! yea, i kind of did some improvisation there

i'm using the sandwich normalization formulation from another paper https://arxiv.org/abs/2105.13290 rather than just normalizing the aggregated values directly

for the feedforward, i'm not entirely sure, probably wouldn't make that huge of a difference

inspirit · 2022-04-06T15:40:59Z

Aha I see, yup i remember sandwich norm paper :)
another difference I noticed: you use projection based gating (with Linear layer) in GRMSNorm, while original paper is using simple per element multiplication here: return normed_x *(x*gate).sigmoid() where gate = nn.Parameter(torch.tensor(dim))

lucidrains · 2022-04-06T18:13:51Z

@inspirit ohh apologies, yea, i didn't build that correctly b58b121

let me know if that works! i've seen the relu based attention in another recent paper https://github.com/lucidrains/FLASH-pytorch , so maybe there's something to it!

lucidrains · 2022-04-07T20:36:43Z

@inspirit how did it go? :) any interesting experimental results?

inspirit · 2022-04-08T06:19:00Z

it seems to be less stable compared to normal softmax attention, I fused it with preceiver for my experiments, sometimes it gives slightly better results sometimes not :) the reason might be due to a small model inner dimension (128) and more sparse attention due to ReLu use

lucidrains · 2022-04-09T01:50:50Z

@inspirit yea, i thought it would be too good to be true if relu attention worked 😞 it must have worked for FLASH because they confine their quadratic attention to local windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayerNorm/GatedRMS inconsistency #1

LayerNorm/GatedRMS inconsistency #1

inspirit commented Apr 1, 2022

lucidrains commented Apr 6, 2022

inspirit commented Apr 6, 2022

lucidrains commented Apr 6, 2022

lucidrains commented Apr 7, 2022

inspirit commented Apr 8, 2022

lucidrains commented Apr 9, 2022

LayerNorm/GatedRMS inconsistency #1

LayerNorm/GatedRMS inconsistency #1

Comments

inspirit commented Apr 1, 2022

lucidrains commented Apr 6, 2022

inspirit commented Apr 6, 2022

lucidrains commented Apr 6, 2022

lucidrains commented Apr 7, 2022

inspirit commented Apr 8, 2022

lucidrains commented Apr 9, 2022