You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Would be good to experiment with rescaling attention values, say with full attention as well to see if you can prove something about how the attention values may be what causes training instability (the current theory is that large language models can easily be pushed into local minima that are hard to break out of when the attention values get too big before being softmax-ed). Like maybe the experiment would be, use large dimensions, and to save on memory, only use a single head and only a single layer of multihead attention (with feed forward layers after it) with comically small sequence lengths (of like 4) and small batch sizes (also use no dropout to stop regularization). Consider also only using a subset of wikitext-2. Show how larger dimensions make effect worse. Measure the pre-softmax values and see if they get larger over time.
Would also be good to see if you can try a different implementation of rescaling for LEAP could work better since it's still oddly unstable for how strongly scaled it should be.
The text was updated successfully, but these errors were encountered:
The current rescaling seems to be good now after figuring out that there need to be normalization terms in the denominator. Only currently need to work on experiments that validate the rescaling
Would be good to experiment with rescaling attention values, say with full attention as well to see if you can prove something about how the attention values may be what causes training instability (the current theory is that large language models can easily be pushed into local minima that are hard to break out of when the attention values get too big before being softmax-ed). Like maybe the experiment would be, use large dimensions, and to save on memory, only use a single head and only a single layer of multihead attention (with feed forward layers after it) with comically small sequence lengths (of like 4) and small batch sizes (also use no dropout to stop regularization). Consider also only using a subset of wikitext-2. Show how larger dimensions make effect worse. Measure the pre-softmax values and see if they get larger over time.
Would also be good to see if you can try a different implementation of rescaling for LEAP could work better since it's still oddly unstable for how strongly scaled it should be.
The text was updated successfully, but these errors were encountered: