Rescaling Attention values experiments/different implementations #10

mtanghu · 2022-08-31T06:52:21Z

Would be good to experiment with rescaling attention values, say with full attention as well to see if you can prove something about how the attention values may be what causes training instability (the current theory is that large language models can easily be pushed into local minima that are hard to break out of when the attention values get too big before being softmax-ed). Like maybe the experiment would be, use large dimensions, and to save on memory, only use a single head and only a single layer of multihead attention (with feed forward layers after it) with comically small sequence lengths (of like 4) and small batch sizes (also use no dropout to stop regularization). Consider also only using a subset of wikitext-2. Show how larger dimensions make effect worse. Measure the pre-softmax values and see if they get larger over time.

Would also be good to see if you can try a different implementation of rescaling for LEAP could work better since it's still oddly unstable for how strongly scaled it should be.

mtanghu · 2022-09-02T08:00:04Z

The current rescaling seems to be good now after figuring out that there need to be normalization terms in the denominator. Only currently need to work on experiments that validate the rescaling

mtanghu · 2022-09-03T21:29:31Z

Closing this issues as the rescaling implementation makes sense and now only experiments/measurements are needed which is addressed in #17

mtanghu added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers and removed help wanted Extra attention is needed good first issue Good for newcomers labels Sep 3, 2022

mtanghu closed this as completed Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescaling Attention values experiments/different implementations #10

Rescaling Attention values experiments/different implementations #10

mtanghu commented Aug 31, 2022 •

edited

Loading

mtanghu commented Sep 2, 2022

mtanghu commented Sep 3, 2022

Rescaling Attention values experiments/different implementations #10

Rescaling Attention values experiments/different implementations #10

Comments

mtanghu commented Aug 31, 2022 • edited Loading

mtanghu commented Sep 2, 2022

mtanghu commented Sep 3, 2022

mtanghu commented Aug 31, 2022 •

edited

Loading