Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescaling Attention values experiments/different implementations #10

Closed
mtanghu opened this issue Aug 31, 2022 · 2 comments
Closed

Rescaling Attention values experiments/different implementations #10

mtanghu opened this issue Aug 31, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@mtanghu
Copy link
Owner

mtanghu commented Aug 31, 2022

Would be good to experiment with rescaling attention values, say with full attention as well to see if you can prove something about how the attention values may be what causes training instability (the current theory is that large language models can easily be pushed into local minima that are hard to break out of when the attention values get too big before being softmax-ed). Like maybe the experiment would be, use large dimensions, and to save on memory, only use a single head and only a single layer of multihead attention (with feed forward layers after it) with comically small sequence lengths (of like 4) and small batch sizes (also use no dropout to stop regularization). Consider also only using a subset of wikitext-2. Show how larger dimensions make effect worse. Measure the pre-softmax values and see if they get larger over time.

Would also be good to see if you can try a different implementation of rescaling for LEAP could work better since it's still oddly unstable for how strongly scaled it should be.

@mtanghu
Copy link
Owner Author

mtanghu commented Sep 2, 2022

The current rescaling seems to be good now after figuring out that there need to be normalization terms in the denominator. Only currently need to work on experiments that validate the rescaling

@mtanghu mtanghu added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers and removed help wanted Extra attention is needed good first issue Good for newcomers labels Sep 3, 2022
@mtanghu
Copy link
Owner Author

mtanghu commented Sep 3, 2022

Closing this issues as the rescaling implementation makes sense and now only experiments/measurements are needed which is addressed in #17

@mtanghu mtanghu closed this as completed Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant