-
-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use_linear_attn = True
produce noise and unstable loss
#55
Comments
@ken012git 🙏 do you want to try |
Sure! Thanks for your immediate response! I would also like to know what causes the issue. =) |
@ken012git forgot the residual 🤦 and also needed a feedforward after it anyways |
Hi @lucidrains , I have tested v0.2.4 and the issue seems gone. Thanks!
Loss curve, blue: early stage results, left: I am wondering we should use transformers or linear attention layers at this line that configured by Would you point me relevant papers? Thanks |
@ken012git thank you for the experiments! basically, in a lot of papers, researchers remove attention past a certain token length (1024 or 2048) since it is prohibitively expensive due to the quadratic compute. but i like to substitute them with linear attention, even if it is a bit weaker. my favorite linear attention remains https://arxiv.org/abs/1812.01243 , and here i am also giving it a depthwise conv recommended by the primer paper |
After moving from v0.0.60. to v0.1.10, I found the Imagen loss is unstable in the early training steps and the results is noisy from the early stage.
The problem is gone when I set
use_linear_attn = False
The text was updated successfully, but these errors were encountered: