Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_linear_attn = True produce noise and unstable loss #55

Closed
ken012git opened this issue Jun 13, 2022 · 5 comments
Closed

use_linear_attn = True produce noise and unstable loss #55

ken012git opened this issue Jun 13, 2022 · 5 comments

Comments

@ken012git
Copy link

After moving from v0.0.60. to v0.1.10, I found the Imagen loss is unstable in the early training steps and the results is noisy from the early stage.

Screen_Shot_2022-06-12_at_11 13 46_AM

unknown

The problem is gone when I set use_linear_attn = False

@lucidrains
Copy link
Owner

@ken012git 🙏 do you want to try 0.2.4? i think i found the issue 🤦‍♂️

@ken012git
Copy link
Author

Sure! Thanks for your immediate response!

I would also like to know what causes the issue. =)

@lucidrains
Copy link
Owner

@ken012git forgot the residual 🤦 and also needed a feedforward after it anyways

@ken012git
Copy link
Author

Hi @lucidrains ,

I have tested v0.2.4 and the issue seems gone. Thanks!

# test model, resolution 64
unet1 = Unet(
        dim = 32,
        cond_dim = 512,
        dim_mults = (1, 2, 4, 8),
        num_resnet_blocks = (2, 2, 2, 2),    # small
        layer_attns = (False, False, False, True),
        layer_cross_attns = (False, False, False, True),
       # use_linear_attn = False,        
        use_linear_attn = True,
    )

Loss curve, blue: use_linear_attn =False,red: use_linear_attn =True
Screen Shot 2022-06-13 at 4 08 59 PM

early stage results, left: use_linear_attn =False,right: use_linear_attn =True
Screen Shot 2022-06-13 at 4 10 25 PM

I am wondering we should use transformers or linear attention layers at this line that configured by use_linear_attn.

Would you point me relevant papers? Thanks

@lucidrains
Copy link
Owner

lucidrains commented Jun 13, 2022

@ken012git thank you for the experiments! basically, in a lot of papers, researchers remove attention past a certain token length (1024 or 2048) since it is prohibitively expensive due to the quadratic compute. but i like to substitute them with linear attention, even if it is a bit weaker. my favorite linear attention remains https://arxiv.org/abs/1812.01243 , and here i am also giving it a depthwise conv recommended by the primer paper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants