L2 attention is implemented wrong! #14

PeterL1n · 2023-06-12T22:21:20Z

From paper
https://arxiv.org/pdf/2006.04710.pdf

First, token needs to attend itself to ensure Lipschitz!

Second, torch.cdist is not the correct way to do it. Follow the original paper.

for tied qk

AB = torch.matmul(qk, qk.transpose(-1, -2))
AA = torch.sum(qk ** 2, -1, keepdim=True)
BB = AA.transpose(-1, -2)    # Since query and key are tied.
attn = -(AA - 2 * AB + BB)
attn = attn.mul(self.scale).softmax(-1)

for separate qk

AB = torch.matmul(q, k.transpose(-1, -2))
AA = torch.sum(q ** 2, -1, keepdim=True)
BB = torch.sum(k ** 2, -1, keepdim=True).transpose(-1, -2)
attn = -(AA - 2 * AB + BB)
attn = attn.mul(self.scale).softmax(-1)

This is basically torch.cdist().square(), but more efficient and supports double backward for r1 regularization.

Last, I believe the paper only used L2 self attention in discriminator. The generator should still use dot attention.

The text was updated successfully, but these errors were encountered:

lucidrains · 2023-06-13T23:19:42Z

@PeterL1n thanks Peter! will get this all resolved this weekend

did they end up using tied qk for their final model?

lucidrains · 2023-06-13T23:23:07Z

@PeterL1n this is news to me that they are using the squared of the euclidean distance; i will reread the original paper, thank you!

lucidrains · 2023-06-13T23:25:07Z

@PeterL1n if the token attends to itself, wouldn't it always have a distance of 0 and attend to itself the most? maybe it works out for their Lipschitz proof, but how does this make sense in the tied scenario?

PeterL1n · 2023-06-14T01:18:31Z

The paper only proved that L2 attention with tied qk is Lipschitz for self attention. It must be tied to be Lipschitz!. Also it is not Lipschitz for cross attention, that is why in GigaGAN's discriminator, only self-attention is used. However, they used self & cross in generator, knowing that generator can't be Lipschitz, there is no point in using L2 attention in the generator, so I believe they used regular dot product attention for the generator.

You are correct that in the case of tied qk, token's self distance is always zero, thus always the most similar. So self value is always included in the attention. Other position can have close L2 distance to take away the proportion to self token.

This is my understanding.

lucidrains · 2023-06-14T01:40:04Z

Let's roll with that! Thank you Peter for the review 🙏

lucidrains · 2023-06-14T01:47:55Z

it would be euclidean distance squared, so it would have to be quite close. that is strange. just thinking out loud

lucidrains · 2023-06-14T16:41:17Z

@PeterL1n do you want to see if 0.0.18 unblocks you for your research / startup?

lucidrains · 2023-06-14T16:41:31Z

@PeterL1n i will get back to wiring up the training code soon later this month

lucidrains · 2023-06-15T16:19:50Z

@PeterL1n reviewed the old deepmind paper and indeed it is squared distance! thanks for catching this and correcting my misunderstanding

lucidrains · 2023-06-15T16:20:06Z

closing as it should be resolved, feel free to reopen if you note any further issues

PeterL1n changed the title ~~L2 self attention needs to attent itself~~ L2 attention is implemented wrong! Jun 13, 2023

lucidrains added a commit that referenced this issue Jun 14, 2023

address #14

2766122

lucidrains added a commit that referenced this issue Jun 14, 2023

address #14

35d3649

lucidrains closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2 attention is implemented wrong! #14

L2 attention is implemented wrong! #14

PeterL1n commented Jun 12, 2023 •

edited

Loading

lucidrains commented Jun 13, 2023

lucidrains commented Jun 13, 2023

lucidrains commented Jun 13, 2023

PeterL1n commented Jun 14, 2023 •

edited

Loading

lucidrains commented Jun 14, 2023

lucidrains commented Jun 14, 2023 •

edited

Loading

lucidrains commented Jun 14, 2023 •

edited

Loading

lucidrains commented Jun 14, 2023

lucidrains commented Jun 15, 2023

lucidrains commented Jun 15, 2023

L2 attention is implemented wrong! #14

L2 attention is implemented wrong! #14

Comments

PeterL1n commented Jun 12, 2023 • edited Loading

lucidrains commented Jun 13, 2023

lucidrains commented Jun 13, 2023

lucidrains commented Jun 13, 2023

PeterL1n commented Jun 14, 2023 • edited Loading

lucidrains commented Jun 14, 2023

lucidrains commented Jun 14, 2023 • edited Loading

lucidrains commented Jun 14, 2023 • edited Loading

lucidrains commented Jun 14, 2023

lucidrains commented Jun 15, 2023

lucidrains commented Jun 15, 2023

PeterL1n commented Jun 12, 2023 •

edited

Loading

PeterL1n commented Jun 14, 2023 •

edited

Loading

lucidrains commented Jun 14, 2023 •

edited

Loading

lucidrains commented Jun 14, 2023 •

edited

Loading