Maybe scale is wrong #3

denadai2 · 2022-05-20T18:25:00Z

memorizing-transformers-pytorch/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py

Line 237 in 83fa147

sim = einsum('b h i d, b j d -> b h i j', q, k) * scale

Shouldn't this be (1-scale)?

lucidrains · 2022-05-20T18:28:15Z

ohh no, that is actually the learned temperature from a variant of attention (cosine similarity attention) https://github.com/lucidrains/x-transformers#query-key-normalization the temperature is in log space, exponentiated here https://github.com/lucidrains/memorizing-transformers-pytorch/blob/main/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L235

lucidrains · 2022-05-20T18:29:29Z

@denadai2 ohh if you were looking for the sigmoid gating, i removed that, since it was not working well for me and another researcher (thought that was one of the weak parts of the paper). i went with the other researcher's suggestion of attending across the similarities, local and distant (softmax across the attention logits concatted)

denadai2 · 2022-05-20T18:33:51Z

thanks for the prompt answer! I saw it now :)

btw this increases the complexity I'd say... it makes sense though

denadai2 changed the title ~~Maybe sim is wrong~~ Maybe scale is wrong May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe scale is wrong #3

Maybe scale is wrong #3

denadai2 commented May 20, 2022 •

edited

Loading

lucidrains commented May 20, 2022

lucidrains commented May 20, 2022 •

edited

Loading

denadai2 commented May 20, 2022 •

edited

Loading

Maybe scale is wrong #3

Maybe scale is wrong #3

Comments

denadai2 commented May 20, 2022 • edited Loading

lucidrains commented May 20, 2022

lucidrains commented May 20, 2022 • edited Loading

denadai2 commented May 20, 2022 • edited Loading

denadai2 commented May 20, 2022 •

edited

Loading

lucidrains commented May 20, 2022 •

edited

Loading

denadai2 commented May 20, 2022 •

edited

Loading