Multi-Scale Retention: Why include position embeddings explicitly? #48

fkodom · 2023-08-02T14:48:58Z

My question is about the RetNet paper, which leads to the implementation here...

Why include the positional embedding updates directly in the multi-scale retention layer, rather than just applying them to the RetNet inputs?

IMO, this seems overly specific to the language modeling use case. Other applications of retention/attention should be free to use whatever positional embeddings they need/want.

The retention formulation is still self-consistent (i.e. equivalent for parallel, recurrent, chunkwise) without explicitly including positional embeddings in the retention layer. See Equations (1) and (2):

Instead of forcing positional embeddings into the retention formulation, we can just set A equal to the decay matrix D. The parallel/recurrent/chunkwise formulations are still equivalent, and we remove the hard-coded dependence on xPos embeddings in the retention layer.

Conceptually, I'm thinking of how to apply RetNet to other data domains (images, heterogeneous graphs, etc). In those cases, the xPos embeddings are not reflective of the actual position in the sequence (2D position in image, generic position within a graph, etc). Does it make sense to remove the explicit position embedding from the retention layer, or am I missing something?

The text was updated successfully, but these errors were encountered:

sunyt32 · 2023-08-03T08:07:33Z

$e^{i\theta}$ works well on language modeling, and we set it as default. For other domains, we don't evaluate on them yet, and I agree that rotation may not be the best option. Also, maybe an optimization technique is needed, but setting them as learnable parameters naively will cause nan in gradients. You can try to adjust it manually or explore a usable method to optimize it.

donglixp · 2023-08-03T10:36:28Z

It depends on how you understand "position embeddings". For example, we can also add the position embeddings (such as "generic position within a graph") to the token embeddings, where the positions are regarded as attributes.

fkodom · 2023-08-08T14:04:36Z

Thanks! This is exactly what I was looking for. 😎

shumingma assigned donglixp Aug 3, 2023

fkodom closed this as completed Aug 8, 2023

Shreyas-Dongre mentioned this issue Oct 26, 2023

encountered nan while trying to train syncdoth/RetNet#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Scale Retention: Why include position embeddings explicitly? #48

Multi-Scale Retention: Why include position embeddings explicitly? #48

fkodom commented Aug 2, 2023

sunyt32 commented Aug 3, 2023

donglixp commented Aug 3, 2023

fkodom commented Aug 8, 2023

Multi-Scale Retention: Why include position embeddings explicitly? #48

Multi-Scale Retention: Why include position embeddings explicitly? #48

Comments

fkodom commented Aug 2, 2023

sunyt32 commented Aug 3, 2023

donglixp commented Aug 3, 2023

fkodom commented Aug 8, 2023