MultiHeadAttention is missing a position_bias call argument, making Alibi or T5-style position embedding impossible to implement #18423

martin-gorner · 2023-08-02T13:35:13Z

Alibi or T5 relative position embeddings modify the attention computation instead of being simply added to token embeddings.

The T5 implementation of MultiHeadAttention has a position_bias argument that allows this.

The Keras MultiHeadAttention seems to be missing this argument.

Without this, I don't think that implementing T5-style or Alibi relative position embeddings is possible.

fchollet · 2023-08-02T16:00:10Z

@mattdangerw what do you think?

mattdangerw · 2023-08-14T06:59:55Z

Could add! Honestly, this is one of many extensions/variations to MHA we could consider...

Cached attention, which is needed for any efficient decoding.
Attention score biases, which are needed for alibi and t5 style attention.
Rotary embeddings, which are applied to the key and query after the dense projections. (overall, seems a bit more popular than alibi or t5 bias)
Multi query attention, grouped query attention. (I think this is supplanting multi-head as best practice)
Probably more I am missing.

Of all of these, 1) is probably the most important, because without a caching solution, you wouldn't want to use the layer for anything generative in practice. And most of these new techniques are becoming popular strictly for generative models. But might be worth taking a step back...

Probably the first question we should answer is whether we want a robust attention layer offering to live in keras-core on keras-nlp.

After that, some design choises...

A single MHA layer with a lot of options?
A few different layers? And if so, where do we split? We face a cartesian product of features you could combine.
Subclassing as a preferred solution?

fchollet · 2023-08-15T19:32:10Z

My strategy would be:

Use subclassing for any one-off or new use case and make the layer subclass part of the model codebase.
If a certain pattern of subclassing occurs often across multiple models, then we need to add it as a built-in feature for better UX.
If it can be added to the layer in a "flat" way (meaning that it has no/low coupling with existing logic and does not add combinatorial complexity), we can add it as a layer argument on the MHA layer.
If it interacts with the rest of the layer logic in a way that would make the MHA layer very complex, roll out a new layer (in KerasNLP since it will be NLP specific) that subclasses MHA or maybe even rewrite it from scratch -- where the feature name is in the name of the layer.

mattdangerw · 2023-08-15T21:58:28Z

Talked with Francois. I think we could make the following changes...

Add the following call arguments:

Add cache=None or key_cache=None, value_cache=None depending on if we want a single tensor or two.
Add cache_index=None or cache_update_index=None (in KerasNLP we use the latter), which controls where the newly computed key/values will update the cache.
Add attention_bias=None. Not as popular as RoPE, but well defined and simple to add without making our layer spaghetti.

MultiQueryAttention/GroupQueryAttention I am not sure about. Either we add some init arguments or make subclass. Essentially we just need to allow controlling the query and key/value head count separately.

For RoPE, we can recommend a subclass for now. There is some variation with how it's configure/applied, so it would be awkward to add arguments for it. I think this would do it...

    def _compute_attention(self, query, key, value, **kwargs):
        query = rotary_embedding(query)
        key = rotary_embedding(key)
        return super()._compute_attention(query, key, value, **kwargs)

fchollet · 2023-08-16T02:20:25Z

MultiQueryAttention/GroupQueryAttention I am not sure about. Either we add some init arguments or make subclass. Essentially we just need to allow controlling the query and key/value head count separately.

Ideally we just need to check that our current factoring of MHA enables to implement this via a subclass without too much hassle.

mattdangerw · 2023-08-16T18:40:33Z

Ideally we just need to check that our current factoring of MHA enables to implement this via a subclass without too much hassle.

I just did a multi-query attn layer for a model conversion, and I would say it's a hassle right now sadly. I ended up abandoning a subclass and just rewriting the layer. But should be doable to improve!

martin-gorner · 2023-08-28T16:07:10Z

Thank you for bringing a broader perspective to this conversation!

sachinprasadhs · 2024-04-11T21:01:07Z

With the addition of GroupedQueryAttention layer if the concern of this issue is addressed, feel free to close the issue. Thanks!

github-actions · 2024-04-26T01:48:45Z

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-05-10T01:50:22Z

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

mattdangerw mentioned this issue Sep 19, 2023

Add MultiQueryAttention & GroupedQueryAttention #18402

Closed

fchollet transferred this issue from keras-team/keras-core Sep 22, 2023

sachinprasadhs self-assigned this Apr 11, 2024

sachinprasadhs added the type:feature The user is asking for a new feature. label Apr 11, 2024

sachinprasadhs added the stat:awaiting response from contributor label Apr 11, 2024

github-actions bot added the stale label Apr 26, 2024

github-actions bot closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiHeadAttention is missing a position_bias call argument, making Alibi or T5-style position embedding impossible to implement #18423

MultiHeadAttention is missing a position_bias call argument, making Alibi or T5-style position embedding impossible to implement #18423

martin-gorner commented Aug 2, 2023 •

edited

fchollet commented Aug 2, 2023

mattdangerw commented Aug 14, 2023

fchollet commented Aug 15, 2023

mattdangerw commented Aug 15, 2023 •

edited

fchollet commented Aug 16, 2023

mattdangerw commented Aug 16, 2023

martin-gorner commented Aug 28, 2023

sachinprasadhs commented Apr 11, 2024

github-actions bot commented Apr 26, 2024

github-actions bot commented May 10, 2024

MultiHeadAttention is missing a position_bias call argument, making Alibi or T5-style position embedding impossible to implement #18423

MultiHeadAttention is missing a position_bias call argument, making Alibi or T5-style position embedding impossible to implement #18423

Comments

martin-gorner commented Aug 2, 2023 • edited

fchollet commented Aug 2, 2023

mattdangerw commented Aug 14, 2023

fchollet commented Aug 15, 2023

mattdangerw commented Aug 15, 2023 • edited

fchollet commented Aug 16, 2023

mattdangerw commented Aug 16, 2023

martin-gorner commented Aug 28, 2023

sachinprasadhs commented Apr 11, 2024

github-actions bot commented Apr 26, 2024

github-actions bot commented May 10, 2024

martin-gorner commented Aug 2, 2023 •

edited

mattdangerw commented Aug 15, 2023 •

edited