New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiHeadAttention is missing a position_bias call argument, making Alibi or T5-style position embedding impossible to implement #18423
Comments
@mattdangerw what do you think? |
Could add! Honestly, this is one of many extensions/variations to MHA we could consider...
Of all of these, Probably the first question we should answer is whether we want a robust attention layer offering to live in keras-core on keras-nlp. After that, some design choises...
|
My strategy would be:
|
Talked with Francois. I think we could make the following changes... Add the following call arguments:
MultiQueryAttention/GroupQueryAttention I am not sure about. Either we add some init arguments or make subclass. Essentially we just need to allow controlling the query and key/value head count separately. For RoPE, we can recommend a subclass for now. There is some variation with how it's configure/applied, so it would be awkward to add arguments for it. I think this would do it... def _compute_attention(self, query, key, value, **kwargs):
query = rotary_embedding(query)
key = rotary_embedding(key)
return super()._compute_attention(query, key, value, **kwargs) |
Ideally we just need to check that our current factoring of MHA enables to implement this via a subclass without too much hassle. |
I just did a multi-query attn layer for a model conversion, and I would say it's a hassle right now sadly. I ended up abandoning a subclass and just rewriting the layer. But should be doable to improve! |
Thank you for bringing a broader perspective to this conversation! |
With the addition of GroupedQueryAttention layer if the concern of this issue is addressed, feel free to close the issue. Thanks! |
This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further. |
Alibi or T5 relative position embeddings modify the attention computation instead of being simply added to token embeddings.
The T5 implementation of MultiHeadAttention has a position_bias argument that allows this.
The Keras MultiHeadAttention seems to be missing this argument.
Without this, I don't think that implementing T5-style or Alibi relative position embeddings is possible.
The text was updated successfully, but these errors were encountered: