Add support for Keras mask & causal mask to MultiHeadAttention #16619

ageron · 2022-05-29T04:12:36Z

This is a first go at adding Keras masking support and causal support to the MultiHeadAttention layer.

See the discussion in #16248.

fchollet

Thank you for the PR! For consistency, I would favor the same API as the one used in KerasNLP here: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/layers/transformer_decoder.py#L191

As in, a use_causal_mask argument in call(). It would not change the PR much otherwise.

fchollet · 2022-05-29T17:45:37Z

FYI @mattdangerw @chenmoneygithub

ageron · 2022-05-29T21:49:20Z

Hi @fchollet , thanks for reviewing this PR.

Note that the keras.layers.Attention layer uses an init parameter called causal, as noted by @haifeng-jin in #16248, so it seems to me that consistency within Keras is more important than consistency with Keras-NLP?

That said, I just made the change you requested, and I'm happy to push it if you prefer, I don't have a strong opinion about this: perhaps it's the Attention layer that should be updated to use use_causal_mask in the call() method.

fchollet · 2022-05-29T23:00:33Z

This is a good question -- we should make a choice one way or the other and standardize everything (Keras, KerasNLP) on that choice. Let me chat with the team and we'll figure out what to do.

chenmoneygithub · 2022-05-29T23:27:59Z

@fchollet To me causal is not the best argument name as it is too abstract. I did a codesearch and there are only a few usages of the causal argument of tf.keras.layers.Attention in google3, so arg name changing won't be too hard.

ageron · 2022-05-30T10:04:01Z

Sounds good. I just replaced causal in the constructor with use_causal_mask in the call() method.

fchollet · 2022-05-30T16:26:40Z

I also think it makes sense to have the argument in call() because that's where we pass the other masking-related arguments. So let's keep use_causal_mask in call().

@chenmoneygithub could you please file an issue to change the Attention layer API to adopt this convention?

keras/layers/attention/multi_head_attention.py

chenmoneygithub · 2022-05-31T17:34:15Z

@fchollet Sure!

fchollet

LGTM

mattdangerw · 2022-06-08T21:11:57Z

@ageron did some more testing a came across a few things.

I think we need to be more defensive about assuming mask types. We have real world usage where the mask (explicit or implicit) is floating point, which leads to errors when doing the bitwise &. Maybe we can just make sure we cast all masks to bool type before working with them? We should probably add a test for this too.
In an encoder where QKV are all the same, with an implicit mask, this code looks like it will make an attention mask with zeros along the bottom and along the right. I am used to seeing an attention mask with zeros only along one of those dimensions. This colab the difference. I am reading your code correct? Do you know what the right approach here is?

ageron · 2022-06-15T11:12:43Z

Hi @mattdangerw ,
Sorry for the late response, got Covid in the family.
Thanks for the careful review, you make two good points.
I'll take care of the mask types.
However, regarding the padding difference, after some thought I'm not sure the difference matters, since it only affects masked out tokens, so normally these outputs will be ignored anyway.

For example, let's take a three word sentence padded to 5 tokens like this:

"I am happy <pad> <pad>"

With causal self-attention, we want "I" to attend only to "I", "am" should attend only to "I am", and "happy" should attend only to "I am happy". So far so good. Next, it doesn't really matter what both padding tokens attend to, since their output representation will be ignored anyway (downstream).

I tried to think of ways in which it could matter, but I couldn't find any, other than possibly a performance difference. Wdyt?

mattdangerw · 2022-06-23T17:40:24Z

I tried to think of ways in which it could matter, but I couldn't find any, other than possibly a performance difference. Wdyt?

Yeah that makes sense to me. Thanks for the explanation. Let's go with what we have up.

Could you add the defensive casting to bool types for the masks, and a test?

ageron · 2022-06-25T09:53:57Z

Hi @mattdangerw ,
Thanks for your feedback, I'll add the defensive casting + test by ~Wednesday, sounds good?

ageron · 2022-06-26T23:33:41Z

Hi @mattdangerw, I just added defensive casting to bool (with tests).

mattdangerw · 2022-06-27T19:23:23Z

@ageron thank you so much! I will kick off another round of testing on this.

mattdangerw · 2022-06-29T20:46:02Z

(approving just to trigger our import flow for testing)

mattdangerw · 2022-07-01T19:43:40Z

Still looking at this. We have one production failure from an oddly shaped Keras implicit mask we are trying to figure out.

Hopefully no more action needed here, but will ping if anything comes up!

Add support for Keras masking and causal masking

c0293de

google-ml-butler bot added the size:M label May 29, 2022

google-ml-butler bot assigned gbaned May 29, 2022

ageron mentioned this pull request May 29, 2022

Add support for automatic mask handling in MultiHeadAttention layer #16248

Closed

ageron changed the title ~~Add support for Keras masking and causal masking~~ Add support for Keras mask & causal mask to MultiHeadAttention May 29, 2022

fchollet reviewed May 29, 2022

View reviewed changes

Replace causal in init with use_causal_mask in call

015625f

Reformat code using shell/format.sh

ebf46bf

gbaned added this to Assigned Reviewer in PR Queue via automation May 30, 2022

fchollet reviewed May 30, 2022

View reviewed changes

keras/layers/attention/multi_head_attention.py Outdated Show resolved Hide resolved

keras/layers/attention/multi_head_attention.py Show resolved Hide resolved

Fix docstring indentation and missing line breaks

9ec97ba

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jun 2, 2022

fchollet approved these changes Jun 2, 2022

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Jun 2, 2022

kokoro-team removed the kokoro:force-run label Jun 2, 2022

ageron added 2 commits June 27, 2022 10:25

Merge branch 'master' into mha_automask

b00b4b4

Add defensive casting to bool for implicit and explicit masks

7605567

google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label Jun 26, 2022

mattdangerw approved these changes Jun 29, 2022

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Jun 29, 2022

kokoro-team removed the kokoro:force-run label Jun 29, 2022

copybara-service bot merged commit 645b361 into keras-team:master Jul 11, 2022

PR Queue automation moved this from Approved by Reviewer to Merged Jul 11, 2022

ageron deleted the mha_automask branch July 12, 2022 17:40

fchollet mentioned this pull request Sep 14, 2022

Fix masks in English-to-Spanish translation example keras-team/keras-io#1046

Merged

patrick-wilken mentioned this pull request Feb 27, 2023

Frontend API and PyTorch backend rwth-i6/returnn#1120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Keras mask & causal mask to MultiHeadAttention #16619

Add support for Keras mask & causal mask to MultiHeadAttention #16619

ageron commented May 29, 2022

fchollet left a comment

fchollet commented May 29, 2022

ageron commented May 29, 2022

fchollet commented May 29, 2022

chenmoneygithub commented May 29, 2022

ageron commented May 30, 2022

fchollet commented May 30, 2022

chenmoneygithub commented May 31, 2022

fchollet left a comment

mattdangerw commented Jun 8, 2022

ageron commented Jun 15, 2022 •

edited

mattdangerw commented Jun 23, 2022

ageron commented Jun 25, 2022

ageron commented Jun 26, 2022 •

edited

mattdangerw commented Jun 27, 2022

mattdangerw commented Jun 29, 2022

mattdangerw commented Jul 1, 2022

Add support for Keras mask & causal mask to MultiHeadAttention #16619

Add support for Keras mask & causal mask to MultiHeadAttention #16619

Conversation

ageron commented May 29, 2022

fchollet left a comment

Choose a reason for hiding this comment

fchollet commented May 29, 2022

ageron commented May 29, 2022

fchollet commented May 29, 2022

chenmoneygithub commented May 29, 2022

ageron commented May 30, 2022

fchollet commented May 30, 2022

chenmoneygithub commented May 31, 2022

fchollet left a comment

Choose a reason for hiding this comment

mattdangerw commented Jun 8, 2022

ageron commented Jun 15, 2022 • edited

mattdangerw commented Jun 23, 2022

ageron commented Jun 25, 2022

ageron commented Jun 26, 2022 • edited

mattdangerw commented Jun 27, 2022

mattdangerw commented Jun 29, 2022

mattdangerw commented Jul 1, 2022

ageron commented Jun 15, 2022 •

edited

ageron commented Jun 26, 2022 •

edited