-
Notifications
You must be signed in to change notification settings - Fork 301
Issue 182: Modified TransformerDecoder with optional parameter #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 182: Modified TransformerDecoder with optional parameter #217
Conversation
…d and edited tests
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Dropped some comments.
) | ||
|
||
if encoder_sequence is not None: | ||
# Encoder-decoder attention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this comment over self._encoder_decoder_attention_layer
self._feedforward_layernorm, | ||
) | ||
else: | ||
# Skip Encoder-Decoder attention, Feedforward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit confusing - the comma "," could suggest Feedforward is skipped as well.
Maybe just say "# Skip Encoder-Decoder attention if no encoder_sequence
is provided."?
output = decoder(decoder_input) | ||
model = keras.Model( | ||
inputs=decoder_input, | ||
outputs = output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the space surrounding "=" => outputs=output
use_causal_mask=True, | ||
) | ||
|
||
def test_valid_call_without_encoder_with_mask(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete this test case because it is covered by test_valid_call_with_mask
Right now there are two things that indicate decoder only but can conflict. The first one is the decoder_only attribute that is passed in an initialized. The second is implicit in the optional parameter of encoder_sequence. These are the behaviors right now for these two things:
I added this comment to the docstring Let me know if this is enough to explain and whether it is intuitive, or if I should make any changes, thanks! |
@jessechancy Yea, we should throw an explicit error message to our users if the two places contradict:
@mattdangerw Does this look good to you? |
Example for TransformerDecoder usage: |
…hether encoder_sequence input is recieved
self._feedforward_layernorm, | ||
) | ||
|
||
if self._encoder_decoder_attention_layer is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment. It might be nice if you rename the self_attended variable to attention_output, and do something like this.
attention_output = self._add_and_norm(...)
if encoder_sequence is not None:
... cross attention ...
attention_output = self._add_and_norm(...)
feed_forward_output = self._feed_forward(attention_output)
return self._add_and_norm(...)
So basically bring this back to the single return statement. As a reader, that would make it much clearer how the computation is flowing overall with and without encoder_sequence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks great. Left a few comments, mostly minor
[guide](https://keras.io/guides/understanding_masking_and_padding/) | ||
for more details. | ||
If decoder_only is set to True, the encoder layer would not be built, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have removed this argument. We should remove docs too.
We should update the class level docs with a few things
- In the second paragraph about masking. Add as a first sentence, "This layer will always apply a causal mask to the decoder attention layer."
- Add a new paragraph. Some suggested text below:
This layer can be called with with either one or two inputs as follows:
- `layer(decoder_sequence)`: no cross-attention will be built into the decoder
block. This is useful when building a "decoder-only" transformer such as GPT-2.
- `layer(decoder_sequence, encoder_sequence)`: cross-attention will be built into
the encoder block. This is useful when building an "encoder-decoder" transformer,
such as the original transformer model described in Attention is All You Need.
defaults to "zeros". The bias initializer for | ||
the dense and multiheaded attention layers. | ||
name: string, defaults to None. The name of the layer. | ||
decoder_only: bool, defaults to False. If True, only the decoder layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
self.supports_masking = True | ||
|
||
def _build(self, input_shape): | ||
def _build(self, input_shape, cross_attention): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe include_cross_attention
, so it is more obvious this is a boolean value?
raise ValueError( | ||
f"The number of call arguments to " | ||
f"`keras_nlp.layers.TransformerDecoder` should not change." | ||
f"\nUse `layer(decoder_sequence, encoder_sequence)` to " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove all the \n
in both error messages
encoder_sequence, encoder_padding_mask, encoder_attention_mask | ||
) | ||
# Encoder-decoder attention. | ||
encoder_decoder_attended = self._encoder_decoder_attention_layer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's clean up some variable names
_encoder_decoder_attention_layer
-> _cross_attention_layer
_enc_dec_attentiondropout
-> _cross_attention_dropout
_enc_dec_attention_layernorm
-> _cross_attention_layernorm
encoder_decoder_attended
-> cross_attended
output = decoder(encoder_input, decoder_input) | ||
output = decoder(decoder_input, encoder_input) | ||
# should raise ValueError if encoder_input is not provided | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the try block. You can add a separate test for these using self.assertRaises(ValueError)
. There's other examples in this test file.
use_causal_mask=True, | ||
output = decoder(decoder_input) | ||
# should raise ValueError if encoder_input is provided | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, remove the try, catch
self.assertGreater(len(grad), 1) | ||
optimizer.apply_gradients(zip(grad, model.trainable_variables)) | ||
|
||
def test_one_training_step_of_transformer_without_encoder(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without_encoder -> without_cross_attention
here and elsewhere
model_output = model(decoder_sequence) | ||
loaded_model_output = loaded_model(decoder_sequence) | ||
self.assertAllClose(model_output, loaded_model_output) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra newlines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks.
One nit, and I think there are some format issues still.
decoder block. This is useful when building a "decoder-only" | ||
transformer such as GPT-2. | ||
`layer(decoder_sequence, encoder_sequence)`: cross-attention will be | ||
built into the encoder block. This is useful when building an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoder block -> decoder block
Made encoder sequence an optional parameter, added testing for this change.