-
Notifications
You must be signed in to change notification settings - Fork 309
Add FNet Encoder Layer #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hey, @mattdangerw. Any thoughts on this? |
chenmoneygithub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Dropped some comments, and please add unit tests to test the layer's funcitonality, thanks!
keras_nlp/layers/fnet_encoder.py
Outdated
| if not self._built: | ||
| self._build(inputs.shape) | ||
|
|
||
| # Apply fourier transform on the input. Note: We don't have padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to understand this a bit more - are you saying the original paper does not do padding, and take in RaggedTensor as input? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as far as I understand, they do pad the sequences so that every sequence is of the same length. See this: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L106. The expected shape is (bsz, max_seq_length, hidden_dim).
However, the padding mask is not used at all. They delete it here: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L114. So, the mixing is done over all the tokens, including the padding tokens. See the input pipeline here: https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L100 and https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L107-L109.
My guess would be that FLAX/JAX probably doesn't have the equivalent of a Ragged Tensor. Hence, they had to pad the sequences in order to be able to batch them up. As to why they did not mask the pad tokens, I'm not sure.
Edit: I've mailed the authors of FNet: Mixing Tokens with Fourier Transforms asking them about why the padding mask is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I received a reply from the first author, James Lee-Thorp. This is the gist of the message:
For FNet, padding tokens are applied, but padding tokens are not masked. There is a possibility that the model may become sensitive to samples with varying number of padding tokens...but in practice, the authors did not notice any major performance issues. So, I think we can keep it the same, we don't need to apply padding masks.
If we do want to apply a padding mask, we should only apply it after the embedding layer, and not after every Fourier sublayer. This is because if we apply it after every Fourier sublayer, we will end up zeroing out some frequencies which is counter-productive.
I'll add a more detailed comment on Line 129, I haven't properly conveyed the meaning with the current comment.
|
Actually, if it's ok, maybe wait until this Tuesday to fully flesh this out? We have an internal sync meeting then, and will discussing where we want to set our bar (e.g. with citations) for inclusion in this repo. I'm pretty sure we will want this, but if you wait till then, we could give a more confident green light. Thanks for patience! Really do appreciate the contributions! |
|
Sure! Thank you :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the first round of comments. Thank you, @chenmoneygithub, for reviewing the code :)
keras_nlp/layers/fnet_encoder.py
Outdated
| if not self._built: | ||
| self._build(inputs.shape) | ||
|
|
||
| # Apply fourier transform on the input. Note: We don't have padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as far as I understand, they do pad the sequences so that every sequence is of the same length. See this: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L106. The expected shape is (bsz, max_seq_length, hidden_dim).
However, the padding mask is not used at all. They delete it here: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L114. So, the mixing is done over all the tokens, including the padding tokens. See the input pipeline here: https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L100 and https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L107-L109.
My guess would be that FLAX/JAX probably doesn't have the equivalent of a Ragged Tensor. Hence, they had to pad the sequences in order to be able to batch them up. As to why they did not mask the pad tokens, I'm not sure.
Edit: I've mailed the authors of FNet: Mixing Tokens with Fourier Transforms asking them about why the padding mask is deleted.
|
Thanks for patience @abheesht17! We discussed in our meeting and confirmed we would like to add this. Please do go ahead and add tests. Also if you could add initializers arguments similar to #50 that would be great. Thank you! |
|
Awesome, @mattdangerw! Thank you :). I'll add the the tests and the initialiser arguments. One more thing - could you and @chenmoneygithub please give #43 (comment) and #43 (comment) a look so that we can have closure on the padding mask issue? Should I also add a text classification example (stacks of FNet Encoder layers, with a dense layer on top), or should we leave that to a future PR? |
|
@mattdangerw, @chenmoneygithub, I've done the following:
Let me know if it's okay. Thanks! |
fchollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
keras_nlp/layers/fnet_encoder.py
Outdated
| self.bias_initializer = keras.initializers.get(bias_initializer) | ||
| self._built = False | ||
|
|
||
| def _build(self, input_shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be the built-in build()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, overriding the build() function in tf.keras.layers.Layer? Done! Should I make the same changes to the TransformerEncoder class?
https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/layers/transformer_encoder.py#L83
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default build() only works if your call() method takes one input, the TransformerEncoder takes multiple inputs, so default build() cannot be used there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see. Thanks! :)
keras_nlp/layers/fnet_encoder.py
Outdated
| ) | ||
| self._output_dropout = keras.layers.Dropout(rate=self.dropout) | ||
|
|
||
| def _fourier_transform(self, input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-line it in call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
keras_nlp/layers/fnet_encoder.py
Outdated
| mixing_output = tf.math.real(tf.signal.fft2d(input)) | ||
| return mixing_output | ||
|
|
||
| def _add_and_norm(self, input1, input2, norm_layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-line it in call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
keras_nlp/layers/fnet_encoder.py
Outdated
| def _add_and_norm(self, input1, input2, norm_layer): | ||
| return norm_layer(input1 + input2) | ||
|
|
||
| def _feed_forward(self, input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-line it in call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
keras_nlp/layers/fnet_encoder.py
Outdated
|
|
||
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this value come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the default. intermediate_dim should be a required argument to pass in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At component level let's do not set the default, we will pass this value when we are building the actual model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, alright. I've removed the default value!
keras_nlp/layers/fnet_encoder.py
Outdated
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, | ||
| dropout=0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should default to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we are not shipping a model architecture here, we are shipping a reusable layer. I would be inclined to use the same defaults as the TransformerEncoder block for most of the args here so this is essentially a swap in replacement. Having each encoder block we ship use different defaults everywhere would be a lot to track, and lead to bad usability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I'll set the defaults to the ones used by TransformerEncoder 👍🏼
keras_nlp/layers/fnet_encoder.py
Outdated
| intermediate_dim=3072, | ||
| dropout=0.1, | ||
| activation="gelu", | ||
| layer_norm_epsilon=1e-12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems way too small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dropout=0.1, | ||
| activation="gelu", | ||
| layer_norm_epsilon=1e-12, | ||
| kernel_initializer="glorot_uniform", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this choice of initializer the best practice for this layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) in Keras. Will change it to this.
Intermediate Dense Layer
kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
Output Dense Layer:
kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = "zeros" # They don't specify the bias initializer in the output dense layer. The default in Flax is "zeros".
https://github.com/google-research/google-research/blob/master/f_net/layers.py#L73-L81
Should I set it as tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) for all, or should I separate the bias initializer for the output dense layer and set that to "zeros"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review comments, @fchollet, @mattdangerw! :)
Summary:
- For all argument-related comments, I've attached links to the config present in the official FNet code. What do you think I should do - keep the same defaults as given in the FNet config, or change it to the defaults given in
TransformerEncoder? - Please see the comment on initializers. I've changed everything to
RandomNormalwith a standard deviation of2e-2. - For the comments on making the functions in-line, I had followed what's been done in the
TransformerEncoderclass. I've now changed it to what @fchollet mentioned in the comments. Should I make the same changes inTransformerEncoderandTransformerDecoder?
keras_nlp/layers/fnet_encoder.py
Outdated
|
|
||
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keras_nlp/layers/fnet_encoder.py
Outdated
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, | ||
| dropout=0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keras_nlp/layers/fnet_encoder.py
Outdated
| intermediate_dim=3072, | ||
| dropout=0.1, | ||
| activation="gelu", | ||
| layer_norm_epsilon=1e-12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dropout=0.1, | ||
| activation="gelu", | ||
| layer_norm_epsilon=1e-12, | ||
| kernel_initializer="glorot_uniform", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) in Keras. Will change it to this.
Intermediate Dense Layer
kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
Output Dense Layer:
kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = "zeros" # They don't specify the bias initializer in the output dense layer. The default in Flax is "zeros".
https://github.com/google-research/google-research/blob/master/f_net/layers.py#L73-L81
Should I set it as tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) for all, or should I separate the bias initializer for the output dense layer and set that to "zeros"?
keras_nlp/layers/fnet_encoder.py
Outdated
| self.bias_initializer = keras.initializers.get(bias_initializer) | ||
| self._built = False | ||
|
|
||
| def _build(self, input_shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, overriding the build() function in tf.keras.layers.Layer? Done! Should I make the same changes to the TransformerEncoder class?
https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/layers/transformer_encoder.py#L83
keras_nlp/layers/fnet_encoder.py
Outdated
| ) | ||
| self._output_dropout = keras.layers.Dropout(rate=self.dropout) | ||
|
|
||
| def _fourier_transform(self, input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
keras_nlp/layers/fnet_encoder.py
Outdated
| mixing_output = tf.math.real(tf.signal.fft2d(input)) | ||
| return mixing_output | ||
|
|
||
| def _add_and_norm(self, input1, input2, norm_layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
keras_nlp/layers/fnet_encoder.py
Outdated
| def _add_and_norm(self, input1, input2, norm_layer): | ||
| return norm_layer(input1 + input2) | ||
|
|
||
| def _feed_forward(self, input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
keras_nlp/layers/fnet_encoder.py
Outdated
| x = self._output_dense(x) | ||
| return self._output_dropout(x) | ||
|
|
||
| # Apply fourier transform on the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer putting these comments into the class' docstring, since the line mixing_output = _fourier_transform(inputs) does not indicate this much information.
Additionally, this code reference is error-prone, because the line number is subject to change. Let's delete it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
abheesht17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments, @chenmoneygithub! I've addressed them.
keras_nlp/layers/fnet_encoder.py
Outdated
|
|
||
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, alright. I've removed the default value!
keras_nlp/layers/fnet_encoder.py
Outdated
| x = self._output_dense(x) | ||
| return self._output_dropout(x) | ||
|
|
||
| # Apply fourier transform on the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
mattdangerw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments. Once we resolve the default argument discussion this looks good to me.
keras_nlp/layers/fnet_encoder.py
Outdated
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, | ||
| dropout=0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we are not shipping a model architecture here, we are shipping a reusable layer. I would be inclined to use the same defaults as the TransformerEncoder block for most of the args here so this is essentially a swap in replacement. Having each encoder block we ship use different defaults everywhere would be a lot to track, and lead to bad usability.
keras_nlp/layers/fnet_encoder.py
Outdated
| A Tensor of the same shape as the `inputs`. | ||
| """ | ||
|
|
||
| def _fourier_transform(input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any reason to keep this inner functions private with underscores. They are local to the call function.
You only need the underscore for self._x to communicate that it's private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Changed!
keras_nlp/layers/fnet_encoder.py
Outdated
| # Apply fourier transform on the input. | ||
| mixing_output = _fourier_transform(inputs) | ||
|
|
||
| # LayerNorm layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the # LayerNorm layer. and # Feedforward layer. comments, the function names communicate that fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
abheesht17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @mattdangerw, for the review! I've changed all defaults to what they are in TransformerEncoder (including the initialisers).
keras_nlp/layers/fnet_encoder.py
Outdated
| def __init__( | ||
| self, | ||
| intermediate_dim=3072, | ||
| dropout=0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I'll set the defaults to the ones used by TransformerEncoder 👍🏼
keras_nlp/layers/fnet_encoder.py
Outdated
| A Tensor of the same shape as the `inputs`. | ||
| """ | ||
|
|
||
| def _fourier_transform(input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Changed!
keras_nlp/layers/fnet_encoder.py
Outdated
| # Apply fourier transform on the input. | ||
| mixing_output = _fourier_transform(inputs) | ||
|
|
||
| # LayerNorm layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
mattdangerw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to me.
|
Thank you, @mattdangerw! 🥳 |
keras_nlp/layers/fnet_encoder.py
Outdated
| Args: | ||
| intermediate_dim: int, defaults to 3072. The hidden size of feedforward | ||
| network. | ||
| dropout: float, defaults to 0.1. The dropout value, applied in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops just saw this, update the default in the docstrings too (or maybe just drop them), when we generate docs we can include the default from the function signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my bad. I've fixed it 👍🏼. Thanks for noticing!
fchollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API LGTM.
* Add rough code for FNet Encoder * Format code * Minor doc-string changes * Format __init__.py * Address review comments - 1 * Add detailed comment about padding masks * Add kernel and bias initialisers * Add unit tests for the layer * Address review comments - 2 * Address review comments - 3 * Address review comments - 4 * Minor change * Correct doc-string
Resolves #40.
I've implemented the FNet Encoder Layer. I'll add unit tests once the code has been finalised. @mattdangerw ,@chenmoneygithub, please go through it. Thanks!
Here is the link to the Colab notebook where I do a forward pass: https://colab.research.google.com/drive/1ldWMSqw8q8qwndHUxtuILxowRRFiBTbT?usp=sharing
I'm planning to add an example for this (text classification). I'll stack 2-3 encoder layers and train the model.
Caveat: In the paper, the authors mention that they used SentencePiece Tokenizer (this hasn't been implemented in
keras-nlpyet). So, should I add the example later in a separate PR? Or should I just use SentencePiece from TF Text?