Skip to content

Conversation

@abheesht17
Copy link
Collaborator

@abheesht17 abheesht17 commented Mar 15, 2022

Resolves #40.

I've implemented the FNet Encoder Layer. I'll add unit tests once the code has been finalised. @mattdangerw ,@chenmoneygithub, please go through it. Thanks!

Here is the link to the Colab notebook where I do a forward pass: https://colab.research.google.com/drive/1ldWMSqw8q8qwndHUxtuILxowRRFiBTbT?usp=sharing

I'm planning to add an example for this (text classification). I'll stack 2-3 encoder layers and train the model.
Caveat: In the paper, the authors mention that they used SentencePiece Tokenizer (this hasn't been implemented in keras-nlp yet). So, should I add the example later in a separate PR? Or should I just use SentencePiece from TF Text?

@abheesht17 abheesht17 changed the title Fnet encoder Add FNet Encoder Layer Mar 15, 2022
@abheesht17
Copy link
Collaborator Author

Hey, @mattdangerw. Any thoughts on this?

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Dropped some comments, and please add unit tests to test the layer's funcitonality, thanks!

if not self._built:
self._build(inputs.shape)

# Apply fourier transform on the input. Note: We don't have padding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to understand this a bit more - are you saying the original paper does not do padding, and take in RaggedTensor as input? Thanks!

Copy link
Collaborator Author

@abheesht17 abheesht17 Mar 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as far as I understand, they do pad the sequences so that every sequence is of the same length. See this: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L106. The expected shape is (bsz, max_seq_length, hidden_dim).

However, the padding mask is not used at all. They delete it here: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L114. So, the mixing is done over all the tokens, including the padding tokens. See the input pipeline here: https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L100 and https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L107-L109.

My guess would be that FLAX/JAX probably doesn't have the equivalent of a Ragged Tensor. Hence, they had to pad the sequences in order to be able to batch them up. As to why they did not mask the pad tokens, I'm not sure.

Edit: I've mailed the authors of FNet: Mixing Tokens with Fourier Transforms asking them about why the padding mask is deleted.

Copy link
Collaborator Author

@abheesht17 abheesht17 Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I received a reply from the first author, James Lee-Thorp. This is the gist of the message:

For FNet, padding tokens are applied, but padding tokens are not masked. There is a possibility that the model may become sensitive to samples with varying number of padding tokens...but in practice, the authors did not notice any major performance issues. So, I think we can keep it the same, we don't need to apply padding masks.

If we do want to apply a padding mask, we should only apply it after the embedding layer, and not after every Fourier sublayer. This is because if we apply it after every Fourier sublayer, we will end up zeroing out some frequencies which is counter-productive.

I'll add a more detailed comment on Line 129, I haven't properly conveyed the meaning with the current comment.

@mattdangerw
Copy link
Member

Actually, if it's ok, maybe wait until this Tuesday to fully flesh this out? We have an internal sync meeting then, and will discussing where we want to set our bar (e.g. with citations) for inclusion in this repo.

I'm pretty sure we will want this, but if you wait till then, we could give a more confident green light. Thanks for patience! Really do appreciate the contributions!

@abheesht17
Copy link
Collaborator Author

Sure! Thank you :)

Copy link
Collaborator Author

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the first round of comments. Thank you, @chenmoneygithub, for reviewing the code :)

if not self._built:
self._build(inputs.shape)

# Apply fourier transform on the input. Note: We don't have padding
Copy link
Collaborator Author

@abheesht17 abheesht17 Mar 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as far as I understand, they do pad the sequences so that every sequence is of the same length. See this: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L106. The expected shape is (bsz, max_seq_length, hidden_dim).

However, the padding mask is not used at all. They delete it here: https://github.com/google-research/google-research/blob/master/f_net/layers.py#L114. So, the mixing is done over all the tokens, including the padding tokens. See the input pipeline here: https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L100 and https://github.com/google-research/google-research/blob/master/f_net/input_pipeline.py#L107-L109.

My guess would be that FLAX/JAX probably doesn't have the equivalent of a Ragged Tensor. Hence, they had to pad the sequences in order to be able to batch them up. As to why they did not mask the pad tokens, I'm not sure.

Edit: I've mailed the authors of FNet: Mixing Tokens with Fourier Transforms asking them about why the padding mask is deleted.

@mattdangerw
Copy link
Member

Thanks for patience @abheesht17! We discussed in our meeting and confirmed we would like to add this.

Please do go ahead and add tests. Also if you could add initializers arguments similar to #50 that would be great.

Thank you!

@abheesht17
Copy link
Collaborator Author

abheesht17 commented Mar 22, 2022

Awesome, @mattdangerw! Thank you :). I'll add the the tests and the initialiser arguments.

One more thing - could you and @chenmoneygithub please give #43 (comment) and #43 (comment) a look so that we can have closure on the padding mask issue?

Should I also add a text classification example (stacks of FNet Encoder layers, with a dense layer on top), or should we leave that to a future PR?

@abheesht17
Copy link
Collaborator Author

abheesht17 commented Mar 22, 2022

@mattdangerw, @chenmoneygithub, I've done the following:

  • Added UTs (followed the UTs for TransformerEncoder).
  • Added kernel and bias initialisers.

Let me know if it's okay. Thanks!

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

self.bias_initializer = keras.initializers.get(bias_initializer)
self._built = False

def _build(self, input_shape):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be the built-in build()

Copy link
Collaborator Author

@abheesht17 abheesht17 Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, overriding the build() function in tf.keras.layers.Layer? Done! Should I make the same changes to the TransformerEncoder class?
https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/layers/transformer_encoder.py#L83

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default build() only works if your call() method takes one input, the TransformerEncoder takes multiple inputs, so default build() cannot be used there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see. Thanks! :)

)
self._output_dropout = keras.layers.Dropout(rate=self.dropout)

def _fourier_transform(self, input):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-line it in call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

mixing_output = tf.math.real(tf.signal.fft2d(input))
return mixing_output

def _add_and_norm(self, input1, input2, norm_layer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-line it in call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

def _add_and_norm(self, input1, input2, norm_layer):
return norm_layer(input1 + input2)

def _feed_forward(self, input):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-line it in call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


def __init__(
self,
intermediate_dim=3072,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this value come from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the default. intermediate_dim should be a required argument to pass in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At component level let's do not set the default, we will pass this value when we are building the actual model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, alright. I've removed the default value!

def __init__(
self,
intermediate_dim=3072,
dropout=0.1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should default to 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we are not shipping a model architecture here, we are shipping a reusable layer. I would be inclined to use the same defaults as the TransformerEncoder block for most of the args here so this is essentially a swap in replacement. Having each encoder block we ship use different defaults everywhere would be a lot to track, and lead to bad usability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll set the defaults to the ones used by TransformerEncoder 👍🏼

intermediate_dim=3072,
dropout=0.1,
activation="gelu",
layer_norm_epsilon=1e-12,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems way too small

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropout=0.1,
activation="gelu",
layer_norm_epsilon=1e-12,
kernel_initializer="glorot_uniform",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this choice of initializer the best practice for this layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FNet repo: https://github.com/google-research/google-research/blob/6fd7a1a0872e9d27c1f1764836bbf9048a7903e7/f_net/models.py#L39-L42

I believe it is tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) in Keras. Will change it to this.

Intermediate Dense Layer

kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)

Output Dense Layer:

kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = "zeros" # They don't specify the bias initializer in the output dense layer. The default in Flax is "zeros".

https://github.com/google-research/google-research/blob/master/f_net/layers.py#L73-L81

Should I set it as tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) for all, or should I separate the bias initializer for the output dense layer and set that to "zeros"?

Copy link
Collaborator Author

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review comments, @fchollet, @mattdangerw! :)

Summary:

  • For all argument-related comments, I've attached links to the config present in the official FNet code. What do you think I should do - keep the same defaults as given in the FNet config, or change it to the defaults given in TransformerEncoder?
  • Please see the comment on initializers. I've changed everything to RandomNormal with a standard deviation of 2e-2.
  • For the comments on making the functions in-line, I had followed what's been done in the TransformerEncoder class. I've now changed it to what @fchollet mentioned in the comments. Should I make the same changes in TransformerEncoder and TransformerDecoder?


def __init__(
self,
intermediate_dim=3072,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def __init__(
self,
intermediate_dim=3072,
dropout=0.1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intermediate_dim=3072,
dropout=0.1,
activation="gelu",
layer_norm_epsilon=1e-12,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropout=0.1,
activation="gelu",
layer_norm_epsilon=1e-12,
kernel_initializer="glorot_uniform",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FNet repo: https://github.com/google-research/google-research/blob/6fd7a1a0872e9d27c1f1764836bbf9048a7903e7/f_net/models.py#L39-L42

I believe it is tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) in Keras. Will change it to this.

Intermediate Dense Layer

kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)

Output Dense Layer:

kernel_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2)
bias_initializer = "zeros" # They don't specify the bias initializer in the output dense layer. The default in Flax is "zeros".

https://github.com/google-research/google-research/blob/master/f_net/layers.py#L73-L81

Should I set it as tf.keras.initializers.RandomNormal(mean=0.0, stddev=2e-2) for all, or should I separate the bias initializer for the output dense layer and set that to "zeros"?

self.bias_initializer = keras.initializers.get(bias_initializer)
self._built = False

def _build(self, input_shape):
Copy link
Collaborator Author

@abheesht17 abheesht17 Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, overriding the build() function in tf.keras.layers.Layer? Done! Should I make the same changes to the TransformerEncoder class?
https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/layers/transformer_encoder.py#L83

)
self._output_dropout = keras.layers.Dropout(rate=self.dropout)

def _fourier_transform(self, input):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

mixing_output = tf.math.real(tf.signal.fft2d(input))
return mixing_output

def _add_and_norm(self, input1, input2, norm_layer):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

def _add_and_norm(self, input1, input2, norm_layer):
return norm_layer(input1 + input2)

def _feed_forward(self, input):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

x = self._output_dense(x)
return self._output_dropout(x)

# Apply fourier transform on the input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer putting these comments into the class' docstring, since the line mixing_output = _fourier_transform(inputs) does not indicate this much information.

Additionally, this code reference is error-prone, because the line number is subject to change. Let's delete it, thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator Author

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments, @chenmoneygithub! I've addressed them.


def __init__(
self,
intermediate_dim=3072,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, alright. I've removed the default value!

x = self._output_dense(x)
return self._output_dropout(x)

# Apply fourier transform on the input.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments. Once we resolve the default argument discussion this looks good to me.

def __init__(
self,
intermediate_dim=3072,
dropout=0.1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we are not shipping a model architecture here, we are shipping a reusable layer. I would be inclined to use the same defaults as the TransformerEncoder block for most of the args here so this is essentially a swap in replacement. Having each encoder block we ship use different defaults everywhere would be a lot to track, and lead to bad usability.

A Tensor of the same shape as the `inputs`.
"""

def _fourier_transform(input):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any reason to keep this inner functions private with underscores. They are local to the call function.

You only need the underscore for self._x to communicate that it's private.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Changed!

# Apply fourier transform on the input.
mixing_output = _fourier_transform(inputs)

# LayerNorm layer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the # LayerNorm layer. and # Feedforward layer. comments, the function names communicate that fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

Copy link
Collaborator Author

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @mattdangerw, for the review! I've changed all defaults to what they are in TransformerEncoder (including the initialisers).

def __init__(
self,
intermediate_dim=3072,
dropout=0.1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll set the defaults to the ones used by TransformerEncoder 👍🏼

A Tensor of the same shape as the `inputs`.
"""

def _fourier_transform(input):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Changed!

# Apply fourier transform on the input.
mixing_output = _fourier_transform(inputs)

# LayerNorm layer.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me.

@abheesht17
Copy link
Collaborator Author

Thank you, @mattdangerw! 🥳

Args:
intermediate_dim: int, defaults to 3072. The hidden size of feedforward
network.
dropout: float, defaults to 0.1. The dropout value, applied in the
Copy link
Member

@mattdangerw mattdangerw Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops just saw this, update the default in the docstrings too (or maybe just drop them), when we generate docs we can include the default from the function signature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad. I've fixed it 👍🏼. Thanks for noticing!

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM.

@mattdangerw mattdangerw merged commit b17ca27 into keras-team:master Mar 28, 2022
adhadse pushed a commit to adhadse/keras-nlp that referenced this pull request Sep 17, 2022
* Add rough code for FNet Encoder

* Format code

* Minor doc-string changes

* Format __init__.py

* Address review comments - 1

* Add detailed comment about padding masks

* Add kernel and bias initialisers

* Add unit tests for the layer

* Address review comments - 2

* Address review comments - 3

* Address review comments - 4

* Minor change

* Correct doc-string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add FNet Encoder Layer

4 participants