-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flax] Add FlaxBlenderbot #13633
[Flax] Add FlaxBlenderbot #13633
Conversation
* Add tests * Clean source code and fix some bugs
@patrickvonplaten I would like to kindly ping for a review. :) I've been struggling to achieve the pt-flax equivalence, however, I cannot find that difference/bug in this new flax implementation. Thanks a lot! :) |
Hey @stancld, Thanks a lot for the PR! The difference between PT and Flax in your PR is very close actually < 0.1 so it might also very well be that the implementation is correct! I'll try to take a deeper look at the end of next week. Could you try one last thing: add print statements such as:
and
before the word embeddings, after the word embeddings, each encoder transformer layer, before the decoder word embeddings, the decoder attention layers, ... to see when the activations start to diverge. If it happens gradually it might very well be that the model is correct and there is a difference. If it haapens all of a sudden at some point then there might be a subtle bug. |
@patrickvonplaten Thank you for the tip! I'll have a look :) |
Hello @patrickvonplaten, I ran a few tests it seems and one output is below. There is some level of divergence, but not sure if it's too severe. I'm gonna check the Flax code today once again :)
|
@@ -405,7 +405,7 @@ def setup(self) -> None: | |||
num_heads=self.config.encoder_attention_heads, | |||
dropout=self.config.attention_dropout, | |||
) | |||
self.self_attn_layer_norm = nn.LayerNorm(dtype=self.dtype) | |||
self.self_attn_layer_norm = nn.LayerNorm(dtype=self.dtype, epsilon=1e-05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patil-suraj - the default in PyTorch is 1e-05, so I adapted it for all Bart-like models. Given that PT and Flax tests were passing for Bart before I think this "bug correction" is fine in terms of backwards compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for me now!
@patrickvonplaten Thank you very much for spotting the problem! :] |
…nto flax_blenderbot
Tests on master seem to be broken currently :-/ But I think the PR is good to go. @patil-suraj could you maybe take a look once you're back (and maybe rebase to master with @stancld to fix the circli ci runner) |
Awesome - I let you merge @patil-suraj once you're back :-) |
self.embed_dim, | ||
use_bias=self.bias, | ||
dtype=self.dtype, | ||
kernel_init=jax.nn.initializers.normal(self.config.init_std, self.dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not pass the dtype
anymore to the kernel_init
, it's meant to specify the dtype
of computation and not of parameters. This was a bug in all flax models, which is fixed by #13098.
@stancld Could you please rebase the branch again with master and fix this according to what is explained in #13098?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patil-suraj Thank you for providing me with the context. Should be fixed now :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @stancld for adding this, LGTM!
WIll push a couple of flax checkpoint and then merge :)
if is_flax_available(): | ||
import jax | ||
|
||
jax_device = jax.default_backend() | ||
else: | ||
jax_device = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
What does this PR do?
This PR adds flax implementation of Blenderbot.
Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
TODOs:
Who can review?
@patrickvonplaten @patil-suraj