-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Keras Sprint] Migrate Vision Transformers from Legacy to Backbone
API
#1914
base: master
Are you sure you want to change the base?
Conversation
performs steps 1-3 for moving ViT out of legacy and into backbone in single commit. moved code out of legacy and into newer folders and files Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
moved weights.py information into backbone_presets.py, summarized information for each version of the model and kept the notop weights as default for the backbone Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
renamed symbols from vit to vitbackbone, removed include_top and other arguments from backbone args directly, changed get_config, added classproperty Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
replaced all extended vits to call the superclass with differing args with respect to architecture type Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
updated all docstrings Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
changed docstrings, added preliminary tests, need to find more information to fill in some more tests Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR -- this is awesome!
It generally looks good to me, just some minor comments.
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails. Locally, I have run |
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
I'm seeing
Maybe you didn't commit the changes after the linter ran? |
But I did! That's what makes me more surprised! I even tried doing this process again, just in case it may have caused the same thing, but git shows that nothing has changed. |
I just gave reformatting a shot -- it did make changes on my end and it looks like they ended up in the commit. Maybe you had an issue with your editor or something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM -- I'm going to make sure that @jbischof takes a look as well, as he's been the expert on these backbone migrations
return cls.presets | ||
|
||
|
||
class ViTH16Backbone(ViTBackbone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbischof IIRC you had expressed some preference to call these "huge" models "XL" instead of "H" -- we haven't really been consistent about preset naming here, and have generally just mirrored the convention of the original paper or implementation for each backbone. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, "H" is better than XL. More papers follow that convention than "XL", and other ViT-like models (such as for SAM, DPT, etc.) all follow the B, L, H convention.
I'm personally not sure I saw "XL" in any paper, sans the new SD model? Then again, the API here may stray from that convention if it helps with consistency/clarity. Jonathan is better at those choices than me :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard call. It's good to stick to the paper conventions for sure. But what's more intuitive?
Ti, S, B, L, H
XS, S, M, L, XL
Also, we can keep t-shirt sizes consistent across models while the YoloV8 might use "N" for the smallest size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in NLP we have been even more verbose and used "extra_small"
, "small"
, "base"
, "large"
, "extra_large"
(link). Can't fault our enthusiasm!
Understood, must be that. Thank you so much! |
Thanks for porting this @suvadityamuk! Excited to see it back in the new API 🎊 |
@suvadityamuk @ianstenbit before merging - one nit from my end. It looks like the If the input is different, we need to interpolate the positional embeddings to match the seq_len after patching/embedding that input resolution |
That's actually something that has been on my mind as well. The problem is, if I do not give a certain Do you mind elaborating on the interpolation problem a bit? |
Given a different input shape, the patching/embedding process will produce a different sequence length. For 224x224, it's 196 + class token. For 336, the sequence length would be 256 + class token. Hence, the learned positional embeddings that correspond to the sequence length wouldn't match and the weights wouldn't be able to load. This was seen as a major downside of the design of ViTs, but someone figured out that you can just interpolate the 196 seq len to any larger number, and it still works. There are tests for the method to interpolate, which showcase the |
I'm pretty sure that this is because of the |
I believe |
Ah, I must've missed adding it back in the original implementation then. It may be better to open a new issue for it to not mix the port PR with the fix PR? |
I agree, but there are cases wherein self = <keras_cv.models.backbones.vit.vit_backbone_test.ViTBackboneTest testMethod=test_valid_call>
def test_valid_call(self):
> model = ViTBackbone(
patch_size=16,
transformer_layer_num=12,
project_dim=192,
mlp_dim=768,
num_heads=3,
mlp_dropout=0.0,
attention_dropout=0.0,
include_rescaling=False,
)
keras_cv/models/backbones/vit/vit_backbone_test.py:29:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
keras_cv/models/backbones/vit/vit_backbone.py:117: in __init__
encoded_patches = PatchingAndEmbedding(project_dim, patch_size)(x)
../oss-ml/lib64/python3.8/site-packages/keras/utils/traceback_utils.py:61: in error_handler
return fn(*args, **kwargs)
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:1058: in __call__
return self._functional_construction_call(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2572: in _functional_construction_call
outputs = self._keras_tensor_symbolic_call(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2419: in _keras_tensor_symbolic_call
return self._infer_output_signature(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2476: in _infer_output_signature
self._maybe_build(inputs)
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:3023: in _maybe_build
self.build(input_shapes)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <keras_cv.layers.vit_layers.PatchingAndEmbedding object at 0x7f784c726400>, input_shape = TensorShape([None, None, None, 3])
def build(self, input_shape):
self.class_token = self.add_weight(
shape=[1, 1, self.project_dim], name="class_token", trainable=True
)
self.num_patches = (
> input_shape[1]
// self.patch_size
* input_shape[2]
// self.patch_size
)
E TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
keras_cv/layers/vit_layers.py:85: TypeError
|
Feel free to make a suggestion to this PR! I'll incorporate that into the current commits, two birds with one stone since this was anyway not supposed to happen |
I'll have to take a look at when the returned shapes are |
Also, about the |
Hi @jbischof! Gentle ping for a review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, sorry for the delay in review!
Could you please merge with master so we can identify potential bugs with multibackend Keras?
return cls.presets | ||
|
||
|
||
class ViTH16Backbone(ViTBackbone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard call. It's good to stick to the paper conventions for sure. But what's more intuitive?
Ti, S, B, L, H
XS, S, M, L, XL
Also, we can keep t-shirt sizes consistent across models while the YoloV8 might use "N" for the smallest size.
to use as image input for the model. | ||
patch_size: the patch size to be supplied to the Patching layer to turn | ||
input images into a flattened sequence of patches. | ||
transformer_layer_num: the number of transformer encoder layers to stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we match the arg names in KerasNLP (link)? In the long term I'd love to merge implementations and these names are fairly standardized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we couldn't use the KerasNLP layer OOB since the CV implementation seems quite standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might also be something to coordinate wiith @DavidLandup0 since he is merging in Transformer layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents is that we should keep them separate. IMO, it's better to have a bit of duplicated code (i.e. one class) than to try to make cross-lib dependencies. Since we'll be adding a few CV-specific transformer encoders and attention types which will necessarily have to be part of KCV only, fragmenting the implementations doesn't feel like the best idea to me.
We could try to see if a vanilla transformer encoder could become a part of core Keras though, and have both being used in KNLP and KCV. As far as I can tell, there's demand for the class as part of the main framework and it would make sense to move a layer to it if both KNLP and KCV would use it, since that means that it's not specific enough to sit in only one lib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a slightly different impl (KNLP seems to use one layer norm, i.e. if self.normalize_first
and if not self.normalize_first
) while this layer uses two (one before all ops and one after the residual). Merging the implementations would wreck the weights of one lib
model = ViTBackbone.from_preset("vittiny16_imagenet") | ||
outputs = model(tf.ones(shape=(1, 224, 224, 3))) | ||
|
||
test_output = outputs[0, :5, :5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just assert outputs[0, :5, 0]
reduce the cognitive load here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For get_config
you can just use config.update()
instead of this more verbose method.
@suvadityamuk are you still interested in working on this? |
Hi, @jbischof I am interested in this, but a bit tied up until Sept 6th. Happy to take it up fully after that date though, yes. Will send in the updated files as soon as I can, thanks for waiting up and apologies for the delay! |
@suvadityamuk do you have any updates on this? Please let us know if you need help. Thanks! |
What does this PR do?
Fixes #1628 as part of the Keras Sprint.
Some important things to note:
input_shape
parameter as(224, 224, 3)
instead of the normal default(None, None, 3)
. This is because, within thePatchingAndEmbedding
layer, there is anum_patches
calculation that requires a constant integerinput_shape
(Line 84).vit_layers.py
from__interpolate_positional_embeddings
tointerpolate_positional_embeddings
. Previously, this was causingtf.AutoGraph
from being unable to work on thePatchingAndEmbedding
layer since name-mangling is not yet supported. Now, it does not throw any errors after this change.# noqa: E501
tags on the weights URLs invit_aliases.py
to make sure that the links do not break. Let me know if there's a better way to handle that!Please let me know what more changes you think are necessary. Thanks!
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ianstenbit @jbischof