Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Keras Sprint] Migrate Vision Transformers from Legacy to Backbone API #1914

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

suvadityamuk
Copy link
Contributor

What does this PR do?

Fixes #1628 as part of the Keras Sprint.

Some important things to note:

  • Within the backbone, I had to set the default input_shape parameter as (224, 224, 3) instead of the normal default (None, None, 3). This is because, within the PatchingAndEmbedding layer, there is a num_patches calculation that requires a constant integer input_shape (Line 84).
  • I have changed a function name within vit_layers.py from __interpolate_positional_embeddings to interpolate_positional_embeddings. Previously, this was causing tf.AutoGraph from being unable to work on the PatchingAndEmbedding layer since name-mangling is not yet supported. Now, it does not throw any errors after this change.
  • I have taken the liberty to add some # noqa: E501 tags on the weights URLs in vit_aliases.py to make sure that the links do not break. Let me know if there's a better way to handle that!

Please let me know what more changes you think are necessary. Thanks!

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you write any new necessary tests?
  • If this adds a new model, can you run a few training steps on TPU in Colab to ensure that no XLA incompatible OP are used?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ianstenbit @jbischof

performs steps 1-3 for moving ViT out of legacy and into backbone in single commit. moved code out of legacy and into newer folders and files

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
moved weights.py information into backbone_presets.py, summarized information for each version of the model and kept the notop weights as default for the backbone

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
renamed symbols from vit to vitbackbone, removed include_top and other arguments from backbone args directly, changed get_config, added classproperty

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
replaced all extended vits to call the superclass with differing args with respect to architecture type

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
updated all docstrings

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
changed docstrings, added preliminary tests, need to find more information to fill in some more tests

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Copy link
Contributor

@ianstenbit ianstenbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR -- this is awesome!

It generally looks good to me, just some minor comments.

keras_cv/models/backbones/vit/vit_aliases.py Outdated Show resolved Hide resolved
keras_cv/models/backbones/vit/vit_aliases.py Show resolved Hide resolved
keras_cv/models/backbones/vit/vit_backbone.py Outdated Show resolved Hide resolved
keras_cv/models/backbones/vit/vit_backbone.py Outdated Show resolved Hide resolved
keras_cv/models/backbones/vit/vit_backbone.py Outdated Show resolved Hide resolved
keras_cv/models/backbones/vit/vit_backbone_test.py Outdated Show resolved Hide resolved
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
@suvadityamuk
Copy link
Contributor Author

@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails.

Locally, I have run bash shell/format.sh and bash shell/lint.sh, and they both succeed, but CI fails the lint test for some reason. Although, it does seem to skip one file. I reckon it might be the vit_backbone.py file itself, but not sure why it would skip it 🤔

image

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>
@ianstenbit
Copy link
Contributor

@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails.

Locally, I have run bash shell/format.sh and bash shell/lint.sh, and they both succeed, but CI fails the lint test for some reason. Although, it does seem to skip one file. I reckon it might be the vit_backbone.py file itself, but not sure why it would skip it 🤔

image

I'm seeing

would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/layers/vit_layers_test.py
would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/models/backbones/vit/vit_backbone.py

Maybe you didn't commit the changes after the linter ran?

@suvadityamuk
Copy link
Contributor Author

suvadityamuk commented Jul 6, 2023

@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails.
Locally, I have run bash shell/format.sh and bash shell/lint.sh, and they both succeed, but CI fails the lint test for some reason. Although, it does seem to skip one file. I reckon it might be the vit_backbone.py file itself, but not sure why it would skip it thinking

I'm seeing

would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/layers/vit_layers_test.py
would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/models/backbones/vit/vit_backbone.py

Maybe you didn't commit the changes after the linter ran?

But I did! That's what makes me more surprised! I even tried doing this process again, just in case it may have caused the same thing, but git shows that nothing has changed.

This is what it shows me now
image

@ianstenbit
Copy link
Contributor

I just gave reformatting a shot -- it did make changes on my end and it looks like they ended up in the commit. Maybe you had an issue with your editor or something?

Copy link
Contributor

@ianstenbit ianstenbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM -- I'm going to make sure that @jbischof takes a look as well, as he's been the expert on these backbone migrations

return cls.presets


class ViTH16Backbone(ViTBackbone):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbischof IIRC you had expressed some preference to call these "huge" models "XL" instead of "H" -- we haven't really been consistent about preset naming here, and have generally just mirrored the convention of the original paper or implementation for each backbone. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, "H" is better than XL. More papers follow that convention than "XL", and other ViT-like models (such as for SAM, DPT, etc.) all follow the B, L, H convention.

I'm personally not sure I saw "XL" in any paper, sans the new SD model? Then again, the API here may stray from that convention if it helps with consistency/clarity. Jonathan is better at those choices than me :D

Copy link
Contributor

@jbischof jbischof Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard call. It's good to stick to the paper conventions for sure. But what's more intuitive?
Ti, S, B, L, H
XS, S, M, L, XL

Also, we can keep t-shirt sizes consistent across models while the YoloV8 might use "N" for the smallest size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in NLP we have been even more verbose and used "extra_small", "small", "base", "large", "extra_large" (link). Can't fault our enthusiasm!

@suvadityamuk
Copy link
Contributor Author

I just gave reformatting a shot -- it did make changes on my end and it looks like they ended up in the commit. Maybe you had an issue with your editor or something?

Understood, must be that. Thank you so much!

@DavidLandup0
Copy link
Contributor

Thanks for porting this @suvadityamuk! Excited to see it back in the new API 🎊

@DavidLandup0
Copy link
Contributor

@suvadityamuk @ianstenbit before merging - one nit from my end. It looks like the _interpolate_embeddings() call isn't actually used in the forward pass of the model. Not sure if it was missed in the original PR or accidentally omitted here, but as it is, the model works only on an input resolution of 224x224x3

If the input is different, we need to interpolate the positional embeddings to match the seq_len after patching/embedding that input resolution

@suvadityamuk
Copy link
Contributor Author

@suvadityamuk @ianstenbit before merging - one nit from my end. It looks like the _interpolate_embeddings() call isn't actually used in the forward pass of the model. Not sure if it was missed in the original PR or accidentally omitted here, but as it is, the model works only on an input resolution of 224x224x3

If the input is different, we need to interpolate the positional embeddings to match the seq_len after patching/embedding that input resolution

That's actually something that has been on my mind as well. The problem is, if I do not give a certain input_shape param in the ViTBackbone, the PatchingAndEmbedding layer seems to throw an error. It's something I've been trying to fix, but to no avail.
That is the reason why I moved it from the original (None, None, 3) to the (224, 224, 3). As far as I understand though, supplying input_shape as an argument and using that shouldn't be a problem either.

Do you mind elaborating on the interpolation problem a bit?

@DavidLandup0
Copy link
Contributor

Do you mind elaborating on the interpolation problem a bit?

Given a different input shape, the patching/embedding process will produce a different sequence length. For 224x224, it's 196 + class token. For 336, the sequence length would be 256 + class token. Hence, the learned positional embeddings that correspond to the sequence length wouldn't match and the weights wouldn't be able to load. This was seen as a major downside of the design of ViTs, but someone figured out that you can just interpolate the 196 seq len to any larger number, and it still works.

There are tests for the method to interpolate, which showcase the PatchingAndEmbedding calls - so that should work correctly? 🤔

@DavidLandup0
Copy link
Contributor

The problem is, if I do not give a certain input_shape param in the ViTBackbone, the PatchingAndEmbedding layer seems to throw an error.

I'm pretty sure that this is because of the Embedding layer for which we need to know the number of positional embeddings at creation time. If you instantiate PatchingAndEmbedding with the provided input shape instead of (None, None, 3), it should work

@suvadityamuk
Copy link
Contributor Author

Do you mind elaborating on the interpolation problem a bit?

Given a different input shape, the patching/embedding process will produce a different sequence length. For 224x224, it's 196 + class token. For 336, the sequence length would be 256 + class token. Hence, the learned positional embeddings that correspond to the sequence length wouldn't match and the weights wouldn't be able to load. This was seen as a major downside of the design of ViTs, but someone figured out that you can just interpolate the 196 seq len to any larger number, and it still works.

There are tests for the method to interpolate, which showcase the PatchingAndEmbedding calls - so that should work correctly? thinking

I believe interpolate_embeddings() was not there while I was making the port. There were tests which did use the functionality (as can be seen in vit_layers_test.py, but not in the actual forward pass of the model).

@DavidLandup0
Copy link
Contributor

Ah, I must've missed adding it back in the original implementation then. It may be better to open a new issue for it to not mix the port PR with the fix PR?

@suvadityamuk
Copy link
Contributor Author

The problem is, if I do not give a certain input_shape param in the ViTBackbone, the PatchingAndEmbedding layer seems to throw an error.

I'm pretty sure that this is because of the Embedding layer for which we need to know the number of positional embeddings at creation time. If you instantiate PatchingAndEmbedding with the provided input shape instead of (None, None, 3), it should work

I agree, but there are cases wherein utils.parse_model_inputs() returns a (None, None, None, 3) tuple, which throws the error within the build() method for PatchingAndEmbedding. I've tried to make it work, but these are the results.

self = <keras_cv.models.backbones.vit.vit_backbone_test.ViTBackboneTest testMethod=test_valid_call>

    def test_valid_call(self):
>       model = ViTBackbone(
            patch_size=16,
            transformer_layer_num=12,
            project_dim=192,
            mlp_dim=768,
            num_heads=3,
            mlp_dropout=0.0,
            attention_dropout=0.0,
            include_rescaling=False,
        )

keras_cv/models/backbones/vit/vit_backbone_test.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
keras_cv/models/backbones/vit/vit_backbone.py:117: in __init__
    encoded_patches = PatchingAndEmbedding(project_dim, patch_size)(x)
../oss-ml/lib64/python3.8/site-packages/keras/utils/traceback_utils.py:61: in error_handler
    return fn(*args, **kwargs)
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:1058: in __call__
    return self._functional_construction_call(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2572: in _functional_construction_call
    outputs = self._keras_tensor_symbolic_call(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2419: in _keras_tensor_symbolic_call
    return self._infer_output_signature(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2476: in _infer_output_signature
    self._maybe_build(inputs)
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:3023: in _maybe_build
    self.build(input_shapes)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <keras_cv.layers.vit_layers.PatchingAndEmbedding object at 0x7f784c726400>, input_shape = TensorShape([None, None, None, 3])

    def build(self, input_shape):
        self.class_token = self.add_weight(
            shape=[1, 1, self.project_dim], name="class_token", trainable=True
        )
        self.num_patches = (
>           input_shape[1]
            // self.patch_size
            * input_shape[2]
            // self.patch_size
        )
E       TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

keras_cv/layers/vit_layers.py:85: TypeError

@suvadityamuk
Copy link
Contributor Author

Feel free to make a suggestion to this PR! I'll incorporate that into the current commits, two birds with one stone since this was anyway not supposed to happen

@DavidLandup0
Copy link
Contributor

I'll have to take a look at when the returned shapes are (None, None, None, 3), but it feels like we should be able to somehow instantiate this 🤔
If I come up with anything, posting it here. Thanks! Sorry for derailing the thread 😓

@suvadityamuk
Copy link
Contributor Author

Also, about the interpolate_embeddings call, it seems it is included in the forward pass. So that worry is solved. The only problem that currently exists is the use of (224, 224, 3) as the default.

@suvadityamuk
Copy link
Contributor Author

Hi @jbischof!

Gentle ping for a review!

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, sorry for the delay in review!

Could you please merge with master so we can identify potential bugs with multibackend Keras?

return cls.presets


class ViTH16Backbone(ViTBackbone):
Copy link
Contributor

@jbischof jbischof Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard call. It's good to stick to the paper conventions for sure. But what's more intuitive?
Ti, S, B, L, H
XS, S, M, L, XL

Also, we can keep t-shirt sizes consistent across models while the YoloV8 might use "N" for the smallest size.

to use as image input for the model.
patch_size: the patch size to be supplied to the Patching layer to turn
input images into a flattened sequence of patches.
transformer_layer_num: the number of transformer encoder layers to stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we match the arg names in KerasNLP (link)? In the long term I'd love to merge implementations and these names are fairly standardized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we couldn't use the KerasNLP layer OOB since the CV implementation seems quite standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might also be something to coordinate wiith @DavidLandup0 since he is merging in Transformer layers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents is that we should keep them separate. IMO, it's better to have a bit of duplicated code (i.e. one class) than to try to make cross-lib dependencies. Since we'll be adding a few CV-specific transformer encoders and attention types which will necessarily have to be part of KCV only, fragmenting the implementations doesn't feel like the best idea to me.

We could try to see if a vanilla transformer encoder could become a part of core Keras though, and have both being used in KNLP and KCV. As far as I can tell, there's demand for the class as part of the main framework and it would make sense to move a layer to it if both KNLP and KCV would use it, since that means that it's not specific enough to sit in only one lib?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a slightly different impl (KNLP seems to use one layer norm, i.e. if self.normalize_first and if not self.normalize_first) while this layer uses two (one before all ops and one after the residual). Merging the implementations would wreck the weights of one lib

model = ViTBackbone.from_preset("vittiny16_imagenet")
outputs = model(tf.ones(shape=(1, 224, 224, 3)))

test_output = outputs[0, :5, :5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just assert outputs[0, :5, 0] reduce the cognitive load here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For get_config you can just use config.update() instead of this more verbose method.

@jbischof
Copy link
Contributor

jbischof commented Aug 8, 2023

@suvadityamuk are you still interested in working on this?

@suvadityamuk
Copy link
Contributor Author

Hi, @jbischof

I am interested in this, but a bit tied up until Sept 6th. Happy to take it up fully after that date though, yes. Will send in the updated files as soon as I can, thanks for waiting up and apologies for the delay!

@divyashreepathihalli
Copy link
Collaborator

@suvadityamuk do you have any updates on this? Please let us know if you need help. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate VIT to the new Backbone API
5 participants