[Keras Sprint] Migrate Vision Transformers from Legacy to `Backbone` API #1914

suvadityamuk · 2023-07-06T14:08:41Z

What does this PR do?

Fixes #1628 as part of the Keras Sprint.

Some important things to note:

Within the backbone, I had to set the default input_shape parameter as (224, 224, 3) instead of the normal default (None, None, 3). This is because, within the PatchingAndEmbedding layer, there is a num_patches calculation that requires a constant integer input_shape (Line 84).
I have changed a function name within vit_layers.py from __interpolate_positional_embeddings to interpolate_positional_embeddings. Previously, this was causing tf.AutoGraph from being unable to work on the PatchingAndEmbedding layer since name-mangling is not yet supported. Now, it does not throw any errors after this change.
I have taken the liberty to add some # noqa: E501 tags on the weights URLs in vit_aliases.py to make sure that the links do not break. Let me know if there's a better way to handle that!

Please let me know what more changes you think are necessary. Thanks!

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you write any new necessary tests?
If this adds a new model, can you run a few training steps on TPU in Colab to ensure that no XLA incompatible OP are used?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ianstenbit @jbischof

performs steps 1-3 for moving ViT out of legacy and into backbone in single commit. moved code out of legacy and into newer folders and files Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

moved weights.py information into backbone_presets.py, summarized information for each version of the model and kept the notop weights as default for the backbone Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

renamed symbols from vit to vitbackbone, removed include_top and other arguments from backbone args directly, changed get_config, added classproperty Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

replaced all extended vits to call the superclass with differing args with respect to architecture type Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

updated all docstrings Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

changed docstrings, added preliminary tests, need to find more information to fill in some more tests Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

ianstenbit

Thanks for the PR -- this is awesome!

It generally looks good to me, just some minor comments.

keras_cv/models/backbones/vit/vit_aliases.py

keras_cv/models/backbones/vit/vit_backbone.py

keras_cv/models/backbones/vit/vit_backbone_presets.py

keras_cv/models/backbones/vit/vit_backbone_test.py

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

suvadityamuk · 2023-07-06T15:10:47Z

@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails.

Locally, I have run bash shell/format.sh and bash shell/lint.sh, and they both succeed, but CI fails the lint test for some reason. Although, it does seem to skip one file. I reckon it might be the vit_backbone.py file itself, but not sure why it would skip it 🤔

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

ianstenbit · 2023-07-06T15:37:43Z

@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails.

Locally, I have run bash shell/format.sh and bash shell/lint.sh, and they both succeed, but CI fails the lint test for some reason. Although, it does seem to skip one file. I reckon it might be the vit_backbone.py file itself, but not sure why it would skip it 🤔

I'm seeing

would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/layers/vit_layers_test.py
would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/models/backbones/vit/vit_backbone.py

Maybe you didn't commit the changes after the linter ran?

suvadityamuk · 2023-07-06T15:40:20Z

@ianstenbit Please take a look! Tried to fix most things, but for some reason, the linting test fails.
Locally, I have run bash shell/format.sh and bash shell/lint.sh, and they both succeed, but CI fails the lint test for some reason. Although, it does seem to skip one file. I reckon it might be the vit_backbone.py file itself, but not sure why it would skip it thinking

I'm seeing
would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/layers/vit_layers_test.py
would reformat /home/runner/work/keras-cv/keras-cv/keras_cv/models/backbones/vit/vit_backbone.py
Maybe you didn't commit the changes after the linter ran?

But I did! That's what makes me more surprised! I even tried doing this process again, just in case it may have caused the same thing, but git shows that nothing has changed.

This is what it shows me now

ianstenbit · 2023-07-06T15:47:31Z

I just gave reformatting a shot -- it did make changes on my end and it looks like they ended up in the commit. Maybe you had an issue with your editor or something?

ianstenbit

This LGTM -- I'm going to make sure that @jbischof takes a look as well, as he's been the expert on these backbone migrations

ianstenbit · 2023-07-06T15:51:53Z

keras_cv/models/backbones/vit/vit_aliases.py

+        return cls.presets
+
+
+class ViTH16Backbone(ViTBackbone):


@jbischof IIRC you had expressed some preference to call these "huge" models "XL" instead of "H" -- we haven't really been consistent about preset naming here, and have generally just mirrored the convention of the original paper or implementation for each backbone. WDYT?

IMO, "H" is better than XL. More papers follow that convention than "XL", and other ViT-like models (such as for SAM, DPT, etc.) all follow the B, L, H convention.

I'm personally not sure I saw "XL" in any paper, sans the new SD model? Then again, the API here may stray from that convention if it helps with consistency/clarity. Jonathan is better at those choices than me :D

This is a hard call. It's good to stick to the paper conventions for sure. But what's more intuitive?
Ti, S, B, L, H
XS, S, M, L, XL

Also, we can keep t-shirt sizes consistent across models while the YoloV8 might use "N" for the smallest size.

Actually in NLP we have been even more verbose and used "extra_small", "small", "base", "large", "extra_large" (link). Can't fault our enthusiasm!

suvadityamuk · 2023-07-06T16:12:53Z

I just gave reformatting a shot -- it did make changes on my end and it looks like they ended up in the commit. Maybe you had an issue with your editor or something?

Understood, must be that. Thank you so much!

DavidLandup0 · 2023-07-06T16:33:08Z

Thanks for porting this @suvadityamuk! Excited to see it back in the new API 🎊

DavidLandup0 · 2023-07-06T16:39:09Z

@suvadityamuk @ianstenbit before merging - one nit from my end. It looks like the _interpolate_embeddings() call isn't actually used in the forward pass of the model. Not sure if it was missed in the original PR or accidentally omitted here, but as it is, the model works only on an input resolution of 224x224x3

If the input is different, we need to interpolate the positional embeddings to match the seq_len after patching/embedding that input resolution

suvadityamuk · 2023-07-06T17:17:07Z

@suvadityamuk @ianstenbit before merging - one nit from my end. It looks like the _interpolate_embeddings() call isn't actually used in the forward pass of the model. Not sure if it was missed in the original PR or accidentally omitted here, but as it is, the model works only on an input resolution of 224x224x3

If the input is different, we need to interpolate the positional embeddings to match the seq_len after patching/embedding that input resolution

That's actually something that has been on my mind as well. The problem is, if I do not give a certain input_shape param in the ViTBackbone, the PatchingAndEmbedding layer seems to throw an error. It's something I've been trying to fix, but to no avail.
That is the reason why I moved it from the original (None, None, 3) to the (224, 224, 3). As far as I understand though, supplying input_shape as an argument and using that shouldn't be a problem either.

Do you mind elaborating on the interpolation problem a bit?

DavidLandup0 · 2023-07-06T17:26:31Z

Do you mind elaborating on the interpolation problem a bit?

Given a different input shape, the patching/embedding process will produce a different sequence length. For 224x224, it's 196 + class token. For 336, the sequence length would be 256 + class token. Hence, the learned positional embeddings that correspond to the sequence length wouldn't match and the weights wouldn't be able to load. This was seen as a major downside of the design of ViTs, but someone figured out that you can just interpolate the 196 seq len to any larger number, and it still works.

There are tests for the method to interpolate, which showcase the PatchingAndEmbedding calls - so that should work correctly? 🤔

DavidLandup0 · 2023-07-06T17:28:05Z

The problem is, if I do not give a certain input_shape param in the ViTBackbone, the PatchingAndEmbedding layer seems to throw an error.

I'm pretty sure that this is because of the Embedding layer for which we need to know the number of positional embeddings at creation time. If you instantiate PatchingAndEmbedding with the provided input shape instead of (None, None, 3), it should work

suvadityamuk · 2023-07-06T17:32:08Z

Do you mind elaborating on the interpolation problem a bit?

Given a different input shape, the patching/embedding process will produce a different sequence length. For 224x224, it's 196 + class token. For 336, the sequence length would be 256 + class token. Hence, the learned positional embeddings that correspond to the sequence length wouldn't match and the weights wouldn't be able to load. This was seen as a major downside of the design of ViTs, but someone figured out that you can just interpolate the 196 seq len to any larger number, and it still works.

There are tests for the method to interpolate, which showcase the PatchingAndEmbedding calls - so that should work correctly? thinking

I believe interpolate_embeddings() was not there while I was making the port. There were tests which did use the functionality (as can be seen in vit_layers_test.py, but not in the actual forward pass of the model).

DavidLandup0 · 2023-07-06T17:34:42Z

Ah, I must've missed adding it back in the original implementation then. It may be better to open a new issue for it to not mix the port PR with the fix PR?

suvadityamuk · 2023-07-06T17:34:45Z

The problem is, if I do not give a certain input_shape param in the ViTBackbone, the PatchingAndEmbedding layer seems to throw an error.

I'm pretty sure that this is because of the Embedding layer for which we need to know the number of positional embeddings at creation time. If you instantiate PatchingAndEmbedding with the provided input shape instead of (None, None, 3), it should work

I agree, but there are cases wherein utils.parse_model_inputs() returns a (None, None, None, 3) tuple, which throws the error within the build() method for PatchingAndEmbedding. I've tried to make it work, but these are the results.

self = <keras_cv.models.backbones.vit.vit_backbone_test.ViTBackboneTest testMethod=test_valid_call>

    def test_valid_call(self):
>       model = ViTBackbone(
            patch_size=16,
            transformer_layer_num=12,
            project_dim=192,
            mlp_dim=768,
            num_heads=3,
            mlp_dropout=0.0,
            attention_dropout=0.0,
            include_rescaling=False,
        )

keras_cv/models/backbones/vit/vit_backbone_test.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
keras_cv/models/backbones/vit/vit_backbone.py:117: in __init__
    encoded_patches = PatchingAndEmbedding(project_dim, patch_size)(x)
../oss-ml/lib64/python3.8/site-packages/keras/utils/traceback_utils.py:61: in error_handler
    return fn(*args, **kwargs)
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:1058: in __call__
    return self._functional_construction_call(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2572: in _functional_construction_call
    outputs = self._keras_tensor_symbolic_call(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2419: in _keras_tensor_symbolic_call
    return self._infer_output_signature(
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:2476: in _infer_output_signature
    self._maybe_build(inputs)
../oss-ml/lib64/python3.8/site-packages/keras/engine/base_layer.py:3023: in _maybe_build
    self.build(input_shapes)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <keras_cv.layers.vit_layers.PatchingAndEmbedding object at 0x7f784c726400>, input_shape = TensorShape([None, None, None, 3])

    def build(self, input_shape):
        self.class_token = self.add_weight(
            shape=[1, 1, self.project_dim], name="class_token", trainable=True
        )
        self.num_patches = (
>           input_shape[1]
            // self.patch_size
            * input_shape[2]
            // self.patch_size
        )
E       TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

keras_cv/layers/vit_layers.py:85: TypeError

suvadityamuk · 2023-07-06T17:35:45Z

Feel free to make a suggestion to this PR! I'll incorporate that into the current commits, two birds with one stone since this was anyway not supposed to happen

DavidLandup0 · 2023-07-06T17:39:46Z

I'll have to take a look at when the returned shapes are (None, None, None, 3), but it feels like we should be able to somehow instantiate this 🤔
If I come up with anything, posting it here. Thanks! Sorry for derailing the thread 😓

suvadityamuk · 2023-07-06T17:41:31Z

Also, about the interpolate_embeddings call, it seems it is included in the forward pass. So that worry is solved. The only problem that currently exists is the use of (224, 224, 3) as the default.

suvadityamuk · 2023-07-07T18:04:02Z

Hi @jbischof!

Gentle ping for a review!

jbischof

This is great, sorry for the delay in review!

Could you please merge with master so we can identify potential bugs with multibackend Keras?

jbischof · 2023-07-21T16:44:03Z

keras_cv/models/backbones/vit/vit_aliases.py

+        return cls.presets
+
+
+class ViTH16Backbone(ViTBackbone):


This is a hard call. It's good to stick to the paper conventions for sure. But what's more intuitive?
Ti, S, B, L, H
XS, S, M, L, XL

Also, we can keep t-shirt sizes consistent across models while the YoloV8 might use "N" for the smallest size.

jbischof · 2023-07-21T16:46:20Z

keras_cv/models/backbones/vit/vit_backbone.py

+            to use as image input for the model.
+        patch_size: the patch size to be supplied to the Patching layer to turn
+            input images into a flattened sequence of patches.
+        transformer_layer_num: the number of transformer encoder layers to stack


Could we match the arg names in KerasNLP (link)? In the long term I'd love to merge implementations and these names are fairly standardized.

I wonder if we couldn't use the KerasNLP layer OOB since the CV implementation seems quite standard.

This might also be something to coordinate wiith @DavidLandup0 since he is merging in Transformer layers

My 2 cents is that we should keep them separate. IMO, it's better to have a bit of duplicated code (i.e. one class) than to try to make cross-lib dependencies. Since we'll be adding a few CV-specific transformer encoders and attention types which will necessarily have to be part of KCV only, fragmenting the implementations doesn't feel like the best idea to me.

We could try to see if a vanilla transformer encoder could become a part of core Keras though, and have both being used in KNLP and KCV. As far as I can tell, there's demand for the class as part of the main framework and it would make sense to move a layer to it if both KNLP and KCV would use it, since that means that it's not specific enough to sit in only one lib?

There's also a slightly different impl (KNLP seems to use one layer norm, i.e. if self.normalize_first and if not self.normalize_first) while this layer uses two (one before all ops and one after the residual). Merging the implementations would wreck the weights of one lib

jbischof · 2023-07-21T17:12:26Z

keras_cv/models/backbones/vit/vit_backbone_presets_test.py

+        model = ViTBackbone.from_preset("vittiny16_imagenet")
+        outputs = model(tf.ones(shape=(1, 224, 224, 3)))
+
+        test_output = outputs[0, :5, :5]


Maybe just assert outputs[0, :5, 0] reduce the cognitive load here

jbischof · 2023-07-22T18:45:37Z

keras_cv/layers/vit_layers.py

For get_config you can just use config.update() instead of this more verbose method.

jbischof · 2023-08-08T22:17:31Z

@suvadityamuk are you still interested in working on this?

suvadityamuk · 2023-08-28T15:52:16Z

Hi, @jbischof

I am interested in this, but a bit tied up until Sept 6th. Happy to take it up fully after that date though, yes. Will send in the updated files as soon as I can, thanks for waiting up and apologies for the delay!

divyashreepathihalli · 2023-12-07T23:03:52Z

@suvadityamuk do you have any updates on this? Please let us know if you need help. Thanks!

suvadityamuk added 8 commits July 5, 2023 23:24

add: new files for vit in folder

de59d3d

performs steps 1-3 for moving ViT out of legacy and into backbone in single commit. moved code out of legacy and into newer folders and files Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

add: step 4 for moving from legacy to backbone

de81233

moved weights.py information into backbone_presets.py, summarized information for each version of the model and kept the notop weights as default for the backbone Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

add: completed step 5

56eff21

renamed symbols from vit to vitbackbone, removed include_top and other arguments from backbone args directly, changed get_config, added classproperty Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

add: completed step 6

1f486cd

replaced all extended vits to call the superclass with differing args with respect to architecture type Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

add: completed step 6

6e431e0

updated all docstrings Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

add: completed step 7 and made progress on step 8

94cc33d

changed docstrings, added preliminary tests, need to find more information to fill in some more tests Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

add: complete all steps for migration

af7c445

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

chore: linting with black, flake8, and isort

2b345c3

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

ianstenbit suggested changes Jul 6, 2023

View reviewed changes

suvadityamuk added 3 commits July 6, 2023 19:57

chore: add model params in tests and fix test for PatchingAndEmbedding

a2ea870

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

chore: linting fix

47ea83d

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

chore: address comments and make linting fixes

477a9eb

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

chore: remove irrelevant tests, fix linting, fix tests

133246b

Signed-off-by: Suvaditya Mukherjee <suvadityamuk@gmail.com>

Reformat

7ab51a9

ianstenbit approved these changes Jul 6, 2023

View reviewed changes

jbischof reviewed Jul 21, 2023

View reviewed changes

jbischof reviewed Jul 22, 2023

View reviewed changes

keras_cv/layers/vit_layers.py

Copy link

Contributor

jbischof Jul 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For get_config you can just use config.update() instead of this more verbose method.

divyashreepathihalli self-requested a review December 7, 2023 23:04

innat mentioned this pull request Apr 1, 2024

Add Depth-Anything #2404

Open

[Keras Sprint] Migrate Vision Transformers from Legacy to Backbone API #1914

Are you sure you want to change the base?

[Keras Sprint] Migrate Vision Transformers from Legacy to Backbone API #1914

Conversation

suvadityamuk commented Jul 6, 2023

What does this PR do?

Before submitting

Who can review?

ianstenbit left a comment

Choose a reason for hiding this comment

suvadityamuk commented Jul 6, 2023

ianstenbit commented Jul 6, 2023

suvadityamuk commented Jul 6, 2023 • edited Loading

ianstenbit commented Jul 6, 2023

ianstenbit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbischof Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suvadityamuk commented Jul 6, 2023

DavidLandup0 commented Jul 6, 2023

DavidLandup0 commented Jul 6, 2023

suvadityamuk commented Jul 6, 2023

DavidLandup0 commented Jul 6, 2023

DavidLandup0 commented Jul 6, 2023

suvadityamuk commented Jul 6, 2023

DavidLandup0 commented Jul 6, 2023

suvadityamuk commented Jul 6, 2023

suvadityamuk commented Jul 6, 2023

DavidLandup0 commented Jul 6, 2023

suvadityamuk commented Jul 6, 2023

suvadityamuk commented Jul 7, 2023

jbischof left a comment

Choose a reason for hiding this comment

jbischof Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbischof commented Aug 8, 2023

suvadityamuk commented Aug 28, 2023

divyashreepathihalli commented Dec 7, 2023

[Keras Sprint] Migrate Vision Transformers from Legacy to `Backbone` API #1914

[Keras Sprint] Migrate Vision Transformers from Legacy to `Backbone` API #1914

suvadityamuk commented Jul 6, 2023 •

edited

Loading

jbischof Jul 21, 2023 •

edited

Loading

jbischof Jul 21, 2023 •

edited

Loading