adding CoCa #256

gpucce · 2022-11-25T15:24:37Z

The PR idea is to add the CoCa model as implemented in https://github.com/lucidrains/CoCa-pytorch, using existing parts as much as possible.

Ideally adding possibilty to choose between custom and non custom Attention implementation as is done for CLIP.

rom1504 · 2022-11-25T15:34:34Z

It would be best to see if it's possible to unify with existing code both the tower support and the losses
So we can train with a variety of text towers and we may benefit from the existing efficient implementation of losses without too much duplicate code

gpucce · 2022-11-25T15:39:51Z

Sure, I will try and reuse as much as I can, for now it is mostly copied from the coca-pytorch repo, will probably ask for some help while I move on :)

gpucce · 2022-11-27T08:24:59Z

@rom1504 I will reuse the visual_model from open_clip, however in coca-pytorch the transformer layer for the text model are different from the regular ones, feed_forward and attention are parallel, do you prefer like that or regular ones? I have no idea how much difference it makes.

Even if I use the regular attention I think the current implementation doesn't allow cross attention, would you prefer a CrossAttention layer or adding the crossattention possibility to the regular attention with kwargs to the forward?

rom1504 · 2022-11-27T18:21:15Z

I think let's bring options into current text model so they support coca

Parallel feed forward and self attention can be added as an option. I think it won't make a huge difference for these relatively small models. Gpt-j and palm used this mecanism to improve speed at large scale.
cross attention is indeed the important feature. I think bringing it to the existing model could be great. That should be possible for our text attention implementation. For HF encoders there's a chance it's already implemented and we can just use their implementation

Thanks for working on this!

gpucce · 2022-12-01T02:27:31Z

@rom1504 I am moving forward, if you have time could you just have a look at how the cross attention and decoder are added to existing models to see if the integration is going in a reasonable direction?

rom1504 · 2022-12-18T19:13:14Z

Another idea of bonus feature (not for this PR probably) : support many HF decoder for the "multimodal transformer" that got added here

iejMac · 2022-12-18T20:44:34Z

@gpucce Do you think you could give me push access to your fork? I'd love to help out but I don't want you to have to manually merge all of my suggested changes each time I make them

gpucce · 2022-12-18T20:49:05Z

@gpucce Do you think you could give me push access to your fork? I'd love to help out but I don't want you to have to manually merge all of my suggested changes each time I make them

Sure, I will do it as soon as I am on a computer

rwightman · 2022-12-18T22:33:36Z

would appreciate your review @rwightman if you think anything big need to be done

Made some code review comments, most important points:

Make CoCa model dual tower (.visual, .text) from the start so builtin and HF text transformer models are the same, no bwd compat to be concerned about
Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

and then test test test.

gpucce · 2022-12-18T22:53:27Z

Thanks for the review @rwightman. @iejMac had raised a similar point to the second one. To coordinate with everyone (@rom1504) since the list of todos is getting longer, I am planning to address all of them, however I don't proceed too fast. If someone is working on some of them to speed up the whole process, please share with me.

Otherwise I will make everything as suggested taking a bit of time, in general will start from the generative part.

rom1504 · 2022-12-18T23:02:54Z

yes I think starting with implementing captioning will give us confidence that things are working

rom1504 · 2022-12-19T00:50:50Z

src/open_clip/coca_model.py

+            else LayerNorm
+        )
+
+        text = _build_input_dependent_text_tower(embed_dim, text_cfg, quick_gelu, cast_dtype, multimodal=False)


could be self.text here

I think adding this would make the state_dict of the model you have just trained incompatible, is that fine?

yes, we'll retrain

rom1504 · 2022-12-19T00:58:35Z

src/open_clip/transformer.py

+        text_embs = text_embs.permute(1, 0, 2)  # NLD -> LND
+        image_embs = image_embs.permute(1, 0, 2)  # NLD -> LND
+
+        for r, ca in zip(self.resblocks, self.cross_attn):


could you name those resblock and cross_attn ? r and ca are a bit confusing

rom1504 · 2022-12-19T00:58:54Z

src/open_clip/transformer.py

+        mask.triu_(1)  # zero out the lower diagonal
+        return mask
+
+    def forward(self, image_embs, text_embs):


could you add a comment explaining the shape of those, I think it'll help

src/open_clip/coca_model.py

rom1504 · 2022-12-19T01:22:21Z

src/open_clip/coca_model.py

+    def _repeat(self, t, N):
+        return t.reshape(1, 1, -1).repeat(N, 1, 1)
+
+    def encode_text(self, text, normalize=True, return_tokens=False):


same comment as for visual, can we use the text tower much more ?

This one is a bit harder than the visual one I think

what is missing? don't we simply need to add that return tokens option in the text encoder too ?

AwePhD · 2022-12-20T13:25:16Z

Hello,

I would like to pre-train coca and so, from CoCa implementation's repo, I saw this PR/branch. I am not familiar with the base code of OpenCLIP, but I think it could be a good opportunity for me to get my hands dirty.
I cannot estimate if you would appreciate help for developing this feature or not. So let me know, I would be glad to participate :)

rom1504 · 2022-12-20T13:52:35Z

@AwePhD help would definitely be appreciated. See above comments for what we need

You can open PRs on the branch of this PR

Soonhwan-Kwon · 2022-12-20T14:16:36Z

I'm also want to help, and I've done many experiments with CoCa model on most public datasets, and caption generation also(but w/o HF compatibility).

rom1504 · 2022-12-20T15:50:15Z

Ok. I think the best path forward is I'm going to merge this in a coca branch in this repo, and then we all do PRs towards that branch. When we're happy that everything is good, we merge to main

…

On Tue, Dec 20, 2022, 15:16 Soonhwan-Kwon ***@***.***> wrote: I'm also want to help, and I've done many experiments with CoCa model on most public datasets, and caption generation also(but w/o HF compatibility). — Reply to this email directly, view it on GitHub <#256 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437UW3HLTOXNB76J3FDTWOG5U7ANCNFSM6AAAAAASLNMBGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rom1504 · 2022-12-20T15:50:19Z

Recent news is we have confirmed with Maciej that: - this gets the same perf on zero shot classification than clip B/32 - captioning seems to work a bit We can probably share this model to help testing We need to proper AR sampling implementation to know whether captioning is working well.

…

On Tue, Dec 20, 2022, 16:37 Romain Beaumont ***@***.***> wrote: Ok. I think the best path forward is I'm going to merge this in a coca branch in this repo, and then we all do PRs towards that branch. When we're happy that everything is good, we merge to main On Tue, Dec 20, 2022, 15:16 Soonhwan-Kwon ***@***.***> wrote: > I'm also want to help, and I've done many experiments with CoCa model on > most public datasets, and caption generation also(but w/o HF compatibility). > > — > Reply to this email directly, view it on GitHub > <#256 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAR437UW3HLTOXNB76J3FDTWOG5U7ANCNFSM6AAAAAASLNMBGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

gpucce · 2022-12-20T15:57:35Z

Nice that it gets same performance!

If you can wait a moment before merging in a few moments I should be able to simplify the logic for the visual part, while for the text one more things would need changing.

gpucce · 2022-12-20T15:59:03Z

And if you can somehow share the model that would be very useful

rom1504 · 2022-12-20T15:59:09Z

src/open_clip/coca_model.py

@@ -160,19 +155,31 @@ def encode_image(self, images, normalize=True, return_tokens=False):
    def _repeat(self, t, N):
        return t.reshape(1, 1, -1).repeat(N, 1, 1)

+    def _build_cls_mask(self, text, cast_dtype):


What should be the impact of this change?

I think right now the cls token at the end can attend to pad tokens in the sequence, this should not be possible with this extra mask

sounds good!

gpucce · 2022-12-20T17:27:05Z

@rom1504 the visual part should be simpler now, can make the new branch now, I will still work on the generative part as soon as I have time

rom1504 · 2022-12-20T18:17:21Z

src/open_clip/transformer.py

@@ -465,6 +465,9 @@ def forward(self, x: torch.Tensor):
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD

+        if output_tokens:
+            return x
+
        if self.global_average_pool:
            x = x.mean(dim=1)


I'm wondering if this can be done after the ln post

if yes, then it will make it possible to do the ln post only here and not in coca

rom1504 · 2022-12-20T18:17:54Z

src/open_clip/coca_model.py

-        x = x.permute(1, 0, 2)  # NLD -> LND
-        x = self.visual.transformer(x)
-        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.visual(images, output_tokens=True)
        x = self.visual.ln_post(x)


this ln post call makes a big assumption on the API of the visual encoder

rom1504 · 2022-12-20T22:19:19Z

@gpucce I merged into coca branch. I excluded the 2 last commits you added today to avoid discrepancies with our trained model.
Can you please open a PR with your 2 last commits and also any further improvement ? targeting coca branch, and not main
Thanks

All comments mentioned here stay valid

rom1504 · 2022-12-20T22:25:02Z

Please refer to #308 for follow ups

initial setup

1189487

gpucce added 6 commits November 27, 2022 08:20

add coca loss

91d01fa

remove loss from the model

efb6540

fix loss

669a3a0

add underscores

f081dc4

name changes

0b1c895

Merge remote-tracking branch 'upstream/main' into add_coca

27369b6

gpucce added 10 commits November 29, 2022 10:18

add cross attention to Residual and CustomResidual

d518dd0

fix if

11bf57c

ädd transformer 'decoder'

f3dedf6

minor fix

50c4726

looks better

1e41d83

initlize coca model structure

0d91609

clean

50e0cbe

typo and format

93b4236

checkpoint signature

97e3c0f

adjust multimodal decoder and add CoCaTransformer

6ae6f8c

gpucce added 8 commits December 1, 2022 14:32

keep older logic

0975dfe

remove chunk

f2265ec

typo

9d47f0e

fix

6a101ec

make chunk dim explicit

e259851

adjust cfg names

7fff61d

add attentionalpooling

abd132d

add attentional pooling to coca

452d7d2

rom1504 reviewed Dec 19, 2022

View reviewed changes

src/open_clip/coca_model.py Show resolved Hide resolved

rom1504 reviewed Dec 19, 2022

View reviewed changes

add cls mask for pad ids

64c33d8

rom1504 reviewed Dec 20, 2022

View reviewed changes

simplify encode image

17813eb

rom1504 reviewed Dec 20, 2022

View reviewed changes

rom1504 mentioned this pull request Dec 20, 2022

Add coca trained #307

Merged

rom1504 closed this Dec 20, 2022

rom1504 mentioned this pull request Dec 20, 2022

Add coca trained (#307) #308

Merged

6 tasks

gpucce mentioned this pull request Dec 27, 2022

use TextEncoder in coca encode_image #321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding CoCa #256

adding CoCa #256

gpucce commented Nov 25, 2022

rom1504 commented Nov 25, 2022

gpucce commented Nov 25, 2022

gpucce commented Nov 27, 2022 •

edited

Loading

rom1504 commented Nov 27, 2022

gpucce commented Dec 1, 2022

rom1504 commented Dec 18, 2022

iejMac commented Dec 18, 2022

gpucce commented Dec 18, 2022

rwightman commented Dec 18, 2022

gpucce commented Dec 18, 2022

rom1504 commented Dec 18, 2022

rom1504 Dec 19, 2022

gpucce Dec 20, 2022

rom1504 Dec 20, 2022

rom1504 Dec 19, 2022

rom1504 Dec 19, 2022

rom1504 Dec 19, 2022

gpucce Dec 20, 2022

rom1504 Dec 20, 2022

AwePhD commented Dec 20, 2022

rom1504 commented Dec 20, 2022

Soonhwan-Kwon commented Dec 20, 2022

rom1504 commented Dec 20, 2022 via email

rom1504 commented Dec 20, 2022 via email

gpucce commented Dec 20, 2022

gpucce commented Dec 20, 2022

rom1504 Dec 20, 2022

gpucce Dec 20, 2022

rom1504 Dec 20, 2022

gpucce commented Dec 20, 2022

rom1504 Dec 20, 2022

rom1504 Dec 20, 2022

rom1504 commented Dec 20, 2022

rom1504 commented Dec 20, 2022

adding CoCa #256

adding CoCa #256

Conversation

gpucce commented Nov 25, 2022

rom1504 commented Nov 25, 2022

gpucce commented Nov 25, 2022

gpucce commented Nov 27, 2022 • edited Loading

rom1504 commented Nov 27, 2022

gpucce commented Dec 1, 2022

rom1504 commented Dec 18, 2022

iejMac commented Dec 18, 2022

gpucce commented Dec 18, 2022

rwightman commented Dec 18, 2022

gpucce commented Dec 18, 2022

rom1504 commented Dec 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AwePhD commented Dec 20, 2022

rom1504 commented Dec 20, 2022

Soonhwan-Kwon commented Dec 20, 2022

rom1504 commented Dec 20, 2022 via email

rom1504 commented Dec 20, 2022 via email

gpucce commented Dec 20, 2022

gpucce commented Dec 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpucce commented Dec 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rom1504 commented Dec 20, 2022

rom1504 commented Dec 20, 2022

gpucce commented Nov 27, 2022 •

edited

Loading