use `TextEncoder` in coca `encode_image` #321

gpucce · 2022-12-24T10:17:12Z

This PR adds the possibility to use a CLS token as a model parameter in TextEncoder to simplify the logic of CoCa

gpucce · 2022-12-24T10:17:39Z

@rom1504 please don't merge this yet

src/open_clip/coca_model.py

gpucce · 2022-12-27T11:43:02Z

@rom1504 this should be ready, essentially it addresses these points that were raised in the old PR and consequently adds possibility for text only encoder to be the Huggingface one, to make the multimodal one also available as HF is more difficult and I would do it in a different PR. As usual reviews, comments and anything else are more than welcome.

would appreciate your review @rwightman if you think anything big need to be done

Made some code review comments, most important points:

Make CoCa model dual tower (.visual, .text) from the start so builtin and HF text transformer models are the same, no bwd compat to be concerned about
Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

and then test test test.

Originally posted by @rwightman in #256 (comment)

src/open_clip/coca_model.py

iejMac · 2022-12-28T17:20:26Z

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

gpucce · 2022-12-28T17:22:36Z

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

Thanks for looking into it! For HF I didn't add a new Pooler but can do it if you think it is better

iejMac · 2022-12-28T17:25:22Z

For HF stuff from my quick look I have 2 concerns:

Do all HF models have resize_token_embeddings?
What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

gpucce · 2022-12-28T18:04:06Z

For HF stuff from my quick look I have 2 concerns:

Do all HF models have resize_token_embeddings?

What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

Should be a method of PretrainedTransformer https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings
The reason would be to have the model CLS token and the one used for contrastive loss to be different. And adding this through the HF tokenizer seems more difficult to make stable.

iejMac · 2022-12-29T13:40:31Z

@gpucce Re point 2: I think we actually want the CLS tokens to be the same. The idea would be to tune the CLS output embedding so that it's useful contrastively vs. just learning a completely new one. I realize this raises the issue of transformers that don't use CLS pooling but I don't see a problem with diverging from CoCa a bit here i.e. allowing for different pooling methods (like averaging) and using that as the text embedding for contrastive learning. To stay consistent with the paper we can make CLS pooling the pooling method for the default configuration.

iejMac · 2022-12-29T13:41:12Z

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

gpucce · 2022-12-29T13:56:31Z

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

Hi, I am getting to it now and I agree with you, I will remove the embed_cls option. Another issue might then be that what is fed to the generative side might not have any <end_of_text> token.

gpucce · 2022-12-29T14:06:24Z

@iejMac would it be bad to add the output_tokens option to the poolers themselves?

iejMac · 2022-12-29T14:17:35Z

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

gpucce · 2022-12-29T14:21:56Z

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

I was thinking because with CLS pooling the CLS embedding should not be in the tokens while with Mean/Max maybe yes? Can also pick from the second on for all of them

iejMac · 2022-12-29T14:25:53Z

Ah do you mean that for CLS pooling the tokens are x[1:] whereas for mean/max it's just x? If so I still think maybe jej ust keep poolers doing only pooling and then we can add some condition where it's like

if type(tokenizer) == ClsPooler:
    toks = model.hidden_state[1:]

or something like that, I forgot the exact syntax

src/open_clip/hf_model.py

gpucce · 2022-12-29T18:36:07Z

@iejMac sure I will add it here shortly, I have an idea what the issue is too, later I will try with it a bit and late write here what I find

gpucce · 2022-12-29T18:52:19Z

@iejMac this should log them separately

iejMac · 2022-12-29T19:39:03Z

@gpucce we should probably also log the total loss since it's weighted so that might contain important info

gpucce · 2022-12-29T22:01:52Z

@iejMac my idea about the issue seems like it's wrong, another thing could be that since pad tokens are now ignored in the loss it looks higher. Tomorrow I will try and check for other things, I saw your run seems to be doing ok, if you get any evaluation on a checkpoint please share, and sorry for taking a lot of time today.

rom1504 · 2022-12-29T23:16:10Z

it's quite possible the loss will go down a little later

iejMac · 2023-01-02T11:47:32Z

@gpucce if you look at the Wandb it looks like this PR consistently achieves much higher loss and slightly lower zero shot imagenet performance than our previous version (in the coca branch). Looking over the code again, do you have any ideas if the is expected and if not, what could be wrong? It looks like the captioning loss is pretty high, too bad we can't compare to previous run.

gpucce · 2023-01-04T16:37:52Z

@iejMac sorry I didn't have time to work on this sooner and also for wasting some compute:

higher loss is due to now ignoring pad_tokens and indeed performance is lower but close to the same
the small decrease in performance is my mistake, I added a projection that doesn't make sense and I am confident a small commit should fix it

Before fixing this and making the model not loadable again, I added a PR to CLIP_benchmark LAION-AI/CLIP_benchmark#63 that works with both the coca branch and this one at this stage, I tested the old checkpoint and I get scores like the following

Bleu_1: 0.225, Bleu_2: 0.086, Bleu_3: 0.037, Bleu_4: 0.017, METEOR: 0.064, ROUGE_L: 0.169, CIDEr: 0.089, SPICE: 0.028

This is not really a good validation but should be enough to check if the model generative part is training, running on an untrained model gets all zeros. If you have time could try and run this with the old checkpoint as I did and with the one from the shorter run that finished a couple days ago if you have one, so we can know that generation is still working?

EDIT: Or of course share the checkpoint and I do it

iejMac · 2023-01-04T18:11:09Z

@gpucce no worries. I uploaded 2 comparison checkpoints to hf here - https://huggingface.co/iejMac/CoCa_tests/tree/main/checkpoints

epoch_53_new.pt is from latest run
epoch_53_old.pt is from our full run on the coca branch

gpucce · 2023-01-04T18:44:51Z

@gpucce no worries. I uploaded 2 comparison checkpoints to hf here - https://huggingface.co/iejMac/CoCa_tests/tree/main/checkpoints

epoch_53_new.pt is from latest run epoch_53_old.pt is from our full run on the coca branch

@iejMac here are the scores

	Bleu_1	Bleu_2	Bleu_3	Bleu_4	CIDEr	METEOR	ROUGE_L	SPICE
epoch_53_new	0.1789	0.0363	0.0056	0	0.0342	0.0742	0.1605	0.0281
epoch_53_old	0.2004	0.0696	0.0284	0.0115	0.0704	0.0573	0.1533	0.0233

This ones are a bit worse too but probably for the same reason as the CLIP one is will push the fix in a few hours

iejMac · 2023-01-05T13:12:10Z

@gpucce let me know when you're happy with the fixes

gpucce · 2023-01-05T15:01:11Z

@iejMac hopefully it is fine now, if you can try running it so we see what happens it would be very nice

iejMac · 2023-01-05T21:32:59Z

@gpucce I hope you don't mind me asking but what is this new CoCaEncoderDecoder class? I thought it was only going to be a small change. Why do you think all of this is necessary?

gpucce · 2023-01-05T21:42:35Z

@iejMac not at all, the new class is just to have only .text instead of both .text and .multimodal_decoder.

More in general the reason the change is larger than I thought is that I would have had to use even more parts of the TextTransformer and VisionTransforner from "outside" with things like if hasattr(self.visual, "smth") ... and it was becoming too much so I moved some of the logic as options inside those,

iejMac · 2023-01-05T21:49:05Z

@gpucce Ah ok that sounds good. Going to start a run up and send wandb

rom1504 · 2023-01-05T23:05:54Z

main...gpucce:open_clip:encode_in_encoder change compared to main for reference

rom1504 · 2023-01-06T00:15:56Z

new run at https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1

merging here

let's continue at #308 to make sure we addressed all the points

Soonhwan-Kwon · 2023-01-06T06:29:56Z

new run at https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1

merging here

let's continue at #308 to make sure we addressed all the points

It seems that epoch_53_new comes from older version, are new run checkpoints from latest model architecture?

* Add coca trained (#307) * initial setup * add coca loss * remove loss from the model * fix loss * add underscores * name changes * add cross attention to Residual and CustomResidual * fix if * ädd transformer 'decoder' * minor fix * looks better * initlize coca model structure * clean * typo and format * checkpoint signature * adjust multimodal decoder and add CoCaTransformer * keep older logic * remove chunk * typo * fix * make chunk dim explicit * adjust cfg names * add attentionalpooling * add attentional pooling to coca * small change * add cocatransformer variants and AttentionPooling * remoive older attention pooler * adapt embed text to coca text transformer * rm coca layers * rename and remove useless CoCa models * make attentionpooler pooler only * refactor for one transformer only * coca forward works * separatae context and n_queries * add inital coca_base config * remove config * small loss change * init training file * make variable order right * remove print * uniform names * renaming * add coca funcs to init * add coca config and exclude from testing * add and comment simple test (no trained model) * add L2 norm * make L2 same as in clip * remove unused temperature * type * clean * fix config * make rename and move cfg * rename * temptative add coca to factory * fix config * update config * embed contrastive cls token in model * remove unused arg * import create_loss * make factory accept coca * make caption loss distributed * make loss customizable * pass loss trhough training_epoch * add coca specific params to params * removed decoder unused parameters * remove unused attributes * adjust coca_config * fix config and remove unused parameters * remove comment * remove more comments * rename attention pooler * rename TransformerDecoder * make AttentionalPooler clearer * add local loss logic to cocaloss * only create loss if train in data * remove wrong file * fix attentional pooler call * not ready for testing * really not ready for testing * eof lien * uniform names * add possible generative loss to evaluate * change _build function names * remove wrong import * remove local_loss from captioning loss * indexing error * finish renaming * adjust configs * add training test for coca * simplify captioning loss * remove hf * fix evaluate and loss * remove print * move projection * add coca vit 32 config * test on new config * adjust coca_base config * remove coca from test_inference * maybe fix regression test * make logits and labels contiguous * simpler logic * make contiguous after transpose * last test * try fix loss * CoCa PR: loss fix + rename file * wait for feedback on this * cleanup * CoCa PR: add set_grad_checkpointing + fix checkpoint API * CoCa PR: fix eval (which uses encode_x instead of forward) * move making space for CLS token into encode_text * rever zs changes + fix Co-authored-by: gpucce <g.puccetti92@gmail.com> Co-authored-by: gpucce <g.puccetti@gmail.com> Co-authored-by: iejmac <iejmac@ip-172-31-44-155.ec2.internal> * Add coca to CI * Add coca to CI pr * simplify encode_iamge (#313) Co-authored-by: Romain Beaumont <romain.rom1@gmail.com> * Add cls mask (#312) * buil_cls_mask * add cls_mask to encode_text * add model properties Co-authored-by: Romain Beaumont <romain.rom1@gmail.com> Co-authored-by: gpucce <g.puccetti@gmail.com> * Ignore pad tokens in captioning loss (#316) * add ignore_index * just need to pick right index Co-authored-by: gpucce <g.puccetti@gmail.com> * add `generate` to coca model (#314) * add initial generative support * make generation context_length independend * remove kwargs * last positional embeddings for CLS * typo * fix mask len * add comment * remove unused args * simpler logic for input shorter than context length Co-authored-by: gpucce <g.puccetti@gmail.com> * use `TextEncoder` in coca `encode_image` (#321) * use self.text in encode image * unused var * rever aAtention and CustoResidualAttentionBlock * remove whiteline * add dict output * bintegrate self.text attributes * HF compatibility * better config and minor fixes * clean * remove eembed_cls option from HF * use cls_token_position * fix cls masking * resize labels * text -> self.text * split loss logging * add total loss * minor logs formatting * fix generate * simpler logic * disentangle proj for HF too * adjust config * only norm cls * move attn_pool to VisionTransformer * adjust coca_base config * fix grad checkpointing in MultimodalTransformer Co-authored-by: gpucce <g.puccetti@gmail.com> Co-authored-by: iejMac <kilianmaciej6@gmail.com> * Get some basic PEP changes out of the way * Add tests bis (#355) * make jit compilable * redundant annotation * less tests * less annotations * even less annotations * fix name check in ci * some annotations back * make it simpler * make hf simpler too * better jit support with tests * remove extra line * add customtextclip * more jit tests * missing assert * add eval * typo * rever forward changes * clean coca model * more cleaning * last cleaning * train.py: fix is_clip when doing distributed (#364) * add README (#365) * add README * multimodal_cfg info * multimodal * remove output_dict argument (#368) * remove output_dict argument * cleaner * do same thing for _encode_image (#366) * do same thing for _encode_image * encoder * try this * adjust inference tests * fix syntax * True not None * dumb * CoCa/forward: remove unused output_dict param * Revert "do same thing for _encode_image (#366)" This reverts commit de343fb. * refactor * white space * remove extra layer norm * move to_logits into decoder * leave for later * better torchscript * annotate hf too * Add CoCa-ViT-L/14 config (#379) * Remove dead LN code, refactor attn_pool conditional for more clarity, minor formatting tweaks * latent_dim to embed_dim * remove extra cfg * A bit more cleanup, keep context_length as context len, 'num_pos' to incl extra tokens. None type check for embed_cls instead of getattr * CoCa: add B/32 pretrained (#389) * add B/32 pretrained * fix * no capital * slash * remove coca from ci.yml --------- Co-authored-by: gpucce <g.puccetti92@gmail.com> Co-authored-by: gpucce <g.puccetti@gmail.com> Co-authored-by: iejmac <iejmac@ip-172-31-44-155.ec2.internal> Co-authored-by: iejMac <kilianmaciej6@gmail.com> Co-authored-by: Ross Wightman <rwightman@gmail.com>

use self.text in encode image

0cb805c

rom1504 reviewed Dec 24, 2022

View reviewed changes

src/open_clip/coca_model.py Outdated Show resolved Hide resolved

gpucce added 5 commits December 24, 2022 19:31

unused var

59d7db9

rever aAtention and CustoResidualAttentionBlock

1c8592a

remove whiteline

7552db8

add dict output

42deabd

bintegrate self.text attributes

2d12297

rom1504 reviewed Dec 25, 2022

View reviewed changes

src/open_clip/coca_model.py Outdated Show resolved Hide resolved

rom1504 reviewed Dec 25, 2022

View reviewed changes

src/open_clip/coca_model.py Outdated Show resolved Hide resolved

rom1504 reviewed Dec 25, 2022

View reviewed changes

src/open_clip/coca_model.py Outdated Show resolved Hide resolved

gpucce added 2 commits December 27, 2022 09:01

HF compatibility

9565e28

better config and minor fixes

82ae3e2

gpucce marked this pull request as ready for review December 27, 2022 11:43

clean

0e46a01

iejMac reviewed Dec 28, 2022

View reviewed changes

src/open_clip/coca_model.py Outdated Show resolved Hide resolved

src/open_clip/coca_model.py Outdated Show resolved Hide resolved

remove eembed_cls option from HF

79cdc23

iejMac reviewed Dec 29, 2022

View reviewed changes

src/open_clip/hf_model.py Outdated Show resolved Hide resolved

split loss logging

650877d

iejMac and others added 2 commits December 29, 2022 20:10

add total loss

6cfef50

minor logs formatting

7f2f7df

fix generate

b621d09

gpucce added 4 commits January 4, 2023 23:35

simpler logic

b29c40a

disentangle proj for HF too

8eb5400

adjust config

cd8212d

only norm cls

963e051

gpucce added 2 commits January 5, 2023 15:04

move attn_pool to VisionTransformer

a06f1e5

adjust coca_base config

400aa5d

fix grad checkpointing in MultimodalTransformer

ba92b79

rom1504 merged commit 30a73d4 into mlfoundations:coca Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use `TextEncoder` in coca `encode_image` #321

use `TextEncoder` in coca `encode_image` #321

gpucce commented Dec 24, 2022

gpucce commented Dec 24, 2022

gpucce commented Dec 27, 2022

iejMac commented Dec 28, 2022

gpucce commented Dec 28, 2022

iejMac commented Dec 28, 2022

gpucce commented Dec 28, 2022

iejMac commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022 •

edited

gpucce commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022

gpucce commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022

rom1504 commented Dec 29, 2022

iejMac commented Jan 2, 2023

gpucce commented Jan 4, 2023 •

edited

iejMac commented Jan 4, 2023

gpucce commented Jan 4, 2023

iejMac commented Jan 5, 2023

gpucce commented Jan 5, 2023

iejMac commented Jan 5, 2023

gpucce commented Jan 5, 2023

iejMac commented Jan 5, 2023

rom1504 commented Jan 5, 2023

rom1504 commented Jan 6, 2023

Soonhwan-Kwon commented Jan 6, 2023

use TextEncoder in coca encode_image #321

use TextEncoder in coca encode_image #321

Conversation

gpucce commented Dec 24, 2022

gpucce commented Dec 24, 2022

gpucce commented Dec 27, 2022

iejMac commented Dec 28, 2022

gpucce commented Dec 28, 2022

iejMac commented Dec 28, 2022

gpucce commented Dec 28, 2022

iejMac commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022 • edited

gpucce commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022

gpucce commented Dec 29, 2022

iejMac commented Dec 29, 2022

gpucce commented Dec 29, 2022

rom1504 commented Dec 29, 2022

iejMac commented Jan 2, 2023

gpucce commented Jan 4, 2023 • edited

iejMac commented Jan 4, 2023

gpucce commented Jan 4, 2023

iejMac commented Jan 5, 2023

gpucce commented Jan 5, 2023

iejMac commented Jan 5, 2023

gpucce commented Jan 5, 2023

iejMac commented Jan 5, 2023

rom1504 commented Jan 5, 2023

rom1504 commented Jan 6, 2023

Soonhwan-Kwon commented Jan 6, 2023

use `TextEncoder` in coca `encode_image` #321

use `TextEncoder` in coca `encode_image` #321

gpucce commented Dec 29, 2022 •

edited

gpucce commented Jan 4, 2023 •

edited