Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use TextEncoder in coca encode_image #321

Merged
merged 25 commits into from Jan 6, 2023

Conversation

gpucce
Copy link
Contributor

@gpucce gpucce commented Dec 24, 2022

This PR adds the possibility to use a CLS token as a model parameter in TextEncoder to simplify the logic of CoCa

@gpucce
Copy link
Contributor Author

gpucce commented Dec 24, 2022

@rom1504 please don't merge this yet

@gpucce
Copy link
Contributor Author

gpucce commented Dec 27, 2022

@rom1504 this should be ready, essentially it addresses these points that were raised in the old PR and consequently adds possibility for text only encoder to be the Huggingface one, to make the multimodal one also available as HF is more difficult and I would do it in a different PR. As usual reviews, comments and anything else are more than welcome.

would appreciate your review @rwightman if you think anything big need to be done

Made some code review comments, most important points:

  • Make CoCa model dual tower (.visual, .text) from the start so builtin and HF text transformer models are the same, no bwd compat to be concerned about
  • Revert changes to Attention / CustomResidualAttentionBlock, we should avoid using them with the CoCa model
  • Move all models to output dicts instead of tuples, makes it more flexible pass output through optional loss / filters w/o brittle indexing and tuple len checks that are bound to fail with one more addition. Could make the dict output optional at construction time to prevent possible breaks for other downstream users.

and then test test test.

Originally posted by @rwightman in #256 (comment)

@gpucce gpucce marked this pull request as ready for review December 27, 2022 11:43
src/open_clip/coca_model.py Outdated Show resolved Hide resolved
src/open_clip/coca_model.py Outdated Show resolved Hide resolved
@iejMac
Copy link
Contributor

iejMac commented Dec 28, 2022

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

@gpucce
Copy link
Contributor Author

gpucce commented Dec 28, 2022

Hey, nice work! Left some very minor comments but I still need to look at the HF stuff in more detail. I'll do that later

Thanks for looking into it! For HF I didn't add a new Pooler but can do it if you think it is better

@iejMac
Copy link
Contributor

iejMac commented Dec 28, 2022

For HF stuff from my quick look I have 2 concerns:

  1. Do all HF models have resize_token_embeddings?
  2. What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

@gpucce
Copy link
Contributor Author

gpucce commented Dec 28, 2022

For HF stuff from my quick look I have 2 concerns:

  1. Do all HF models have resize_token_embeddings?

  2. What is the point of embed_cls? We already have a CLS pooler: https://github.com/gpucce/open_clip/blob/0e46a0199d6aaae69dc5c5089a62e8fe3b9d26ab/src/open_clip/hf_model.py#L65

  1. Should be a method of PretrainedTransformer https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings

  2. The reason would be to have the model CLS token and the one used for contrastive loss to be different. And adding this through the HF tokenizer seems more difficult to make stable.

@iejMac
Copy link
Contributor

iejMac commented Dec 29, 2022

@gpucce Re point 2: I think we actually want the CLS tokens to be the same. The idea would be to tune the CLS output embedding so that it's useful contrastively vs. just learning a completely new one. I realize this raises the issue of transformers that don't use CLS pooling but I don't see a problem with diverging from CoCa a bit here i.e. allowing for different pooling methods (like averaging) and using that as the text embedding for contrastive learning. To stay consistent with the paper we can make CLS pooling the pooling method for the default configuration.

@iejMac
Copy link
Contributor

iejMac commented Dec 29, 2022

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

@gpucce
Copy link
Contributor Author

gpucce commented Dec 29, 2022

Let me know what you think and also if you're busy. I can get to it myself actually in case you are

Hi, I am getting to it now and I agree with you, I will remove the embed_cls option. Another issue might then be that what is fed to the generative side might not have any <end_of_text> token.

@gpucce
Copy link
Contributor Author

gpucce commented Dec 29, 2022

@iejMac would it be bad to add the output_tokens option to the poolers themselves?

@iejMac
Copy link
Contributor

iejMac commented Dec 29, 2022

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

@gpucce
Copy link
Contributor Author

gpucce commented Dec 29, 2022

Why do we want to add it to the poolers? Can't we just use the poolers to get the pooled embedding but also return the tokens on the side?

I was thinking because with CLS pooling the CLS embedding should not be in the tokens while with Mean/Max maybe yes? Can also pick from the second on for all of them

@iejMac
Copy link
Contributor

iejMac commented Dec 29, 2022

Ah do you mean that for CLS pooling the tokens are x[1:] whereas for mean/max it's just x? If so I still think maybe jej ust keep poolers doing only pooling and then we can add some condition where it's like

if type(tokenizer) == ClsPooler:
    toks = model.hidden_state[1:]

or something like that, I forgot the exact syntax

src/open_clip/hf_model.py Outdated Show resolved Hide resolved
@gpucce
Copy link
Contributor Author

gpucce commented Dec 29, 2022

@iejMac sure I will add it here shortly, I have an idea what the issue is too, later I will try with it a bit and late write here what I find

@gpucce
Copy link
Contributor Author

gpucce commented Dec 29, 2022

@iejMac this should log them separately

@iejMac
Copy link
Contributor

iejMac commented Dec 29, 2022

@gpucce we should probably also log the total loss since it's weighted so that might contain important info

@gpucce
Copy link
Contributor Author

gpucce commented Dec 29, 2022

@iejMac my idea about the issue seems like it's wrong, another thing could be that since pad tokens are now ignored in the loss it looks higher. Tomorrow I will try and check for other things, I saw your run seems to be doing ok, if you get any evaluation on a checkpoint please share, and sorry for taking a lot of time today.

@rom1504
Copy link
Collaborator

rom1504 commented Dec 29, 2022

it's quite possible the loss will go down a little later

@iejMac
Copy link
Contributor

iejMac commented Jan 2, 2023

@gpucce if you look at the Wandb it looks like this PR consistently achieves much higher loss and slightly lower zero shot imagenet performance than our previous version (in the coca branch). Looking over the code again, do you have any ideas if the is expected and if not, what could be wrong? It looks like the captioning loss is pretty high, too bad we can't compare to previous run.

@gpucce
Copy link
Contributor Author

gpucce commented Jan 4, 2023

@iejMac sorry I didn't have time to work on this sooner and also for wasting some compute:

  • higher loss is due to now ignoring pad_tokens and indeed performance is lower but close to the same
  • the small decrease in performance is my mistake, I added a projection that doesn't make sense and I am confident a small commit should fix it

Before fixing this and making the model not loadable again, I added a PR to CLIP_benchmark LAION-AI/CLIP_benchmark#63 that works with both the coca branch and this one at this stage, I tested the old checkpoint and I get scores like the following

Bleu_1: 0.225, Bleu_2: 0.086, Bleu_3: 0.037, Bleu_4: 0.017, METEOR: 0.064, ROUGE_L: 0.169, CIDEr: 0.089, SPICE: 0.028

This is not really a good validation but should be enough to check if the model generative part is training, running on an untrained model gets all zeros. If you have time could try and run this with the old checkpoint as I did and with the one from the shorter run that finished a couple days ago if you have one, so we can know that generation is still working?

EDIT: Or of course share the checkpoint and I do it

@iejMac
Copy link
Contributor

iejMac commented Jan 4, 2023

@gpucce no worries. I uploaded 2 comparison checkpoints to hf here - https://huggingface.co/iejMac/CoCa_tests/tree/main/checkpoints

epoch_53_new.pt is from latest run
epoch_53_old.pt is from our full run on the coca branch

@gpucce
Copy link
Contributor Author

gpucce commented Jan 4, 2023

@gpucce no worries. I uploaded 2 comparison checkpoints to hf here - https://huggingface.co/iejMac/CoCa_tests/tree/main/checkpoints

epoch_53_new.pt is from latest run epoch_53_old.pt is from our full run on the coca branch

@iejMac here are the scores

Bleu_1 Bleu_2 Bleu_3 Bleu_4 CIDEr METEOR ROUGE_L SPICE
epoch_53_new 0.1789 0.0363 0.0056 0 0.0342 0.0742 0.1605 0.0281
epoch_53_old 0.2004 0.0696 0.0284 0.0115 0.0704 0.0573 0.1533 0.0233

This ones are a bit worse too but probably for the same reason as the CLIP one is will push the fix in a few hours

@iejMac
Copy link
Contributor

iejMac commented Jan 5, 2023

@gpucce let me know when you're happy with the fixes

@gpucce
Copy link
Contributor Author

gpucce commented Jan 5, 2023

@iejMac hopefully it is fine now, if you can try running it so we see what happens it would be very nice

@iejMac
Copy link
Contributor

iejMac commented Jan 5, 2023

@gpucce I hope you don't mind me asking but what is this new CoCaEncoderDecoder class? I thought it was only going to be a small change. Why do you think all of this is necessary?

@gpucce
Copy link
Contributor Author

gpucce commented Jan 5, 2023

@iejMac not at all, the new class is just to have only .text instead of both .text and .multimodal_decoder.

More in general the reason the change is larger than I thought is that I would have had to use even more parts of the TextTransformer and VisionTransforner from "outside" with things like if hasattr(self.visual, "smth") ... and it was becoming too much so I moved some of the logic as options inside those,

@iejMac
Copy link
Contributor

iejMac commented Jan 5, 2023

@gpucce Ah ok that sounds good. Going to start a run up and send wandb

@rom1504
Copy link
Collaborator

rom1504 commented Jan 5, 2023

main...gpucce:open_clip:encode_in_encoder change compared to main for reference

@rom1504
Copy link
Collaborator

rom1504 commented Jan 6, 2023

new run at https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1

merging here

let's continue at #308 to make sure we addressed all the points

@rom1504 rom1504 merged commit 30a73d4 into mlfoundations:coca Jan 6, 2023
@Soonhwan-Kwon
Copy link

new run at https://wandb.ai/iejmac/open-clip/reports/CoCa-ViT-B-32-v2--VmlldzozMjM0Nzg1

merging here

let's continue at #308 to make sure we addressed all the points

It seems that epoch_53_new comes from older version, are new run checkpoints from latest model architecture?

rom1504 added a commit that referenced this pull request Jan 29, 2023
* Add coca trained (#307)

* initial setup

* add coca loss

* remove loss from the model

* fix loss

* add underscores

* name changes

* add cross attention to Residual and CustomResidual

* fix if

* ädd transformer 'decoder'

* minor fix

* looks better

* initlize coca model structure

* clean

* typo and format

* checkpoint signature

* adjust multimodal decoder and add CoCaTransformer

* keep older logic

* remove chunk

* typo

* fix

* make chunk dim explicit

* adjust cfg names

* add attentionalpooling

* add attentional pooling to coca

* small change

* add cocatransformer variants and AttentionPooling

* remoive older attention pooler

* adapt embed text to coca text transformer

* rm coca layers

* rename and remove useless CoCa models

* make attentionpooler pooler only

* refactor for one transformer only

* coca forward works

* separatae context and n_queries

* add inital coca_base config

* remove config

* small loss change

* init training file

* make variable order right

* remove print

* uniform names

* renaming

* add coca funcs to init

* add coca config and exclude from testing

* add and comment simple test (no trained model)

* add L2 norm

* make L2 same as in clip

* remove unused temperature

* type

* clean

* fix config

* make rename and move cfg

* rename

* temptative add coca to factory

* fix config

* update config

* embed contrastive cls token in model

* remove unused arg

* import create_loss

* make factory accept coca

* make caption loss distributed

* make loss customizable

* pass loss trhough training_epoch

* add coca specific params to params

* removed decoder unused parameters

* remove unused attributes

* adjust coca_config

* fix config and remove unused parameters

* remove comment

* remove more comments

* rename attention pooler

* rename TransformerDecoder

* make AttentionalPooler clearer

* add local loss logic to cocaloss

* only create loss if train in data

* remove wrong file

* fix attentional pooler call

* not ready for testing

* really not ready for testing

* eof lien

* uniform names

* add possible generative loss to evaluate

* change _build function names

* remove wrong import

* remove local_loss from captioning loss

* indexing error

* finish renaming

* adjust configs

* add training test for coca

* simplify captioning loss

* remove hf

* fix evaluate and loss

* remove print

* move projection

* add coca vit 32 config

* test on new config

* adjust coca_base config

* remove coca from test_inference

* maybe fix regression test

* make logits and labels contiguous

* simpler logic

* make contiguous after transpose

* last test

* try fix loss

* CoCa PR: loss fix + rename file

* wait for feedback on this

* cleanup

* CoCa PR: add set_grad_checkpointing + fix checkpoint API

* CoCa PR: fix eval (which uses encode_x instead of forward)

* move making space for CLS token into encode_text

* rever zs changes + fix

Co-authored-by: gpucce <g.puccetti92@gmail.com>
Co-authored-by: gpucce <g.puccetti@gmail.com>
Co-authored-by: iejmac <iejmac@ip-172-31-44-155.ec2.internal>

* Add coca to CI

* Add coca to CI pr

* simplify encode_iamge (#313)

Co-authored-by: Romain Beaumont <romain.rom1@gmail.com>

* Add cls mask (#312)

* buil_cls_mask

* add cls_mask to encode_text

* add model properties

Co-authored-by: Romain Beaumont <romain.rom1@gmail.com>
Co-authored-by: gpucce <g.puccetti@gmail.com>

* Ignore pad tokens in captioning loss (#316)

* add ignore_index

* just need to pick right index

Co-authored-by: gpucce <g.puccetti@gmail.com>

* add `generate` to coca model (#314)

* add initial generative support

* make generation context_length independend

* remove kwargs

* last positional embeddings for CLS

* typo

* fix mask len

* add comment

* remove unused args

* simpler logic for input shorter than context length

Co-authored-by: gpucce <g.puccetti@gmail.com>

* use `TextEncoder` in coca `encode_image` (#321)

* use self.text in encode image

* unused var

* rever aAtention and CustoResidualAttentionBlock

* remove whiteline

* add dict output

* bintegrate self.text attributes

* HF compatibility

* better config and minor fixes

* clean

* remove eembed_cls option from HF

* use cls_token_position

* fix cls masking

* resize labels

* text -> self.text

* split loss logging

* add total loss

* minor logs formatting

* fix generate

* simpler logic

* disentangle proj for HF too

* adjust config

* only norm cls

* move attn_pool to VisionTransformer

* adjust coca_base config

* fix grad checkpointing in MultimodalTransformer

Co-authored-by: gpucce <g.puccetti@gmail.com>
Co-authored-by: iejMac <kilianmaciej6@gmail.com>

* Get some basic PEP changes out of the way

* Add tests bis (#355)

* make jit compilable

* redundant annotation

* less tests

* less annotations

* even less annotations

* fix name check in ci

* some annotations back

* make it simpler

* make hf simpler too

* better jit support with tests

* remove extra line

* add customtextclip

* more jit tests

* missing assert

* add eval

* typo

* rever forward changes

* clean coca model

* more cleaning

* last cleaning

* train.py: fix is_clip when doing distributed (#364)

* add README (#365)

* add README

* multimodal_cfg info

* multimodal

* remove output_dict argument (#368)

* remove output_dict argument

* cleaner

* do same thing for _encode_image (#366)

* do same thing for _encode_image

* encoder

* try this

* adjust inference tests

* fix syntax

* True not None

* dumb

* CoCa/forward: remove unused output_dict param

* Revert "do same thing for _encode_image (#366)"

This reverts commit de343fb.

* refactor

* white space

* remove extra layer norm

* move to_logits into decoder

* leave for later

* better torchscript

* annotate hf too

* Add CoCa-ViT-L/14 config (#379)

* Remove dead LN code, refactor attn_pool conditional for more clarity, minor formatting tweaks

* latent_dim to embed_dim

* remove extra cfg

* A bit more cleanup, keep context_length as context len, 'num_pos' to incl extra tokens. None type check for embed_cls instead of getattr

* CoCa: add B/32 pretrained (#389)

* add B/32 pretrained

* fix

* no capital

* slash

* remove coca from ci.yml

---------

Co-authored-by: gpucce <g.puccetti92@gmail.com>
Co-authored-by: gpucce <g.puccetti@gmail.com>
Co-authored-by: iejmac <iejmac@ip-172-31-44-155.ec2.internal>
Co-authored-by: iejMac <kilianmaciej6@gmail.com>
Co-authored-by: Ross Wightman <rwightman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants