add `generate` to coca model #314

gpucce · 2022-12-21T13:00:42Z

This PR should add the generate method to the CoCa model to add support for generation

based on https://github.com/lucidrains/x-transformers/blob/main/x_transformers/autoregressive_wrapper.py

gpucce · 2022-12-21T15:13:49Z

@rom1504 where would you like to test the generative side?

I should have implemented the generation part as a generate method in the CoCa model, looks like smth is happening but not really as much as it should I believe. Perhaps the captioning loss also needs some adjustments with regards to pad tokens, I might be missing that in the coca loss.

Nevertheless, it would be nice to put up a proper benchmark, which task/data should I use for it?

gpucce · 2022-12-21T15:15:03Z

Also tests pass locally but it would be nice to have them run here too, I don't know if that is possible with this PR not pointing at main.

rom1504 · 2022-12-21T15:25:26Z

Could you check with the published model (see link in other PR) that it works on a few images ?

…

On Wed, Dec 21, 2022, 16:15 Giovanni Puccetti ***@***.***> wrote: Also tests pass locally but it would be nice to have them run here too, I don't know if that is possible with this PR not pointing at main. — Reply to this email directly, view it on GitHub <#314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437R7PH3LUSVEG2XAY33WOMNIHANCNFSM6AAAAAATFTGXHU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

gpucce · 2022-12-21T17:23:35Z

@rom1504 some examples from imagenet, always uaing the prompt "the image shows a" then PRED is what the model generates, LABEL the label in imagenet and then the 5 images that are used.

PRED of a a red red - - crested crowned tur
LABEL African grey, African gray, Psittacus erithacus
PRED shows corn corn in on a a farm tractor . 
LABEL ear, spike, capitulum
PRED of a a giant giant panda panda . ( <end_of_text>
LABEL giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca
PRED shows a a wooden wooden cradle cradle with with a 
LABEL cradle
PRED of a a cowboy cowboy on sitting a on horse 
LABEL cowboy hat, ten-gallon hat

PRED shows a a red red - - backed brown cou
LABEL coucal
PRED shows a a dog dog looking and at looking a 
LABEL Italian greyhound
PRED shows the the view mountain of of the the volcano 
LABEL volcano
PRED shows a a dog dog sitting sitting on on the 
LABEL Welsh springer spaniel
PRED shows a a black black - - capped throchick
LABEL chickadee

rom1504 · 2022-12-21T17:26:55Z

Nice!

It seems to love repeating itself
I'm wondering if we could fix that by tuning sampling params

Any opinions @lucidrains ?

Soonhwan-Kwon · 2022-12-21T17:57:47Z

I also tested naive caption generation and it repeats a lot. We need beam search or contrastive search to get better result and I recommend to use same convention of huggingface generation for future supports.

Soonhwan-Kwon · 2022-12-21T18:08:20Z

I already scored cider 120 on coco with CoCa and in this implementation seperated kv in attention as HF transformers, it can be more compatible and efficient. If needed I can have a look in here.

rom1504 · 2022-12-21T18:33:03Z

It would definitely be great if you can have a look and make a PR!

…

On Wed, Dec 21, 2022, 19:08 Soonhwan-Kwon ***@***.***> wrote: I already scored cider 120 on coco with CoCa and in this implementation seperated kv in attention as transformers, it can be more compatible and efficient. If needed I can have a look in here. — Reply to this email directly, view it on GitHub <#314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437QERPGFSEPGURZ5LXLWONBR7ANCNFSM6AAAAAATFTGXHU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rom1504 · 2022-12-21T23:04:29Z

which task/data should I use for it?

maybe "cider on coco", @Soonhwan-Kwon could you point to code to evaluate this ?

rom1504 · 2022-12-21T23:04:43Z

I checked the code here and it looks good, maybe we can merge and iterate on top

rom1504 · 2022-12-21T23:05:32Z

@gpucce could you resolve the merge conflict ? (rebase on coca)

Soonhwan-Kwon · 2022-12-21T23:57:41Z

which task/data should I use for it?

maybe "cider on coco", @Soonhwan-Kwon could you point to code to evaluate this ?

I used this blip code as reference. line80~
https://huggingface.co/spaces/Salesforce/BLIP/blob/fe102a54bbdf72e68c97f685d50a25ce6e46cc5e/data/utils.py

gpucce · 2022-12-22T10:30:32Z

Replying here instead of the closed PR @rom1504 the reason the text part is harder to rewrite using only self.text is that the CLS token for contrastive loss is a model parameter separated from token_embeddings, to avoid confusion with selecting a specific token in the vocabulary, but to add it in the right place I had to rewrite the forward, although the logic is very similar

rom1504 · 2022-12-22T14:43:33Z

I see. I'm wondering if we should add an option to have a cls token in the original text encoder
Also, some HF encoders already have that, so in this case it would already work by default

gpucce · 2022-12-22T14:53:37Z

I think to support generation with shorter context length, if we add support for this cls to TextEncoder there would be need for more changes still in TextEncoder. An alternative could be defining a coca tokenizer. Whichever you think is best, I would do it in a different PR if you are still ok with merging this one

rom1504 · 2022-12-22T15:46:51Z

Making changes in text encoder might work

I think what's important is to find a way such that we could support more text encoders kind such as HF ones for example

* Add coca trained (#307) * initial setup * add coca loss * remove loss from the model * fix loss * add underscores * name changes * add cross attention to Residual and CustomResidual * fix if * ädd transformer 'decoder' * minor fix * looks better * initlize coca model structure * clean * typo and format * checkpoint signature * adjust multimodal decoder and add CoCaTransformer * keep older logic * remove chunk * typo * fix * make chunk dim explicit * adjust cfg names * add attentionalpooling * add attentional pooling to coca * small change * add cocatransformer variants and AttentionPooling * remoive older attention pooler * adapt embed text to coca text transformer * rm coca layers * rename and remove useless CoCa models * make attentionpooler pooler only * refactor for one transformer only * coca forward works * separatae context and n_queries * add inital coca_base config * remove config * small loss change * init training file * make variable order right * remove print * uniform names * renaming * add coca funcs to init * add coca config and exclude from testing * add and comment simple test (no trained model) * add L2 norm * make L2 same as in clip * remove unused temperature * type * clean * fix config * make rename and move cfg * rename * temptative add coca to factory * fix config * update config * embed contrastive cls token in model * remove unused arg * import create_loss * make factory accept coca * make caption loss distributed * make loss customizable * pass loss trhough training_epoch * add coca specific params to params * removed decoder unused parameters * remove unused attributes * adjust coca_config * fix config and remove unused parameters * remove comment * remove more comments * rename attention pooler * rename TransformerDecoder * make AttentionalPooler clearer * add local loss logic to cocaloss * only create loss if train in data * remove wrong file * fix attentional pooler call * not ready for testing * really not ready for testing * eof lien * uniform names * add possible generative loss to evaluate * change _build function names * remove wrong import * remove local_loss from captioning loss * indexing error * finish renaming * adjust configs * add training test for coca * simplify captioning loss * remove hf * fix evaluate and loss * remove print * move projection * add coca vit 32 config * test on new config * adjust coca_base config * remove coca from test_inference * maybe fix regression test * make logits and labels contiguous * simpler logic * make contiguous after transpose * last test * try fix loss * CoCa PR: loss fix + rename file * wait for feedback on this * cleanup * CoCa PR: add set_grad_checkpointing + fix checkpoint API * CoCa PR: fix eval (which uses encode_x instead of forward) * move making space for CLS token into encode_text * rever zs changes + fix Co-authored-by: gpucce <g.puccetti92@gmail.com> Co-authored-by: gpucce <g.puccetti@gmail.com> Co-authored-by: iejmac <iejmac@ip-172-31-44-155.ec2.internal> * Add coca to CI * Add coca to CI pr * simplify encode_iamge (#313) Co-authored-by: Romain Beaumont <romain.rom1@gmail.com> * Add cls mask (#312) * buil_cls_mask * add cls_mask to encode_text * add model properties Co-authored-by: Romain Beaumont <romain.rom1@gmail.com> Co-authored-by: gpucce <g.puccetti@gmail.com> * Ignore pad tokens in captioning loss (#316) * add ignore_index * just need to pick right index Co-authored-by: gpucce <g.puccetti@gmail.com> * add `generate` to coca model (#314) * add initial generative support * make generation context_length independend * remove kwargs * last positional embeddings for CLS * typo * fix mask len * add comment * remove unused args * simpler logic for input shorter than context length Co-authored-by: gpucce <g.puccetti@gmail.com> * use `TextEncoder` in coca `encode_image` (#321) * use self.text in encode image * unused var * rever aAtention and CustoResidualAttentionBlock * remove whiteline * add dict output * bintegrate self.text attributes * HF compatibility * better config and minor fixes * clean * remove eembed_cls option from HF * use cls_token_position * fix cls masking * resize labels * text -> self.text * split loss logging * add total loss * minor logs formatting * fix generate * simpler logic * disentangle proj for HF too * adjust config * only norm cls * move attn_pool to VisionTransformer * adjust coca_base config * fix grad checkpointing in MultimodalTransformer Co-authored-by: gpucce <g.puccetti@gmail.com> Co-authored-by: iejMac <kilianmaciej6@gmail.com> * Get some basic PEP changes out of the way * Add tests bis (#355) * make jit compilable * redundant annotation * less tests * less annotations * even less annotations * fix name check in ci * some annotations back * make it simpler * make hf simpler too * better jit support with tests * remove extra line * add customtextclip * more jit tests * missing assert * add eval * typo * rever forward changes * clean coca model * more cleaning * last cleaning * train.py: fix is_clip when doing distributed (#364) * add README (#365) * add README * multimodal_cfg info * multimodal * remove output_dict argument (#368) * remove output_dict argument * cleaner * do same thing for _encode_image (#366) * do same thing for _encode_image * encoder * try this * adjust inference tests * fix syntax * True not None * dumb * CoCa/forward: remove unused output_dict param * Revert "do same thing for _encode_image (#366)" This reverts commit de343fb. * refactor * white space * remove extra layer norm * move to_logits into decoder * leave for later * better torchscript * annotate hf too * Add CoCa-ViT-L/14 config (#379) * Remove dead LN code, refactor attn_pool conditional for more clarity, minor formatting tweaks * latent_dim to embed_dim * remove extra cfg * A bit more cleanup, keep context_length as context len, 'num_pos' to incl extra tokens. None type check for embed_cls instead of getattr * CoCa: add B/32 pretrained (#389) * add B/32 pretrained * fix * no capital * slash * remove coca from ci.yml --------- Co-authored-by: gpucce <g.puccetti92@gmail.com> Co-authored-by: gpucce <g.puccetti@gmail.com> Co-authored-by: iejmac <iejmac@ip-172-31-44-155.ec2.internal> Co-authored-by: iejMac <kilianmaciej6@gmail.com> Co-authored-by: Ross Wightman <rwightman@gmail.com>

gpucce and others added 3 commits December 22, 2022 19:13

add initial generative support

f674319

make generation context_length independend

0fa6b05

remove kwargs

2812d66

gpucce force-pushed the add_generative branch from fb8c9b0 to 2812d66 Compare December 22, 2022 10:18

last positional embeddings for CLS

46fd53a

gpucce added 5 commits December 22, 2022 19:32

typo

1de2d2d

fix mask len

e10b0a7

add comment

7e8830b

Merge remote-tracking branch 'upstream/coca' into add_generative

f3a84e6

remove unused args

d2e1be3

simpler logic for input shorter than context length

18c1da3

rom1504 marked this pull request as ready for review December 22, 2022 15:45

rom1504 merged commit dee1ea5 into mlfoundations:coca Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `generate` to coca model #314

add `generate` to coca model #314

gpucce commented Dec 21, 2022

gpucce commented Dec 21, 2022

gpucce commented Dec 21, 2022

rom1504 commented Dec 21, 2022 via email

gpucce commented Dec 21, 2022

rom1504 commented Dec 21, 2022

Soonhwan-Kwon commented Dec 21, 2022

Soonhwan-Kwon commented Dec 21, 2022 •

edited

rom1504 commented Dec 21, 2022 via email

rom1504 commented Dec 21, 2022

rom1504 commented Dec 21, 2022

rom1504 commented Dec 21, 2022

Soonhwan-Kwon commented Dec 21, 2022 •

edited

gpucce commented Dec 22, 2022 •

edited

rom1504 commented Dec 22, 2022

gpucce commented Dec 22, 2022

rom1504 commented Dec 22, 2022

add generate to coca model #314

add generate to coca model #314

Conversation

gpucce commented Dec 21, 2022

gpucce commented Dec 21, 2022

gpucce commented Dec 21, 2022

rom1504 commented Dec 21, 2022 via email

gpucce commented Dec 21, 2022

rom1504 commented Dec 21, 2022

Soonhwan-Kwon commented Dec 21, 2022

Soonhwan-Kwon commented Dec 21, 2022 • edited

rom1504 commented Dec 21, 2022 via email

rom1504 commented Dec 21, 2022

rom1504 commented Dec 21, 2022

rom1504 commented Dec 21, 2022

Soonhwan-Kwon commented Dec 21, 2022 • edited

gpucce commented Dec 22, 2022 • edited

rom1504 commented Dec 22, 2022

gpucce commented Dec 22, 2022

rom1504 commented Dec 22, 2022

add `generate` to coca model #314

add `generate` to coca model #314

Soonhwan-Kwon commented Dec 21, 2022 •

edited

Soonhwan-Kwon commented Dec 21, 2022 •

edited

gpucce commented Dec 22, 2022 •

edited