How to use a non-timm pretrained image tower with a HF pretrained text tower? #543

bryant1410 · 2023-05-31T20:57:48Z

bryant1410
May 31, 2023

I see that in #236 you folks used an OpenCLIP-pretrained image tower (ViT-H/14) along with an HF pretrained text tower (xlm-roberta-large). How do I reproduce using that config (xlm-roberta-large-ViT-H-14)? What flags did you pass for it? Because it seems that with --pretrained-image I can only pass a timm model in the config, otherwise I hit this assert:

open_clip/src/open_clip/factory.py

Line 177 in fb72f4d

    
           assert False, 'pretrained image towers currently only supported for timm models'

If I try with --pretrained $NAME instead of --pretrained-image, then it's gonna look for pretrained checkpoints for the model name (xlm-roberta-large-ViT-H-14), then look for the $NAME weights, where the only one for that config is not ViT-H/14 but the one obtained after using it with XLM-RoBERTa. Still, if that weight was correct, using --pretrained would override the pretrained text tower weights that were already loaded, if I understand correctly.

What am I missing?

Answered by rwightman

Jun 27, 2023

@bryant1410 there's no built in support to load existing CLIP trained weights for image tower into a new model with a different text encoder. There were some PR's to add support but they didn't get merged due to being combined with lots of other changes or got lost in the shuffle, would be good to add at some point.

To do this right now you'd have to hack some code to load the image weights, the timm pretrained flag just passes down to timm to load pretrained (imagenet) weights.

View full answer

bryant1410 · 2023-05-31T22:20:56Z

bryant1410
May 31, 2023
Author

BTW, it'd be great if, for each trained model, you could share the exact version of the repo (commit hash I guess) and the command used to launch it (esp. to see which flags were used).

0 replies

bryant1410 · 2023-05-31T22:29:45Z

bryant1410
May 31, 2023
Author

I guess this is the current answer? #237 (comment) Not sure.

Also, I guess another workaround is to use the HF version and load it as a timm model.

0 replies

rwightman · 2023-06-27T18:13:42Z

rwightman
Jun 27, 2023
Maintainer

@bryant1410 there's no built in support to load existing CLIP trained weights for image tower into a new model with a different text encoder. There were some PR's to add support but they didn't get merged due to being combined with lots of other changes or got lost in the shuffle, would be good to add at some point.

To do this right now you'd have to hack some code to load the image weights, the timm pretrained flag just passes down to timm to load pretrained (imagenet) weights.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a non-timm pretrained image tower with a HF pretrained text tower? #543

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

How to use a non-timm pretrained image tower with a HF pretrained text tower? #543

bryant1410 May 31, 2023

Replies: 3 comments

bryant1410 May 31, 2023 Author

bryant1410 May 31, 2023 Author

rwightman Jun 27, 2023 Maintainer

bryant1410
May 31, 2023

bryant1410
May 31, 2023
Author

bryant1410
May 31, 2023
Author

rwightman
Jun 27, 2023
Maintainer