How to use a non-timm pretrained image tower with a HF pretrained text tower? #543
-
I see that in #236 you folks used an OpenCLIP-pretrained image tower (ViT-H/14) along with an HF pretrained text tower ( open_clip/src/open_clip/factory.py Line 177 in fb72f4d If I try with What am I missing? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
BTW, it'd be great if, for each trained model, you could share the exact version of the repo (commit hash I guess) and the command used to launch it (esp. to see which flags were used). |
Beta Was this translation helpful? Give feedback.
-
I guess this is the current answer? #237 (comment) Not sure. Also, I guess another workaround is to use the HF version and load it as a timm model. |
Beta Was this translation helpful? Give feedback.
-
@bryant1410 there's no built in support to load existing CLIP trained weights for image tower into a new model with a different text encoder. There were some PR's to add support but they didn't get merged due to being combined with lots of other changes or got lost in the shuffle, would be good to add at some point. To do this right now you'd have to hack some code to load the image weights, the timm pretrained flag just passes down to timm to load pretrained (imagenet) weights. |
Beta Was this translation helpful? Give feedback.
@bryant1410 there's no built in support to load existing CLIP trained weights for image tower into a new model with a different text encoder. There were some PR's to add support but they didn't get merged due to being combined with lots of other changes or got lost in the shuffle, would be good to add at some point.
To do this right now you'd have to hack some code to load the image weights, the timm pretrained flag just passes down to timm to load pretrained (imagenet) weights.