-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combining CLIPA-v2 and SigLIP (both big_vision based) models #660
Conversation
…ax based modelling code).
Almost there on this one, it got a bit epic. Still some more cleanup to do but want to get this in for more eyes. Main goal is to support SigLIP pretrained weights and CLIPA-v2 (I don't think supporting v1 is worth it at this point) There is associated cleanup and additions to support
For @zw615
I still have more testing to do, I've run through all SigLIP weights, and ~50% of the v2 CLIPA. At the large end, both the SO400M SigLIP and the new G/14 CLIPA check out at a hair past 83% top-1 There is a drop of ~0-0.2 in the zero-shot scores from jax versions, this fits past experience. Need to do some more testing, push weights to hub, CoCa is probably broken right now. |
@zw615 do you want to set up and organization and HF hub so we can push under something official? Or should I do it into the timm organization myself? It's much easier to get the weights on the hub. If you do set up an org I can provide some command lines to push once I have everything else ready. |
can you fix the merge conflict so tests will run please? |
|
||
|
||
@torch.no_grad() | ||
def load_big_vision_weights(model: CustomTextCLIP, checkpoint_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how stable do we expect this to be ? should we somehow (at least a comment) lock to a specific commit of big_vision ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rom1504 it works for big vision siglip weights -> timm + builtin text models only, I don't see why it wouldn't be stable, it has nothing to do with the code revision, it's only the weight files that are relevant. This wouldn't work for their lit models, but could be extended, etc.
I could drop this code after I convert and push to HF hub, but it could be useful reference or for future models. It's isolated to checkpoint loading with a npz/npy filename and has no extra deps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok you're saying this will work for some specific big vision trained checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe we could indicate those above the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will make a comment that it only supports the big_vision SigLIP weights right now, any other big_vision weights (present or future) could be added by expanding the support by checking for various keys in the numpy archive....
Pretty sure it's broken at this point, have some more things to do before I'd expect a pass. Will merge main tomorrow. Would like bigger picture feedback/comments early, regardless of tests. |
Also @gabrielilharco if you have any comments... |
@gpucce while you're poking around here, there are changes impacting CoCa in this PR (and probably some of your future adds) ... just a heads up, I put a sketch of an attentional pool as desribed in the paper in there, while keeping bwd compat |
really cool @rwightman! At a high level the design looks good to me. Since a lot changed and some of it affects existing models, we should make sure that the performance of all pretrained models we support remains the same after the changes |
thanks, will try and keep it all together, there was another small PR trying to address this, #551, fixing another small issue, don´t know if you ever had time to have a look |
It's a lot of changes but looks ok to me. Would definitely be safer to merge (and potentially revert) in small PRs bringing more scoped changes My main worry is I can't tell from the code if things still work. The tests partially cover inference still working the same. Maybe it needs to be expanded to new models What about training, how do we check? |
@rom1504 yeah, smaller usually better but clipa and siglip are both inter-related in a non-trivial way since they both come big_vision but with differences that conflict with past decisions made here, and they also need extra (but different) tokenizer support, preprocessing, etc. As a counter point, things would have ended up more of a mess if CLIPA had been merged as is, and then someone else decided to add SigLIP on top, and the other fixes got added, etc ... at least this will be pulled together with a goal to leave things a bit better in the end with hopefully a cleaner base to add more |
@rom1504 I feel testing inference will be enough here, the training risk should be low, the breaks should all be detectable running inference on pretrained models across all types |
…ocessing / text tokenization sensibly.
Looks like tests are passing Once this is merged, we can adapt open_clip/tests/test_inference.py Line 44 in e7b39e4
|
Nice @rwightman! Looking great. I'm able to run evals now for all new models, will send results once they're done |
@gpucce think I found it, the LN wasn't being applied properly with the attn_pool set, have to do it differently for the legacy (current) CoCa and future double pool options... |
@rwightman it seems there is still smth different for the captions, let me try and have a better look at the outputs more deeply Edit: if you meant that last commit should be the fix |
I'm seeing small discrepancies in performance for CoCa models too after the latest change. Some sample results below Before (without this PR): After the latest changes: |
Oh fun, yeah thought that last change would bring the CoCa ViT outputs to match old ones, they would have been different (for the tokens output) before, the pooled output should have been the same... |
FWIW, it's probably worth checking CoCa on the main branch to ensure it was in a good state there too, in case there was a prior regression we didn't pick up |
…atch prev CoCa behaviour, but at odds with argmax which leaves special tokens in (not consistent)
@rwightman until booster is up (tomorrow I think) I can't run the evals again because the gpu I have where I am is filled right now and also leonardo is under maintenance currently I am checking instances by hand |
@rwightman good point, I'm running some evals on main now and they are matching the numbers from d7542e4 so far |
@gabrielilharco @gpucce okay, so I did have another issue in the text pool, the 'tokens' included the cls token which would impact generation, but not zero-shot eval. Fixed that in the very latest. But for both of the previous fixes, the broken path was in the 'tokens' output from vision or text encoder, so zero-shot should have been consistent with main. |
@gabrielilharco nice eval looking good, but double checked that bigG CLIPA, should be 83, looks like I mixed up a config value, durr... confirming |
@rwightman eval for already fine-tuned model is consistent with main, I am running fine-tuning |
fine-tuning seems fine too |
@gpucce nice, thanks! okay, so are we ready to merge? @gabrielilharco
|
@rwightman all good from my side! I'm getting 83.09 for bigG CLIPA on ImageNet. Full evals are still running (it's slow since the model is so big), I'll update the numbers once they are done |
k, I will merge now so we can get changes and various fixes in now. Maybe hold off on a versioned release for another day or two to give a few early adopters chance to try main branch? We should still add some text to README / PRETRAINED.MD about CLIPA-v2 and SigLIP but can think about that one... |
Great!
I think as follow up to merging it would be nice to
- include clippa and siglip in the regression tests so they keep working
- same for coca
- add a regression test for the caption generation of coca
The first 2 are just about enabling these models in the test, no code
change
…On Fri, Oct 20, 2023, 23:52 Ross Wightman ***@***.***> wrote:
Merged #660 <#660> into
main.
—
Reply to this email directly, view it on GitHub
<#660 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437Q2HHDX57EV3X7SIW3YAKM3FAVCNFSM6AAAAAA5WHN2YKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQG4ZDMMRSGY3TEMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@rom1504 @rwightman I can add the reg tests for generation |
Hi @rwightman , sure! I will discuss with and work on this HF organization thing with other co-authors, and should be finish it real quick.
|
@rwightman Hi, ross. Could you please move those weights to the UCSC-VLAA organization? (https://huggingface.co/UCSC-VLAA) |
Okay, will ask someone to do that on Monday!
…On Sat, Oct 21, 2023, 5:15 PM zw ***@***.***> wrote:
@rwightman <https://github.com/rwightman> Hi, ross. Could you please move
those weights to the *UCSC-VLAA* organization? (
https://huggingface.co/UCSC-VLAA)
—
Reply to this email directly, view it on GitHub
<#660 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLQICHUYOY3GNC6M62KNQTYARQQJAVCNFSM6AAAAAA5WHN2YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTHE2TEMZRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
No description provided.