You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone convnext_base_w for inference on iOS / macOS devices.
They introduce 3 novel image backbones: mci{0|1|2}. It would be amazing if these models were available directly via timm. I believe this would be an essential first step towards getting it into open_clip for fine-tuning.
The arch, defined here, uses MobileOne and FastVIT components, which are already available in timm. I'm not sure how compatible the re-implementation there is with the existing one in timm out of the box, but it smells like integration is definitely possible.
The text was updated successfully, but these errors were encountered:
@rsomani95 the components themselves are equivalent at a functional level, but the naming was remapped, so would have to remap for this model as well...
@rsomani95 I took a closer look at this s1/s2 (mc1/mc2) are the easiest, could probably map those to OpenCLIP w/ a timm FastViT encoder (after a few additions and a key remapping for weights). I think the text encoder for those is compatible.
S0 uses a repmixer based text encoder so would need new code in OpenCLIP as well. The image encoder would map to a tweaked ver of FastViT.
The B model uses a ViT w/ a different stem, doable. I really like ViT NOT having BatchNorm though so a shame that it's now a ViT Base w/ BN in the stem.
@rwightman thanks for looking into that. That's really great to hear re. s1/s2 as those, in my eyes, sit in the perfect sweetspot of speed + accuracy. Given your observations, maybe it makes sense to port those two alone first? Is there something in particular I could help with?
MobileCLIP is a really fast CLIP architecture for mobile inference - about 3x faster than the fastest publicly available CLIP backbone
convnext_base_w
for inference on iOS / macOS devices.They introduce 3 novel image backbones:
mci{0|1|2}
. It would be amazing if these models were available directly viatimm
. I believe this would be an essential first step towards getting it intoopen_clip
for fine-tuning.The arch, defined here, uses MobileOne and FastVIT components, which are already available in
timm
. I'm not sure how compatible the re-implementation there is with the existing one intimm
out of the box, but it smells like integration is definitely possible.The text was updated successfully, but these errors were encountered: