Performance of VIT-B/32 is worse than RN50 on CC3M #14

JACKHAHA363 · 2021-09-08T16:40:25Z

Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?

carlini · 2021-09-08T17:42:57Z

ViT-B performed worse for us on CC than a RN50. I suspect (but can not prove) this is because there's not enough data and vision transformers appear more data hungry than resnets. I don't have the accuracy off hand, but this looks comparable to what we were seeing.

rwightman · 2022-04-06T00:36:06Z

This is expected and your numbers appear reasonable. In training quite a few models at the lower end recently, the ViT-B models (even the smaller ones) will underperform similar sized ResNet models for smaller data. This includes up to at least the 12-15M sample range as I was unable to push ViT-B-32 past RN50 on cc12m or yfcc15m. I feel the crossover point is probably in the 40-100M sample range but have not verified that.

One could possibly work around this by using a pretrained backbone for the vision tower. There is partial support for this right now in some (preliminary) support for timm models...

One could modify a model config such as this to enable the timm_model_pretrained flag https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/timm-vit_base_patch32_224.json

You'd then be starting with a vision tower pretrained on imagenet. It significantly speeds up reaching decent zero shot and eval rseults BUT I'd caution against using an imagenet pretrained backbone and doing zero_shot eval on Imagenet, you'd probably want an alternate zero shot test dataset

rwightman · 2022-04-06T00:37:31Z

Moving to discussion for future reference

JACKHAHA363 mentioned this issue Sep 8, 2021

Training CLIP-ViT openai/CLIP#58

Closed

mlfoundations locked and limited conversation to collaborators Apr 6, 2022

rwightman converted this issue into discussion #56 Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Performance of VIT-B/32 is worse than RN50 on CC3M #14

Performance of VIT-B/32 is worse than RN50 on CC3M #14

JACKHAHA363 commented Sep 8, 2021

carlini commented Sep 8, 2021

rwightman commented Apr 6, 2022 •

edited

rwightman commented Apr 6, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Performance of VIT-B/32 is worse than RN50 on CC3M #14

Performance of VIT-B/32 is worse than RN50 on CC3M #14

Comments

JACKHAHA363 commented Sep 8, 2021

carlini commented Sep 8, 2021

rwightman commented Apr 6, 2022 • edited

rwightman commented Apr 6, 2022

This issue was moved to a discussion.

rwightman commented Apr 6, 2022 •

edited