Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of VIT-B/32 is worse than RN50 on CC3M #14

Closed
JACKHAHA363 opened this issue Sep 8, 2021 · 3 comments
Closed

Performance of VIT-B/32 is worse than RN50 on CC3M #14

JACKHAHA363 opened this issue Sep 8, 2021 · 3 comments

Comments

@JACKHAHA363
Copy link

Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?
Screen Shot 2021-09-08 at 12 39 26 PM

@carlini
Copy link
Collaborator

carlini commented Sep 8, 2021

ViT-B performed worse for us on CC than a RN50. I suspect (but can not prove) this is because there's not enough data and vision transformers appear more data hungry than resnets. I don't have the accuracy off hand, but this looks comparable to what we were seeing.

@rwightman
Copy link
Collaborator

rwightman commented Apr 6, 2022

This is expected and your numbers appear reasonable. In training quite a few models at the lower end recently, the ViT-B models (even the smaller ones) will underperform similar sized ResNet models for smaller data. This includes up to at least the 12-15M sample range as I was unable to push ViT-B-32 past RN50 on cc12m or yfcc15m. I feel the crossover point is probably in the 40-100M sample range but have not verified that.

One could possibly work around this by using a pretrained backbone for the vision tower. There is partial support for this right now in some (preliminary) support for timm models...

One could modify a model config such as this to enable the timm_model_pretrained flag https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/timm-vit_base_patch32_224.json

You'd then be starting with a vision tower pretrained on imagenet. It significantly speeds up reaching decent zero shot and eval rseults BUT I'd caution against using an imagenet pretrained backbone and doing zero_shot eval on Imagenet, you'd probably want an alternate zero shot test dataset

@rwightman
Copy link
Collaborator

Moving to discussion for future reference

@mlfoundations mlfoundations locked and limited conversation to collaborators Apr 6, 2022
@rwightman rwightman converted this issue into discussion #56 Apr 6, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants