Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the ImageNet zero-shot performance with the released models #24

Closed
jwyang opened this issue Jan 26, 2021 · 9 comments
Closed

About the ImageNet zero-shot performance with the released models #24

jwyang opened this issue Jan 26, 2021 · 9 comments

Comments

@jwyang
Copy link

jwyang commented Jan 26, 2021

Hi, CLIP authors,

Really great work! Appreciate much for releasing the code!

Recently, I am trying to evaluate the released two models (RN50 and ViT-B/32.) on imagenet validation set. What I can get with prompt engineering without ensemble are shown below:

ResNet-50 top-1: 55.09, top-5: 83.59
ViT-B/32 top-1: 59.06, top-5: 85.59

Not sure whether these numbers match those on your side. As a reference for us to do trial-and-errors, can you report the validation accuracies for these two models?

thanks,
Jianwei

@Newmu
Copy link

Newmu commented Jan 26, 2021

We're planning to release a code example for replicating the paper's zero-shot ImageNet results that can be used as a reference for helping with this, apologies for not including sufficient details in the paper!

@jwyang
Copy link
Author

jwyang commented Jan 27, 2021

Hi, Alec, thanks for sharing this information. looking forward to seeing the updates!

@jongwook
Copy link
Collaborator

Just updated a code example here. You can also open in Colab. Hope this helps!

@jwyang
Copy link
Author

jwyang commented Jan 31, 2021

Hi, @jongwook , this is fantastic! I will follow your Colab to reproduce your numbers!

@gjtjx
Copy link

gjtjx commented Feb 1, 2021

Just updated a code example here. You can also open in Colab. Hope this helps!
Top-1 55.73,Top-5 83.45?

@NikashS
Copy link

NikashS commented Feb 1, 2021

Hey just had a quick clarifying question, the results of the notebook you released achieve a zero-shot accuracy of 55.73% but Table 17 of the paper (last page) mentions that ViT-B/32 can achieve a 63.2% accuracy. Do you know how I can recreate this performance?

@Newmu
Copy link

Newmu commented Feb 1, 2021

Sorry it's confusing, the notebook evaluates on ImageNetV2 since it has a public evaluation set. When evaluated on ImageNet it should be within noise (+- few tenths of a percent) of results reported in Table 17.

In the notebook this is explained as:

The ILSVRC2012 datasets are no longer available for download publicly. We instead download the ImageNet-V2 dataset by Recht et al..
If you have the ImageNet dataset downloaded, you can replace the dataset with the official torchvision loader, e.g.:

images = torchvision.datasets.ImageNet("path/to/imagenet", split='val', transform=preprocess)

@jwyang
Copy link
Author

jwyang commented Feb 3, 2021

Update: had changed imagenetv2 to imagenet validation set, and verified that the performance matches the reported numbers in Table 17. Thanks!

@KaiyangZhou
Copy link

KaiyangZhou commented Jul 16, 2021

Hello, I followed this code https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb to use the templates ensemble for zero-shot prediction on imagenet, the reproduced result is 57.81%, which is far from the 76.2% accuracy reported in Table 1 in the paper. Am I missing something?

Updates on 19 Jul:
My mistake. The 76.2% results is obtained by using ViT-L/14 which is unavailable in the current repo (I used R50). So seems like the architecture matters a lot in the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants