Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of Region classification in Fig.1 #73

Closed
HatakeKiki opened this issue Jul 10, 2023 · 1 comment
Closed

Reproduction of Region classification in Fig.1 #73

HatakeKiki opened this issue Jul 10, 2023 · 1 comment

Comments

@HatakeKiki
Copy link

HatakeKiki commented Jul 10, 2023

Thanks for you inspiring work! I'm reproducing the region classification results of your paper. I tried several CLIP models but the results on LVIS lags.

The models used: vanilla CLIP model RN50, RN50x4, ViT-B-32
The prompt templates used:
templates = [
'itap of a {}.',
'a bad photo of the {}.',
'a origami {}.',
'a photo of the large {}.',
'a {} in a video game.',
'art of the {}.',
'a photo of the small {}.',
]
The metric used: top-1 accuracy
The dataset split: results obtained on official validation set
The results:
ImageNet: 53.34, 59.71(seems pretty close to 59.6 as reported in the Fig.1 (b)), 56.34
LVIS: 7.58, 9.68, 11.93, which all seem to be far worse than 19.1 as reported in Fig. 1(b).

For LVIS, I used load_lvis_json for data loading and I cropped the images with gt 2d bboxes. I also tried with a single prompt of "a photo of a {class_name}" and the text embedding downloaded in this repo but the results are slightly worse. Could you provided more details about the region classification experiments?

Specifically, 1. which model did you use? 2. What text prompt did you use? 3. Is there anything wrong in my steps?

Again, thanks for your patience.

@YiwuZhong
Copy link
Collaborator

Hi @HatakeKiki, thanks for your interest in our work. I used ResNet-50 (see Table 11 of CLIP paper). The text prompts include ~80 templates. It can be found in this codebase. To further improve region classification, I did some augmentation (e.g., cropping a larger region with different scales like 1.2/1.5/2.0). It's highly sensitive depending on the augmentation you used. But the accuracy will be always low (below 20).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants