You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for you inspiring work! I'm reproducing the region classification results of your paper. I tried several CLIP models but the results on LVIS lags.
The models used: vanilla CLIP model RN50, RN50x4, ViT-B-32
The prompt templates used:
templates = [
'itap of a {}.',
'a bad photo of the {}.',
'a origami {}.',
'a photo of the large {}.',
'a {} in a video game.',
'art of the {}.',
'a photo of the small {}.',
]
The metric used: top-1 accuracy
The dataset split: results obtained on official validation set
The results:
ImageNet: 53.34, 59.71(seems pretty close to 59.6 as reported in the Fig.1 (b)), 56.34
LVIS: 7.58, 9.68, 11.93, which all seem to be far worse than 19.1 as reported in Fig. 1(b).
For LVIS, I used load_lvis_json for data loading and I cropped the images with gt 2d bboxes. I also tried with a single prompt of "a photo of a {class_name}" and the text embedding downloaded in this repo but the results are slightly worse. Could you provided more details about the region classification experiments?
Specifically, 1. which model did you use? 2. What text prompt did you use? 3. Is there anything wrong in my steps?
Again, thanks for your patience.
The text was updated successfully, but these errors were encountered:
Hi @HatakeKiki, thanks for your interest in our work. I used ResNet-50 (see Table 11 of CLIP paper). The text prompts include ~80 templates. It can be found in this codebase. To further improve region classification, I did some augmentation (e.g., cropping a larger region with different scales like 1.2/1.5/2.0). It's highly sensitive depending on the augmentation you used. But the accuracy will be always low (below 20).
Thanks for you inspiring work! I'm reproducing the region classification results of your paper. I tried several CLIP models but the results on LVIS lags.
The models used: vanilla CLIP model RN50, RN50x4, ViT-B-32
The prompt templates used:
templates = [
'itap of a {}.',
'a bad photo of the {}.',
'a origami {}.',
'a photo of the large {}.',
'a {} in a video game.',
'art of the {}.',
'a photo of the small {}.',
]
The metric used: top-1 accuracy
The dataset split: results obtained on official validation set
The results:
ImageNet: 53.34, 59.71(seems pretty close to 59.6 as reported in the Fig.1 (b)), 56.34
LVIS: 7.58, 9.68, 11.93, which all seem to be far worse than 19.1 as reported in Fig. 1(b).
For LVIS, I used load_lvis_json for data loading and I cropped the images with gt 2d bboxes. I also tried with a single prompt of "a photo of a {class_name}" and the text embedding downloaded in this repo but the results are slightly worse. Could you provided more details about the region classification experiments?
Specifically, 1. which model did you use? 2. What text prompt did you use? 3. Is there anything wrong in my steps?
Again, thanks for your patience.
The text was updated successfully, but these errors were encountered: