New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing Pascal-50S #4
Comments
Hi @jnhwkim ! Thanks for reaching out about this! Quick question: do you have results for the other metrics as well ? |
@jmhessel I just tried to reproduce CLIP-S and RefCLIP-S. For clarity, I successfully reproduced with similar results for the other benchmarks, Flickr8k-Expert, CF, Composite, and FOIL. If you think getting the results for the other metrics is necessary to pinpoint the issue, I should try. |
Gotcha --- I'm glad you were able to reproduce the other datasets. it's been a while, but I should be able to dig up my evaluation code for pascal 50s. I will get back to you in the next few days, if not sooner. Everything you mentioned sounds correct though. By the way --- can you check the md5sum of an image here to make sure it matches up ? https://github.com/jmhessel/clipscore/blob/main/checksums/pascal_50s_checksum.txt |
@jmhessel I checked md5sum for one or two images, but I can try the whole for making sure. |
That sounds good! let me do a bit of digging --- it's been a while since I've looked at the code |
@jmhessel I understand. Sincerely, thank you for your effort. |
Okay, I was able to re-run everything for my internal code for a random sample of references. What I get is:
which is fairly in-line with the reported numbers. I often find that numerical precision issues make the numbers vary a bit (e.g., the float16 vs. float32 will make the results vary slightly). I did my best to standardize everything within this public, cleaner code base, which is the official implementation :-) By the way: one question --- how do you break ties? I break ties randomly in my code, rather than always defaulting to a loss. That could help to explain the slightly lower scores. I will dig a bit more on this in the next few days. |
Another question: what hardware are you running on and what numpy version are you using? If you're running on GPU and using a particular version of numpy, there's a critical bug that might effect cosine sim computations (see #2 ). I know the other datasets have worked fine, but just to be sure :-) |
My Pascal-50S dataset class looks like this:
|
|
Okay, I think I figured it out! You've found a minor bug in my setup :-) The bug: just for the Pascal-50S dataset (interestingly, I fixed this for Abstract-50S...), I used the 11 human ratings within the Thank you for finding and reporting this! 🎉 I will update the arxiv paper with the updated results and look into submitting an errata with a note for the ACL anthology version. I think the overall story won't change much --- but, given that many people compare on Pascal-50S, I will update ASAP. Is it okay if I add an acknowledgement to you for this? In specific: The main comparison that this affects is Table 4 in https://arxiv.org/pdf/2104.08718.pdf . The comparison between clipscore/refclipscore versus the three rows with an asterisk (TIGEr, ViLBERTScore-F, BERT-S++) is not fair, because (presumably) those works evaluated in the setting using the 48 consensus ratings rather than the 11 ratings within the other file. The comparison with the non-asterisked rows is correct, because those were also computed using the 11 ratings. When fixing this bug, my new numbers are much more aligned with yours. I haven't finalized the numbers, yet, but I get some differences.
For CLIPScore:
|
It's my honor to be acknowledged. I accept. Thank you for clarifying the issue along the way. |
Thanks again for your help :-D ! I just submitted an updated version to arxiv, and also submitted an errata fix. |
Thanks for sharing the code.
I am trying to reproduce Pascal-50S, but I got some different scores as follows:
Not much different when resampling the five references among 48 candidates.
Observations are:
For integrity, I checked the number of samples for each category (=1k), and double-check the categories using the fields of the Pascal-50S dataset,
new_data
andcategory
.I believe there are very few factors to swing the results. My questions are:
strip()
and add the prefix "A photo depicts ".clip.tokenize(caption, truncate=True)
.Is there any other factor to reproduce as in the paper?
The text was updated successfully, but these errors were encountered: