Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing Pascal-50S #4

Closed
jnhwkim opened this issue Mar 23, 2022 · 13 comments
Closed

Reproducing Pascal-50S #4

jnhwkim opened this issue Mar 23, 2022 · 13 comments

Comments

@jnhwkim
Copy link

jnhwkim commented Mar 23, 2022

Thanks for sharing the code.

I am trying to reproduce Pascal-50S, but I got some different scores as follows:

CLIP-S
HC, HI, HM, MM, Mean
0.551, 0.992, 0.960, 0.715, 0.804
RefCLIP-S
HC, HI, HM, MM, Mean
0.650, 0.995, 0.956, 0.737, 0.835

Not much different when resampling the five references among 48 candidates.

Observations are:

  1. CLIP-S is lower than RefCLIP-S.
  2. The mean of RefCLIP-S is similar to the paper's report, but HC and MM reports are significantly different.

For integrity, I checked the number of samples for each category (=1k), and double-check the categories using the fields of the Pascal-50S dataset, new_data and category.

I believe there are very few factors to swing the results. My questions are:

  1. I preprocessed all captions using strip() and add the prefix "A photo depicts ".
  2. For the CLIP encoding, I used the CLIP tokenizer like clip.tokenize(caption, truncate=True).
  3. Random selection of 5 references among the 48 candidates for RefCLIP-S.

Is there any other factor to reproduce as in the paper?

@jmhessel
Copy link
Owner

Hi @jnhwkim ! Thanks for reaching out about this! Quick question: do you have results for the other metrics as well ?

@jnhwkim
Copy link
Author

jnhwkim commented Mar 23, 2022

@jmhessel I just tried to reproduce CLIP-S and RefCLIP-S. For clarity, I successfully reproduced with similar results for the other benchmarks, Flickr8k-Expert, CF, Composite, and FOIL.

If you think getting the results for the other metrics is necessary to pinpoint the issue, I should try.

@jmhessel
Copy link
Owner

Gotcha --- I'm glad you were able to reproduce the other datasets. it's been a while, but I should be able to dig up my evaluation code for pascal 50s. I will get back to you in the next few days, if not sooner. Everything you mentioned sounds correct though. By the way --- can you check the md5sum of an image here to make sure it matches up ? https://github.com/jmhessel/clipscore/blob/main/checksums/pascal_50s_checksum.txt

@jnhwkim
Copy link
Author

jnhwkim commented Mar 23, 2022

@jmhessel I checked md5sum for one or two images, but I can try the whole for making sure.

@jmhessel
Copy link
Owner

That sounds good! let me do a bit of digging --- it's been a while since I've looked at the code

@jnhwkim
Copy link
Author

jnhwkim commented Mar 23, 2022

@jmhessel I understand. Sincerely, thank you for your effort.

@jmhessel
Copy link
Owner

Okay, I was able to re-run everything for my internal code for a random sample of references. What I get is:

 metric               & HC   & HI   & HM   & MM   \\
 RefCLIPScore         & 57.6 & 99.5 & 96.2 & 80.8 \\

which is fairly in-line with the reported numbers. I often find that numerical precision issues make the numbers vary a bit (e.g., the float16 vs. float32 will make the results vary slightly). I did my best to standardize everything within this public, cleaner code base, which is the official implementation :-)

By the way: one question --- how do you break ties? I break ties randomly in my code, rather than always defaulting to a loss. That could help to explain the slightly lower scores. I will dig a bit more on this in the next few days.

@jmhessel
Copy link
Owner

jmhessel commented Mar 23, 2022

Another question: what hardware are you running on and what numpy version are you using?

If you're running on GPU and using a particular version of numpy, there's a critical bug that might effect cosine sim computations (see #2 ). I know the other datasets have worked fine, but just to be sure :-)

@jnhwkim
Copy link
Author

jnhwkim commented Mar 23, 2022

  • I am using the cosine similarity function from torch.
  • Ties are broken randomly using add 0~1 random number - 0.5 to one side as following code.
  • I will check whether the precision issue is particularly critical in this dataset.

My Pascal-50S dataset class looks like this:

class Pascal50sDataset(torch.utils.data.Dataset):
    idx2cat = {1: 'HC', 2: 'HI', 3: 'HM', 4: 'MM'}

    def __init__(self,
                 root: str = "data/Pascal-50s/",
                 media_size: int = 224,
                 voc_path: str = "data/VOC2010/"):
        super().__init__()
        self.voc_path = voc_path
        self.read_data(root)
        self.read_score(root)
        self.transforms = keys_to_transforms([], size=media_size)

    @staticmethod
    def loadmat(path):
        return scipy.io.loadmat(path)

    def read_data(self, root):
        mat = self.loadmat(
            os.path.join(root, "pyCIDErConsensus/pair_pascal.mat"))
        self.data = mat["new_input"][0]
        self.categories = mat["category"][0]
        # sanity check
        c = torch.Tensor(mat["new_data"])
        hc = (c.sum(dim=-1) == 12).int()
        hi = (c.sum(dim=-1) == 13).int()
        hm = ((c < 6).sum(dim=-1) == 1).int()
        mm = ((c < 6).sum(dim=-1) == 2).int()
        assert 1000 == hc.sum()
        assert 1000 == hi.sum()
        assert 1000 == hm.sum()
        assert 1000 == mm.sum()
        assert (hc + hi + hm + mm).sum() == self.categories.shape[0]
        chk = (torch.Tensor(self.categories) - hc - hi * 2 - hm * 3 - mm * 4)
        assert 0 == chk.abs().sum(), chk

    def read_score(self, root):
        mat = self.loadmat(
            os.path.join(root, "pyCIDErConsensus/consensus_pascal.mat"))
        data = mat["triplets"][0]
        self.labels = []
        self.references = []
        for i in range(len(self)):
            votes = {}
            refs = []
            for j in range(i * 48, (i + 1) * 48):
                a,b,c,d = [x[0][0] for x in data[j]]
                key = b[0].strip() if 1 == d else c[0].strip()
                refs.append(a[0].strip())
                votes[key] = votes.get(key, 0) + 1
            assert 2 >= len(votes.keys()), votes
            assert len(votes.keys()) > 0
            try:
                vote_a = votes.get(self.data[i][1][0].strip(), 0)
                vote_b = votes.get(self.data[i][2][0].strip(), 0)
            except KeyError:
                print("warning: data mismatch!")
                print(f"a: {self.data[i][1][0].strip()}")
                print(f"b: {self.data[i][2][0].strip()}")
                print(votes)
                exit()
            # Ties are broken randomly.
            label = 0 if vote_a > vote_b + random.random() - .5 else 1
            self.labels.append(label)
            self.references.append(refs)

    def __len__(self):
        return len(self.data)

    def get_image(self, filename: str):
        path = os.path.join(self.voc_path, "JPEGImages")
        img = Image.open(os.path.join(path, filename)).convert('RGB')
        return self.transforms(img)

    def __getitem__(self, idx: int):
        vid, a, b = [x[0] for x in self.data[idx]]
        label = self.labels[idx]
        feat = self.get_image(vid)
        a = a.strip()
        b = b.strip()
        references = self.references[idx]
        category = self.categories[idx]
        return feat, a, b, references, category, label

@jnhwkim
Copy link
Author

jnhwkim commented Mar 23, 2022

  • I was calculating on float32 in GPU (after encoding by CLIP I cast it as float32 for further operation), but I noticed this official code operates on float16. I followed your code to check, but, as you expected, the difference was too slight for meaningful change.
  • I checked the md5sum. All files are OK.

@jmhessel
Copy link
Owner

jmhessel commented Mar 23, 2022

Okay, I think I figured it out! You've found a minor bug in my setup :-) The bug: just for the Pascal-50S dataset (interestingly, I fixed this for Abstract-50S...), I used the 11 human ratings within the pyCIDErConsensus/pair_pascal.mat instead of what I should have used, which is the 48 human ratings within the pyCIDErConsensus/consensus_pascal.mat! What are those 11 ratings, anyway, and why didn't I notice there were only 11 of them, rather than the 48 that there should have been? doh.

Thank you for finding and reporting this! 🎉 I will update the arxiv paper with the updated results and look into submitting an errata with a note for the ACL anthology version. I think the overall story won't change much --- but, given that many people compare on Pascal-50S, I will update ASAP. Is it okay if I add an acknowledgement to you for this?

In specific: The main comparison that this affects is Table 4 in https://arxiv.org/pdf/2104.08718.pdf . The comparison between clipscore/refclipscore versus the three rows with an asterisk (TIGEr, ViLBERTScore-F, BERT-S++) is not fair, because (presumably) those works evaluated in the setting using the 48 consensus ratings rather than the 11 ratings within the other file. The comparison with the non-asterisked rows is correct, because those were also computed using the 11 ratings.

When fixing this bug, my new numbers are much more aligned with yours. I haven't finalized the numbers, yet, but I get some differences.
For RefCLIPScore:

metric               &   HC &   HI &   HM &   MM &   mean \\
RefCLIPScore         & 64.5 & 99.6 & 95.4 & 72.9 &   83.1 \\
reported RefCLIPScore 57.9 & 99.5 & 96.1 & 80.8 & 83.6 \\

For CLIPScore:

CLIPScore            & 56.4 & 99.3 & 96.4 & 70.7 &   80.7 \\
reported CLIPScore & 60.3 & 99.4 & 97.9 &77.3 &83.7 \\

@jnhwkim
Copy link
Author

jnhwkim commented Mar 23, 2022

It's my honor to be acknowledged. I accept. Thank you for clarifying the issue along the way.

@jnhwkim jnhwkim closed this as completed Mar 23, 2022
@jmhessel
Copy link
Owner

Thanks again for your help :-D ! I just submitted an updated version to arxiv, and also submitted an errata fix.

acl-org/acl-anthology#1859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants