Reproducing Pascal-50S #4

jnhwkim · 2022-03-23T00:57:19Z

Thanks for sharing the code.

I am trying to reproduce Pascal-50S, but I got some different scores as follows:

CLIP-S
HC, HI, HM, MM, Mean
0.551, 0.992, 0.960, 0.715, 0.804
RefCLIP-S
HC, HI, HM, MM, Mean
0.650, 0.995, 0.956, 0.737, 0.835

Not much different when resampling the five references among 48 candidates.

Observations are:

CLIP-S is lower than RefCLIP-S.
The mean of RefCLIP-S is similar to the paper's report, but HC and MM reports are significantly different.

For integrity, I checked the number of samples for each category (=1k), and double-check the categories using the fields of the Pascal-50S dataset, new_data and category.

I believe there are very few factors to swing the results. My questions are:

I preprocessed all captions using strip() and add the prefix "A photo depicts ".
For the CLIP encoding, I used the CLIP tokenizer like clip.tokenize(caption, truncate=True).
Random selection of 5 references among the 48 candidates for RefCLIP-S.

Is there any other factor to reproduce as in the paper?

The text was updated successfully, but these errors were encountered:

jmhessel · 2022-03-23T01:50:15Z

Hi @jnhwkim ! Thanks for reaching out about this! Quick question: do you have results for the other metrics as well ?

jnhwkim · 2022-03-23T01:56:06Z

@jmhessel I just tried to reproduce CLIP-S and RefCLIP-S. For clarity, I successfully reproduced with similar results for the other benchmarks, Flickr8k-Expert, CF, Composite, and FOIL.

If you think getting the results for the other metrics is necessary to pinpoint the issue, I should try.

jmhessel · 2022-03-23T02:04:36Z

Gotcha --- I'm glad you were able to reproduce the other datasets. it's been a while, but I should be able to dig up my evaluation code for pascal 50s. I will get back to you in the next few days, if not sooner. Everything you mentioned sounds correct though. By the way --- can you check the md5sum of an image here to make sure it matches up ? https://github.com/jmhessel/clipscore/blob/main/checksums/pascal_50s_checksum.txt

jnhwkim · 2022-03-23T02:06:42Z

@jmhessel I checked md5sum for one or two images, but I can try the whole for making sure.

jmhessel · 2022-03-23T02:08:17Z

That sounds good! let me do a bit of digging --- it's been a while since I've looked at the code

jnhwkim · 2022-03-23T02:10:01Z

@jmhessel I understand. Sincerely, thank you for your effort.

jmhessel · 2022-03-23T03:32:02Z

Okay, I was able to re-run everything for my internal code for a random sample of references. What I get is:

 metric               & HC   & HI   & HM   & MM   \\
 RefCLIPScore         & 57.6 & 99.5 & 96.2 & 80.8 \\

which is fairly in-line with the reported numbers. I often find that numerical precision issues make the numbers vary a bit (e.g., the float16 vs. float32 will make the results vary slightly). I did my best to standardize everything within this public, cleaner code base, which is the official implementation :-)

By the way: one question --- how do you break ties? I break ties randomly in my code, rather than always defaulting to a loss. That could help to explain the slightly lower scores. I will dig a bit more on this in the next few days.

jmhessel · 2022-03-23T03:42:28Z

Another question: what hardware are you running on and what numpy version are you using?

If you're running on GPU and using a particular version of numpy, there's a critical bug that might effect cosine sim computations (see #2 ). I know the other datasets have worked fine, but just to be sure :-)

jnhwkim · 2022-03-23T04:00:03Z

I am using the cosine similarity function from torch.
Ties are broken randomly using add 0~1 random number - 0.5 to one side as following code.
I will check whether the precision issue is particularly critical in this dataset.

My Pascal-50S dataset class looks like this:

class Pascal50sDataset(torch.utils.data.Dataset):
    idx2cat = {1: 'HC', 2: 'HI', 3: 'HM', 4: 'MM'}

    def __init__(self,
                 root: str = "data/Pascal-50s/",
                 media_size: int = 224,
                 voc_path: str = "data/VOC2010/"):
        super().__init__()
        self.voc_path = voc_path
        self.read_data(root)
        self.read_score(root)
        self.transforms = keys_to_transforms([], size=media_size)

    @staticmethod
    def loadmat(path):
        return scipy.io.loadmat(path)

    def read_data(self, root):
        mat = self.loadmat(
            os.path.join(root, "pyCIDErConsensus/pair_pascal.mat"))
        self.data = mat["new_input"][0]
        self.categories = mat["category"][0]
        # sanity check
        c = torch.Tensor(mat["new_data"])
        hc = (c.sum(dim=-1) == 12).int()
        hi = (c.sum(dim=-1) == 13).int()
        hm = ((c < 6).sum(dim=-1) == 1).int()
        mm = ((c < 6).sum(dim=-1) == 2).int()
        assert 1000 == hc.sum()
        assert 1000 == hi.sum()
        assert 1000 == hm.sum()
        assert 1000 == mm.sum()
        assert (hc + hi + hm + mm).sum() == self.categories.shape[0]
        chk = (torch.Tensor(self.categories) - hc - hi * 2 - hm * 3 - mm * 4)
        assert 0 == chk.abs().sum(), chk

    def read_score(self, root):
        mat = self.loadmat(
            os.path.join(root, "pyCIDErConsensus/consensus_pascal.mat"))
        data = mat["triplets"][0]
        self.labels = []
        self.references = []
        for i in range(len(self)):
            votes = {}
            refs = []
            for j in range(i * 48, (i + 1) * 48):
                a,b,c,d = [x[0][0] for x in data[j]]
                key = b[0].strip() if 1 == d else c[0].strip()
                refs.append(a[0].strip())
                votes[key] = votes.get(key, 0) + 1
            assert 2 >= len(votes.keys()), votes
            assert len(votes.keys()) > 0
            try:
                vote_a = votes.get(self.data[i][1][0].strip(), 0)
                vote_b = votes.get(self.data[i][2][0].strip(), 0)
            except KeyError:
                print("warning: data mismatch!")
                print(f"a: {self.data[i][1][0].strip()}")
                print(f"b: {self.data[i][2][0].strip()}")
                print(votes)
                exit()
            # Ties are broken randomly.
            label = 0 if vote_a > vote_b + random.random() - .5 else 1
            self.labels.append(label)
            self.references.append(refs)

    def __len__(self):
        return len(self.data)

    def get_image(self, filename: str):
        path = os.path.join(self.voc_path, "JPEGImages")
        img = Image.open(os.path.join(path, filename)).convert('RGB')
        return self.transforms(img)

    def __getitem__(self, idx: int):
        vid, a, b = [x[0] for x in self.data[idx]]
        label = self.labels[idx]
        feat = self.get_image(vid)
        a = a.strip()
        b = b.strip()
        references = self.references[idx]
        category = self.categories[idx]
        return feat, a, b, references, category, label

jnhwkim · 2022-03-23T05:08:27Z

I was calculating on float32 in GPU (after encoding by CLIP I cast it as float32 for further operation), but I noticed this official code operates on float16. I followed your code to check, but, as you expected, the difference was too slight for meaningful change.
I checked the md5sum. All files are OK.

jmhessel · 2022-03-23T06:38:55Z

Okay, I think I figured it out! You've found a minor bug in my setup :-) The bug: just for the Pascal-50S dataset (interestingly, I fixed this for Abstract-50S...), I used the 11 human ratings within the pyCIDErConsensus/pair_pascal.mat instead of what I should have used, which is the 48 human ratings within the pyCIDErConsensus/consensus_pascal.mat! What are those 11 ratings, anyway, and why didn't I notice there were only 11 of them, rather than the 48 that there should have been? doh.

Thank you for finding and reporting this! 🎉 I will update the arxiv paper with the updated results and look into submitting an errata with a note for the ACL anthology version. I think the overall story won't change much --- but, given that many people compare on Pascal-50S, I will update ASAP. Is it okay if I add an acknowledgement to you for this?

In specific: The main comparison that this affects is Table 4 in https://arxiv.org/pdf/2104.08718.pdf . The comparison between clipscore/refclipscore versus the three rows with an asterisk (TIGEr, ViLBERTScore-F, BERT-S++) is not fair, because (presumably) those works evaluated in the setting using the 48 consensus ratings rather than the 11 ratings within the other file. The comparison with the non-asterisked rows is correct, because those were also computed using the 11 ratings.

When fixing this bug, my new numbers are much more aligned with yours. I haven't finalized the numbers, yet, but I get some differences.
For RefCLIPScore:

metric               &   HC &   HI &   HM &   MM &   mean \\
RefCLIPScore         & 64.5 & 99.6 & 95.4 & 72.9 &   83.1 \\
reported RefCLIPScore 57.9 & 99.5 & 96.1 & 80.8 & 83.6 \\

For CLIPScore:

CLIPScore            & 56.4 & 99.3 & 96.4 & 70.7 &   80.7 \\
reported CLIPScore & 60.3 & 99.4 & 97.9 &77.3 &83.7 \\

jnhwkim · 2022-03-23T07:18:11Z

It's my honor to be acknowledged. I accept. Thank you for clarifying the issue along the way.

jmhessel · 2022-03-23T19:49:51Z

Thanks again for your help :-D ! I just submitted an updated version to arxiv, and also submitted an errata fix.

acl-org/acl-anthology#1859

jnhwkim closed this as completed Mar 23, 2022

jmhessel mentioned this issue Apr 17, 2022

PASCAL-50S #5

Closed

sarasarto mentioned this issue Oct 16, 2023

Pascal-50S and FOIL aimagelab/pacscore#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Pascal-50S #4

Reproducing Pascal-50S #4

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022 •

edited

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jmhessel commented Mar 23, 2022 •

edited

jnhwkim commented Mar 23, 2022 •

edited

jnhwkim commented Mar 23, 2022 •

edited

jmhessel commented Mar 23, 2022 •

edited

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

Reproducing Pascal-50S #4

Reproducing Pascal-50S #4

Comments

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022 • edited

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jmhessel commented Mar 23, 2022 • edited

jnhwkim commented Mar 23, 2022 • edited

jnhwkim commented Mar 23, 2022 • edited

jmhessel commented Mar 23, 2022 • edited

jnhwkim commented Mar 23, 2022

jmhessel commented Mar 23, 2022

jnhwkim commented Mar 23, 2022 •

edited

jmhessel commented Mar 23, 2022 •

edited

jnhwkim commented Mar 23, 2022 •

edited

jnhwkim commented Mar 23, 2022 •

edited

jmhessel commented Mar 23, 2022 •

edited