[Question] Regarding Captioning Evaluation on Flickr30k #768

devaansh100 · 2023-11-07T04:04:45Z

Question

Hi, thanks for the great work! I have been trying to evaluate llava image captioning on Flickr30k, but I am not able to reproduce the results. While the original llava paper does not have these scores, some other works like this and this do. Both of them report the CIDEr score as 27.65.

Due to the lack of eval scripts for captioning(unless I missed it!), I have used llava/eval/model_vqa.py keeping the same settings, however, I am getting near zero scores. This is due to the detailed caption produced by the model. Reducing the max_new_tokens helps bring it closer to the expected score.

I understand these are separate works, but by any chance would you be aware of the script and parameters used to get these scores? More specifically:

I am interested in the checkpoint used. MPT seems to be giving better performance on this task than Vicuna
The generations settings used

Thanks in advance!

The text was updated successfully, but these errors were encountered:

HenryHZY · 2023-11-08T20:59:11Z

Hi @devaansh100
Can you share more Information about why "MPT seems to be giving better performance on this task than Vicuna"?
For CIDEr score 27.65, I think you can modify their evaluation scripts link, though I haven't tried it.

devaansh100 · 2023-11-09T04:35:54Z

MPT seems to be giving better performance on this task than Vicuna

Let me get back to you on this! I am rechecking my evaluation and it actually seems to be the other way round, I might have made a typo.

modify their evaluation scripts

I did do that, however, setting max_new_tokens to 256 seems to be a bit too high. Lowering it to around 20, actually gives a CIDEr score of 36.14. Using 50 drops it down to 0.18. I am trying some intermediate hyperparameters too(all experiments using 1 beam, a temperature = 0.2 with a top_k = 1). This variation is where the confusion stems from.

On a side note, the current LLaVA model on hf uses 576 image tokens for the LLM(resizing image to 336x336 and using a patch size of 14) but this paper gets 27.65 by using 257 tokens. Is there a previous version of llava with such a setting?

HenryHZY · 2023-11-09T08:04:47Z

Thanks for your reply. I am not the maintainer of LLaVA. But I can provide some of my opinions.

As mentioned above:

'max_new_tokens=20' with 'CIDEr=36.14' refers to LLaVA1.5.
'max_new_tokens=16' (default setting of LVLM-eHub) with 'CIDEr= 27.65' refers to the previous LLaVA version, whcih uses 224x224 image resolution.

I think it makes sense that LLaVA1.5 performs poorly in the image captioning task compared to other models. After all, LLaVA1.5 is mainly focused on the vqa task.

By the way, I haven't used the LVLM-eHub before. Is it available for you to release your LLaVA 1.5 evaluation script on Flickr30k? I am also interested in reproducing your Flickr30k result:)

ursulalujun · 2023-11-30T07:02:19Z

Hello~ @devaansh100
I noticed that the output of LLaVA1.5 can sometimes be much longer than GT when evaluating on Flickr30k. Although the content of outputs is correct, different lengths will lead to a decrease in similarity and CIDEr. I checked the paper and found this passage:

To address this, we propose using a single response formatting prompt that clearly indicates the output format, to be appended at the end of VQA questions when promoting short answers: Answer the question using a single word or phrase. We empirically show that when LLM is finetuned with such prompts, LLaVA is able to properly adjust the output format according to the user’s instructions, and does not require additional processing of the VQA data.

Then I write a prompt similar to this example, "Describe this image using one or more simple sentences". The output length of LLaVA1.5 is indeed much shorter and CIDEr achieves 66.71.

devaansh100 · 2023-11-30T07:27:34Z

@ursulalujun I see, thank you! However, while these prompts/hyperparameters work, there is naturally some test leakage that is happening. Moreover, it doesn't replicate the 27.65 in the previous works. But I guess this lack of standardization is an inherent problem with LLM evaluation🤷🏻‍♂️.

@HenryHZY Sorry for the late response, I had missed this message! I did not write a special script. Rather, I converted the dataset into the format expected by model_vqa.py. The entries looked like this:

{"image": "1007129816.jpg", "text": "Caption this image.", "question_id": 0}
{"image": "1009434119.jpg", "text": "Caption this image.", "question_id": 1}
{"image": "101362133.jpg", "text": "Caption this image.", "question_id": 2}

For CIDEr:

from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice
import pandas as pd
import json
import sys
class Evaluator:
    def __init__(self) -> None:
        self.tokenizer = PTBTokenizer()
        self.scorer_list = [
            (Cider(), "CIDEr"),
            # (Meteor(), "METEOR")
        ]
        self.evaluation_report = {}

    def do_the_thing(self, golden_reference, candidate_reference):
        golden_reference = self.tokenizer.tokenize(golden_reference)
        candidate_reference = self.tokenizer.tokenize(candidate_reference)
        
        # From this point, some variables are named as in the original code
        # I have no idea why they name like these
        # The original code: https://github.com/salaniz/pycocoevalcap/blob/a24f74c408c918f1f4ec34e9514bc8a76ce41ffd/eval.py#L51-L63
        for scorer, method in self.scorer_list:
            score, scores = scorer.compute_score(golden_reference, candidate_reference)
            if isinstance(method, list):
                for sc, scs, m in zip(score, scores, method):
                    self.evaluation_report[m] = sc
            else:
                self.evaluation_report[method] = score

df = pd.read_csv('../Datasets/Flickr30k/flickr_annotations_30k.csv')
df = df[df['split'] == 'test']
golden_reference = []
f = open(f'Flickr30k/{sys.argv[1]}.jsonl')
outputs = [json.loads(x)['text'] for x in f.readlines()]
candidate_reference = []
for i, x in enumerate(df.iloc):
    s = x['raw'][2:-2].replace('"','').split(',')
    golden_reference.append(s)
    # print(outputs[i])
    candidate_reference.append(outputs[i])

golden_reference = {k: [{'caption': x} for x in v] for k, v in enumerate(golden_reference)}

candidate_reference = {k: [{'caption': v}] for k, v in enumerate(candidate_reference)}
# breakpoint()
evaluator = Evaluator()
evaluator.do_the_thing(golden_reference, candidate_reference)

print(evaluator.evaluation_report)

Hope this helps!

ursulalujun · 2023-11-30T09:30:22Z

@devaansh100 Yes, I agree with you! LLM is sensitive to prompt, and they can memory the prompt seen in the training stage. I have tried to use the prompt "give me a short description of this image", but it can not control the length of the output very well. By the way, I evaluated llava1.5 in ChEF, a newly published benchmark framework. https://github.com/OpenGVLab/LAMM/tree/ChEF

devaansh100 mentioned this issue Nov 9, 2023

LLaVA evaluation on Flickr30k OpenGVLab/Multi-Modality-Arena#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Regarding Captioning Evaluation on Flickr30k #768

[Question] Regarding Captioning Evaluation on Flickr30k #768

devaansh100 commented Nov 7, 2023

HenryHZY commented Nov 8, 2023 •

edited

devaansh100 commented Nov 9, 2023

HenryHZY commented Nov 9, 2023 •

edited

ursulalujun commented Nov 30, 2023 •

edited

devaansh100 commented Nov 30, 2023

ursulalujun commented Nov 30, 2023

[Question] Regarding Captioning Evaluation on Flickr30k #768

[Question] Regarding Captioning Evaluation on Flickr30k #768

Comments

devaansh100 commented Nov 7, 2023

Question

HenryHZY commented Nov 8, 2023 • edited

devaansh100 commented Nov 9, 2023

HenryHZY commented Nov 9, 2023 • edited

ursulalujun commented Nov 30, 2023 • edited

devaansh100 commented Nov 30, 2023

ursulalujun commented Nov 30, 2023

HenryHZY commented Nov 8, 2023 •

edited

HenryHZY commented Nov 9, 2023 •

edited

ursulalujun commented Nov 30, 2023 •

edited