Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Regarding Captioning Evaluation on Flickr30k #768

Open
devaansh100 opened this issue Nov 7, 2023 · 6 comments
Open

[Question] Regarding Captioning Evaluation on Flickr30k #768

devaansh100 opened this issue Nov 7, 2023 · 6 comments

Comments

@devaansh100
Copy link

Question

Hi, thanks for the great work! I have been trying to evaluate llava image captioning on Flickr30k, but I am not able to reproduce the results. While the original llava paper does not have these scores, some other works like this and this do. Both of them report the CIDEr score as 27.65.

Due to the lack of eval scripts for captioning(unless I missed it!), I have used llava/eval/model_vqa.py keeping the same settings, however, I am getting near zero scores. This is due to the detailed caption produced by the model. Reducing the max_new_tokens helps bring it closer to the expected score.

I understand these are separate works, but by any chance would you be aware of the script and parameters used to get these scores? More specifically:

  1. I am interested in the checkpoint used. MPT seems to be giving better performance on this task than Vicuna
  2. The generations settings used

Thanks in advance!

@HenryHZY
Copy link

HenryHZY commented Nov 8, 2023

Hi @devaansh100
Can you share more Information about why "MPT seems to be giving better performance on this task than Vicuna"?
For CIDEr score 27.65, I think you can modify their evaluation scripts link, though I haven't tried it.

@devaansh100
Copy link
Author

MPT seems to be giving better performance on this task than Vicuna

Let me get back to you on this! I am rechecking my evaluation and it actually seems to be the other way round, I might have made a typo.

modify their evaluation scripts

I did do that, however, setting max_new_tokens to 256 seems to be a bit too high. Lowering it to around 20, actually gives a CIDEr score of 36.14. Using 50 drops it down to 0.18. I am trying some intermediate hyperparameters too(all experiments using 1 beam, a temperature = 0.2 with a top_k = 1). This variation is where the confusion stems from.

On a side note, the current LLaVA model on hf uses 576 image tokens for the LLM(resizing image to 336x336 and using a patch size of 14) but this paper gets 27.65 by using 257 tokens. Is there a previous version of llava with such a setting?

@HenryHZY
Copy link

HenryHZY commented Nov 9, 2023

Thanks for your reply. I am not the maintainer of LLaVA. But I can provide some of my opinions.

As mentioned above:

'max_new_tokens=20' with 'CIDEr=36.14' refers to LLaVA1.5.
'max_new_tokens=16' (default setting of LVLM-eHub) with 'CIDEr= 27.65' refers to the previous LLaVA version, whcih uses 224x224 image resolution.

I think it makes sense that LLaVA1.5 performs poorly in the image captioning task compared to other models. After all, LLaVA1.5 is mainly focused on the vqa task.

By the way, I haven't used the LVLM-eHub before. Is it available for you to release your LLaVA 1.5 evaluation script on Flickr30k? I am also interested in reproducing your Flickr30k result:)

@ursulalujun
Copy link

ursulalujun commented Nov 30, 2023

Hello~ @devaansh100
I noticed that the output of LLaVA1.5 can sometimes be much longer than GT when evaluating on Flickr30k. Although the content of outputs is correct, different lengths will lead to a decrease in similarity and CIDEr. I checked the paper and found this passage:

To address this, we propose using a single response formatting prompt that clearly indicates the output format, to be appended at the end of VQA questions when promoting short answers: Answer the question using a single word or phrase. We empirically show that when LLM is finetuned with such prompts, LLaVA is able to properly adjust the output format according to the user’s instructions, and does not require additional processing of the VQA data.

Then I write a prompt similar to this example, "Describe this image using one or more simple sentences". The output length of LLaVA1.5 is indeed much shorter and CIDEr achieves 66.71.

@devaansh100
Copy link
Author

@ursulalujun I see, thank you! However, while these prompts/hyperparameters work, there is naturally some test leakage that is happening. Moreover, it doesn't replicate the 27.65 in the previous works. But I guess this lack of standardization is an inherent problem with LLM evaluation🤷🏻‍♂️.

@HenryHZY Sorry for the late response, I had missed this message! I did not write a special script. Rather, I converted the dataset into the format expected by model_vqa.py. The entries looked like this:

{"image": "1007129816.jpg", "text": "Caption this image.", "question_id": 0}
{"image": "1009434119.jpg", "text": "Caption this image.", "question_id": 1}
{"image": "101362133.jpg", "text": "Caption this image.", "question_id": 2}

For CIDEr:

from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice
import pandas as pd
import json
import sys
class Evaluator:
    def __init__(self) -> None:
        self.tokenizer = PTBTokenizer()
        self.scorer_list = [
            (Cider(), "CIDEr"),
            # (Meteor(), "METEOR")
        ]
        self.evaluation_report = {}

    def do_the_thing(self, golden_reference, candidate_reference):
        golden_reference = self.tokenizer.tokenize(golden_reference)
        candidate_reference = self.tokenizer.tokenize(candidate_reference)
        
        # From this point, some variables are named as in the original code
        # I have no idea why they name like these
        # The original code: https://github.com/salaniz/pycocoevalcap/blob/a24f74c408c918f1f4ec34e9514bc8a76ce41ffd/eval.py#L51-L63
        for scorer, method in self.scorer_list:
            score, scores = scorer.compute_score(golden_reference, candidate_reference)
            if isinstance(method, list):
                for sc, scs, m in zip(score, scores, method):
                    self.evaluation_report[m] = sc
            else:
                self.evaluation_report[method] = score

df = pd.read_csv('../Datasets/Flickr30k/flickr_annotations_30k.csv')
df = df[df['split'] == 'test']
golden_reference = []
f = open(f'Flickr30k/{sys.argv[1]}.jsonl')
outputs = [json.loads(x)['text'] for x in f.readlines()]
candidate_reference = []
for i, x in enumerate(df.iloc):
    s = x['raw'][2:-2].replace('"','').split(',')
    golden_reference.append(s)
    # print(outputs[i])
    candidate_reference.append(outputs[i])

golden_reference = {k: [{'caption': x} for x in v] for k, v in enumerate(golden_reference)}

candidate_reference = {k: [{'caption': v}] for k, v in enumerate(candidate_reference)}
# breakpoint()
evaluator = Evaluator()
evaluator.do_the_thing(golden_reference, candidate_reference)

print(evaluator.evaluation_report)

Hope this helps!

@ursulalujun
Copy link

@devaansh100 Yes, I agree with you! LLM is sensitive to prompt, and they can memory the prompt seen in the training stage. I have tried to use the prompt "give me a short description of this image", but it can not control the length of the output very well. By the way, I evaluated llava1.5 in ChEF, a newly published benchmark framework. https://github.com/OpenGVLab/LAMM/tree/ChEF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants