New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Regarding Captioning Evaluation on Flickr30k #768
Comments
Hi @devaansh100 |
Let me get back to you on this! I am rechecking my evaluation and it actually seems to be the other way round, I might have made a typo.
I did do that, however, setting On a side note, the current LLaVA model on hf uses 576 image tokens for the LLM(resizing image to 336x336 and using a patch size of 14) but this paper gets 27.65 by using 257 tokens. Is there a previous version of llava with such a setting? |
Thanks for your reply. I am not the maintainer of LLaVA. But I can provide some of my opinions. As mentioned above:
I think it makes sense that LLaVA1.5 performs poorly in the image captioning task compared to other models. After all, LLaVA1.5 is mainly focused on the vqa task. By the way, I haven't used the LVLM-eHub before. Is it available for you to release your LLaVA 1.5 evaluation script on Flickr30k? I am also interested in reproducing your Flickr30k result:) |
Hello~ @devaansh100
Then I write a prompt similar to this example, "Describe this image using one or more simple sentences". The output length of LLaVA1.5 is indeed much shorter and CIDEr achieves 66.71. |
@ursulalujun I see, thank you! However, while these prompts/hyperparameters work, there is naturally some test leakage that is happening. Moreover, it doesn't replicate the 27.65 in the previous works. But I guess this lack of standardization is an inherent problem with LLM evaluation🤷🏻♂️. @HenryHZY Sorry for the late response, I had missed this message! I did not write a special script. Rather, I converted the dataset into the format expected by
For CIDEr:
Hope this helps! |
@devaansh100 Yes, I agree with you! LLM is sensitive to prompt, and they can memory the prompt seen in the training stage. I have tried to use the prompt "give me a short description of this image", but it can not control the length of the output very well. By the way, I evaluated llava1.5 in ChEF, a newly published benchmark framework. https://github.com/OpenGVLab/LAMM/tree/ChEF |
Question
Hi, thanks for the great work! I have been trying to evaluate llava image captioning on Flickr30k, but I am not able to reproduce the results. While the original llava paper does not have these scores, some other works like this and this do. Both of them report the CIDEr score as 27.65.
Due to the lack of eval scripts for captioning(unless I missed it!), I have used
llava/eval/model_vqa.py
keeping the same settings, however, I am getting near zero scores. This is due to the detailed caption produced by the model. Reducing themax_new_tokens
helps bring it closer to the expected score.I understand these are separate works, but by any chance would you be aware of the script and parameters used to get these scores? More specifically:
Thanks in advance!
The text was updated successfully, but these errors were encountered: