Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VQA input construction #48

Closed
fangpang20 opened this issue Mar 21, 2022 · 1 comment
Closed

VQA input construction #48

fangpang20 opened this issue Mar 21, 2022 · 1 comment
Assignees

Comments

@fangpang20
Copy link

Hi, guys:
Thank you for your diligent work. I'm trying to prepare VQA input for single sample inference.
I'm not sure about the architecture of the VQA model, Such as the "decoder_prompts" , "prefix_tokens" in the autonomously constructed "sample".
and following sentence in the readme description about VQA is vague to me:
"we transform original VQA training questions with multiple golden answers into multiple training samples. "
Do you have any suggestions?

@yangapku yangapku self-assigned this Mar 21, 2022
@yangapku
Copy link
Member

yangapku commented Mar 21, 2022

We use decoder_prompts and prefix_tokens for better VQA finetuning performance. Specifically, for VQA we have an hyper-parameter option called --prompt-type, which determines whether to add the question before the answer in the input sequence of the decoder during finetuning & evaluation. The question has already been input in the encoder, here we consider whether to feed it into the decoder again. If the --prompt-type is not none, then the decoder_prompts and prefix_tokens will record the prepended question to construct the decoder input sequence during evaluation. The decoder_prompts is used for all-candidate evaluation and the prefix_tokens is used for beam-search generative evaluation. In our experiments, we found concatenating the question with the answer in the decoder input sequence improves the accuracy somewhat, compared with not performing concatenation.

For the other question, note that in the original VQAv2 dataset, most questions are annotated with more than one ground-truth answers. However, OFA is a seq2seq model which requires one source sequence (image & question) paired with only one target sequence (ground-truth answer) during training. In this case, we split the original sample with one question paired with more than one answers into multiple seq2seq samples, each consists of the question paired with one of the ground-truth answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants