image captioning with ViLBERT #20

szalata · 2019-10-03T13:16:31Z

Figure 5 in the paper shows samples of generated image descriptions, but I couldn't reproduce similar results using the pretrained ViLBERT. I have used the BertForMultiModalPreTraining and supplied as features the features of the image which seem to be OK, given that the prediction_scores_v (that is the hv vector in the paper) seeems to reflect what is in the picture. As the "question", I have supplied a tensor with 30 [MASK] tokens.
Then I have been, following the paper, passing that through the model 30 times and at each iteration setting ith token of the "question" (text stream) to the text token with the highest score at the ith position.
I have also tried repeating the procedure multiple times, but it didn't change much. This results in very poor captions, such as "the a man is a man who is a man who is a man ...".

Could you please elaborate on the captioning method you've presented in the publication?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image captioning with ViLBERT #20

image captioning with ViLBERT #20

szalata commented Oct 3, 2019

image captioning with ViLBERT #20

image captioning with ViLBERT #20

Comments

szalata commented Oct 3, 2019