You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Figure 5 in the paper shows samples of generated image descriptions, but I couldn't reproduce similar results using the pretrained ViLBERT. I have used the BertForMultiModalPreTraining and supplied as features the features of the image which seem to be OK, given that the prediction_scores_v (that is the hv vector in the paper) seeems to reflect what is in the picture. As the "question", I have supplied a tensor with 30 [MASK] tokens.
Then I have been, following the paper, passing that through the model 30 times and at each iteration setting ith token of the "question" (text stream) to the text token with the highest score at the ith position.
I have also tried repeating the procedure multiple times, but it didn't change much. This results in very poor captions, such as "the a man is a man who is a man who is a man ...".
Could you please elaborate on the captioning method you've presented in the publication?
The text was updated successfully, but these errors were encountered:
Figure 5 in the paper shows samples of generated image descriptions, but I couldn't reproduce similar results using the pretrained ViLBERT. I have used the BertForMultiModalPreTraining and supplied as features the features of the image which seem to be OK, given that the prediction_scores_v (that is the hv vector in the paper) seeems to reflect what is in the picture. As the "question", I have supplied a tensor with 30 [MASK] tokens.
Then I have been, following the paper, passing that through the model 30 times and at each iteration setting ith token of the "question" (text stream) to the text token with the highest score at the ith position.
I have also tried repeating the procedure multiple times, but it didn't change much. This results in very poor captions, such as "the a man is a man who is a man who is a man ...".
Could you please elaborate on the captioning method you've presented in the publication?
The text was updated successfully, but these errors were encountered: