Integration of Resnet features with Transformer #2

yfeng997 · 2019-08-18T06:44:25Z

I am curious about the details regarding how you are integrating transformer into the captioning framework. The way that I understand it now, the whole picture works as follows:

for each image, use resent to extract features in C x W x H format, where C is usually 512 and W and H are 7.
each one of the spatial feature (C x 1 x 1) becomes one input to the decoder dec_enc_att module, so that there are a total of W*H inputs to the dec_enc_att. As an analogy, each one of the inputs is like a single word embedding in a text-to-text translation task, and there are W*H 'words' from each image.
Each spatial feature (C x 1 x 1) undergoes a linear transformation before being inputted to the dec_enc_att module. The dec_enc_att module treats these W*H projected spatial feature as both Keys and Values (as defined in the transformer structure), and uses the outputs as the Queries.

Is the understanding above correct? If not, would you kindly point me out any misunderstanding? Thanks @njchoma !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of Resnet features with Transformer #2

Integration of Resnet features with Transformer #2

yfeng997 commented Aug 18, 2019 •

edited

Integration of Resnet features with Transformer #2

Integration of Resnet features with Transformer #2

Comments

yfeng997 commented Aug 18, 2019 • edited

yfeng997 commented Aug 18, 2019 •

edited