You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am curious about the details regarding how you are integrating transformer into the captioning framework. The way that I understand it now, the whole picture works as follows:
for each image, use resent to extract features in C x W x H format, where C is usually 512 and W and H are 7.
each one of the spatial feature (C x 1 x 1) becomes one input to the decoder dec_enc_att module, so that there are a total of W*H inputs to the dec_enc_att. As an analogy, each one of the inputs is like a single word embedding in a text-to-text translation task, and there are W*H 'words' from each image.
Each spatial feature (C x 1 x 1) undergoes a linear transformation before being inputted to the dec_enc_att module. The dec_enc_att module treats these W*H projected spatial feature as both Keys and Values (as defined in the transformer structure), and uses the outputs as the Queries.
Is the understanding above correct? If not, would you kindly point me out any misunderstanding? Thanks @njchoma !
The text was updated successfully, but these errors were encountered:
I am curious about the details regarding how you are integrating transformer into the captioning framework. The way that I understand it now, the whole picture works as follows:
for each image, use resent to extract features in C x W x H format, where C is usually 512 and W and H are 7.
each one of the spatial feature (C x 1 x 1) becomes one input to the decoder dec_enc_att module, so that there are a total of W*H inputs to the dec_enc_att. As an analogy, each one of the inputs is like a single word embedding in a text-to-text translation task, and there are W*H 'words' from each image.
Each spatial feature (C x 1 x 1) undergoes a linear transformation before being inputted to the dec_enc_att module. The dec_enc_att module treats these W*H projected spatial feature as both Keys and Values (as defined in the transformer structure), and uses the outputs as the Queries.
Is the understanding above correct? If not, would you kindly point me out any misunderstanding? Thanks @njchoma !
The text was updated successfully, but these errors were encountered: