Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of Resnet features with Transformer #2

Open
yfeng997 opened this issue Aug 18, 2019 · 0 comments
Open

Integration of Resnet features with Transformer #2

yfeng997 opened this issue Aug 18, 2019 · 0 comments

Comments

@yfeng997
Copy link

yfeng997 commented Aug 18, 2019

I am curious about the details regarding how you are integrating transformer into the captioning framework. The way that I understand it now, the whole picture works as follows:

  1. for each image, use resent to extract features in C x W x H format, where C is usually 512 and W and H are 7.

  2. each one of the spatial feature (C x 1 x 1) becomes one input to the decoder dec_enc_att module, so that there are a total of W*H inputs to the dec_enc_att. As an analogy, each one of the inputs is like a single word embedding in a text-to-text translation task, and there are W*H 'words' from each image.

  3. Each spatial feature (C x 1 x 1) undergoes a linear transformation before being inputted to the dec_enc_att module. The dec_enc_att module treats these W*H projected spatial feature as both Keys and Values (as defined in the transformer structure), and uses the outputs as the Queries.

Is the understanding above correct? If not, would you kindly point me out any misunderstanding? Thanks @njchoma !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant