Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the number of image features(embeddings) projected into the input to text decoder. #27

Closed
dlwogns0128 opened this issue Nov 17, 2022 · 1 comment

Comments

@dlwogns0128
Copy link

Hi, I read paper and checked the code about the number of image features projected into the input to text decoder.
Is it right the only feature of [CLS] token from Vision Transformer used to be projected into the input to text decoder?
And then, Does text decoder decode with the only one token of projected image features and others tokens of text features?

Thanks about your nice research!

@amsword
Copy link
Contributor

amsword commented Nov 22, 2022

no. output_grid will be true in this case

@amsword amsword closed this as completed Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants