Question about the number of image features(embeddings) projected into the input to text decoder. #27

dlwogns0128 · 2022-11-17T13:11:29Z

Hi, I read paper and checked the code about the number of image features projected into the input to text decoder.
Is it right the only feature of [CLS] token from Vision Transformer used to be projected into the input to text decoder?
And then, Does text decoder decode with the only one token of projected image features and others tokens of text features?

GenerativeImage2Text/generativeimage2text/layers/CLIP/model.py

Line 270 in 777860e

x = self.ln_post(x[:, 0, :])

Thanks about your nice research!

amsword · 2022-11-22T01:03:30Z

no. output_grid will be true in this case

GenerativeImage2Text/generativeimage2text/layers/CLIP/model.py

Line 263 in 777860e

if self.output_grid:

amsword closed this as completed Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the number of image features(embeddings) projected into the input to text decoder. #27

Question about the number of image features(embeddings) projected into the input to text decoder. #27

dlwogns0128 commented Nov 17, 2022

amsword commented Nov 22, 2022

Question about the number of image features(embeddings) projected into the input to text decoder. #27

Question about the number of image features(embeddings) projected into the input to text decoder. #27

Comments

dlwogns0128 commented Nov 17, 2022

amsword commented Nov 22, 2022