You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I read paper and checked the code about the number of image features projected into the input to text decoder.
Is it right the only feature of [CLS] token from Vision Transformer used to be projected into the input to text decoder?
And then, Does text decoder decode with the only one token of projected image features and others tokens of text features?
Hi, I read paper and checked the code about the number of image features projected into the input to text decoder.
Is it right the only feature of [CLS] token from Vision Transformer used to be projected into the input to text decoder?
And then, Does text decoder decode with the only one token of projected image features and others tokens of text features?
GenerativeImage2Text/generativeimage2text/layers/CLIP/model.py
Line 270 in 777860e
Thanks about your nice research!
The text was updated successfully, but these errors were encountered: