-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about clip.encode_text #217
Comments
You're correct that the argmax operation takes the representation at the EOT position. There's nothing inherently wrong with taking the average along the sequence dimension, but taking representation at the position of a special token (e.g. the CLS token in ViT and BERT) is empirically known to work better. |
Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations. |
Other representations are still used since in each attention layer, the [EOT] token is attended to every other location. |
The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding? |
@ygfrancois |
Thank you @LikeGiver, I understand this point now. Argmax is used to locate the index (i_eot) of [EOT] at tokenized prompts. Once we locate it , we use the features of [EOT] by x[batchsize, i_eot] to represent the features of prompts |
I have the same question. In my prompts, the number 49407 mean the end token of each prompt, not the useful meaning of each token. for example, [49406, 518, 34606, 771, 4267, 7863, 6898, 518, 4960, 2445, 537, |
To whom it may concern,
I'm checking the code of clip. It is simple but wonderful. However, I found Line 350 in
clip/model.py
:x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
may be a little bit confusing to me. May I ask why use this dimension (text.argmax(dim=-1)
) out of the 77 dimensions of the tokenized text? Why not using the average of all dimension? Thanks a lot.P.S. My understanding of
text.argmax(dim=-1)
: it indicates the location of the end of the input text.Best
The text was updated successfully, but these errors were encountered: