Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about clip.encode_text #217

Open
AntonotnaWang opened this issue Feb 15, 2022 · 7 comments
Open

Question about clip.encode_text #217

AntonotnaWang opened this issue Feb 15, 2022 · 7 comments

Comments

@AntonotnaWang
Copy link

To whom it may concern,

I'm checking the code of clip. It is simple but wonderful. However, I found Line 350 in clip/model.py: x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection may be a little bit confusing to me. May I ask why use this dimension (text.argmax(dim=-1)) out of the 77 dimensions of the tokenized text? Why not using the average of all dimension? Thanks a lot.

P.S. My understanding of text.argmax(dim=-1): it indicates the location of the end of the input text.

Best

@jongwook
Copy link
Collaborator

You're correct that the argmax operation takes the representation at the EOT position. There's nothing inherently wrong with taking the average along the sequence dimension, but taking representation at the position of a special token (e.g. the CLS token in ViT and BERT) is empirically known to work better.

@Teoge
Copy link

Teoge commented Dec 15, 2022

Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.

@powpos360
Copy link

Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.

Other representations are still used since in each attention layer, the [EOT] token is attended to every other location.

@ygfrancois
Copy link

The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?

@LikeGiver
Copy link

LikeGiver commented Apr 25, 2023

@ygfrancois
The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?
the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT]
Screenshot from 2023-04-25 22-12-20

@stardusts-hj
Copy link

@ygfrancois The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding? the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT] Screenshot from 2023-04-25 22-12-20

Thank you @LikeGiver, I understand this point now. Argmax is used to locate the index (i_eot) of [EOT] at tokenized prompts. Once we locate it , we use the features of [EOT] by x[batchsize, i_eot] to represent the features of prompts

@ZHUYMGeo
Copy link

I have the same question. In my prompts, the number 49407 mean the end token of each prompt, not the useful meaning of each token. for example, [49406, 518, 34606, 771, 4267, 7863, 6898, 518, 4960, 2445, 537,
791, 1025, 33811, 538, 26878, 49407, 0, 0, 0, 0, ..., 0], I think i should choose the meaningful token except 49406 and 49407. Is it work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants