Question about clip.encode_text #217

AntonotnaWang · 2022-02-15T06:01:02Z

To whom it may concern,

I'm checking the code of clip. It is simple but wonderful. However, I found Line 350 in clip/model.py: x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection may be a little bit confusing to me. May I ask why use this dimension (text.argmax(dim=-1)) out of the 77 dimensions of the tokenized text? Why not using the average of all dimension? Thanks a lot.

P.S. My understanding of text.argmax(dim=-1): it indicates the location of the end of the input text.

Best

The text was updated successfully, but these errors were encountered:

jongwook · 2022-04-11T01:23:45Z

You're correct that the argmax operation takes the representation at the EOT position. There's nothing inherently wrong with taking the average along the sequence dimension, but taking representation at the position of a special token (e.g. the CLS token in ViT and BERT) is empirically known to work better.

Teoge · 2022-12-15T12:39:03Z

Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.

powpos360 · 2022-12-19T20:03:55Z

Hi, I also have a question. Does that make representations in other locations meaningless? Since they are not supervised by any loss, the network can output arbitrary values for these representations.

Other representations are still used since in each attention layer, the [EOT] token is attended to every other location.

ygfrancois · 2023-04-25T11:15:52Z

The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?

LikeGiver · 2023-04-25T13:52:49Z

@ygfrancois
The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding?
the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT]

stardusts-hj · 2023-12-16T07:03:41Z

@ygfrancois The position of [EOT] token is different for text with diff length, this dose not confuse the learning of position embedding? the [ETO] token is 49407 in this situation, which is the largest number in the tokenized_prompts (i.e text), so we can use text.argmax(dim=-1) to determine the position of [EOT]

Thank you @LikeGiver, I understand this point now. Argmax is used to locate the index (i_eot) of [EOT] at tokenized prompts. Once we locate it , we use the features of [EOT] by x[batchsize, i_eot] to represent the features of prompts

ZHUYMGeo · 2023-12-25T15:27:02Z

I have the same question. In my prompts, the number 49407 mean the end token of each prompt, not the useful meaning of each token. for example, [49406, 518, 34606, 771, 4267, 7863, 6898, 518, 4960, 2445, 537,
791, 1025, 33811, 538, 26878, 49407, 0, 0, 0, 0, ..., 0], I think i should choose the meaningful token except 49406 and 49407. Is it work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about clip.encode_text #217

Question about clip.encode_text #217

AntonotnaWang commented Feb 15, 2022

jongwook commented Apr 11, 2022

Teoge commented Dec 15, 2022

powpos360 commented Dec 19, 2022

ygfrancois commented Apr 25, 2023

LikeGiver commented Apr 25, 2023 •

edited

stardusts-hj commented Dec 16, 2023

ZHUYMGeo commented Dec 25, 2023

Question about clip.encode_text #217

Question about clip.encode_text #217

Comments

AntonotnaWang commented Feb 15, 2022

jongwook commented Apr 11, 2022

Teoge commented Dec 15, 2022

powpos360 commented Dec 19, 2022

ygfrancois commented Apr 25, 2023

LikeGiver commented Apr 25, 2023 • edited

stardusts-hj commented Dec 16, 2023

ZHUYMGeo commented Dec 25, 2023

LikeGiver commented Apr 25, 2023 •

edited