Reason on bad results of CLIP-based initialization of image encoder #16

soskek · 2022-03-28T04:26:23Z

This is a question on an interesting report in the paper.
The paper reported

We also evaluated on a model initialized with the CLIP image encoder with the same setup and hyperparameters, but observed worse performance than using the ViT initialization.

It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?

Boyiliee · 2022-03-29T00:27:43Z

Hi @soskek ,

Thanks for your interest in LSeg!

And happy to share some thoughts here. The performance is a bit worse, I didn't try many or tune the hyper-parameters. Maybe it could be improved if we tune the hyper-parameters. To some extent, I am not surprised by this result. CLIP primarily focuses on image classification. While in LSeg, as has been mentioned in the paper, we only select the pre-trained text encoder and fix it during training. We only train the visual encoder for better localization ability. Segmentation is primarily for pixel-level prediction and localization ability, which is totally different from what CLIP aims to do. And of course, our finding is limited, we are happy to see more findings regarding this.

Hope this helps.

Best,
Boyi

soskek · 2022-03-29T01:10:28Z

Thank you for your fast reply!
I understand it well and totally agree that CLIP's classification ability could be harmful to pixel-level tasks.

Your careful comments are very helpful.
Thank you again!

Best,
Sosuke

Boyiliee · 2022-03-29T01:45:54Z

Happy to hear that!

Good luck with your research!

Boyiliee closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason on bad results of CLIP-based initialization of image encoder #16

Reason on bad results of CLIP-based initialization of image encoder #16

soskek commented Mar 28, 2022

Boyiliee commented Mar 29, 2022

soskek commented Mar 29, 2022

Boyiliee commented Mar 29, 2022

Reason on bad results of CLIP-based initialization of image encoder #16

Reason on bad results of CLIP-based initialization of image encoder #16

Comments

soskek commented Mar 28, 2022

Boyiliee commented Mar 29, 2022

soskek commented Mar 29, 2022

Boyiliee commented Mar 29, 2022