Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reason on bad results of CLIP-based initialization of image encoder #16

Closed
soskek opened this issue Mar 28, 2022 · 3 comments
Closed

Reason on bad results of CLIP-based initialization of image encoder #16

soskek opened this issue Mar 28, 2022 · 3 comments

Comments

@soskek
Copy link

soskek commented Mar 28, 2022

This is a question on an interesting report in the paper.
The paper reported

We also evaluated on a model initialized with the CLIP image encoder with the same setup and hyperparameters, but observed worse performance than using the ViT initialization.

It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?

@Boyiliee
Copy link
Collaborator

Hi @soskek ,

Thanks for your interest in LSeg!

And happy to share some thoughts here. The performance is a bit worse, I didn't try many or tune the hyper-parameters. Maybe it could be improved if we tune the hyper-parameters. To some extent, I am not surprised by this result. CLIP primarily focuses on image classification. While in LSeg, as has been mentioned in the paper, we only select the pre-trained text encoder and fix it during training. We only train the visual encoder for better localization ability. Segmentation is primarily for pixel-level prediction and localization ability, which is totally different from what CLIP aims to do. And of course, our finding is limited, we are happy to see more findings regarding this.

Hope this helps.

Best,
Boyi

@soskek
Copy link
Author

soskek commented Mar 29, 2022

Thank you for your fast reply!
I understand it well and totally agree that CLIP's classification ability could be harmful to pixel-level tasks.

Your careful comments are very helpful.
Thank you again!

Best,
Sosuke

@Boyiliee
Copy link
Collaborator

Happy to hear that!

Good luck with your research!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants