You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a question on an interesting report in the paper.
The paper reported
We also evaluated on a model initialized with the CLIP image encoder with the same setup and hyperparameters, but observed worse performance than using the ViT initialization.
It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?
The text was updated successfully, but these errors were encountered:
And happy to share some thoughts here. The performance is a bit worse, I didn't try many or tune the hyper-parameters. Maybe it could be improved if we tune the hyper-parameters. To some extent, I am not surprised by this result. CLIP primarily focuses on image classification. While in LSeg, as has been mentioned in the paper, we only select the pre-trained text encoder and fix it during training. We only train the visual encoder for better localization ability. Segmentation is primarily for pixel-level prediction and localization ability, which is totally different from what CLIP aims to do. And of course, our finding is limited, we are happy to see more findings regarding this.
This is a question on an interesting report in the paper.
The paper reported
It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?
The text was updated successfully, but these errors were encountered: