New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to obtain logits (and probabilities) for 0-shot classification of single classes #193
Comments
With CLIP zero-shot classification, AFAIK you can only ask for the probability that it is a dog relative to other targets - you can not do so in isolation. |
Thanks. It's a pity as it would open quite a few use cases such as reliably estimating whether a certain concept is present in an image or not. |
What about computing the cosine similarity between your images and a dog vector (which can be computed using the text encoder, either from a prompt or an average of multiple prompts, e.g. with different breeds), this will give you a score for each image that you can use to rank them, basically like the image retrieval setup. Or I suppose you necessarily need probabilities? |
Hi @mehdidc . Yes unfortunately I would need the probabilities to answer questions such as "Is there a dog in the image?", with a certain confidence. I imagine it to be very useful for image retrieval too, as you wouldn't want to return any images from a text query if no dogs are present in any of the images that you search through, while in a cosine similarity + ranking approach you would supposedly still return the top k instances. Instead I would like to return all instances in which I am confident a dog is present and return no instances at all if no dogs are present. |
You can automatically find the right similarity threshold if you have a few
annotated labels.
Often a threshold around 0.25 works
…On Thu, Oct 20, 2022, 11:03 arnaudvl ***@***.***> wrote:
Hi @mehdidc <https://github.com/mehdidc> . Yes unfortunately I would need
the probabilities to answer questions such as "Is there a dog in the
image?", with a certain confidence. I imagine it to be very useful for
image retrieval too, as you wouldn't want to return any images from a text
query if no dogs are present in any the images that you search through,
while in a cosine similarity + ranking approach you would supposedly still
return the top k instances. Instead I would like to return all instances in
which I am confident a dog is present and return no instances at all if no
dogs are present.
—
Reply to this email directly, view it on GitHub
<#193 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437XAW22NXR5CVK5GOUTWEEDEXANCNFSM6AAAAAARJCYENQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks @rom1504 for the suggestions. Unfortunately I don't have the luxury of annotated labels (even just a few) for every new, open-ended user query and would require a measure of confidence rather than returning instances above a heuristic-based threshold. I realise I am likely asking for a bit too much here! |
If you could say more about the final task/app you are trying to
solve/build, it could help to figure out a good solution.
…On Thu, Oct 20, 2022, 13:00 arnaudvl ***@***.***> wrote:
Thanks @rom1504 <https://github.com/rom1504> for the suggestions.
Unfortunately I don't have the luxury of annotated labels (even just a few)
for every new, open-ended user query and would require a measure of
confidence rather than returning instances above a heuristic-based
threshold. I realise I am likely asking for a bit too much here!
—
Reply to this email directly, view it on GitHub
<#193 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437X42ASU3T3DNL7D22LWEEQ43ANCNFSM6AAAAAARJCYENQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks. I have a few tasks which would benefit from this, but maybe the most obvious one would be the following: Given a visual representation of an image dataset (e.g. 2D scatter plot after dimensionality reduction on top of the CLIP embeddings), I'd like to colour code each of the instances in the dataset according to the presence of the text in the query (e.g. "dog"). Ideally this would go from 0 (no dog at all present in an instance) to 1 (definitely a dog present in an instance). While this is typically close to binary for simple concepts, this can be more nuanced for most other queries (e.g. "daytime image" -> probability of 0.5 at dusk/dawn?), ideally using probabilities. So in essence this is a 0-shot (because any user-defined text query is allowed) binary (concept present vs. concept not present in an image) classification problem. Any tips? Thanks again! |
This is the solution you would like to apply. Do you have a specific use case for this or do you only want to solve it from a research point of view and not apply it ? |
The motivation is to apply it and allow users to feed in their own data and query those to obtain text-driven insights. So I'd like to explore to what extend the research problem can be solved in a nice, principled way, or whether I'll have to resort to a more heuristic-driven approach, and then to what extend the solution can be applied to generic datasets. |
I don't see why these tasks require a 'single class' .... the prompts for generating the zero-shot classifiers accept human language, so figure out the prompts required to achieve the goal (ie you can say 'picture/scene/image/rendering/painting without a dog') if you want p(dog), then setup two sets of prompts that will achieve the aim, create a set of prompts for images/depictions containing a dog, and another set for scenes that are opposites and do not contain a dog... a daytime / nightime split is even easier to imagine. |
Also, re softmax, just because its outputs sum to 1, doesn't mean it can be directly be interpreted as a prob as is so often done, the calibration is typically quite poor. At best you can hope to have reasonable confidence that there is or isn't the object of interest in the scene (in which case either of rom or mehdi's ideas can work well with a bit of fiddling for the thresholding). To say you want a prob and assume that 0.8 from a softmax output actually means 80% prob there's a x in the scene, well that's unlikely to be the case. |
Thanks for the helpful comments @rwightman . As mentioned before I definitely recognise it's a binary classification problem, which ideally (and largely to improve user experience) would be reduced to checking whether a certain concept is present in the images or not. As you note this could be done by also constructing the "opposite prompt". Tbh I've had quite mixed results trying this. E.g. text = ['a picture with a dog', 'a picture without a dog']
text = ['an image with a dog', 'an image without a dog']
text = ['a picture which contains a dog', 'a picture which does not contain a dog'] already returns surprisingly different 0-shot results on fairly simple STL10 instances. So it looks like I need to dig more into prompt engineering, which combined with thresholding should give me a decent baseline to start off with. Thanks all for the help and the super useful repo! |
I think depending on the distribution of the images you expect, you could try to construct a set of non dog classes explicitly, each with its own prompt(s), because I don't think you would find a lot of examples in the training set of CLIP models with a caption like "not containing an object x". As a starting point, you might try/experiment with imagenet1k classes actually (https://github.com/mlfoundations/open_clip/blob/main/src/training/imagenet_zeroshot_data.py), it has a lot of dog breeds, other animals, different daily objects, you can e.g. accumulate the score of the breeds. |
So in general, basically maintain a "library" of positive and negative classes, each time there is false positive or a false negative you update either of them with a new class/concept, but I don't know how easy it is to maintain |
@mehdidc as you mentioned, |
Hello @arnaudvl and @mehdidc, my understanding is that with the original OpenAI CLIP models, we could get a pretty good sense of this by just taking the cosine similarity between the image and the text embedding and clamping it to 0-1, since that's ultimately the metric their model is working with in the loss function. They compute cosine sim, clamp it, then scale it and take the softmax, meaning the original 0-1 score is pretty well calibrated. At least, it seems so in my experiments. The problem comes from the change in loss function in the open clip version, where the logit is directly the dot product instead of a scaled 0-1 similarity indicator. In my work I'm having trouble using the open clip version because the scaling of relevant similarities is less well behaved. I tried just naively passing it through a sigmoid, with no luck. @mehdidc do you have any recommendations? |
open_clip/src/open_clip/model.py Line 570 in 4762fae
|
Regarding the topic mentioned here, i am still missing information on the actual use case. Computing (zero shot) binary classification with openclip may be possible, but the ranking and similarity properties of clip allow much more general problem definitions. |
Oh! It is indeed just cosine sim here, I missed the normalization when I was reviewing the loss function y'all use! It also looks like the original clip paper didn't clamp to 0-1 as I thought it did .. I guess I just am misremembering, that's so weird! Such a strong memory.. So in that case the -1 to 1 range cosine sim should be a relatively well-behaved similarity score even without the other scores to provide context for a softmax normalization, yes? |
Hi @rom1504 , apologies for the late reply. Since you asked about the use case: it is very similar to https://twitter.com/benmschmidt/status/1587847092306837509. They also use a CLIP search module. However, I find the quality of the search results often lacking. E.g. a simple query on https://atlas.nomic.ai/map/809ef16a-5b2d-4291-b772-a913f4c8ee61/9ed7d171-650b-4526-85bf-3592ee51ea31 such as |
First of all, thanks for the amazing work going into this repo!
In the case where we want to return the probability of the presence of 1 class (e.g. "dog") in a set of images, how would we go about it? While
(100.0 * image_features @ text_features.T).softmax(dim=-1)
provides well-calibrated probabilities in the multi-class setting,(100.0 * image_features @ text_features.T).sigmoid()
does not when we return the logits of only 1 class and have no other classes to compute the softmax against. Fromlogits = np.dot(I_e, T_e.T) * np.exp(t)
in Figure 3 of the CLIP paper, it would have to follow thatt=4.6...
givennp.exp(t)=100
from the usage snippet in the README, is this correct (Edit: indeedmodel.logit_scale
confirms this)? And wouldn't this be surprisingly consistent across architectures/training runs? I believe the OpenAI implementation initialisest=1/.07
leading to an initial scaling factor of approximately14.29
. This is then trained of course (link to code).Alternatively, could I try sampling a random, normalised vector as
text_feature
for a non-existing "non-dog" class and apply(100.0 * image_features @ text_features.T).softmax(dim=-1)
as in the multi-class setting?Thanks
The text was updated successfully, but these errors were encountered: