How to obtain logits (and probabilities) for 0-shot classification of single classes #193

arnaudvl · 2022-10-19T12:46:10Z

First of all, thanks for the amazing work going into this repo!
In the case where we want to return the probability of the presence of 1 class (e.g. "dog") in a set of images, how would we go about it? While (100.0 * image_features @ text_features.T).softmax(dim=-1) provides well-calibrated probabilities in the multi-class setting, (100.0 * image_features @ text_features.T).sigmoid() does not when we return the logits of only 1 class and have no other classes to compute the softmax against. From logits = np.dot(I_e, T_e.T) * np.exp(t) in Figure 3 of the CLIP paper, it would have to follow that t=4.6... given np.exp(t)=100 from the usage snippet in the README, is this correct (Edit: indeed model.logit_scale confirms this)? And wouldn't this be surprisingly consistent across architectures/training runs? I believe the OpenAI implementation initialises t=1/.07 leading to an initial scaling factor of approximately 14.29. This is then trained of course (link to code).
Alternatively, could I try sampling a random, normalised vector as text_feature for a non-existing "non-dog" class and apply (100.0 * image_features @ text_features.T).softmax(dim=-1) as in the multi-class setting?
Thanks

The text was updated successfully, but these errors were encountered:

mitchellnw · 2022-10-19T16:34:57Z

With CLIP zero-shot classification, AFAIK you can only ask for the probability that it is a dog relative to other targets - you can not do so in isolation.

arnaudvl · 2022-10-20T08:21:23Z

Thanks. It's a pity as it would open quite a few use cases such as reliably estimating whether a certain concept is present in an image or not.

mehdidc · 2022-10-20T08:40:52Z

What about computing the cosine similarity between your images and a dog vector (which can be computed using the text encoder, either from a prompt or an average of multiple prompts, e.g. with different breeds), this will give you a score for each image that you can use to rank them, basically like the image retrieval setup. Or I suppose you necessarily need probabilities?

arnaudvl · 2022-10-20T09:02:56Z

Hi @mehdidc . Yes unfortunately I would need the probabilities to answer questions such as "Is there a dog in the image?", with a certain confidence. I imagine it to be very useful for image retrieval too, as you wouldn't want to return any images from a text query if no dogs are present in any of the images that you search through, while in a cosine similarity + ranking approach you would supposedly still return the top k instances. Instead I would like to return all instances in which I am confident a dog is present and return no instances at all if no dogs are present.

rom1504 · 2022-10-20T10:32:35Z

You can automatically find the right similarity threshold if you have a few annotated labels. Often a threshold around 0.25 works

…

On Thu, Oct 20, 2022, 11:03 arnaudvl ***@***.***> wrote: Hi @mehdidc <https://github.com/mehdidc> . Yes unfortunately I would need the probabilities to answer questions such as "Is there a dog in the image?", with a certain confidence. I imagine it to be very useful for image retrieval too, as you wouldn't want to return any images from a text query if no dogs are present in any the images that you search through, while in a cosine similarity + ranking approach you would supposedly still return the top k instances. Instead I would like to return all instances in which I am confident a dog is present and return no instances at all if no dogs are present. — Reply to this email directly, view it on GitHub <#193 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437XAW22NXR5CVK5GOUTWEEDEXANCNFSM6AAAAAARJCYENQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

arnaudvl · 2022-10-20T11:00:18Z

Thanks @rom1504 for the suggestions. Unfortunately I don't have the luxury of annotated labels (even just a few) for every new, open-ended user query and would require a measure of confidence rather than returning instances above a heuristic-based threshold. I realise I am likely asking for a bit too much here!

rom1504 · 2022-10-20T11:14:51Z

If you could say more about the final task/app you are trying to solve/build, it could help to figure out a good solution.

…

On Thu, Oct 20, 2022, 13:00 arnaudvl ***@***.***> wrote: Thanks @rom1504 <https://github.com/rom1504> for the suggestions. Unfortunately I don't have the luxury of annotated labels (even just a few) for every new, open-ended user query and would require a measure of confidence rather than returning instances above a heuristic-based threshold. I realise I am likely asking for a bit too much here! — Reply to this email directly, view it on GitHub <#193 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437X42ASU3T3DNL7D22LWEEQ43ANCNFSM6AAAAAARJCYENQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

arnaudvl · 2022-10-20T15:45:33Z

Thanks. I have a few tasks which would benefit from this, but maybe the most obvious one would be the following:

Given a visual representation of an image dataset (e.g. 2D scatter plot after dimensionality reduction on top of the CLIP embeddings), I'd like to colour code each of the instances in the dataset according to the presence of the text in the query (e.g. "dog"). Ideally this would go from 0 (no dog at all present in an instance) to 1 (definitely a dog present in an instance). While this is typically close to binary for simple concepts, this can be more nuanced for most other queries (e.g. "daytime image" -> probability of 0.5 at dusk/dawn?), ideally using probabilities. So in essence this is a 0-shot (because any user-defined text query is allowed) binary (concept present vs. concept not present in an image) classification problem.

Any tips? Thanks again!

rom1504 · 2022-10-20T15:49:43Z

This is the solution you would like to apply.

Do you have a specific use case for this or do you only want to solve it from a research point of view and not apply it ?

arnaudvl · 2022-10-20T15:58:49Z

The motivation is to apply it and allow users to feed in their own data and query those to obtain text-driven insights. So I'd like to explore to what extend the research problem can be solved in a nice, principled way, or whether I'll have to resort to a more heuristic-driven approach, and then to what extend the solution can be applied to generic datasets.

rwightman · 2022-10-23T04:48:20Z

I don't see why these tasks require a 'single class' .... the prompts for generating the zero-shot classifiers accept human language, so figure out the prompts required to achieve the goal (ie you can say 'picture/scene/image/rendering/painting without a dog')

if you want p(dog), then setup two sets of prompts that will achieve the aim, create a set of prompts for images/depictions containing a dog, and another set for scenes that are opposites and do not contain a dog... a daytime / nightime split is even easier to imagine.

rwightman · 2022-10-23T05:02:50Z

Also, re softmax, just because its outputs sum to 1, doesn't mean it can be directly be interpreted as a prob as is so often done, the calibration is typically quite poor. At best you can hope to have reasonable confidence that there is or isn't the object of interest in the scene (in which case either of rom or mehdi's ideas can work well with a bit of fiddling for the thresholding). To say you want a prob and assume that 0.8 from a softmax output actually means 80% prob there's a x in the scene, well that's unlikely to be the case.

arnaudvl · 2022-10-23T22:19:45Z

Thanks for the helpful comments @rwightman . As mentioned before I definitely recognise it's a binary classification problem, which ideally (and largely to improve user experience) would be reduced to checking whether a certain concept is present in the images or not. As you note this could be done by also constructing the "opposite prompt". Tbh I've had quite mixed results trying this. E.g.

text = ['a picture with a dog', 'a picture without a dog']
text = ['an image with a dog', 'an image without a dog']
text = ['a picture which contains a dog', 'a picture which does not contain a dog']

already returns surprisingly different 0-shot results on fairly simple STL10 instances. So it looks like I need to dig more into prompt engineering, which combined with thresholding should give me a decent baseline to start off with.

Thanks all for the help and the super useful repo!

mehdidc · 2022-10-24T07:24:45Z

I think depending on the distribution of the images you expect, you could try to construct a set of non dog classes explicitly, each with its own prompt(s), because I don't think you would find a lot of examples in the training set of CLIP models with a caption like "not containing an object x". As a starting point, you might try/experiment with imagenet1k classes actually (https://github.com/mlfoundations/open_clip/blob/main/src/training/imagenet_zeroshot_data.py), it has a lot of dog breeds, other animals, different daily objects, you can e.g. accumulate the score of the breeds.

mehdidc · 2022-10-24T09:56:20Z

So in general, basically maintain a "library" of positive and negative classes, each time there is false positive or a false negative you update either of them with a new class/concept, but I don't know how easy it is to maintain

arnaudvl · 2022-10-26T11:05:48Z

@mehdidc as you mentioned, not containing an object x seems from my experiments indeed where CLIP can struggle quite a bit. Calibration in a classification setting typically is pretty reasonable if the instance contains one of the pre-defined classes or is distributed across multiple classes (e.g. a dog which kind of looks like a cat), but worsens a lot when you try to specify what should not be in the instance.

isaacrob · 2022-10-26T19:52:38Z

Hello @arnaudvl and @mehdidc, my understanding is that with the original OpenAI CLIP models, we could get a pretty good sense of this by just taking the cosine similarity between the image and the text embedding and clamping it to 0-1, since that's ultimately the metric their model is working with in the loss function. They compute cosine sim, clamp it, then scale it and take the softmax, meaning the original 0-1 score is pretty well calibrated. At least, it seems so in my experiments. The problem comes from the change in loss function in the open clip version, where the logit is directly the dot product instead of a scaled 0-1 similarity indicator. In my work I'm having trouble using the open clip version because the scaling of relevant similarities is less well behaved. I tried just naively passing it through a sigmoid, with no luck. @mehdidc do you have any recommendations?

rom1504 · 2022-10-26T20:45:51Z

open_clip/src/open_clip/model.py

Line 570 in 4762fae

image_features = F.normalize(image_features, dim=-1)

open clip is computing cosines same as clip (embeddings are normalized)

rom1504 · 2022-10-26T20:49:17Z

Regarding the topic mentioned here, i am still missing information on the actual use case.

Computing (zero shot) binary classification with openclip may be possible, but the ranking and similarity properties of clip allow much more general problem definitions.

isaacrob · 2022-10-26T20:53:17Z

Oh! It is indeed just cosine sim here, I missed the normalization when I was reviewing the loss function y'all use! It also looks like the original clip paper didn't clamp to 0-1 as I thought it did .. I guess I just am misremembering, that's so weird! Such a strong memory..

So in that case the -1 to 1 range cosine sim should be a relatively well-behaved similarity score even without the other scores to provide context for a softmax normalization, yes?

arnaudvl · 2022-11-05T19:39:41Z

Hi @rom1504 , apologies for the late reply. Since you asked about the use case: it is very similar to https://twitter.com/benmschmidt/status/1587847092306837509. They also use a CLIP search module. However, I find the quality of the search results often lacking. E.g. a simple query on https://atlas.nomic.ai/map/809ef16a-5b2d-4291-b772-a913f4c8ee61/9ed7d171-650b-4526-85bf-3592ee51ea31 such as an image of a football team gives lots of irrelevant instances in the top results. I don't know the exact implementation, but it seems like either the top N instances are returned by default or anything above a certain similarity threshold until a maximum of N instances are returned. Either way, it leads to many unwanted search results, especially if what you're searching for is not in the data at all. This could be avoided if we can do relatively reliable binary classification for [an image of a football team, an image not containing a football team] or similar. Unfortunately, I haven't had success with this setting as CLIP seems to struggle with prompts denoting what should not be present in the instance (sometimes it works, sometimes it doesn't).

arnaudvl closed this as completed Oct 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to obtain logits (and probabilities) for 0-shot classification of single classes #193

How to obtain logits (and probabilities) for 0-shot classification of single classes #193

arnaudvl commented Oct 19, 2022 •

edited

mitchellnw commented Oct 19, 2022

arnaudvl commented Oct 20, 2022

mehdidc commented Oct 20, 2022

arnaudvl commented Oct 20, 2022 •

edited

rom1504 commented Oct 20, 2022 via email

arnaudvl commented Oct 20, 2022

rom1504 commented Oct 20, 2022 via email

arnaudvl commented Oct 20, 2022

rom1504 commented Oct 20, 2022

arnaudvl commented Oct 20, 2022

rwightman commented Oct 23, 2022

rwightman commented Oct 23, 2022

arnaudvl commented Oct 23, 2022

mehdidc commented Oct 24, 2022 •

edited

mehdidc commented Oct 24, 2022

arnaudvl commented Oct 26, 2022

isaacrob commented Oct 26, 2022

rom1504 commented Oct 26, 2022

rom1504 commented Oct 26, 2022 •

edited

isaacrob commented Oct 26, 2022

arnaudvl commented Nov 5, 2022

How to obtain logits (and probabilities) for 0-shot classification of single classes #193

How to obtain logits (and probabilities) for 0-shot classification of single classes #193

Comments

arnaudvl commented Oct 19, 2022 • edited

mitchellnw commented Oct 19, 2022

arnaudvl commented Oct 20, 2022

mehdidc commented Oct 20, 2022

arnaudvl commented Oct 20, 2022 • edited

rom1504 commented Oct 20, 2022 via email

arnaudvl commented Oct 20, 2022

rom1504 commented Oct 20, 2022 via email

arnaudvl commented Oct 20, 2022

rom1504 commented Oct 20, 2022

arnaudvl commented Oct 20, 2022

rwightman commented Oct 23, 2022

rwightman commented Oct 23, 2022

arnaudvl commented Oct 23, 2022

mehdidc commented Oct 24, 2022 • edited

mehdidc commented Oct 24, 2022

arnaudvl commented Oct 26, 2022

isaacrob commented Oct 26, 2022

rom1504 commented Oct 26, 2022

rom1504 commented Oct 26, 2022 • edited

isaacrob commented Oct 26, 2022

arnaudvl commented Nov 5, 2022

arnaudvl commented Oct 19, 2022 •

edited

arnaudvl commented Oct 20, 2022 •

edited

mehdidc commented Oct 24, 2022 •

edited

rom1504 commented Oct 26, 2022 •

edited