Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to obtain logits (and probabilities) for 0-shot classification of single classes #193

Closed
arnaudvl opened this issue Oct 19, 2022 · 21 comments

Comments

@arnaudvl
Copy link

arnaudvl commented Oct 19, 2022

First of all, thanks for the amazing work going into this repo!
In the case where we want to return the probability of the presence of 1 class (e.g. "dog") in a set of images, how would we go about it? While (100.0 * image_features @ text_features.T).softmax(dim=-1) provides well-calibrated probabilities in the multi-class setting, (100.0 * image_features @ text_features.T).sigmoid() does not when we return the logits of only 1 class and have no other classes to compute the softmax against. From logits = np.dot(I_e, T_e.T) * np.exp(t) in Figure 3 of the CLIP paper, it would have to follow that t=4.6... given np.exp(t)=100 from the usage snippet in the README, is this correct (Edit: indeed model.logit_scale confirms this)? And wouldn't this be surprisingly consistent across architectures/training runs? I believe the OpenAI implementation initialises t=1/.07 leading to an initial scaling factor of approximately 14.29. This is then trained of course (link to code).
Alternatively, could I try sampling a random, normalised vector as text_feature for a non-existing "non-dog" class and apply (100.0 * image_features @ text_features.T).softmax(dim=-1) as in the multi-class setting?
Thanks

@mitchellnw
Copy link
Contributor

With CLIP zero-shot classification, AFAIK you can only ask for the probability that it is a dog relative to other targets - you can not do so in isolation.

@arnaudvl
Copy link
Author

Thanks. It's a pity as it would open quite a few use cases such as reliably estimating whether a certain concept is present in an image or not.

@mehdidc
Copy link
Contributor

mehdidc commented Oct 20, 2022

What about computing the cosine similarity between your images and a dog vector (which can be computed using the text encoder, either from a prompt or an average of multiple prompts, e.g. with different breeds), this will give you a score for each image that you can use to rank them, basically like the image retrieval setup. Or I suppose you necessarily need probabilities?

@arnaudvl
Copy link
Author

arnaudvl commented Oct 20, 2022

Hi @mehdidc . Yes unfortunately I would need the probabilities to answer questions such as "Is there a dog in the image?", with a certain confidence. I imagine it to be very useful for image retrieval too, as you wouldn't want to return any images from a text query if no dogs are present in any of the images that you search through, while in a cosine similarity + ranking approach you would supposedly still return the top k instances. Instead I would like to return all instances in which I am confident a dog is present and return no instances at all if no dogs are present.

@rom1504
Copy link
Collaborator

rom1504 commented Oct 20, 2022 via email

@arnaudvl
Copy link
Author

Thanks @rom1504 for the suggestions. Unfortunately I don't have the luxury of annotated labels (even just a few) for every new, open-ended user query and would require a measure of confidence rather than returning instances above a heuristic-based threshold. I realise I am likely asking for a bit too much here!

@rom1504
Copy link
Collaborator

rom1504 commented Oct 20, 2022 via email

@arnaudvl
Copy link
Author

Thanks. I have a few tasks which would benefit from this, but maybe the most obvious one would be the following:

Given a visual representation of an image dataset (e.g. 2D scatter plot after dimensionality reduction on top of the CLIP embeddings), I'd like to colour code each of the instances in the dataset according to the presence of the text in the query (e.g. "dog"). Ideally this would go from 0 (no dog at all present in an instance) to 1 (definitely a dog present in an instance). While this is typically close to binary for simple concepts, this can be more nuanced for most other queries (e.g. "daytime image" -> probability of 0.5 at dusk/dawn?), ideally using probabilities. So in essence this is a 0-shot (because any user-defined text query is allowed) binary (concept present vs. concept not present in an image) classification problem.

Any tips? Thanks again!

@rom1504
Copy link
Collaborator

rom1504 commented Oct 20, 2022

This is the solution you would like to apply.

Do you have a specific use case for this or do you only want to solve it from a research point of view and not apply it ?

@arnaudvl
Copy link
Author

The motivation is to apply it and allow users to feed in their own data and query those to obtain text-driven insights. So I'd like to explore to what extend the research problem can be solved in a nice, principled way, or whether I'll have to resort to a more heuristic-driven approach, and then to what extend the solution can be applied to generic datasets.

@rwightman
Copy link
Collaborator

I don't see why these tasks require a 'single class' .... the prompts for generating the zero-shot classifiers accept human language, so figure out the prompts required to achieve the goal (ie you can say 'picture/scene/image/rendering/painting without a dog')

if you want p(dog), then setup two sets of prompts that will achieve the aim, create a set of prompts for images/depictions containing a dog, and another set for scenes that are opposites and do not contain a dog... a daytime / nightime split is even easier to imagine.

@rwightman
Copy link
Collaborator

Also, re softmax, just because its outputs sum to 1, doesn't mean it can be directly be interpreted as a prob as is so often done, the calibration is typically quite poor. At best you can hope to have reasonable confidence that there is or isn't the object of interest in the scene (in which case either of rom or mehdi's ideas can work well with a bit of fiddling for the thresholding). To say you want a prob and assume that 0.8 from a softmax output actually means 80% prob there's a x in the scene, well that's unlikely to be the case.

@arnaudvl
Copy link
Author

Thanks for the helpful comments @rwightman . As mentioned before I definitely recognise it's a binary classification problem, which ideally (and largely to improve user experience) would be reduced to checking whether a certain concept is present in the images or not. As you note this could be done by also constructing the "opposite prompt". Tbh I've had quite mixed results trying this. E.g.

text = ['a picture with a dog', 'a picture without a dog']
text = ['an image with a dog', 'an image without a dog']
text = ['a picture which contains a dog', 'a picture which does not contain a dog']

already returns surprisingly different 0-shot results on fairly simple STL10 instances. So it looks like I need to dig more into prompt engineering, which combined with thresholding should give me a decent baseline to start off with.

Thanks all for the help and the super useful repo!

@mehdidc
Copy link
Contributor

mehdidc commented Oct 24, 2022

I think depending on the distribution of the images you expect, you could try to construct a set of non dog classes explicitly, each with its own prompt(s), because I don't think you would find a lot of examples in the training set of CLIP models with a caption like "not containing an object x". As a starting point, you might try/experiment with imagenet1k classes actually (https://github.com/mlfoundations/open_clip/blob/main/src/training/imagenet_zeroshot_data.py), it has a lot of dog breeds, other animals, different daily objects, you can e.g. accumulate the score of the breeds.

@mehdidc
Copy link
Contributor

mehdidc commented Oct 24, 2022

So in general, basically maintain a "library" of positive and negative classes, each time there is false positive or a false negative you update either of them with a new class/concept, but I don't know how easy it is to maintain

@arnaudvl
Copy link
Author

@mehdidc as you mentioned, not containing an object x seems from my experiments indeed where CLIP can struggle quite a bit. Calibration in a classification setting typically is pretty reasonable if the instance contains one of the pre-defined classes or is distributed across multiple classes (e.g. a dog which kind of looks like a cat), but worsens a lot when you try to specify what should not be in the instance.

@isaacrob
Copy link

Hello @arnaudvl and @mehdidc, my understanding is that with the original OpenAI CLIP models, we could get a pretty good sense of this by just taking the cosine similarity between the image and the text embedding and clamping it to 0-1, since that's ultimately the metric their model is working with in the loss function. They compute cosine sim, clamp it, then scale it and take the softmax, meaning the original 0-1 score is pretty well calibrated. At least, it seems so in my experiments. The problem comes from the change in loss function in the open clip version, where the logit is directly the dot product instead of a scaled 0-1 similarity indicator. In my work I'm having trouble using the open clip version because the scaling of relevant similarities is less well behaved. I tried just naively passing it through a sigmoid, with no luck. @mehdidc do you have any recommendations?

@rom1504
Copy link
Collaborator

rom1504 commented Oct 26, 2022

image_features = F.normalize(image_features, dim=-1)
open clip is computing cosines same as clip (embeddings are normalized)

@rom1504
Copy link
Collaborator

rom1504 commented Oct 26, 2022

Regarding the topic mentioned here, i am still missing information on the actual use case.

Computing (zero shot) binary classification with openclip may be possible, but the ranking and similarity properties of clip allow much more general problem definitions.

@isaacrob
Copy link

Oh! It is indeed just cosine sim here, I missed the normalization when I was reviewing the loss function y'all use! It also looks like the original clip paper didn't clamp to 0-1 as I thought it did .. I guess I just am misremembering, that's so weird! Such a strong memory..

So in that case the -1 to 1 range cosine sim should be a relatively well-behaved similarity score even without the other scores to provide context for a softmax normalization, yes?

@arnaudvl
Copy link
Author

arnaudvl commented Nov 5, 2022

Hi @rom1504 , apologies for the late reply. Since you asked about the use case: it is very similar to https://twitter.com/benmschmidt/status/1587847092306837509. They also use a CLIP search module. However, I find the quality of the search results often lacking. E.g. a simple query on https://atlas.nomic.ai/map/809ef16a-5b2d-4291-b772-a913f4c8ee61/9ed7d171-650b-4526-85bf-3592ee51ea31 such as an image of a football team gives lots of irrelevant instances in the top results. I don't know the exact implementation, but it seems like either the top N instances are returned by default or anything above a certain similarity threshold until a maximum of N instances are returned. Either way, it leads to many unwanted search results, especially if what you're searching for is not in the data at all. This could be avoided if we can do relatively reliable binary classification for [an image of a football team, an image not containing a football team] or similar. Unfortunately, I haven't had success with this setting as CLIP seems to struggle with prompts denoting what should not be present in the instance (sometimes it works, sometimes it doesn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants