Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For OWL-ViT, is there a demo which shows the way using image patch as querys to do one-shot detection? #325

Closed
Edwardmark opened this issue May 19, 2022 · 11 comments

Comments

@Edwardmark
Copy link

Hi, thanks for your great work. And the demo of text zero-shot is amazing.
For OWL-ViT, is there a demo which shows the way using image patch as querys to do one-shot detection?
Thanks.

@mjlm
Copy link
Collaborator

mjlm commented May 20, 2022

Hi, we're actively working on this demo and will let you know when it's available, hopefully some time next week.

@Edwardmark
Copy link
Author

@mjlm And what prompt is used in coco evaluation? In the paper, it says it uses the seven best prompts, so what are the seven best text prompts? Thanks.

@AlexeyG
Copy link
Collaborator

AlexeyG commented May 27, 2022

The prompts can be found in the CLIP repository. During inference we used the 7 ensembling prompts from the colab.

@AlexeyG AlexeyG closed this as completed May 27, 2022
@stevebottos
Copy link

Is this still in the works? I've been interested in seeing how image input queries could be used as well

@xishanhan
Copy link

Hi, we're actively working on this demo and will let you know when it's available, hopefully some time next week.

Hi, is this one-shot detection demo finished? I'm also very interested in it and want to try.

@mjlm mjlm reopened this Jun 13, 2022
@mjlm
Copy link
Collaborator

mjlm commented Jun 13, 2022

We're still working on this and will let you know here when the demo is ready. I re-opened the issue to keep track.

@xishanhan
Copy link

We're still working on this and will let you know here when the demo is ready. I re-opened the issue to keep track.

That would be very nice, thank you!

@mjlm
Copy link
Collaborator

mjlm commented Jun 22, 2022

We just added a Playground Colab with an interactive demo of both text-conditioned and image-conditioned detection:

OWL-ViT text inference demo OWL-ViT image inference demo

The underlying code illustrates how to extract an embedding for a given image patch, specifically here: https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py#L110-L131

Let us know if you have any questions!

@xishanhan
Copy link

We just added a Playground Colab with an interactive demo of both text-conditioned and image-conditioned detection:

OWL-ViT text inference demo OWL-ViT text inference demo OWL-ViT image inference demo OWL-ViT image inference demo

The underlying code illustrates how to extract an embedding for a given image patch, specifically here: https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/inference.py#L110-L131

Let us know if you have any questions!

Thanks for your reply!I don't have any problems now.

@AlexeyG AlexeyG closed this as completed Jul 1, 2022
@BIGBALLON
Copy link

Hi, @mjlm , thanks for your great work!

I wonder if there are any plans to implement multi-query image-conditioned detection.

Sometimes a single query image is often unable to capture all the features of an object, and using multiple query images to represent it can yield better results.

thanks again!

@mjlm
Copy link
Collaborator

mjlm commented Sep 26, 2023

You can simply average the embeddings of multiple boxes to get a query embedding. This is how we implemented few-shot (i.e. more than one-shot) detection in the paper.

#890 will add example code for image-conditioned detection to the colab. The example shows how to get a query_embedding from the class_embeddings of the source (query) image. If you have e.g. two query embeddings representing the same object, you can simply do two_shot_query_embedding = (query_embedding_1 + query_embedding_2) / 2. This simple method worked for us. Another option would be to keep the embeddings separate, but map them to the same class after classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants