Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Owlv2 Zero-Shot Object Detection #30131

Closed
2 of 4 tasks
nisyad-ms opened this issue Apr 8, 2024 · 9 comments · Fixed by #30686
Closed
2 of 4 tasks

[Bug] Owlv2 Zero-Shot Object Detection #30131

nisyad-ms opened this issue Apr 8, 2024 · 9 comments · Fixed by #30686

Comments

@nisyad-ms
Copy link

System Info

transformers==4.39.3

python==3.10.14

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
import requests
import torch

checkpoint=""google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([im.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(im)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

im

Expected behavior

Expected behavior should be as shown in the second official example here: https://huggingface.co/docs/transformers/main/en/tasks/zero_shot_object_detection

However, the final bounding boxes are still shifted. Please refer to the code above (taken from the official example)

@NielsRogge
Copy link
Contributor

NielsRogge commented Apr 9, 2024

Hi,

Thanks for your interest in OWLv2. As shown in my demo notebook, you need to visualize the bounding boxes on the padded image rather than the original image.

This is also shown here: https://huggingface.co/docs/transformers/en/model_doc/owlv2#transformers.Owlv2ForObjectDetection.forward.example

@nisyad-ms
Copy link
Author

nisyad-ms commented Apr 9, 2024

Thanks, Niels. I saw your demo notebook and it works fine there.
But if I try to reproduce the example below as it is, I don't see the expected results. Can you re-confirm?
https://huggingface.co/docs/transformers/en/tasks/zero_shot_object_detection#text-prompted-zero-shot-object-detection-by-hand

@NielsRogge
Copy link
Contributor

NielsRogge commented Apr 10, 2024

Did you visualize results on the unnormalized image?

@nisyad-ms
Copy link
Author

Yes. If you can try to run the example I mentioned as it is, the final bboxes are not what are shown in the result image.

@NielsRogge
Copy link
Contributor

Yes that example only works as it is for OWLv1. Perhaps we could add a disclaimer for OWLv2 that results need to be shown on the preprocessed image. Would you be up for opening a PR for that?

The docs is here: https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/zero_shot_object_detection.md

@jla524
Copy link
Contributor

jla524 commented May 4, 2024

@NielsRogge @nisyad-ms I managed to show the preprocessed image with the correct boxes. Below is the full code.

import torch
import requests
import numpy as np
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")


def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image


unnormalized_image = get_preprocessed_image(inputs.pixel_values)

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([unnormalized_image.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(unnormalized_image)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

unnormalized_image.show()

Is there an easy way to remove the grey area?

image

@nisyad-ms
Copy link
Author

nisyad-ms commented May 6, 2024

Thanks @jla524 for the example. @NielsRogge also pointed to this.

  • +1 for removing the gray area.

@NielsRogge
Copy link
Contributor

Yes there's an easy way to remove the padding, see https://discuss.huggingface.co/t/owl-v2-bounding-box-misalignment-problem/66181/6?u=nielsr

@jla524
Copy link
Contributor

jla524 commented May 7, 2024

Thanks @NielsRogge! This worked for me:

import torch
import requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
image = Image.open(requests.get(url, stream=True).raw)
width, height = image.size

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(image)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

width_ratio = 1
height_ratio = 1

if width < height:
    width_ratio = width / height
elif height < width:
    height_ratio = height / width

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    xmin /= width_ratio
    ymin /= height_ratio
    xmax /= width_ratio
    ymax /= height_ratio
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

image.show()

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants