[Bug] Owlv2 Zero-Shot Object Detection #30131

nisyad-ms · 2024-04-08T22:32:07Z

System Info

transformers==4.39.3

python==3.10.14

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
import requests
import torch

checkpoint=""google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([im.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(im)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

im

Expected behavior

Expected behavior should be as shown in the second official example here: https://huggingface.co/docs/transformers/main/en/tasks/zero_shot_object_detection

However, the final bounding boxes are still shifted. Please refer to the code above (taken from the official example)

The text was updated successfully, but these errors were encountered:

NielsRogge · 2024-04-09T06:42:47Z

Hi,

Thanks for your interest in OWLv2. As shown in my demo notebook, you need to visualize the bounding boxes on the padded image rather than the original image.

This is also shown here: https://huggingface.co/docs/transformers/en/model_doc/owlv2#transformers.Owlv2ForObjectDetection.forward.example

nisyad-ms · 2024-04-09T16:22:42Z

Thanks, Niels. I saw your demo notebook and it works fine there.
But if I try to reproduce the example below as it is, I don't see the expected results. Can you re-confirm?
https://huggingface.co/docs/transformers/en/tasks/zero_shot_object_detection#text-prompted-zero-shot-object-detection-by-hand

NielsRogge · 2024-04-10T14:25:50Z

Did you visualize results on the unnormalized image?

nisyad-ms · 2024-04-10T20:23:07Z

Yes. If you can try to run the example I mentioned as it is, the final bboxes are not what are shown in the result image.

NielsRogge · 2024-04-11T06:51:30Z

Yes that example only works as it is for OWLv1. Perhaps we could add a disclaimer for OWLv2 that results need to be shown on the preprocessed image. Would you be up for opening a PR for that?

The docs is here: https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/zero_shot_object_detection.md

jla524 · 2024-05-04T08:19:32Z

@NielsRogge @nisyad-ms I managed to show the preprocessed image with the correct boxes. Below is the full code.

import torch
import requests
import numpy as np
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")


def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image


unnormalized_image = get_preprocessed_image(inputs.pixel_values)

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([unnormalized_image.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(unnormalized_image)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

unnormalized_image.show()

Is there an easy way to remove the grey area?

nisyad-ms · 2024-05-06T15:14:25Z

Thanks @jla524 for the example. @NielsRogge also pointed to this.

+1 for removing the gray area.

NielsRogge · 2024-05-06T21:30:02Z

Yes there's an easy way to remove the padding, see https://discuss.huggingface.co/t/owl-v2-bounding-box-misalignment-problem/66181/6?u=nielsr

jla524 · 2024-05-07T05:41:37Z

Thanks @NielsRogge! This worked for me:

import torch
import requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
image = Image.open(requests.get(url, stream=True).raw)
width, height = image.size

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(image)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

width_ratio = 1
height_ratio = 1

if width < height:
    width_ratio = width / height
elif height < width:
    height_ratio = height / width

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    xmin /= width_ratio
    ymin /= height_ratio
    xmax /= width_ratio
    ymax /= height_ratio
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

image.show()

NielsRogge added the Good First Documentation Issue label Apr 12, 2024

jla524 mentioned this issue May 7, 2024

Fix image post-processing for OWLv2 #30686

Merged

NielsRogge mentioned this issue May 8, 2024

Adding OwlV2 checkpoint to Owl-vit model #26315

Closed

amyeroberts closed this as completed in #30686 May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Owlv2 Zero-Shot Object Detection #30131

[Bug] Owlv2 Zero-Shot Object Detection #30131

nisyad-ms commented Apr 8, 2024

NielsRogge commented Apr 9, 2024 •

edited

nisyad-ms commented Apr 9, 2024 •

edited

NielsRogge commented Apr 10, 2024 •

edited

nisyad-ms commented Apr 10, 2024

NielsRogge commented Apr 11, 2024

jla524 commented May 4, 2024

nisyad-ms commented May 6, 2024 •

edited

NielsRogge commented May 6, 2024

jla524 commented May 7, 2024 •

edited

[Bug] Owlv2 Zero-Shot Object Detection #30131

[Bug] Owlv2 Zero-Shot Object Detection #30131

Comments

nisyad-ms commented Apr 8, 2024

System Info

transformers==4.39.3

python==3.10.14

Who can help?

Information

Tasks

Reproduction

Expected behavior

NielsRogge commented Apr 9, 2024 • edited

nisyad-ms commented Apr 9, 2024 • edited

NielsRogge commented Apr 10, 2024 • edited

nisyad-ms commented Apr 10, 2024

NielsRogge commented Apr 11, 2024

jla524 commented May 4, 2024

nisyad-ms commented May 6, 2024 • edited

NielsRogge commented May 6, 2024

jla524 commented May 7, 2024 • edited

NielsRogge commented Apr 9, 2024 •

edited

nisyad-ms commented Apr 9, 2024 •

edited

NielsRogge commented Apr 10, 2024 •

edited

nisyad-ms commented May 6, 2024 •

edited

jla524 commented May 7, 2024 •

edited