# Perform object detection with pollen-vision

Learn how to perform zero shot object detection with the pollen-vision library, using the OWL-ViT model.

This notebook will show you how to use our wrapper for the OWL-ViT object detection model developed by the Google Research lab. 

![Gif Object detection from Reachy's egocentric view](https://media.githubusercontent.com/media/pollen-robotics/pollen-vision/develop/examples/vision_models_examples/gif/reachy_kitchen_detection.gif)

## A word on OWL-ViT
OWL-ViT stands for Vision Transformer for Open-World Localization. It is a zero shot object detection model, meaning the model is able to perform object detection based on text queries, without needing to retrain the model on any labeled data, as it is the case with traditional Deep Learning object detection models.

You can find more information on the model on the dedicated page of the [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/owlvit). The implementation of the wrapper actually uses Hugging Face's [transformers library](https://huggingface.co/docs/transformers/index).

## Setup environment

> Note: If you are working locally on your machine and have already installed the library from source, discard the following.

We need to first install the pollen-vision library. We will install the library from source, this might take a couple of minutes as there are quite heavy dependencies.

In [None]:
!pip install "pollen-vision[vision] @ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab"

Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.25.1->ram@ git+https://github.com/pollen-robotics/recognize-anything->pollen-vision@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab->pollen-vision[vision]@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab)
  Using cached tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting idna>=2.8 (from anyio<5,>=3.5.0->openai==1.12.0->pollen-vision@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab->pollen-vision[vision]@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab)
  Using cached idna-3.6-py3-none-any.whl.metadata (9.9 kB)
Collecting exceptiongroup>=1.0.2 (from anyio<5,>=3.5.0->openai==1.12.0->pollen-vision@ git+https://github.com/pollen-robotics/pollen-visio

Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->ram@ git+https://github.com/pollen-robotics/recognize-anything->pollen-vision@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab->pollen-vision[vision]@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->ram@ git+https://github.com/pollen-robotics/recognize-anything->pollen-vision@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab->pollen-vision[vision]@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->ram@ git+https://github.com

Collecting MarkupSafe>=2.0 (from jinja2->torch->ram@ git+https://github.com/pollen-robotics/recognize-anything->pollen-vision@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab->pollen-vision[vision]@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab)
  Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting mpmath>=0.19 (from sympy->torch->ram@ git+https://github.com/pollen-robotics/recognize-anything->pollen-vision@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab->pollen-vision[vision]@ git+https://github.com/pollen-robotics/pollen-vision.git@99-make-the-notebooks-runnable-on-google-colab)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting contourpy>=1.0.1 (from matplotlib>=2.1.0->pycocotools>=2.0.2->pycocoevalcap->ram@ git+https://github.

Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Using cached matplotlib-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Using cached wcwidth-0.2.13-py2.py3-none-any.whl (34 kB)
Using cached nvidia_nvjitlink_cu12-12.4.99-py3-none-manylinux2014_x86_64.whl (21.1 MB)
Using cached contourpy-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (310 kB)
Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB)
Using cached fonttools-4.49.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
Using cached kiwisolver-1.4.5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
Using cached pyparsing-3.1.2-py3-none-any.whl (103 kB)
Building wheels for collected packages: pollen-vision, 

## Use OWL-ViT

Let's use the OwlViT wrapper to perform zero shot object detection.

In [None]:
import numpy as np
from PIL import Image

from pollen_vision.vision_models.object_detection import OwlVitWrapper

In [None]:
object_detection_wrapper = OwlVitWrapper()

## Import example image

Here we will import an example image to test the OwlViT wrapper. We will use an image from the [reachy-doing-things image dataset](https://huggingface.co/datasets/pollen-robotics/reachy-doing-things) available on Hugging Face. In this dataset, we captured images from an egocentric view of Reachy doing manipulation tasks while being teleoperated.

Feel fry to try the object detection with your own image instead!

In [None]:
from datasets import load_dataset

dataset = load_dataset("pollen-robotics/reachy-doing-things", split="train")

img = dataset[11]['image']
img

## Run inference with the model

As explained, the OWL-ViT model is a zero shot object detection model and takes text queries as input. The inference is performed with the *infer* method. Just pass as argument a list of the candidate for the object detection that you want to detect. OWL-ViT will only try to detect classes that are in the list.

NB: Please note that the image passed as argument for the *infer* method must be a **numpy array object**.

In [None]:
predictions = object_detection_wrapper.infer(
    im=np.array(img),
    candidate_labels=["kettle", "black mug", "sink", "blue mug", "sponge", "bag of chips"],
    detection_threshold=0.15,
)

predictions

Change the candidates list and check what you can detect!

### Visualize detection results

You can visualize easliy the predictions of the model with the *Annotator* class from utils.

In [None]:
from pollen_vision.vision_models.utils import Annotator

In [None]:
annotator = Annotator()

img_annotated = annotator.annotate(im=np.array(img), detection_predictions=predictions)
Image.fromarray(img_annotated)  # annotator returns a numpy array object

## Final notes

That's all folks! You can use [this script](https://github.com/pollen-robotics/pollen-vision/blob/99-make-the-notebooks-runnable-on-google-colab/scripts/annotate_video.py) if you want to perform zero shot object detection on a recorded video. The scripts gathers every commands that you saw here in the notebook.

Check out the [other notebooks](https://drive.google.com/drive/folders/1Xx42Pk4exkS95iyD-5arHIYQLXyRWTXw?usp=drive_link) if you want to learn how to use other vision models like RAM for image tagging or SAM to perform object segmentation.