# Perform object segmentation with pollen-vision

Learn how to perform object segmentation with the pollen-vision library, using the MobileSAM model.

MobileSAM is a lighter version of SAM, a segmentation model developed by Meta AI.

💡 In this notebook, we assume that you have already checked the notebook dedicated to zero shot object detection as we will also perform object detection here.

![Object segmentation from Reachy's egocentric view](https://media.githubusercontent.com/media/pollen-robotics/pollen-vision/develop/examples/vision_models_examples/gif/reachy_kitchen_masks.gif)

## A word on SAM and Mobile SAM

SAM stands for Segment Anything Model. SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training developed by Meta AI. With SAM, you can just give a point to the model to predict the mask for a single object of interest.

In 2023, researchers from Kyung Hee University developed MobileSAM, a lighter version of SAM which allows SAM to be run on mobile devices. In pollen-vison we are using the implementation of MobileSAM from its authors. Check the [MobileSAM paper](https://arxiv.org/pdf/2306.14289.pdf), its [GitHub repository](https://github.com/ChaoningZhang/MobileSAM) and the [orginal SAM paper](https://arxiv.org/pdf/2304.02643.pdf) for more information. 

Credits to Chaoning Zhang et al. from Kyung Hee University and to Alexander Kirillov et al. from Meta AI for developing this and making it open source!

## Setup environment

> Note: If you are working locally on your machine and have already installed the library from source, discard the following.

We need to first install the pollen-vision library. We will install the library from source, this might take a couple of minutes as there are quite heavy dependencies.

> If you are on Colab and a warning window pops up indicating "You must restart the runtime in order to use the newly installed versions." don't worry. Just press restart session and don't execute the pip install cell again, pollen-vision will already be installed.

In [None]:
!pip install "pollen-vision[vision] @ git+https://github.com/pollen-robotics/pollen-vision.git@main"

## Initialize MobileSAM

Let's instanciate a MobileSAM wrapper to prepare the object segmentation.

In [None]:
import numpy as np
from PIL import Image

from pollen_vision.vision_models.object_segmentation import MobileSamWrapper

In [None]:
object_segmentation_wrapper = MobileSamWrapper()

## Import example image

Here we will import an example image to test the OwlViT wrapper. We will use an image from the [reachy-doing-things image dataset](https://huggingface.co/datasets/pollen-robotics/reachy-doing-things) available on Hugging Face. In this dataset, we captured images from an egocentric view of Reachy doing manipulation tasks while being teleoperated.

Feel fry to try the object detection with your own image instead!

In [None]:
from datasets import load_dataset

dataset = load_dataset("pollen-robotics/reachy-doing-things", split="train")

img = dataset[12]["image"]
img

Let's perform object segmentation on objects Reachy could grasp.

## First: object detection

To obtain the segmentation, we first need to do object detection in the image to give inputs to MobileSAM. MobileSAM (and SAM as well) takes either a point, a list of points or a bounding box of an object to perform the segmentation. We show in this example how to use bounding boxes of objects as input. So let's get bounding boxes for objects Reachy could grasp, using the OwlViT wrapper.

In [None]:
from pollen_vision.vision_models.object_detection import OwlVitWrapper

object_detection_wrapper = OwlVitWrapper()

If you chose your own image, replace the *candidate_labels* argument with your own list of objects candidates.

In [None]:
predictions = object_detection_wrapper.infer(
    im=np.array(img), candidate_labels=["blue mug", "paper cup", "kettle", "sponge"], detection_threshold=0.12
)
predictions

We can extract the bounding boxes from the predictions, we will need them as input for the segmentation.

In [None]:
from pollen_vision.utils import get_bboxes

bboxes = get_bboxes(predictions)

N.B.: the format returned for the bounding boxes is *[xmin, ymin, xmax, ymax]*

### Visualize object detections

You can visualize easliy the predictions of the object detection model with the *Annotator* class from utils.

In [None]:
from pollen_vision.utils import Annotator

In [None]:
annotator = Annotator()

img_annotated = annotator.annotate(im=img, detection_predictions=predictions)
Image.fromarray(img_annotated)

## At last, the segmentation!

Now that we have the bounding boxes for the objects we are interested in, we can use our SAM wrapper defined earlier to obtain the segmentation of each object.

In [None]:
masks = object_segmentation_wrapper.infer(im=img, bboxes=bboxes)

Note: You could also call `object_segmentation_wrapper.infer(...)` with a list of list of points as input. 
Here, each list of points would correspond to points of interest for each object. An example such a list would be :
```python
points = [[[x1, x2], [x3, x4], ...], [[x5, x6], [x7, x8], ...], ...]
```

You could then call `object_segmentation_wrapper.infer(im=img, points_list=points)`
            

In [None]:
img_annotated = annotator.annotate(im=img, detection_predictions=predictions, masks=masks)
Image.fromarray(img_annotated)