## Who this is for
This notebook is designed for:

- Computer vision engineers with basic FiftyOne experience (can load datasets and use the App)

- Practitioners interested in zero-shot computer vision approaches who may be new to segmentation tasks

- Users looking to implement quick segmentation solutions without training custom models or creating labeled datasets

## Assumed Knowledge

### Computer Vision Concepts

- Basic understanding of image segmentation (semantic, instance)

- Familiarity with vision-language models and prompting

- Understanding of coordinate systems in images

### Technical Requirements

- Intermediate Python programming skills

- Experience with Jupyter notebooks

- Basic understanding of PyTorch (for model usage)

### FiftyOne Concepts
You should be familiar with:
- [Datasets and Samples](https://beta-docs.voxel51.com/getting_started/basics.html)
- [The FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/)
- [Working with Labels](https://beta-docs.voxel51.com/user_guide/using_datasets.html#labels)
- [The Model Zoo](https://beta-docs.voxel51.com/user_guide/model_zoo/index.html)

## Time to Complete

~45-60 minutes

## Required Packages

It's recommended to use a virtual environment with FiftyOne already installed. Additionally, you'll need:

```bash
# Install Florence2 plugin requirements
fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin
fiftyone plugins requirements @jacobmarks/florence2 --install

# Install Moondream2 plugin requirements
fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin
fiftyone plugins requirements @harpreetsahota/moondream2 --install

# Install SAM2
pip install "git+https://github.com/facebookresearch/sam2.git#egg=sam-2"

# Install FastSAM
pip install ultralytics
```

## Content Overview

- **Zero-Shot Segmentation Introduction**: Understanding the basics of zero-shot segmentation and its applications

- **Phrase Grounding Segmentation**: Using Florence2 to segment images based on text descriptions

- **Moondream + SAM2 Integration**: Combining automatic keypoint detection with advanced segmentation

- **FastSAM Implementation**: Using point-based prompts for quick and efficient segmentation


# Zero-Shot Segmentation

Zero-shot segmentation is a computer vision task that aims to segment objects or regions in images without any training samples for those specific categories. It enables models to perform instance, semantic, or panoptic segmentation for novel categories by transferring visual knowledge learned from seen categories to unseen ones. Prompt-based zero-shot segmentation uses prompts to guide the segmentation process at test time without requiring retraining for new categories. This approach allows a single trained model to handle various segmentation tasks dynamically.

### Types of Prompts

**Text Prompts**
- Free-text descriptions that specify what to segment in an image
- The model uses pre-trained knowledge of text-image relationships to identify and segment the described objects

**Image Prompts**
- Visual examples that show what to segment
- Particularly useful when the target is difficult to describe in words
- Can be a reference image containing the object of interest
- The model compares the visual features between the prompt image and the target image to identify similar regions

**Hybrid Approaches**
- Some systems can accept either text or image prompts for the same model
- CLIPSeg is an example of a model that works with both text and image prompts by adding a decoder to CLIP

#### Let's begin by downloading a dataset from the FiftyOne [Dataset Zoo](https://beta-docs.voxel51.com/data/dataset_zoo/datasets/). 

You'll notice I have passed several arguments to the [`load_zoo_dataset`](https://beta-docs.voxel51.com/api/fiftyone.zoo.datasets.html#load_zoo_dataset) function:

- `max_samples`: The [Dataset](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/) will only be comprised of, at most, 25 [Samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/)

- `shuffle`: Randomize the [Samples](https://beta-docs.voxel51.com/api/fiftyone.core.sample.Sample.html) that are selected 

- `dataset_name`: Assign a [name](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#name) to the [Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#name)

In [1]:
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "quickstart", 
    max_samples=25,
    shuffle=True,
    dataset_name="mini_quickstart",
    overwrite=True,
    )

Overwriting existing directory '/home/harpreet/fiftyone/quickstart'
Downloading dataset to '/home/harpreet/fiftyone/quickstart'
Downloading dataset...
 100% |████|  187.5Mb/187.5Mb [908.5ms elapsed, 0s remaining, 206.4Mb/s]      
Extracting dataset...
Parsing dataset metadata
Found 200 samples
Dataset info written to '/home/harpreet/fiftyone/quickstart/info.json'
Loading 'quickstart'
 100% |███████████████████| 25/25 [263.8ms elapsed, 0s remaining, 94.8 samples/s]      
Dataset 'mini_quickstart' created


We'll need [metadata](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/#storing-field-metadata), such as each Sample's height and width later, so we can use the [`compute_metadata`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#compute_metadata) method of the Dataset to accomplish this:

In [2]:
dataset.compute_metadata()

Computing metadata...
 100% |███████████████████| 25/25 [12.7ms elapsed, 0s remaining, 2.0K samples/s] 


Here's what the metadata for the first Sample looks like:

In [3]:
dataset.first().metadata

<ImageMetadata: {
    'size_bytes': 157534,
    'mime_type': 'image/jpeg',
    'width': 640,
    'height': 489,
    'num_channels': 3,
}>

# Phrase Grounding Segmentation

Phrase grounding segmentation extends traditional phrase grounding by not only localizing objects mentioned in text but also generating pixel-level segmentation masks for those objects. While phrase grounding typically produces bounding boxes around regions corresponding to textual phrases, phrase grounding segmentation aims to create fine-grained segmentation masks that precisely delineate the boundaries of the referenced objects.

This approach enables more precise visual understanding by:
- Associating specific words or phrases with their corresponding image regions
- Generating pixel-accurate segmentation masks rather than just bounding boxes
- Creating a more detailed alignment between language and visual content


## Plugins

[FiftyOne plugins](https://beta-docs.voxel51.com/plugins/) are powerful extensions that allow users to customize and enhance the functionality of the [FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/). 

[Plugins can be written in Python, JavaScript, or a combination of both](https://beta-docs.voxel51.com/plugins/developing_plugins/), enabling users to add new features, create integrations with other tools and APIs, render custom panels, and add custom actions to menus. They are composed of panels, operators, and components, which together allow for building full-featured interactive data applications tailored to specific use cases. Plugins can range from simple actions like adding a checkbox to complex workflows such as requesting annotations from a configurable backend. This extensibility makes FiftyOne highly adaptable to various computer vision tasks and workflows, limited only by the user's imagination.

We'll use the Plugin framework via the FiftyOne SDK, and you can [refer to the docs on using the Plugin Frame in the FiftyOne App](https://beta-docs.voxel51.com/plugins/using_plugins/)

### Florence2 Plugin

The [Florence2 Plugin](https://github.com/jacobmarks/fiftyone_florence2_plugin) integrates Microsoft's Florence2 Vision-Language Model with FiftyOne datasets, offering several powerful computer vision capabilities.

One of these tasks is referring segmentation, which allows you to segment specific regions in an image based on natural language descriptions. This is particularly useful when you need to segment specific parts of an image based on textual descriptions, allowing for region identification using natural language. It can be used in two ways:

• Using a direct expression through the `expression` parameter

• Using an existing expression field in your dataset via the `expression_field` parameter

Note: Referring segmentation is a hard task in Visual AI, and as powerful as the Florence2 model is, the results are not always the best. It's a good idea to be precise with your open vocabulary prompt and use the shortest caption possible for each Sample.

Let's start by setting an enviornment variable, [downloading the plugin, and installing it's requirements](https://beta-docs.voxel51.com/plugins/using_plugins/).

In [4]:
# set environment variable
import os
os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

In [None]:
# download the plugin
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin

In [None]:
# install requirements for the plugin
!fiftyone plugins requirements @jacobmarks/florence2 --install

With the plugin installed, you can instantiate the [Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#using-operators) like this:

In [None]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-large-ft"

florence2_referring_segmentation = foo.get_operator("@jacobmarks/florence2/referring_expression_segmentation_with_florence2")

To use the operator, you will need to start a [Delegated service](https://beta-docs.voxel51.com/plugins/developing_plugins/#delegated-execution_1) by opening your terminal and running the following command:

```bash
fiftyone delegated launch
```

Then, you can [call the Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#calling-operators) by running the following cell:

In [None]:
await florence2_referring_segmentation(
    dataset,
    model_path=MODEL_PATH,
    expression="human",
    output_field="open_expression_segmentations",
    delegate=True
)

You can examine the results of the first Sample like so:

In [7]:
dataset.first()['open_expression_segmentations']

<Polylines: {
    'polylines': [
        <Polyline: {
            'id': '67eac88e0c94125554ed2dd7',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'points': [
                [
                    [0.184499990940094, 0.18949999350955393],
                    [0.28250000476837156, 0.08650000012481628],
                    [0.4114999771118164, 0.03249999984397966],
                    [0.5014999866485595, 0.013500000070209153],
                    [0.6154999732971191, 0.03950000053046915],
                    [0.7244999885559082, 0.08650000012481628],
                    [0.8255000114440918, 0.1804999915124936],
                    [0.8885000228881836, 0.29449999600587934],
                    [0.9215000152587891, 0.4465000019970603],
                    [0.9184999465942383, 0.5855000189720732],
                    [0.8614999771118164, 0.7404999840235174],
                    [0.7494999885559082, 0.8744999860205777],
                   

You can also use [the Florence2 Plugin](https://github.com/jacobmarks/fiftyone_florence2_plugin) when you have existing captions on your dataset. We don't have those here, so let's generate these captions and then use that for segmentation. Start by instantiating the Operator for this task:

In [8]:
import fiftyone.operators as foo

florence2_captioning = foo.get_operator("@jacobmarks/florence2/caption_with_florence2")

And calling the operator, as you've done previously:

In [None]:
await florence2_captioning(
    dataset,
    model_path=MODEL_PATH,
    detail_level="basic",
    output_field="basic_caption",
    delegate=True
    )

In [10]:
dataset.first()['basic_caption']

'A cat curled up in a bowl on a wooden floor.'

In [None]:
await florence2_referring_segmentation(
    dataset,
    model_path=MODEL_PATH,
    expression_field="basic_caption", #must be a field on your dataset
    output_field="expression_field_segmentations",
    delegate=True
)

In [12]:
dataset.first()['expression_field_segmentations']

<Polylines: {
    'polylines': [
        <Polyline: {
            'id': '67eac93f0c94125554ed2df0',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'points': [
                [
                    [0.2244999885559082, 0.13549999712922578],
                    [0.3494999885559082, 0.047499997971735604],
                    [0.4485000133514404, 0.01749999976596949],
                    [0.5275000095367431, 0.015499999430525766],
                    [0.6174999713897705, 0.03649999856461289],
                    [0.7184999942779541, 0.08350000205946846],
                    [0.8164999961853028, 0.1704999927606563],
                    [0.8845000267028809, 0.28050001023493415],
                    [0.9204999923706054, 0.4265000044933857],
                    [0.9234999656677246, 0.5385000154772175],
                    [0.8984999656677246, 0.6594999660499744],
                    [0.8545000076293945, 0.7514999795301317],
                  

### Moondream + SAM2

This next workflow will show you how to leverage the [Moondream2 plugin](https://github.com/harpreetsahota204/moondream2-plugin) for FiftyOne alongside [SAM2 from the FiftyOne Model Zoo](https://voxel51.com/blog/sam-2-is-now-available-in-fiftyone/) for zero-shot segmentation. 

The process works by first using Moondream2 to automatically analyze your images and generate [Keypoints](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) of interest in a zero-shot fashion, requiring no training data or manual annotation. These points then serve as a prompt for SAM2, which uses them to generate segmentation masks around the detected objects or regions. 

First, install the Moondream2 plugin and it's requirements:

In [None]:
# download the plugin from the github repository
!fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin

In [None]:
# install requirements for the plugin
!fiftyone plugins requirements @harpreetsahota/moondream2 --install

With the plugin installed, you can instantiate the [Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#using-operators) like this:

In [13]:
import fiftyone.operators as foo

moondream_operator = foo.get_operator("@harpreetsahota/moondream2/moondream")

To use the operator, you will need to start a [Delegated service](https://beta-docs.voxel51.com/plugins/developing_plugins/#delegated-execution_1) by opening your terminal and running the following command:

```bash
fiftyone delegated launch
```

Then, you can [call the Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#calling-operators) by running the following cell:

In [None]:
await moondream_operator(
    dataset,
    revision="2025-01-09",
    operation="point",
    output_field="moondream_point",
    delegate=True,
    object_type="person"
)

Use the [`reload` method](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#reload) of the Dataset to reload any in-memory samples from the database.

In [15]:
dataset.reload()

In this [Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html), the first result didn't have any class of `person`. To demonstrate what the [Keypoint](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) looks like from Moondream looks like we can get the jth element from the Dataset using [the `skip` method](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#skip) of the Dataset combined with [the `first` method](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#first) of the Dataset.

In [18]:
dataset.skip(15).first()['moondream_point']

<Keypoints: {
    'keypoints': [
        <Keypoint: {
            'id': '67eac9a60c94125554ed2e29',
            'attributes': {},
            'tags': [],
            'label': 'person',
            'points': [[0.59765625, 0.5869140625]],
            'confidence': None,
            'index': None,
        }>,
        <Keypoint: {
            'id': '67eac9a60c94125554ed2e2a',
            'attributes': {},
            'tags': [],
            'label': 'person',
            'points': [[0.2314453125, 0.6767578125]],
            'confidence': None,
            'index': None,
        }>,
    ],
}>

To use SAM2 from the FiftyOne Model Zoo you need to first install its dependencies. You can do so by running the following command:

`pip install "git+https://github.com/facebookresearch/sam2.git#egg=sam-2`


Please [refer to the SAM2 GitHub repo](https://github.com/facebookresearch/sam2) for details and any troubleshooting. You can [refer to the Model Zoo documentation](https://beta-docs.voxel51.com/models/model_zoo/models/#med-sam-2-video-torch_1) for more information about which checkpoints are available in the FiftyOne Model Zoo.

In [None]:
sam2_model = foz.load_zoo_model("segment-anything-2.1-hiera-base-plus-image-torch")

dataset.apply_model(
    sam2_model,
    label_field="sam_segmentations",
    prompt_field="moondream_point",
)

In [20]:
dataset.skip(15).first()['sam_segmentations']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67eacaa521dbe34b2775a1ec',
            'attributes': {},
            'tags': [],
            'label': 'person',
            'bounding_box': [
                0.578125,
                0.5247058823529411,
                0.0421875,
                0.1811764705882353,
            ],
            'mask': array([[False, False, False, ..., False, False, False],
                   [False, False, False, ..., False, False, False],
                   [False, False, False, ..., False, False, False],
                   ...,
                   [False, False, False, ...,  True,  True,  True],
                   [False, False, False, ...,  True,  True,  True],
                   [False, False, False, ..., False, False, False]]),
            'mask_path': None,
            'confidence': 0.8452418446540833,
            'index': None,
        }>,
        <Detection: {
            'id': '67eacaa521dbe34b2775a1ed',
            'att

# Prompting with Keypoints

When working with keypoints in FiftyOne and Hugging Face segmentation models, you need to perform some conversion. 

FiftyOne's [Keypoint class](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) stores points in normalized coordinates within the [0,1] x [0,1] range, regardless of image dimensions. This normalization enables consistent representation across images of different sizes.

When a model requires absolute pixel coordinates to generate segmentation masks, you'll need to perform coordinate conversion when moving between these systems. Feeding those points to a segmentation model requires transforming the normalized coordinates back to absolute pixel values using the image's actual dimensions from its metadata.

### How to parse segmentation masks that are point coordinates
#### FastSAM from Ultralytics 


[FastSAM](https://docs.ultralytics.com/models/fast-sam) outputs segmentation masks as normalized coordinate arrays. Each mask is represented as an array of (x,y) coordinates defining the boundary of a detected object. These coordinates are normalized to [0,1] range and stored in NumPy arrays.

When working with segmentation models that output point coordinates (boundary points of objects), here's what you need to know to display them in FiftyOne:

* Model outputs (often NumPy arrays or tensors) need to be converted to standard Python lists of coordinates.

* Ensure coordinates are normalized to [0,1] range if they aren't already.

* FiftyOne's Polyline expects a specific nesting structure - your points must be organized as a list of shapes, where each shape is a list of points.

* Specify that your polylines should be closed (connecting last point to first) and filled to properly represent segmentation masks.

* Store your polylines in a Polylines field to enable proper visualization in the FiftyOne UI.

Note: FastSAM can accept text input and boundin boxes as prompts. Refer to the [Ultralytics documentation](https://docs.ultralytics.com/models/fast-sam) to learn more.

In [None]:
import torch
from ultralytics import FastSAM
import fiftyone as fo

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load FastSAM model
model = FastSAM("FastSAM-s.pt")

# Process samples in dataset
for sample in dataset.iter_samples(progress=True):
    # Skip samples without keypoints
    if not hasattr(sample, "moondream_point") or not sample.moondream_point.keypoints:
        continue
        
    # Get image path and dimensions
    image_path = sample.filepath
    image_width = sample.metadata.width
    image_height = sample.metadata.height
    
    # Process all keypoints in the sample
    all_keypoints = sample.moondream_point.keypoints
    
    # Collect all points from all keypoint objects
    all_pixel_points = []
    all_labels = []
    
    for keypoint_obj in all_keypoints:
        points = keypoint_obj.points
        # Convert to pixel coordinates
        pixel_points = [
            [int(point[0] * image_width), int(point[1] * image_height)]
            for point in points
        ]
        all_pixel_points.extend(pixel_points)
        all_labels.extend([1] * len(pixel_points))  # 1 for foreground
    
    # Run inference with all points
    results = model(image_path, 
                    device=device, 
                    retina_masks=True,
                    points=all_pixel_points, 
                    labels=all_labels,
                    conf=0.51,
                    iou=0.51
                    )
    
    result = results[0]
    
    # Check if masks were generated
    if hasattr(result, 'masks') and result.masks:
        masks = result.masks.xyn
        
        # Create polyline objects for each mask
        polylines = []
        
        for mask in masks:
            # Convert NumPy arrays to plain Python lists of floats with list comprehension
            points_list = [[float(point[0]), float(point[1])] for point in mask]
            
            # Create polyline with correct nesting
            polyline = fo.Polyline(
                points=[points_list],  # Each mask is one shape
                closed=True,
                filled=True
            )
            
            polylines.append(polyline)
        
        # Save to sample if we have valid polylines
        if polylines:
            polylines_field = fo.Polylines(polylines=polylines)
            sample["fastsam_segmentation"] = polylines_field
            sample.save()

Note, [we are calling `sample.save()` after adding predictions to each Sample](https://beta-docs.voxel51.com/faq/#why-didnt-changes-to-my-dataset-save). This method persists your changes to the FiftyOne database, ensuring that your generated segmentation masks are stored and accessible for future use.

In [None]:
dataset.reload()
dataset

In [None]:
dataset.first()['fastsam_segmentation']

## Summary

In this tutorial, you've learned how to:

- Implement zero-shot segmentation using different approaches without training custom models

- Use text prompts with Florence2 for phrase grounding segmentation

- Combine Moondream2's automatic keypoint detection with SAM2 for efficient segmentation

- Leverage FastSAM for point-based segmentation tasks

### Key Takeaways

- Zero-shot segmentation enables quick implementation of segmentation tasks without labeled training data

- Different prompting methods (text, points, hybrid) offer flexibility for various use cases

- FiftyOne plugins can significantly extend your computer vision capabilities

- Combining multiple models (like Moondream2 + SAM2) can create powerful workflows


### Next Steps

- Check out some more [FiftyOne Plugins](https://beta-docs.voxel51.com/plugins/#getting-started)

- Check out [SAM2](https://voxel51.com/blog/sam-2-is-now-available-in-fiftyone/) in the FiftyOne Model Zoo

- Check out [MedSAM2](https://beta-docs.voxel51.com/models/model_zoo/models/#med-sam-2-video-torch_1) in the FiftyOne Model Zoo

- Learn how to [evaluate segmentations](https://beta-docs.voxel51.com/fiftyone_concepts/evaluation/#semantic-segmentations) with the FiftyOne [Evaluation API](https://beta-docs.voxel51.com/fiftyone_concepts/evaluation/)
