# Zero-Shot Segmentation

Zero-shot segmentation is a computer vision task that aims to segment objects or regions in images without any training samples for those specific categories. It enables models to perform instance, semantic, or panoptic segmentation for novel categories by transferring visual knowledge learned from seen categories to unseen ones. Prompt-based zero-shot segmentation uses prompts to guide the segmentation process at test time without requiring retraining for new categories. This approach allows a single trained model to handle various segmentation tasks dynamically.

### Types of Prompts

**Text Prompts**
- Free-text descriptions that specify what to segment in an image
- The model uses pre-trained knowledge of text-image relationships to identify and segment the described objects

**Image Prompts**
- Visual examples that show what to segment
- Particularly useful when the target is difficult to describe in words
- Can be a reference image containing the object of interest
- The model compares the visual features between the prompt image and the target image to identify similar regions

**Hybrid Approaches**
- Some systems can accept either text or image prompts for the same model
- CLIPSeg is an example of a model that works with both text and image prompts by adding a decoder to CLIP

Let's begin by downloading a dataset from the FiftyOne [Dataset Zoo](https://beta-docs.voxel51.com/data/dataset_zoo/datasets/). 

You'll notice I have passed several arguments to the [`load_zoo_dataset`](https://beta-docs.voxel51.com/api/fiftyone.zoo.datasets.html#load_zoo_dataset) function:

- `max_samples`: The [Dataset](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/) will only be comprised of, at most, 25 [Samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/)

- `shuffle`: Randomize the [Samples](https://beta-docs.voxel51.com/api/fiftyone.core.sample.Sample.html) that are selected 

- `classes`: Load only Samples containing at least one instance of a specified class

- `dataset_name`: Assign a [name](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#name) to the [Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#name)

In [1]:
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "quickstart", 
    max_samples=25,
    shuffle=True,
    classes=['person', 'car', 'train'],
    dataset_name="mini_quickstart",
    overwrite=True,
    )

Overwriting existing directory '/home/harpreet/fiftyone/quickstart'
Downloading dataset to '/home/harpreet/fiftyone/quickstart'
Downloading dataset...
 100% |████|  187.5Mb/187.5Mb [882.5ms elapsed, 0s remaining, 212.5Mb/s]      
Extracting dataset...
Parsing dataset metadata
Found 200 samples
Dataset info written to '/home/harpreet/fiftyone/quickstart/info.json'
Ignoring unsupported parameter 'classes' for importer type <class 'fiftyone.utils.data.importers.FiftyOneDatasetImporter'>
Loading 'quickstart'
 100% |███████████████████| 25/25 [241.3ms elapsed, 0s remaining, 104.3 samples/s]    
Dataset 'mini_quickstart' created


We'll need [metadata](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/#storing-field-metadata), such as each Sample's height and width later, so we can use the [`compute_metadata`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#compute_metadata) method of the Dataset to accomplish this:

In [2]:
dataset.compute_metadata()

Computing metadata...
 100% |███████████████████| 25/25 [12.9ms elapsed, 0s remaining, 1.9K samples/s] 


Here's what the metadata for the first Sample looks like:

In [3]:
dataset.first().metadata

<ImageMetadata: {
    'size_bytes': 108669,
    'mime_type': 'image/jpeg',
    'width': 600,
    'height': 400,
    'num_channels': 3,
}>

# Phrase Grounding Segmentation

Phrase grounding segmentation extends traditional phrase grounding by not only localizing objects mentioned in text but also generating pixel-level segmentation masks for those objects. While phrase grounding typically produces bounding boxes around regions corresponding to textual phrases, phrase grounding segmentation aims to create fine-grained segmentation masks that precisely delineate the boundaries of the referenced objects.

This approach enables more precise visual understanding by:
- Associating specific words or phrases with their corresponding image regions
- Generating pixel-accurate segmentation masks rather than just bounding boxes
- Creating a more detailed alignment between language and visual content


## Plugins

[FiftyOne plugins](https://beta-docs.voxel51.com/plugins/) are powerful extensions that allow users to customize and enhance the functionality of the [FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/). 

[Plugins can be written in Python, JavaScript, or a combination of both](https://beta-docs.voxel51.com/plugins/developing_plugins/), enabling users to add new features, create integrations with other tools and APIs, render custom panels, and add custom actions to menus. They are composed of panels, operators, and components, which together allow for building full-featured interactive data applications tailored to specific use cases. Plugins can range from simple actions like adding a checkbox to complex workflows such as requesting annotations from a configurable backend. This extensibility makes FiftyOne highly adaptable to various computer vision tasks and workflows, limited only by the user's imagination.

We'll use the Plugin framework via the FiftyOne SDK, and you can [refer to the docs on using the Plugin Frame in the FiftyOne App](https://beta-docs.voxel51.com/plugins/using_plugins/)

### Florence2 Plugin

The [Florence2 Plugin](https://github.com/jacobmarks/fiftyone_florence2_plugin) integrates Microsoft's Florence2 Vision-Language Model with FiftyOne datasets, offering several powerful computer vision capabilities.

One of these tasks is referring segmentation, which allows you to segment specific regions in an image based on natural language descriptions. This is particularly useful when you need to segment specific parts of an image based on textual descriptions, allowing for region identification using natural language. It can be used in two ways:

• Using a direct expression through the `expression` parameter

• Using an existing expression field in your dataset via the `expression_field` parameter

Note: Referring segmentation is a hard task in Visual AI, and as powerful as the Florence2 model is, the results are not always the best. It's a good idea to be precise with your open vocabulary prompt and use the shortest caption possible for each Sample.

Let's start by setting an enviornment variable, [downloading the plugin, and installing it's requirements](https://beta-docs.voxel51.com/plugins/using_plugins/).

In [4]:
# set environment variable
import os
os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

In [None]:
# download the plugin
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin

In [None]:
# install requirements for the plugin
!fiftyone plugins requirements @jacobmarks/florence2 --install

With the plugin installed, you can instantiate the [Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#using-operators) like this:

In [None]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-large-ft"

florence2_referring_segmentation = foo.get_operator("@jacobmarks/florence2/referring_expression_segmentation_with_florence2")

To use the operator, you will need to start a [Delegated service](https://beta-docs.voxel51.com/plugins/developing_plugins/#delegated-execution_1) by opening your terminal and running the following command:

```bash
fiftyone delegated launch
```

Then, you can [call the Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#calling-operators) by running the following cell:

In [None]:
await florence2_referring_segmentation(
    dataset,
    model_path=MODEL_PATH,
    expression="human",
    output_field="open_expression_segmentations",
    delegate=True
)

You can examine the results of the first Sample like so:

In [None]:
dataset.first()['open_expression_segmentations']

You can also use [the Florence2 Plugin](https://github.com/jacobmarks/fiftyone_florence2_plugin) when you have existing captions on your dataset. We don't have those here, so let's generate these captions and then use that for segmentation. Start by instantiating the Operator for this task:

In [12]:
import fiftyone.operators as foo

florence2_captioning = foo.get_operator("@jacobmarks/florence2/caption_with_florence2")

And calling the operator, as you've done previously:

In [None]:
await florence2_captioning(
    dataset,
    model_path=MODEL_PATH,
    detail_level="basic",
    output_field="basic_caption",
    delegate=True
    )

In [None]:
dataset.first()['basic_caption']

In [None]:
await florence2_referring_segmentation(
    dataset,
    model_path=MODEL_PATH,
    expression_field="basic_caption", #must be a field on your dataset
    output_field="expression_field_segmentations",
    delegate=True
)

In [None]:
dataset.first()['expression_field_segmentations']

### Moondream + SAM2

This next workflow will show you how to leverage the [Moondream2 plugin](https://github.com/harpreetsahota204/moondream2-plugin) for FiftyOne alongside [SAM2 from the FiftyOne Model Zoo](https://voxel51.com/blog/sam-2-is-now-available-in-fiftyone/) for zero-shot segmentation. 

The process works by first using Moondream2 to automatically analyze your images and generate [Keypoints](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Keypoint.html) of interest in a zero-shot fashion, requiring no training data or manual annotation. These points then serve as a prompt for SAM2, which uses them to generate segmentation masks around the detected objects or regions. 

First, install the Moondream2 plugin and it's requirements:

In [None]:
# download the plugin from the github repository
!fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin

In [None]:
# install requirements for the plugin
!fiftyone plugins requirements @harpreetsahota/moondream2 --install

With the plugin installed, you can instantiate the [Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#using-operators) like this:

In [9]:
import fiftyone.operators as foo

moondream_operator = foo.get_operator("@harpreetsahota/moondream2/moondream")

To use the operator, you will need to start a [Delegated service](https://beta-docs.voxel51.com/plugins/developing_plugins/#delegated-execution_1) by opening your terminal and running the following command:

```bash
fiftyone delegated launch
```

Then, you can [call the Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#calling-operators) by running the following cell:

In [None]:
await moondream_operator(
    dataset,
    revision="2025-01-09",
    operation="point",
    output_field="moondream_point",
    delegate=True,
    object_type="people"
)

To use SAM2 from the FiftyOne Model Zoo you need to first install its dependencies. You can do so by running the following command:

`pip install "git+https://github.com/facebookresearch/sam2.git#egg=sam-2`


Please [refer to the SAM2 GitHub repo](https://github.com/facebookresearch/sam2) for details and any troubleshooting. You can [refer to the Model Zoo documentation](https://beta-docs.voxel51.com/models/model_zoo/models/#med-sam-2-video-torch_1) for more information about which checkpoints are available in the FiftyOne Model Zoo.

In [None]:
model = foz.load_zoo_model("segment-anything-2.1-hiera-base-plus-image-torch")

dataset.apply_model(
    model,
    label_field="sam_segmentations",
    prompt_field="moondream_point",
)

# Using a Hugging Face Pipeline


#### With a KeyPoint

In [None]:
from PIL import Image
import torch
from transformers import SamModel, SamProcessor
import fiftyone.utils.transformers as fout

# Set device based on availability
device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

# Initialize model and processor
model = SamModel.from_pretrained("facebook/sam-vit-base", device_map=device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")

# Process samples in dataset
for sample in dataset.iter_samples(progress=True):
    # Load image
    image = Image.open(sample.filepath)
    
    # Convert normalized FiftyOne keypoints [0,1] to pixel coordinates
    image_width = sample.metadata["width"]
    image_height = sample.metadata["height"]
    
    # Access points from the Keypoint object
    keypoint_obj = sample.moondream_point  # or whatever field name contains the Keypoint
    
    # Convert all points to pixel coordinates
    pixel_points = [
        [
            int(point[0] * image_width),  # x coordinate
            int(point[1] * image_height)  # y coordinate
        ]
        for point in keypoint_obj.points
    ]
    
    # Format points for SAM (needs list of lists of points)
    input_points = [pixel_points]  # Single list of multiple points
    
    # Prepare inputs and move to appropriate device
    inputs = processor(
        image, 
        input_points=input_points, 
        return_tensors="pt"
    ).to(device)
    
    # Generate predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Post-process masks
    masks = processor.image_processor.post_process_masks(
        outputs.pred_masks.cpu(),  # Move tensors back to CPU for processing
        inputs["original_sizes"].cpu(),
        inputs["reshaped_input_sizes"].cpu()
    )
    
    # Convert masks to FiftyOne format and store predictions
    sample["seg_predictions"] = fout.to_segmentation(masks)
    sample.save()

### Huggging Face Pipeline 

#### Without a KeyPoint

In [None]:
from transformers import pipeline

generator = pipeline("mask-generation", model="Zigeng/SlimSAM-uniform-50", points_per_batch=64, device="cuda")
image_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
outputs = generator(image_url)
masks = outputs["masks"]
# array of multiple binary masks returned for each generated mask


In [None]:
import fiftyone.utils.transformers as fout

fout.to_segmentation(masks)

# Arbitrary Models

In [None]:
from transformers import CLIPSegConfig, CLIPSegModel

# Initializing a CLIPSegConfig with CIDAS/clipseg-rd64 style configuration
configuration = CLIPSegConfig()

# Initializing a CLIPSegModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
model = CLIPSegModel(configuration)

# Accessing the model configuration
configuration = model.config

# We can also initialize a CLIPSegConfig from a CLIPSegTextConfig and a CLIPSegVisionConfig

# Initializing a CLIPSegText and CLIPSegVision configuration
config_text = CLIPSegTextConfig()
config_vision = CLIPSegVisionConfig()

config = CLIPSegConfig.from_text_vision_configs(config_text, config_vision)

In [None]:
from transformers import AutoTokenizer, CLIPSegTextModel

tokenizer = AutoTokenizer.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegTextModel.from_pretrained("CIDAS/clipseg-rd64-refined")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled (EOS token) states



### Next Steps

- Check out some more [FiftyOne Plugins](https://beta-docs.voxel51.com/plugins/#getting-started)

- Check out SAM2 in the FiftyOne Model Zoo

- Learn how to evaluate segmentations with the FiftyOne Evaluation API

- https://beta-docs.voxel51.com/models/model_zoo/models/#med-sam-2-video-torch_1