## Who this is for

This tutorial is designed for machine learning practitioners who:
- Have basic familiarity with FiftyOne (used it at least once before)
- Are interested in exploring zero-shot object detection without training models
- Want to quickly test different zero-shot detection models on their datasets

## Assumed Knowledge

**Computer Vision Concepts:**
- Understanding of object detection and bounding boxes
- Familiarity with confidence scores and model predictions
- Basic knowledge of zero-shot learning concepts

**Technical Requirements:**
- Intermediate Python programming skills
- Experience with PyTorch and/or Hugging Face
- Ability to work with image datasets and common formats (jpg, png)

**FiftyOne Concepts:**
You should be familiar with:
- [Datasets and Samples](https://beta-docs.voxel51.com/getting_started/basics.html)
- [Working with Labels](https://beta-docs.voxel51.com/user_guide/using_datasets.html#labels)
- [Model Zoo](https://beta-docs.voxel51.com/user_guide/model_zoo/index.html)
- [Dataset Zoo](https://beta-docs.voxel51.com/user_guide/dataset_zoo/index.html)

## Time to complete

Estimated time: 30-45 minutes
- Setup: 5-10 minutes
- Tutorial: 20-25 minutes
- Experimentation: 10+ minutes

## Required packages

Make sure you have a virtual environment with FiftyOne already installed. Then install the following packages:

```bash

# Install required packages
pip install fiftyone
pip install torch torchvision
pip install transformers<=4.49
pip install ultralytics
pip install pillow
```

## What's covered in this tutorial

This tutorial covers:
1. **Dataset Loading** - Loading a street scene dataset from FiftyOne's Dataset Zoo

2. **Hugging Face Integration** - Using OWL-ViT for zero-shot detection through FiftyOne's Hugging Face integration

3. **Ultralytics Integration** - Implementing YOLO-World for zero-shot detection

4. **Plugin Usage** - Exploring the Florence2 plugin for additional zero-shot capabilities

5. **Custom Implementation** - Understanding how to implement arbitrary zero-shot detection models in FiftyOne

Each section builds upon the previous ones, demonstrating different approaches to zero-shot detection while highlighting FiftyOne's flexibility in working with various model frameworks.

# Zero-Shot Detection

# Load Dataset

Let's load a [Dataset](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/) from the FiftyOne [Dataset Zoo](https://beta-docs.voxel51.com/data/dataset_zoo/datasets/). In this tutorial, we'll use the [Quickstart Geo](https://beta-docs.voxel51.com/data/dataset_zoo/datasets/#dataset-zoo-quickstart-geo) dataset. This is a a small Dataset which consists of 500 images from the validation split of the [BDD100K dataset](https://beta-docs.voxel51.com/data/dataset_zoo/datasets/#dataset-zoo-bdd100k) in the New York City area with object detections and GPS timestamps.

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart-geo")

Let's make up a list of classes to detect. Since the Dataset we're working with is from New York City streets, we'll focus on vehicles, traffic infrastructure, and urban elements that we'd expect to see in NYC traffic scenes. 

This includes various car types, traffic signals, street furniture, and public transportation.

In [2]:
detection_classes = [
    "yellow cab",
    "sedan",
    "coupe",
    "hatchback",
    "SUV",
    "pickup truck",
    "station wagon",
    "crossover",
    "minivan",
    "green light",
    "red light",
    "illuminated tail lights",
    "illuminated head lights",
    "tow truck",
    "parking meter",
    "traffic barrier",
    "traffic cone",
    "bus stop",
    "storefront",
    "construction vehicle",
    "municipal bus",
    "charter bus"
]

# Model Zoo

The FiftyOne Model Zoo provides a powerful interface for downloading models and applying them to your FiftyOne datasets.

It provides native access to hundreds of pre-trained models, and it also supports downloading arbitrary public or private models whose definitions are provided via GitHub repositories or URLs.

In fact, the [Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/) is so flexible that you can natively load certain Hugging Face Transformers models and Ultralytics models for zero-shot object detection as a Zoo model via the [`load_zoo_model`](https://beta-docs.voxel51.com/api/fiftyone.zoo.models.html#load_zoo_model) method.

## Hugging Face Integration

FiftyOne integrates with [Hugging Face's Transformers](https://beta-docs.voxel51.com/integrations/huggingface/#zero-shot-object-detection) library for Zero Shot Detection models. This allows you to load a Transformers model as a [Zoo Model](https://beta-docs.voxel51.com/models/model_zoo/).


To load a model from the Hugging Face Hub, set `name_or_url=zero-shot-detection-transformer-torch`. This specifies that you want to a zero-shot object detection model from the Hugging Face Transformers library. You can then specify the model via the `name_or_path` argument. This should be the repository name or model identifier of the model you want to load.


Note: the `confidence_thresh` parameter is optional and can be used to filter out predictions with confidence scores below the specified threshold. You may need to adjust this value based on the model and dataset you are working. 

In [3]:
import torch
import fiftyone.zoo as foz

device="cuda" if torch.cuda.is_available() else "cpu"

owlvit = foz.load_zoo_model(
    "zero-shot-detection-transformer-torch",
    text_prompt="a photo of a ", # per the model card
    name_or_path="google/owlvit-base-patch32",  # HF model name or path
    classes=detection_classes,
    device=device,
    confidence_thresh=0.1 #setting aribtrarily low threshold
    # install_requirements=True # uncomment to install the necessary requirements
)

dataset.apply_model(
    owlvit, 
    label_field="owlvit_detections",
    )

 100% |█████████████████| 500/500 [29.5s elapsed, 0s remaining, 17.7 samples/s]      


You can examine the results on by [skipping to a random Sample](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#skip) as follows:

In [8]:
dataset.skip(42).first()['owlvit_detections']


<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef665',
            'attributes': {},
            'tags': [],
            'label': 'storefront',
            'bounding_box': [
                0.0014444444444444446,
                -0.0026875,
                0.17194444444444443,
                0.47613281250000006,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.13197840750217438,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef666',
            'attributes': {},
            'tags': [],
            'label': 'storefront',
            'bounding_box': [
                0.13490277777777776,
                0.26649218750000003,
                0.10397222222222224,
                0.19666406250000001,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.15454693138599396,
            'index': None

Any model that can be run in a Hugging Face pipeline for the `zero-shot-object-detection` task can be loaded as a Zoo model.

A good first entry point is to just do it and pass the model name into `name_or_path` in the [`load_zoo_model`](https://beta-docs.voxel51.com/api/fiftyone.zoo.models.html#fiftyone.zoo.models.load_zoo_model) method of the dataset. If a Hugging Face model is not compatible with the integration, you'll see an error to the effect of: 

```python
ValueError: Unrecognized model in <whatever-model-name>
```

In this case, you will need to run the model manually. All this means is that you need to instantiate the model, it's  processor, and write some logic to parse the model output a [FiftyOne Detection](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html). I'll show you how to this later on in this tutorial.

# Ultralytics

FiftyOne [integrates natively with Ultralytics](https://beta-docs.voxel51.com/integrations/ultralytics/), so you can load, fine-tune, and run inference with your favorite Ultralytics models on your FiftyOne datasets with just a few lines of code.

Check out the [documention for our Ultralytics integration](https://docs.voxel51.com/integrations/ultralytics.html#open-vocabulary-detection) if you're interested in manually using an Ultralytics model rather than as a Zoo model.

In [None]:
!pip install ultralytics

In [None]:
import torch
import fiftyone.zoo as foz

device="cuda" if torch.cuda.is_available() else "cpu"

yolo_world = foz.load_zoo_model(
    "yolov8s-world-torch", 
    classes=detection_classes,
    device=device,
    confidence_thresh=0.2
    # install_requirements=True # uncomment to install the necessary requirements
    )

dataset.apply_model(yolo_world, label_field="yolow_detections")


  72% |████████████\----| 361/500 [10.1s elapsed, 3.8s remaining, 36.8 samples/s]    

In [None]:
dataset.skip(42).first()['yolow_detections']

# Plugins

You can also run zero-shot detection via FiftyOne Plugins. The following code will show you how to use the [Florence2](https://github.com/jacobmarks/fiftyone_florence2_plugin).

The example below will show you how to use the Florence2 plugin for zero-shot object detection and zero-shot open vocabulary detection. Begin by downloading the plugin and installing requirements:


In [None]:
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin

In [None]:
!fiftyone plugins requirements @jacbobmarks/florence2 --install

Next, instantiate the operator:

In [None]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_detection = foo.get_operator("@jacobmarks/florence2/detect_with_florence2")

You should start a [delegated service](https://beta-docs.voxel51.com/plugins/developing_plugins/#delegated-execution_1) for this [Operator](https://beta-docs.voxel51.com/plugins/using_plugins/#calling-operators), you can do that by opening your terminal and executing the following command:

```shell
fiftyone delegated launch
```

You'll use the `await` syntax and pass the `delegate=True` argument when running this plugin via notebook. Here's how you can use the plugin for zero-shot object detection:

In [None]:
await florence2_detection(
    dataset,
    model_path=MODEL_PATH,
    detection_type="detection",
    output_field="zero_shot_detections",
    delegate=True
)

In [None]:
dataset.first()['zero_shot_detections']

You can also use Florence2 for zero-shot open vocabulary detection. Note that the model only supports passing one candidate label for this task:

In [None]:
await florence2_detection(
    dataset,
    model_path=MODEL_PATH,
    detection_type="open_vocabulary_detection",
    text_prompt = "pedestrain in intersection", # the object you want to detect
    output_field="open_detection",
    delegate=True
)

In [None]:
dataset.first()['open_detection']

Visit the [Florence2 Plugin's GitHub Repo](https://github.com/jacobmarks/fiftyone_florence2_plugin) for more detail about using this plugin.

# Arbitrary Models

Regardless of which zero-shot object detection model you use, the process of converting predictions to FiftyOne format follows the same general pattern:

1. **Standardize Bounding Box Format**
   - FiftyOne [Detection labels](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html) expects bounding boxes in relative coordinates [0,1]
   - Format must be [top-left-x, top-left-y, width, height]
   - Most models output absolute coordinates or different formats, so conversion is usually needed

2. **Create Detection Objects**
   - Each [Detection](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html) object needs three main components:
     - `label`: the class name
     - `bounding_box`: the normalized coordinates
     - `confidence`: the detection score
   - The individual Detection objects must be grouped into a [Detections](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detections.html) [Field for each Sample](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/). 

3. **Batch Processing Strategy**
   - Instead of updating samples one by one, collect all detections
   - Use [`dataset.set_values()`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#set_values) for efficient batch updates
   - This is much faster than individual [`sample.save()`](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#save) calls

The core workflow is:
- Get model predictions
- Convert coordinates to FiftyOne's expected format
- Create [Detection](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html) objects
- Group them into [Detections](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detections.html) objects (one per [Sample](https://beta-docs.voxel51.com/api/fiftyone.core.sample.Sample.html))
- Batch update the [Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html)

This pattern remains the same regardless of the model you're using, whether it's from the Hugging Face Hub, Torch Hub, or some brand new SOTA model that you can only use via it's GitHub Repo. The only part that changes is how you extract and convert the specific model's output into FiftyOne [Detection](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html) format.



In [None]:
import torch
import fiftyone as fo
from PIL import Image
from transformers import AutoProcessor, OmDetTurboForObjectDetection

device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize model and processor
processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
model = OmDetTurboForObjectDetection.from_pretrained(
    "omlab/omdet-turbo-swin-tiny-hf",
    device_map=device)

filepaths = dataset.values("filepath")

all_detections = []
for filepath in filepaths:
    # Load and process image
    image = Image.open(filepath)
    height, width = image.size[::-1]  # Get dimensions in same format as target_sizes
    
    inputs = processor(image, text=detection_classes, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    results = processor.post_process_grounded_object_detection(
        outputs,
        text_labels=detection_classes,
        target_sizes=[image.size[::-1]],  # Keep model's expected format
        threshold=0.3,
        nms_threshold=0.3,
    )[0]
    
    scores = results["scores"].cpu().numpy()
    boxes = results["boxes"].cpu().numpy()
    text_labels = results["text_labels"]
    
    detections = []
    for score, class_name, box in zip(scores, text_labels, boxes):
        x1, y1, x2, y2 = box
        
        # First normalize all coordinates by their respective dimensions (x/width, y/height)
        x1 = x1 / width
        y1 = y1 / height
        x2 = x2 / width
        y2 = y2 / height
    
        # Then calculate width and height as differences of normalized coordinates
        w = x2 - x1  # width is right_x - left_x
        h = y2 - y1  # height is bottom_y - top_y
        
        detection = fo.Detection(
            label=class_name,
            bounding_box=[x1, y1, w, h],
            confidence=float(score)
        )
        detections.append(detection)
    
    all_detections.append(fo.Detections(detections=detections))

dataset.set_values("omdet_predictions", all_detections)

## Summary

This tutorial has introduced you to several approaches for performing zero-shot object detection using FiftyOne:

- Using pre-trained models through the Hugging Face integration
- Leveraging Ultralytics' YOLO-World model
- Exploring plugin-based solutions like Florence2
- Implementing custom zero-shot detection models


## Next Steps

To continue learning, you can:

• Learn more about our [integration with Hugging Face](https://beta-docs.voxel51.com/integrations/huggingface/)

• Check out the [Zero-Shot Detection Plugin](https://github.com/jacobmarks/zero-shot-prediction-plugin) and learn [more about Plugins](https://beta-docs.voxel51.com/plugins/using_plugins/) in general

• Learn more about [adding object detections to a Dataset](https://beta-docs.voxel51.com/how_do_i/recipes/adding_detections/)

• Use the [Moondream2 Plugin](https://github.com/harpreetsahota204/moondream2-plugin) for zero-shot object detection

• Learn how to [evaluate object detections with FiftyOne](https://beta-docs.voxel51.com/tutorials/evaluate_detections/)

• Learn more in our blog, [_Zero-Shot Image Classification with Multimodal Models and FiftyOne_](https://beta-docs.voxel51.com/tutorials/zero_shot_classification/)


Remember that zero-shot detection is a rapidly evolving field - the approaches shown here are just the beginning. FiftyOne's flexible architecture allows you to easily incorporate new models and techniques as they become available.