## Who this is for

This tutorial is designed for:

- **FiftyOne Experience**: Beginners who have basic familiarity with FiftyOne's core concepts like [Datasets and Samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/)

- **Expertise Level**: Machine learning practitioners with basic understanding of computer vision and classification tasks

- **Goals**: Users looking to implement classification for new or changing categories without retraining models, or those wanting to quickly label datasets with flexible categories

## Assumed Knowledge

### Computer Vision Concepts
- Basic understanding of image classification
- Familiarity with model inference and confidence scores
- Understanding of zero-shot learning (though not required)

### Technical Prerequisites
- Python programming fundamentals
- Basic understanding of PyTorch
- Experience working with Jupyter notebooks

### FiftyOne Concepts
You should be familiar with:
- [Datasets and Samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields)
- [Working with Labels](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/#labels)
- [Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/)

## Time to Complete
- Approximately 30-45 minutes 

## Required Packages

It's recommended to use a virtual environment with [FiftyOne already installed](https://beta-docs.voxel51.com/getting_started/basic/install/). You'll need these additional packages:

```bash
# Install required packages
pip install fiftyone
pip install torch torchvision
pip install open_clip_torch
pip install transformers
```

## Content Overview

The notebook covers:

1. **Dataset Download**: Loading the [ImageNet-O dataset](https://huggingface.co/datasets/Voxel51/ImageNet-O) from Hugging Face

2. **FiftyOne Model Zoo**: Using CLIP models from FiftyOne's[ built-in model zoo](https://beta-docs.voxel51.com/models/model_zoo/models/) for zero-shot classification

3. **OpenCLIP Integration**: Implementing zero-shot classification [using OpenCLIP models](https://beta-docs.voxel51.com/integrations/openclip/) with various architectures

4. **Hugging Face Integration**: Running zero-shot classification using models from [Hugging Face's](https://beta-docs.voxel51.com/integrations/huggingface/model hub 


# Zero-Shot Classification in FiftyOne

Traditionally, computer vision models are trained to predict a fixed set of categories. For image classification, for instance, many standard models are trained on the ImageNet dataset, which contains 1,000 categories. All images must be assigned to one of these 1,000 categories, and the model is trained to predict the correct category for each image.

Thanks to the recent advances in multimodal models, [it is now possible to perform zero-shot learning](https://beta-docs.voxel51.com/tutorials/zero_shot_classification/), which allows us to predict categories that were not seen during training. This can be especially useful when:

• We want to roughly pre-label images with a new set of categories

• Obtaining labeled data for all categories is impractical or impossible.

•  The categories are changing over time, and we want to predict new categories without retraining the model.

# Download Dataset

In this tutorial we will use the ImageNet-O dataset.

The [ImageNet-O dataset](https://huggingface.co/datasets/Voxel51/ImageNet-O) consists of images from classes not found in the standard ImageNet-1k dataset. It tests the robustness and out-of-distribution detection capabilities of computer vision models trained on ImageNet-1k.

Let's [load the dataset](https://beta-docs.voxel51.com/integrations/huggingface/#loading-datasets-from-the-hub) from [Voxel51's Hugging Face Org](https://huggingface.co/datasets/Voxel51/ImageNet-O):

In [None]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh


dataset = fouh.load_from_hub("Voxel51/ImageNet-O")

Let's grab the classes from this dataset using the [`distinct` method](beta-docs.voxel51.com/api/fiftyone.core.aggregations.distinct.html) of the [Dataset](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/). These will be the classes we use for zero-shot classification

In [2]:
dataset_classes = dataset.distinct("ground_truth.label")

# FiftyOne Model Zoo

The [FiftyOne Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/) provides a powerful interface for downloading models and applying them to your FiftyOne datasets. It provides native access to hundreds of pre-trained models, and it also supports downloading arbitrary public or private models whose definitions are provided via GitHub repositories or URLs.

All of these models accept a `text_prompt` keyword argument, which allows you to override the prompt template used to embed the class names. Zero-shot classification results can vary based on this text!

You can load a model from the Model Zoo using the [`load_zoo_model` method](
https://beta-docs.voxel51.com/api/fiftyone.zoo.models.html#load_zoo_model), in this example we will use [CLIP](https://beta-docs.voxel51.com/models/model_zoo/models/#clip-vit-base32-torch):

In [None]:
import torch 
import fiftyone.zoo as foz

clip_zoo_model = foz.load_zoo_model(
    name_or_url="clip-vit-base32-torch",
    text_prompt="A photo of a ",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
)

dataset.apply_model(clip_zoo_model, label_field="clip_classification")

You can examine the results on the [first Sample](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#first) as follows:

In [6]:
dataset.first()['clip_classification']

<Classification: {
    'id': '67d9ddda99e7fb132baf9334',
    'tags': [],
    'label': 'mousetrap',
    'confidence': 0.34010758996009827,
    'logits': None,
}>

# Open CLIP Integration

FiftyOne [integrates natively with the OpenCLIP library](https://beta-docs.voxel51.com/integrations/openclip/), an open source implementation of OpenAI’s CLIP (Contrastive Language-Image Pre-training) model that you can use to run inference on your FiftyOne datasets with a few lines of code!

To get started with OpenCLIP, install the `open_clip_torch` package:

In [None]:
!pip install open_clip_torch

When running inference with [OpenCLIP](https://github.com/mlfoundations/open_clip), you can specify a text prompt to help guide the model towards a solution as well as only specify a certain number of classes to output during zero shot classification.

In [None]:
import torch 
import fiftyone.zoo as foz

open_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    text_prompt="A photo of a",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
)

dataset.apply_model(open_clip_model, label_field="open_clip_classification")

In [10]:
dataset.first()['open_clip_classification']

<Classification: {
    'id': '67d9de8e99e7fb132bafa2d4',
    'tags': [],
    'label': 'mousetrap',
    'confidence': nan,
    'logits': None,
}>

You can also specify different model architectures and pretrained weights by passing in optional parameters. Pretrained models can be loaded directly from OpenCLIP with the following syntax:


```python
meta_clip = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    clip_model="ViT-B-32-quickgelu",
    pretrained="metaclip_400m",
    text_prompt="A photo of a",
    classes=dataset_classes,
)
```


Alternatively you can also load a model from Hugging Face’s Model Hub with the following syntax:


```python
import fiftyone.zoo as foz

open_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    clip_model="hf-hub:repo-name/model-name",
    pretrained="",
)
```

As a concrete example, if you were interested in the [StreetCLIP model](https://huggingface.co/geolocal/StreetCLIP) you would use:

```python
street_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    pretrained="",
    clip_model="hf-hub:geolocal/StreetCLIP"
)
```

# Hugging Face Integration


You can also run models from Hugging Face as a Zoo Model with [FiftyOne's Hugging Face Integration](https://beta-docs.voxel51.com/integrations/huggingface/#zero-shot-classification). Note: These models must be fully integrated in the [Hugging Face transformers](github.com/huggingface/transformers) library, some model weights may be available via Hugging Face but not fully integrated into the transformers library.

To load a model from the Hugging Face Hub, set `name_or_url=zero-shot-classification-transformer-torch`. This specifies that you want to a zero-shot image classification model from the Hugging Face Transformers library. You can then specify the model via the `name_or_path` argument. This should be the repository name or model identifier of the model you want to load:

In [None]:
import torch 
import fiftyone.zoo as foz

siglip_model = foz.load_zoo_model(
    name_or_url="zero-shot-classification-transformer-torch",
    name_or_path="google/siglip2-so400m-patch14-384",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
    )

dataset.apply_model(siglip_model, label_field="siglip2_classification")

We can examine the output for the first Sample as shown below. Note that not all models will output a value for `confidence` or `logits`.

In [16]:
dataset.first()['siglip2_classification']

<Classification: {
    'id': '67d9dff499e7fb132bafb274',
    'tags': [],
    'label': 'frying pan',
    'confidence': 0.10734604299068451,
    'logits': array([-14.63465  , -16.027546 , -14.841102 , -15.603795 , -15.634769 ,
           -26.755354 , -28.286345 , -28.364231 , -27.1351   , -26.7131   ,
           -27.507458 , -26.499924 , -21.994976 , -13.691553 , -28.55092  ,
           -25.564491 , -14.591154 , -26.281277 , -26.274792 , -13.6278   ,
           -14.481212 , -24.684128 , -24.921196 , -28.01366  , -24.428066 ,
           -26.221783 , -24.934784 , -27.947395 , -28.367542 , -13.191751 ,
           -26.12117  , -27.978542 , -21.961689 , -13.579732 , -26.58063  ,
           -13.924773 , -27.960869 , -13.0043545, -26.122686 , -20.667011 ,
           -22.799128 , -20.430523 , -15.019728 , -27.692225 , -27.954025 ,
           -12.913255 , -27.146004 , -27.823196 , -28.423857 , -23.00042  ,
           -23.592287 , -25.413513 , -26.26255  , -15.5457735, -25.749825 ,
           -24.

Any model that can be run in a Hugging Face pipeline for the `zero-shot-image-classification` task can be loaded as a Zoo model.

A good first entry point is to just do it and pass the model name into `name_or_path` in the [`load_zoo_model`](https://beta-docs.voxel51.com/api/fiftyone.zoo.models.html#fiftyone.zoo.models.load_zoo_model) method of the dataset. If a Hugging Face model is not compatible with the integration, you'll see an error to the effect of: 

```python
ValueError: Unrecognized model in <whatever-model-name>
```

In this case, you will need to run the model manually. All this means is that you need to instantiate the model, it's  processor, and write some logic to parse it the model output a [FiftyOne Classification](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Classification.html). 

Refer to [this documentation](https://beta-docs.voxel51.com/how_do_i/recipes/adding_classifications/) for details on how to manually parse model outputs as a [FiftyOne Classification](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Classification.html).

## Conclusion

In this tutorial, you've learned how to:
- Implement zero-shot classification using multiple model architectures without needing to retrain models
- Use three different approaches for zero-shot classification:
  - FiftyOne's Model Zoo CLIP models
  - OpenCLIP models with custom architectures
  - Hugging Face Transformers integration
- Customize text prompts to improve classification results
- Apply models to a FiftyOne dataset and access the classification results

### Key Takeaways
- Zero-shot classification enables prediction of new categories without model retraining
- Different model architectures and text prompts can significantly impact results
- FiftyOne provides flexible integrations with popular model frameworks
- Classification results are stored as standard FiftyOne Classifications, making them easy to analyze and evaluate



## Next Steps

- Check out this [in-depth end-to-end tutorial](https://beta-docs.voxel51.com/tutorials/zero_shot_classification/) for Zero-Shot Classification which includes details on how to evaluate your results

- Learn how to [evaluate classification](https://beta-docs.voxel51.com/fiftyone_concepts/evaluation/#classifications) results

You might also be interested in reading these blogs:

- [*This Visual Illusions Benchmark Makes Me Question the Power of VLMs*](https://voxel51.com/blog/this-visual-illusions-benchmark-makes-me-question-the-power-of-vlms/)

- [*AIMv2 Outperforms CLIP on Synthetic Dataset ImageNet-D*](https://voxel51.com/blog/aimv2-outperforms-clip-on-synthetic-dataset-imagenet-d/)

- [*A History of CLIP Model Training Data Advances*](https://voxel51.com/blog/a-history-of-clip-model-training-data-advances/)

For more resources and updates, follow us on [LinkedIn](https://www.linkedin.com/company/voxel51/) or join our [Discord Community](https://community.voxel51.com/).