# AIMv2 vs CLIP Robustness on ImageNet-D

![ImageNet-D Examples](assets/imagenet-d.gif)


ImageNet-D is a new benchmark of synthetically generated images (via Stable Diffusion) that's pushing image classification models to their breaking points with challenging images and revealing critical failures in model robustness. 

A high-level overview of ImageNet-D:

* It's composed of 4,835 "hard images." 

* ImageNet-D spans 113 overlapping categories between ImageNet and ObjectNet.

* The dataset incorporates 547 nuisance variations, including a wide array of backgrounds (3,764), textures (498), and materials (573), making it far more diverse than previous benchmarks. By systematically varying these factors, ImageNet-D comprehensively assesses how well a model can truly "see" beyond superficial image features.

At the heart of ImageNet-D is the concept of "hard images". To create a challenging test, the researchers employed a clever strategy to mine hard samples:

* They generated a large pool of images using diffusion models.

* They then used a set of "surrogate models" (pre-trained vision models) to identify images that were commonly misclassified.

* Only these challenging "hard images" were retained for the final ImageNet-D dataset. This ensures that the benchmark focuses on the weaknesses of current models and provides a more informative evaluation.

I wrote an in-depth blog about this dataset, which you can read [here](https://medium.com/voxel51/imagenet-d-a-new-synthetic-test-set-designed-to-rigorously-evaluate-the-robustness-of-neural-ab8978716585).

### What we're doing in this tutorial.

In this tutorial, you're going to:

1. Explore the ImageNet-D dataset using FiftyOne

2. Compute and visualize the embeddings for the images in this dataset using AIMv2 and CLIP to gain a deeper understanding of it's contents

3. Perfom zero-shot classification using CLIP in an attempt to verify/replicate the results in the paper

4. Perform zero-shot classification using AIMv2

5. Compare each models performance to the ground truth labels to see which performs better

# Preliminaries

Let's kick things off by installing FiftyOne, some dependencies needed for this tutorial, and then downloading the ImageNet-D dataset from the [Voxel51 org on Hugging Face](https://huggingface.co/Voxel51)

In [None]:
!pip install fiftyone umap-learn

In [None]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub(
    "Voxel51/ImageNet-D",
    name="imagenet_d"
    )

Now let's install a plugin that allows us to create custom dashboards and glean more insight into our dataset:

In [None]:
!fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/dashboard

Once the dataset has been downloaded, you can do some initial exploration by launching the app. 

There are two ways to use the app:

1. As a cell in your notebook, which you can do by running `fo.launch_app(dataset)`

2. In a seperate browser window by running `fiftyone app launch` in your terminal

Once the app is launched you can explore the dataset by:

* Simply scrolling through the images for an initial "vibe check" for what's in it

* Filtering by classes using the sidebar

* Creating a dashboard with a plot for class frequence

![ImageNet-D Examples](assets/explore-imagenetd-in-fo.gif)


In [None]:
fo.launch_app(dataset)

We're going to need the ground truth labels later, so lets go ahead and grab them from the dataset.

In [2]:
gt_labels = dataset.distinct("ground_truth.label")

## What is AIMv2

AIMV2 is a family of **open vision encoders** pre-trained using a novel **multimodal autoregressive objective**. 

It autoregressively generates both **image patches and text tokens**, leveraging signals from all input tokens and patches for efficient training. AIMV2 uses a causal multimodal decoder that first regresses image patches and then decodes text tokens in an autoregressive manner. This model excels in tasks like **image recognition, grounding, and multimodal understanding**. AIMV2 consistently matches or outperforms existing self-supervised and vision-language pre-trained models.

AIMv2 deliberately processes **image patches first**, followed by text tokens:  

1. **Visual Foundation**: Text predictions leverage *complete* visual context (like describing a photo only after seeing it in full).  

2. **Unified Processing**: Predicts next image patches (e.g., reconstructing a photo’s bottom half from the top), then generates text autoregressively (e.g., completing "A dog plays in..." → "park").  

3. **Vision-Centric Design**: Forces robust visual representations to support both image reconstruction *and* text generation.  


I've written an in-depth blog about AIMv2, which you can read [here](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9).

#### **How AIMv2 Differs from CLIP**  

I won't get into details about the CLIP family of models, but if you're interested in learning more [check out this blog](https://medium.com/voxel51/a-history-of-clip-model-training-data-advances-599473b48e1b). I do, however, want to briefly summarize the core differences between AIMv2 and CLIP:

| **AIMv2** | **CLIP** |  
|-----------|----------|  
| Uses **autoregressive modeling** to reconstruct inputs *sequentially* (image patches → text tokens) | Uses **contrastive learning** to align *parallel* image-text pairs |  
| Processes images and text as a **unified sequence** | Processes modalities **separately** |  
| Extracts training signals from **every token** (dense supervision) | Relies on **positive/negative pair contrast** (sparse supervision) |  
| Requires **no specialized batch processing** | Demands **large batches** for effective negative sampling |  
| Learns **implicit relationships** via sequential prediction | Forces **explicit alignment** of embeddings |  


## Using AIMv2 in FiftyOne

I've integrated AIMv2 in two plugins:

1. [Zero-shot prediction plugin](https://github.com/jacobmarks/zero-shot-prediction-plugin)

2. [AIMv2 embeddings plugin](https://github.com/harpreetsahota204/aim-embeddings-plugin)

Let’s begin with embeddings.

# Feature Extraction and Embedding Visualization in FiftyOne

First, you'll need to install the plugin:


In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/aim-embeddings-plugin

With a dataset and plugins downloaded, we’re ready to rock. 

You can, of course, use the plugin via the app. To learn how to do that you can refer to [the blog I wrote about the AIMv2 models](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9) or follow the instructions on the [plugin's GitHub repo](https://github.com/harpreetsahota204/aim-embeddings-plugin).

In this tutorial, however, we're going to stick to using the FiftyOne SDK. So, we need to instantiate an operator:

In [None]:
import fiftyone.operators as foo

aim_embeddings = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

### Run the operator on your dataset

You can choose from any model in the AIMv2 collection. [See the README](https://github.com/harpreetsahota204/aim-embeddings-plugin?tab=readme-ov-file#supported-models) on the plugin's repo for details. In this tutorial, we'll use []`apple/aimv2-large-patch14-224`](https://huggingface.co/apple/aimv2-large-patch14-224).

The plugin supports two types of embeddings:

- **Class Token Embedding (`cls`):** A single embedding vector derived from the special classification token. This represents the global semantic context of an image.

- **Mean Pooling Embedding (`mean`):** An embedding vector computed by averaging the representations of all image patches. This captures distributed contextual information across the entire input.

We'll compute embeddings using both methods. I’ll assume that you’re running this in a Jupyter notebook, in which case you can run the entire model on the dataset as shown below. 


In [None]:
embedding_types = ['cls', 'mean']

for emb_type in embedding_types:
  await aim_embeddings(
      dataset,
      model_name="apple/aimv2-large-patch14-224",
      embedding_types=emb_type,
      emb_field=f"aimv2_{emb_type}_emb",
      delegate=True
      )

We'll visualize these embeddings shortly, and before we do let's go ahead and compute embeddings using the CLIP model as well. This way we can compare how both models represent and organize the same images in their respective embedding spaces.

We can use the CLIP model from the FiftyOne model zoo (which is the same as one of the models they assessed in the ImageNet-D paper). You'll notice that I'm instantiating the model with the `classes` and `text_prompt` argument, that's because we will use the model for zero-shot classification later. The presence of these arguments won't impact the embeddings that we get as these are computed based only the image.

In [None]:
import torch 

import fiftyone.zoo as foz

clip_model = foz.load_zoo_model(
    "clip-vit-base32-torch",
    text_prompt="A photo of a",
    classes=gt_labels,
    device="cuda" if torch.cuda.is_available() else "cpu"
    )



With the `clip_model` instantiated we can use the [`compute_embeddings`](https://docs.voxel51.com/api/fiftyone.core.dataset.html?highlight=compute_embeddings#fiftyone.core.dataset.Dataset.compute_embeddings) method of the dataset.

In [None]:
dataset.compute_embeddings(
    model=clip_model,
    embeddings_field="clip_emb"
)

🤔 You're probably wondering why we're using one pattern for computing embeddings with AIMv2 (e.g. using a plugin) and another one to compute embeddings with CLIP (e.g. using a model from the model zoo).

That's a fair question!

FiftyOne has a powerful [plugins framework](https://docs.voxel51.com/plugins/index.html#getting-started) that allows you to extend the functionality of the library without changing the core code, submitting a PR, and then having to wait for the PR to get merged. It's a way to incorporate cutting edge models and methods into your workflow at the speed of you. We host monthly workshops that teach you about the plugin ecosystem, how to use it in your workflow, and also how to develop them. 

You can [check our events calendar for the next workshop](https://voxel51.com/computer-vision-events/), just search for the event titled *Advanced Computer Vision Data Curation and Model Evaluation*.

### Visualizing embeddings

Now that we've computed embeddings, we can visualize them. To do this, we need to project our high dimensional embeddings to two dimensions. For this we can use [UMAP](https://umap-learn.readthedocs.io/en/latest/).



In [None]:
import fiftyone.brain as fob

embedding_fields = ["aimv2_cls_emb", "aimv2_mean_emb", "clip_emb"]

for embeddings in embedding_fields:
  results = fob.compute_visualization(
      dataset,
      embeddings=embeddings,
      method="umap",
      brain_key=f"{embeddings}_viz",
      num_dims=2,
      n_neighbors=10,
      min_dist=0.051,
      verbose=True,
      )

In [None]:
fo.launch_app(dataset)

# Zero-Shot Classification using AIMv2 in FiftyOne

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/zero-shot-prediction-plugin

In [None]:
import fiftyone.operators as foo

zsc = foo.get_operator("@jacobmarks/zero_shot_prediction/zero_shot_classify")

In [None]:
await zsc(
    dataset,
    labels=gt_labels,
    model_name="AIMv2",
    label_field="AIMv2_predictions",
    )

In [None]:
dataset.apply_model(
    model=clip_model, 
    label_field="clip_predictions"
    )

In [12]:
dataset.save()

In [None]:
dataset