# AIMv2 vs CLIP Robustness on ImageNet-D

<img src="/assets/imagenet-d.gif">


ImageNet-D is a new benchmark of synthetically generated images (via Stable Diffusion) that's pushing image classification models to their breaking points with challenging images and revealing critical failures in model robustness. 

A high-level overview of ImageNet-D:

* It's composed of 4,835 "hard images." 

* ImageNet-D spans 113 overlapping categories between ImageNet and ObjectNet.

* The dataset incorporates 547 nuisance variations, including a wide array of backgrounds (3,764), textures (498), and materials (573), making it far more diverse than previous benchmarks. By systematically varying these factors, ImageNet-D comprehensively assesses how well a model can truly "see" beyond superficial image features.

At the heart of ImageNet-D is the concept of "hard images". To create a challenging test, the researchers employed a clever strategy to mine hard samples:

* They generated a large pool of images using diffusion models.

* They then used a set of "surrogate models" (pre-trained vision models) to identify images that were commonly misclassified.

* Only these challenging "hard images" were retained for the final ImageNet-D dataset. This ensures that the benchmark focuses on the weaknesses of current models and provides a more informative evaluation.

I wrote an in-depth blog about the ImageNet-D dataset, which you can read [here](https://medium.com/voxel51/imagenet-d-a-new-synthetic-test-set-designed-to-rigorously-evaluate-the-robustness-of-neural-ab8978716585).

### What we're doing in this tutorial.

In this tutorial, you're going to:

1. Explore the ImageNet-D dataset using FiftyOne

2. Compute and visualize the embeddings for the images in this dataset using AIMv2 and CLIP to gain a deeper understanding of it's contents

3. Perfom zero-shot classification using CLIP in an attempt to verify/replicate the results in the paper

4. Perform zero-shot classification using AIMv2

5. Compare each models performance to the ground truth labels to see which performs better

# Preliminaries

Let's kick things off by installing FiftyOne, some dependencies needed for this tutorial, and then downloading the ImageNet-D dataset from the [Voxel51 org on Hugging Face](https://huggingface.co/Voxel51).

In [None]:
!pip install fiftyone umap-learn

In [1]:
import os
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

dataset = fouh.load_from_hub(
    "Voxel51/ImageNet-D",
    name="imagenet_d",
    overwrite=True,
    persistent=True,
    )

Downloading config file fiftyone.yml from Voxel51/ImageNet-D
Loading dataset
Importing samples...
 100% |███████████████| 4835/4835 [68.3ms elapsed, 0s remaining, 70.8K samples/s]   


Now let's install a plugin that allows us to create custom dashboards and glean more insight into our dataset:

In [None]:
!fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/dashboard

Once the dataset has been downloaded, you can do some initial exploration by launching the app. 

There are two ways to use the app:

1. As a cell in your notebook, which you can do by running `fo.launch_app(dataset)`

2. In a seperate browser window by running `fiftyone app launch` in your terminal

Once the app is launched you can explore the dataset by:

* Simply scrolling through the images for an initial "vibe check" for what's in it

* Filtering by classes using the sidebar

* Creating a dashboard with a plot for class frequence

![ImageNet-D Examples](assets/explore-imagenetd-in-fo.gif)


In [None]:
fo.launch_app(dataset)

We're going to need the ground truth labels later, so lets go ahead and grab them from the dataset.

In [2]:
gt_labels = dataset.distinct("ground_truth.label")

## What is AIMv2?

AIMV2 is a family of **open vision encoders** pre-trained using a novel **multimodal autoregressive objective**. 

It autoregressively generates both **image patches and text tokens**, leveraging signals from all input tokens and patches for efficient training. AIMV2 uses a causal multimodal decoder that first regresses image patches and then decodes text tokens in an autoregressive manner. This model excels in tasks like **image recognition, grounding, and multimodal understanding**. AIMV2 consistently matches or outperforms existing self-supervised and vision-language pre-trained models.

AIMv2 deliberately processes **image patches first**, followed by text tokens:  

1. **Visual Foundation**: Text predictions leverage *complete* visual context (like describing a photo only after seeing it in full).  

2. **Unified Processing**: Predicts next image patches (e.g., reconstructing a photo’s bottom half from the top), then generates text autoregressively (e.g., completing "A dog plays in..." → "park").  

3. **Vision-Centric Design**: Forces robust visual representations to support both image reconstruction *and* text generation.  


I've written an in-depth blog about AIMv2, which you can read [here](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9).

#### **How AIMv2 Differs from CLIP**  

I won't get into details about the CLIP family of models, but if you're interested in learning more [check out this blog](https://medium.com/voxel51/a-history-of-clip-model-training-data-advances-599473b48e1b). I do, however, want to briefly summarize the core differences between AIMv2 and CLIP:

| **AIMv2** | **CLIP** |  
|-----------|----------|  
| Uses **autoregressive modeling** to reconstruct inputs *sequentially* (image patches → text tokens) | Uses **contrastive learning** to align *parallel* image-text pairs |  
| Processes images and text as a **unified sequence** | Processes modalities **separately** |  
| Extracts training signals from **every token** (dense supervision) | Relies on **positive/negative pair contrast** (sparse supervision) |  
| Requires **no specialized batch processing** | Demands **large batches** for effective negative sampling |  
| Learns **implicit relationships** via sequential prediction | Forces **explicit alignment** of embeddings |  


## Using AIMv2 in FiftyOne

I've integrated AIMv2 in two plugins:

1. [Zero-shot prediction plugin](https://github.com/jacobmarks/zero-shot-prediction-plugin)

2. [AIMv2 embeddings plugin](https://github.com/harpreetsahota204/aim-embeddings-plugin)

Let’s begin with embeddings.

#### Feature Extraction and Embedding Visualization in FiftyOne

First, you'll need to install the plugin:


In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/aim-embeddings-plugin

With a dataset and plugins downloaded, we’re ready to rock. 

You can, of course, use the plugin via the app. To learn how to do that you can refer to [the blog I wrote about the AIMv2 models](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9) or follow the instructions on the [plugin's GitHub repo](https://github.com/harpreetsahota204/aim-embeddings-plugin).

In this tutorial, however, we're going to stick to using the FiftyOne SDK. So, we need to instantiate an operator:

In [3]:
import fiftyone.operators as foo

aim_embeddings = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

Python version is above 3.10, patching the collections module.




### Run the operator on your dataset

You can choose from any model in the AIMv2 collection. [See the README](https://github.com/harpreetsahota204/aim-embeddings-plugin?tab=readme-ov-file#supported-models) on the plugin's repo for details. In this tutorial, we'll use [`apple/aimv2-large-patch14-224`](https://huggingface.co/apple/aimv2-large-patch14-224).

The plugin supports two types of embeddings:

- **Class Token Embedding (`cls`):** A single embedding vector derived from the special classification token. This represents the global semantic context of an image.

- **Mean Pooling Embedding (`mean`):** An embedding vector computed by averaging the representations of all image patches. This captures distributed contextual information across the entire input.

We'll compute embeddings using both methods. I’ll assume that you’re running this in a Jupyter notebook, in which case you can run the entire model on the dataset as shown below. 


In [4]:
embedding_types = ['cls', 'mean']

for emb_type in embedding_types:
  await aim_embeddings(
      dataset,
      model_name="apple/aimv2-large-patch14-224",
      embedding_types=emb_type,
      emb_field=f"aimv2_{emb_type}_emb",
      delegate=True
      )

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Using CUDA device: NVIDIA RTX 6000 Ada Generation
 100% |███████████████| 4835/4835 [2.6m elapsed, 0s remaining, 30.7 samples/s]      
Using CUDA device: NVIDIA RTX 6000 Ada Generation
 100% |███████████████| 4835/4835 [2.6m elapsed, 0s remaining, 31.3 samples/s]      


We'll visualize these embeddings shortly, and before we do let's go ahead and compute embeddings using the CLIP model as well. This way we can compare how both models represent and organize the same images in their respective embedding spaces.

We can use the CLIP model from the FiftyOne model zoo (which is the same as one of the models they assessed in the ImageNet-D paper). You'll notice that I'm instantiating the model with the `classes` and `text_prompt` argument, that's because we will use the model for zero-shot classification later. The presence of these arguments won't impact the embeddings that we get as these are computed based only the image.

In [5]:
import torch 

import fiftyone.zoo as foz

clip_model = foz.load_zoo_model(
    "clip-vit-base32-torch",
    text_prompt="A photo of a",
    classes=gt_labels,
    device="cuda" if torch.cuda.is_available() else "cpu"
    )



With the `clip_model` instantiated we can use the [`compute_embeddings`](https://docs.voxel51.com/api/fiftyone.core.dataset.html?highlight=compute_embeddings#fiftyone.core.dataset.Dataset.compute_embeddings) method of the dataset.

In [6]:
dataset.compute_embeddings(
    model=clip_model,
    embeddings_field="clip_emb"
)

 100% |███████████████| 4835/4835 [25.6s elapsed, 0s remaining, 181.2 samples/s]      


🤔 You're probably wondering why we're using one pattern for computing embeddings with AIMv2 (e.g. using a plugin) and another one to compute embeddings with CLIP (e.g. using a model from the model zoo).

That's a fair question!

FiftyOne has a powerful [plugins framework](https://docs.voxel51.com/plugins/index.html#getting-started) that allows you to extend the functionality of the library without changing the core code, submitting a PR, and then having to wait for the PR to get merged. It's a way to incorporate cutting edge models and methods into your workflow at the speed of you. We host monthly workshops that teach you about the plugin ecosystem, how to use it in your workflow, and also how to develop them. 

You can [check our events calendar for the next workshop](https://voxel51.com/computer-vision-events/), just search for the event titled *Advanced Computer Vision Data Curation and Model Evaluation*.

### Visualizing embeddings

Now that we've computed embeddings, we can visualize them. To do this, we need to project our high dimensional embeddings to two dimensions. For this we can use [UMAP](https://umap-learn.readthedocs.io/en/latest/).



In [7]:
import fiftyone.brain as fob

embedding_fields = ["aimv2_cls_emb", "aimv2_mean_emb", "clip_emb"]

for embeddings in embedding_fields:
  results = fob.compute_visualization(
      dataset,
      embeddings=embeddings,
      method="umap",
      brain_key=f"{embeddings}_viz",
      num_dims=2,
      n_neighbors=10,
      min_dist=0.051,
      verbose=True,
      )

Generating visualization...
UMAP(min_dist=0.051, verbose=True)
Thu Feb 27 06:55:24 2025 Construct fuzzy simplicial set
Thu Feb 27 06:55:24 2025 Finding Nearest Neighbors
Thu Feb 27 06:55:24 2025 Building RP forest with 8 trees
Thu Feb 27 06:55:27 2025 NN descent for 12 iterations
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	Stopping threshold met -- exiting after 5 iterations
Thu Feb 27 06:55:38 2025 Finished Nearest Neighbor Search
Thu Feb 27 06:55:40 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Thu Feb 27 06:55:42 2025 Finished embedding
Generating visualization...
UMAP(min_dist=0.051, verbose=True)
Thu Feb 27 06:55:42 2025 Construct fuzzy simplicial set
Thu Feb 27 06:55:42 2025 Finding Nearest Neighbors
Thu Feb 27 06:55:42 2025 Building RP forest with 8 trees
Thu Feb 27 06:55:42 2025 NN descent for 12 iterations
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	Stopping threshold met -- exiting after 4 iterations
Thu Feb 27 06:55:42 2025 Finished Nearest Neighbor Search
Thu Feb 27 06:55:42 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Thu Feb 27 06:55:43 2025 Finished embedding
Generating visualization...
UMAP(min_dist=0.051, verbose=True)
Thu Feb 27 06:55:43 2025 Construct fuzzy simplicial set
Thu Feb 27 06:55:43 2025 Finding Nearest Neighbors
Thu Feb 27 06:55:43 2025 Building RP forest with 8 trees
Thu Feb 27 06:55:43 2025 NN descent for 12 iterations
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	Stopping threshold met -- exiting after 5 iterations
Thu Feb 27 06:55:44 2025 Finished Nearest Neighbor Search
Thu Feb 27 06:55:44 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Thu Feb 27 06:55:44 2025 Finished embedding


In [None]:
fo.launch_app(dataset)

Once you launch the app, take some time to explore how each model organizes these synthetically generated images in its embedding space. 

Since ImageNet-D systematically varies backgrounds, materials, and textures for each object category, pay special attention to whether the models cluster images based on the core object category (C) or if they're distracted by the intentionally introduced nuisance factors (N). For instance, do images of the same object with different backgrounds cluster together, suggesting the model has learned robust object recognition, or do they scatter based on background similarities? 

Look for interesting patterns like whether AIMv2's autoregressive approach is more resilient to these synthetic variations compared to CLIP's contrastive learning. You might notice that one model creates clusters that better preserve semantic object categories despite varying textures and materials, while the other might be more influenced by surface-level visual similarities.  Try filtering by specific classes and examining how well the models handle extreme variations - for example, do common objects remain well-clustered even when rendered with unusual materials or placed in unexpected contexts? 

These patterns can reveal deeper insights about each model's robustness to synthetic perturbations and their ability to distinguish between essential object features and artificially introduced variations.

I'm curious if you find any interesting patterns, examples, or insight. If so, comment below!

![ImageNet-D Examples](assets/imagenet-d-embeddings.gif)

### Zero-Shot Classification using in FiftyOne

To get started, let's download the [zero-shot prediction plugin](https://github.com/jacobmarks/zero-shot-prediction-plugin) and instantiate the operator. 

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/zero-shot-prediction-plugin

In [8]:
import fiftyone.operators as foo

zsc = foo.get_operator("@jacobmarks/zero_shot_prediction/zero_shot_classify")

Although there were several checkpoints and sizes of feature extractors that were released as part of the AIMv2 collection, only one has been made available for zero-shot classification, [`aimv2-large-patch14-224-lit`](https://huggingface.co/apple/aimv2-large-patch14-224-lit). This is the model that is used in the zero-shot prediction plugin.  You'll recall that earlier we got parsed the ground truth labels to list `gt_labels`. [Under the hood](https://github.com/jacobmarks/zero-shot-prediction-plugin/blob/d85a71c17a9d8a65a5bb1913054347750e6e93f9/classification.py#L382) we are parsing each of the classes into in the required prompt of `Picture of a {category}`, and AIMv2 will select the one with the high probability as the prediction.

The pattern for using this plugin via the SDK is the same as we saw above, we pass in the required arguments to the operator and wait.

In [9]:
await zsc(
    dataset,
    labels=gt_labels,
    model_name="AIMv2",
    label_field="AIMv2_predictions",
    delegate=True
    )

<fiftyone.operators.executor.ExecutionResult at 0x7006beb17c90>

We'll also CLIP for zero-shot detection. Recall  that when we instantiated the `clip_model` we did so with the list of `gt_classes` and the required prefix prompt `A photo of a`.

In [11]:
dataset.apply_model(
    model=clip_model, 
    label_field="clip_predictions",
    store_logits=True
    )

# Save the additions we've made to the database

dataset.save()

 100% |███████████████| 4835/4835 [24.7s elapsed, 0s remaining, 190.2 samples/s]      


# Model evaluation

You can use the [`evaluate_classifications`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.evaluate_classifications) method to evaluate the predictions of the zero-shot classifiers, this will return a [`ClassificationResults`](https://docs.voxel51.com/api/fiftyone.utils.eval.classification.html#fiftyone.utils.eval.classification.ClassificationResults) instance that provides a variety of methods for generating various aggregate evaluation reports about your model.

By default, the classifications will be treated as a generic multiclass classification task, and for illustration purposes I am explicitly requesting that `simple` evaluation be used by setting the method parameter to "simple"; but you can specify other evaluation strategies such as [`top-k`](https://docs.voxel51.com/user_guide/evaluation.html#top-k-evaluation) accuracy or [`binary`](https://docs.voxel51.com/user_guide/evaluation.html#binary-evaluation) evaluation via the method parameter.

In [20]:
zsc_preds = ["AIMv2_predictions", "clip_predictions"]

for pred in zsc_preds:
    __key = pred.split("_")[0]
    dataset.evaluate_classifications(
        pred_field=pred,
        gt_field="ground_truth",
        method="simple",
        eval_key=f"{__key}_simple_eval",
        )

Once the `evaluate_classifications` method has completed, you can analyze the results right in the app with the [Model Evaluation panel](https://docs.voxel51.com/user_guide/app.html#app-model-evaluation-panel). 

With this panel you can analyze the performance of each individually:

![ImageNet-D Examples](assets/imagenet-d-model-eval.gif)


 #### Or you can compare the performance against each other:

 ![ImageNet-D Examples](assets/imagenet-d-compare-models.gif)

 Note that the results displayed are the micro-averaged results. In multiclass classification, when using micro-averaging, precision, recall, and F1 score will have the same value, and this value will be equal to the accuracy.

 You can also access the results of the evaluation via the SDK:

In [21]:
aim_eval_results = dataset.load_evaluation_results("AIMv2_simple_eval")

clip_eval_results = dataset.load_evaluation_results("clip_simple_eval")

#### A Brief Refresher

**Accuracy** in multiclass classification, accuracy is the ratio of correctly classified instances to the total number of instances. It gives an overall sense of how well the classifier is performing across all classes.

*   **Precision, Recall, and F1-score** In multiclass classification, precision, recall, and F1-score can be calculated in several ways:

    *   **Micro-averaging:** Calculate metrics globally by counting the total true positives, false negatives, and false positives.

    *   **Macro-averaging:** Calculate metrics for each class and then average them. This gives equal weight to each class.
    
    *   **Weighted-averaging:** Calculate metrics for each class and average them, weighting each class by its support (number of true instances for each label).

For brevity, let's explore only the results for `weighted`. Note, you can run the following for a classwise breakdown of the model performance:

```python
aim_eval_results.print_report()
```

In Table 3 of the [AIMv2 paper](https://arxiv.org/pdf/2403.18775), the authors reported the accuracy for CLIP ViT-B/32 as across the whole of ImageNet-D as 21.96. As you can see below, we observe similar performance with an accuracy of 25.07.

However, what stands out is the peformance of AIMv2 which has a top-line accuracy of 41.92 and is just mopping the floor with CLIP across the other metrics! 

In [22]:
aim_eval_results.print_metrics(average='weighted', digits=4) # you can also pass in "micro" or "macro"

accuracy   0.4192
precision  0.5996
recall     0.4192
fscore     0.451
support    4835


In [23]:
clip_eval_results.print_metrics(average='weighted', digits=4)

accuracy   0.2507
precision  0.4637
recall     0.2507
fscore     0.2856
support    4835


#### Finding the Hardest Samples

The FiftyOne Brain provides a hardness measure that calculates how easy or difficult it is for your model to understand any given sample.

In [27]:
import fiftyone.brain as fob

zsc_preds = ["AIMv2_predictions", "clip_predictions"]

for pred in zsc_preds:
    fob.compute_hardness(dataset, label_field=pred, hardness_field=f"{pred}_hardness")

Computing hardness...
 100% |███████████████| 4835/4835 [4.4s elapsed, 0s remaining, 1.1K samples/s]       
Hardness computation complete
Computing hardness...
 100% |███████████████| 4835/4835 [4.4s elapsed, 0s remaining, 1.1K samples/s]       
Hardness computation complete


![ImageNet-D Examples](assets/imagenet-d-hardness.gif)


The concept of hardness is particularly interesting for ImageNet-D because:

1. **By Design Difficulty**: ImageNet-D was specifically created through a "hard image mining" strategy where images were only included if they fooled a set of surrogate models. So in a sense, every image in the dataset was already selected for being "hard."


2. **Layered Hardness**: Computing hardness scores on these already-hard images can reveal which synthetic variations are especially challenging for our specific models (AIMv2 and CLIP). This gives us a "hardness within hardness" perspective.


Key questions to investigate:

- For samples that are "hard" for one model but "easy" for another, what characteristics distinguish them?


- Is there a relationship between embedding cluster position and hardness? Do the hardest samples tend to lie in particular regions of the embedding space?


This analysis could reveal valuable insights about whether certain architectural choices (autoregressive vs contrastive) make models more robust to specific types of synthetic perturbations

# Next Steps

You'll notice that I have given you the tools to understand, explore, and analyze the peformance of AIMv2 and CLIP on ImageNet-D; but I haven't given you any answers. That's because I want you to take some time and explore it on your own!

After you've taken the time to dig deeper into the dataset and model performance, here's what you can do to level up your analysis (and FiftyOne skills)

- Explore one of the other checkpoints for feature extraction using the AIMv2 embeddings plugin, a good place to start is [`aimv2-large-patch14-native`](https://huggingface.co/apple/aimv2-large-patch14-native).

- Since you've already computed embeddings in this tutorial, you can use them in the FiftyOne Brain to [compute uniqueness values](https://docs.voxel51.com/brain.html#brain-image-uniqueness) for each sample.

- Likewise you can [compute representativeness](https://docs.voxel51.com/brain.html#brain-image-representativeness) values to find samples which are very similar to large clusters of your the entire ImageNet-D dataset.

- Use the [Janus Pro](https://github.com/harpreetsahota204/janus-vqa-fiftyone) or the [Moondream2](https://github.com/harpreetsahota204/moondream2-plugin) plugin with the prompt `What is the main object
in this image? Respond with one word only` and  repeat the evaluation as we did in this blog.

If you have any questions or want to stay up to date with us at FiftyOne, feel free to join our [Discord community](https://discord.com/invite/fiftyone-community)!