# AIMv2 vs CLIP Robustness on ImageNet-D

![ImageNet-D Examples](assets/imagenet-d.gif)


ImageNet-D is a new benchmark of synthetically generated images (via Stable Diffusion) that's pushing image classification models to their breaking points with challenging images and revealing critical failures in model robustness. 

A high-level overview of ImageNet-D:

* It's composed of 4,835 "hard images." 

* ImageNet-D spans 113 overlapping categories between ImageNet and ObjectNet.

* The dataset incorporates 547 nuisance variations, including a wide array of backgrounds (3,764), textures (498), and materials (573), making it far more diverse than previous benchmarks. By systematically varying these factors, ImageNet-D comprehensively assesses how well a model can truly "see" beyond superficial image features.

At the heart of ImageNet-D is the concept of "hard images". To create a challenging test, the researchers employed a clever strategy to mine hard samples:

* They generated a large pool of images using diffusion models.

* They then used a set of "surrogate models" (pre-trained vision models) to identify images that were commonly misclassified.

* Only these challenging "hard images" were retained for the final ImageNet-D dataset. This ensures that the benchmark focuses on the weaknesses of current models and provides a more informative evaluation.

I wrote an in-depth blog about this dataset, which you can read [here](https://medium.com/voxel51/imagenet-d-a-new-synthetic-test-set-designed-to-rigorously-evaluate-the-robustness-of-neural-ab8978716585).

### What we're doing in this tutorial.

In this tutorial, you're going to:

1. Explore the ImageNet-D dataset using FiftyOne

2. Compute and visualize the embeddings for the images in this dataset using AIMv2 and CLIP to gain a deeper understanding of it's contents

3. Perfom zero-shot classification using CLIP in an attempt to verify/replicate the results in the paper

4. Perform zero-shot classification using AIMv2

5. Compare each models performance to the ground truth labels to see which performs better

# Preliminaries

Let's kick things off by installing FiftyOne, some dependencies needed for this tutorial, and then downloading the ImageNet-D dataset from the [Voxel51 org on Hugging Face](https://huggingface.co/Voxel51)

In [None]:
!pip install fiftyone umap-learn

In [None]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub(
    "Voxel51/ImageNet-D",
    name="imagenet_d"
    )

Once the dataset has been downloaded, you can do some initial exploration by launching the app:

In [None]:
fo.launch_app(dataset)

In [2]:
gt_labels = dataset.distinct("ground_truth.label")

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/aim-embeddings-plugin


**How AIMv2 Differs from CLIP**  

**Core Training Differences**  
| **AIMv2** | **CLIP** |  
|-----------|----------|  
| Uses **autoregressive modeling** to reconstruct inputs *sequentially* (image patches → text tokens) | Uses **contrastive learning** to align *parallel* image-text pairs |  
| Processes images and text as a **unified sequence** | Processes modalities **separately** |  
| Extracts training signals from **every token** (dense supervision) | Relies on **positive/negative pair contrast** (sparse supervision) |  
| Requires **no specialized batch processing** | Demands **large batches** for effective negative sampling |  
| Learns **implicit relationships** via sequential prediction | Forces **explicit alignment** of embeddings |  

**Sequence Architecture: Why Order Matters**  
AIMv2 deliberately processes **image patches first**, followed by text tokens:  
1. **Visual Foundation**: Text predictions leverage *complete* visual context (like describing a photo only after seeing it in full).  
2. **Unified Processing**: Predicts next image patches (e.g., reconstructing a photo’s bottom half from the top), then generates text autoregressively (e.g., completing "A dog plays in..." → "park").  
3. **Vision-Centric Design**: Forces robust visual representations to support both image reconstruction *and* text generation.  


In [None]:
import fiftyone.operators as foo

aim_embeddings = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

In [None]:
embedding_types = ['cls', 'mean']

for emb_type in embedding_types:
  await aim_embeddings(
      dataset,
      model_name="apple/aimv2-large-patch14-224",
      embedding_types=emb_type,
      emb_field=f"aimv2_{emb_type}_emb",
      delegate=True
      )

In [None]:
import torch 

import fiftyone.zoo as foz

clip_model = foz.load_zoo_model(
    "clip-vit-base32-torch",
    text_prompt="A photo of a",
    classes=gt_labels,
    device="cuda" if torch.cuda.is_available() else "cpu"
    )

dataset.compute_embeddings(
    model=clip_model,
    embeddings_field="clip_emb"
)

In [None]:
import fiftyone.brain as fob

embedding_fields = ["aimv2_cls_emb", "aimv2_mean_emb", "clip_emb"]

for embeddings in embedding_fields:
  results = fob.compute_visualization(
      dataset,
      embeddings=embeddings,
      method="umap",
      brain_key=f"{embeddings}_viz",
      num_dims=2,
      n_neighbors=10,
      min_dist=0.051,
      verbose=True,
      )

In [None]:
fo.launch_app(dataset)

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/zero-shot-prediction-plugin

In [None]:
import fiftyone.operators as foo

zsc = foo.get_operator("@jacobmarks/zero_shot_prediction/zero_shot_classify")

In [None]:
await zsc(
    dataset,
    labels=gt_labels,
    model_name="AIMv2",
    label_field="AIMv2_predictions",
    )

In [None]:
dataset.apply_model(
    model=clip_model, 
    label_field="clip_predictions"
    )

In [12]:
dataset.save()

In [None]:
dataset