When humans encounter optical illusions, our brains often see things that aren't physically present in the image.  This perceptual phenomenon, known as pareidolia, has long fascinated neuroscientists and psychologists. Now, researchers are turning these visual puzzles toward Vision-Language Models (VLM) to test their perceptual capabilities.

I recently came across a paper, *[Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions
](https://arxiv.org/abs/2412.08169)*, which introduces a novel task called Illusory VQA. 

The core challenge presented in the [Illusory VQA](https://github.com/IllusoryVQA/IllusoryVQA) task is deceptively complex: given an image containing both a "Real Concept" (RC) and potentially an "Illusory Concept" (IC), can a VLM detect if an illusion is present and correctly answer questions about that illusory element?

This task requires perception beyond standard image recognition and assessing how well models can mimic human-like visual understanding, and is interestingly challenging because the model must simultaneously recognize what's actually in the image while also perceiving what appears to be there due to the illusion — much like our own visual system does.

### The Illusory Datasets

For this task, the authors created four benchmark datasets, each targeting different aspects of visual illusion processing:

• **IllusionMNIST:** Built using the classic MNIST handwritten digit dataset as the source material, this dataset contains 3,960 training samples and 1,219 test samples. The researchers added a "No illusion" class to make the task more challenging, requiring models to determine whether an illusion is actually present.

• **IllusionFashionMNIST:** Based on the Fashion-MNIST dataset, which contains clothing items rather than digits, this collection includes 3,300 training samples and 1,267 test samples. Like its MNIST counterpart, it includes a "No illusion" class to further test discrimination abilities.

• **IllusionAnimals:** This dataset features animal images generated using SDXL-Lightning and transformed with ControlNet to create illusory versions. It comprises 3,300 training samples and 1,100 test samples, with the additional "No illusion" class.

• **IllusionChar:** This unique dataset focuses specifically on reading characters in images, with sequences of 3 to 5 characters per image. Including 9,900 training samples and 3,300 test samples, it tests how well models can interpret text within illusory contexts.

What I found particularly interesting is how these datasets were constructed:

<img src="assets/illusoryvqa-datagen.png" width="70%">

The research team:

- Generated scene descriptions using large language models

- Combined these descriptions with raw images

- Used a variant of ControlNet to create the final illusory images

- Conducted human evaluations to validate the quality of the generated images

- Asked participants to identify what they perceived in each picture

- Filtered out inappropriate content using NSFW detectors

This approach ensures that the illusions in the datasets genuinely challenge perceptual abilities in ways that mirror human visual processing.

### Testing Leading Multimodal Models

The study evaluated several state-of-the-art models:

<img src="assets/illusoryvqa-table2.png" width="70%">

The research team focused on zero-shot performance (how well models perform without specific training on illusions) and performance after fine-tuning.

The results that all models showed a performance drop when dealing with illusions compared to standard images—mirroring the human experience of being "fooled" by optical illusions. Different models demonstrated varying levels of robustness to different types of illusions, suggesting that architectural differences influence how these systems process visual information.

### A Simple Yet Effective Solution

An interesting finding from the research is their straightforward solution for improving model performance on illusory images. The technique:

1. Apply a Gaussian and blur low-pass filter to the illusory images
2. Convert the images to grayscale

This simple preprocessing approach yielded significant performance improvements across all tested models. 

For example, in the IllusionAnimals dataset:

- CLIP initially showed the highest performance on illusory images

- After applying the filter, BLIP-2 achieved the best results—even outperforming humans

- All models saw substantial gains in accuracy after implementing the filtering technique

This finding suggests that relatively simple image processing techniques can help AI systems overcome perceptual challenges posed by illusions. The filtering process essentially helps the models differentiate between real and illusory elements in the images, similar to how certain visual processing aids might help humans see through optical illusions.

## What we're going to do in this tutorial

In this tutorial, we'll explore the IllusionAnimals dataset and evaluate how different AI models perceive visual illusions. We'll:

1. Load and explore the IllusionAnimals dataset using FiftyOne and see if we can reproduce the results from the paper, but only focusing on the CLIP model.

2. Compute embeddings using multiple models:

   - CLIP  

   - SigLIP 2 (a new model released by Google)
   
   - AIMv2 (in my opinion a highly slept on contender to CLIP released in late 2024 by Apple)

3. Visualize these embeddings using UMAP dimensionality reduction

4. Perform zero-shot classification using the models mentioned above

5. Test Visual Question-Answering (VQA) capabilities using:
   - Janus-Pro
   - Moondream2

6. Compare how models perform with and without hints about potential illusions

Let's start by installing some dependencies and downloading the dataset from the Hugging Face Hub 

In [None]:
# installing bleeding edge version of transformers
!pip install git+https://github.com/huggingface/transformers.git#egg=transformers

In [None]:
!pip install fiftyone umap-learn

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/IllusionAnimals",
    overwrite=True,
    persistent=True
    )

In [None]:
fo.launch_app(dataset)

Let's start with an initial visual vibe check of what is in this dataset.

<img src="assets/illusion_animals_initial_explore.gif" width="70%">


Note this is a [Grouped Dataset](https://docs.voxel51.com/user_guide/groups.html#grouped-datasets). Grouped datasets allow us to represent multiples slices of the same data point. This way data for multiple perspectives of the same scene can be stored, visualized, and queried in ways that respect the relationships between the slices of data.

For the IllusionAnimals subset, the dataset includes different slices representing variations of the images with and without illusions, and with and without filters. The specific slices available in the IllusionAnimals dataset are:

*   **Raw Images**: These are the original images of animals without any illusions applied. They serve as a baseline for evaluating the models' performance on standard image recognition tasks. The models should accurately identify the animal in the image.

*   **Illusory Images**: These images have visual illusions incorporated into them. The illusions are designed to make the images appear as one animal while subtly containing elements of another. The goal is to test whether the models can detect the presence of the illusory concept, even with the presence of the real concept.

*   **Filtered Images**: These are the illusory images that have been processed with a Gaussian and blur low-pass filter. This filter is applied to enhance the models’ ability to detect the illusions. The idea is that the filter helps to reduce noise and highlight the illusory elements, making it easier for the models to identify and interpret the content. Applying the filter generally improves model performance.

*   **Illusionless Class**: In addition to the above, an extra class called "illusionless" is added to push the models’ capabilities. This class enables the models to detect instances where no illusion images are present in the picture.


We're only going to work with two of the slices in this tutorial, "main" (the illusion and no illusion images) and "filtered" (which are the images after the Gaussian Blur and grayscaling).

FiftyOne datasets are logical datasets pointing to media files on disk rather than storing the media file contents directly. So by cloning the dataset we are not duplicating the images on disk only the schema.

Note that when I originally parsed the dataset I mapped the images with no illusion to the `illusionless` class. To be consisent with the paper, I'm going to map these to the `no illusion` class. I was too lazy to reparse the dataset, but luckily this is easy to do in FiftyOne using the `map_labels` method of the dataset.

In [51]:
main_images = dataset.select_group_slices("main").map_labels("label", {"illusionless": "no illusion"}).clone(name="illusion_animals") # get the main images
main_images.persistent = True # make the dataset persistent across sessions 

filtered_images = dataset.select_group_slices("filtered").map_labels("label", {"illusionless": "no illusion"}).clone(name="illusion_animals_fitered")
filtered_images.persistent = True

We'll also need to have the labels, so we can grab them now. It doesn't matter which slice we grab them from as they are both the same:

In [52]:
class_names = main_images.distinct("label.label") # get the class names

In [None]:
class_names

### Using Embeddings for Deeper Dataset Understanding

The first thing I want to do is gain a deeper understanding of the images in this dataset, for that we can use embeddings.

Visual embeddings are high-dimensional vector representations of images that capture semantic and visual features. For the IllusionAnimals dataset, embeddings are particularly valuable because they can help us:

1. **Visualize Relationships**: Reducing these high-dimensional embeddings to 2D using UMAP helps visualize how different images cluster together and potentially identify patterns in the dataset.

2. **Compare Model Perspectives**: Different models may encode visual information differently. By comparing embeddings from multiple models (SigLIP and AIM-v2 in our case), we can understand how their "perception" of illusions differs.

3. **Analyze Illusion Effects**: We can examine whether illusory versions of images cluster closer to their "real" concept or their "illusory" concept, giving us insights into how effectively the illusions work from a model's perspective.

For this analysis, we'll use three models:

- **CLIP**

- **SigLIP 2** 

- **AIM-v2** 

Let's start by instantiating the models and then computing embeddings.

#### CLIP

There's been a lot written about the CLIP model, so I won't repeat anything here. However, if you're interested in going deep into CLIP and its history then [check out this blog](https://voxel51.com/blog/a-history-of-clip-model-training-data-advances/).

We can use the model via [FiftyOne's integration with Hugging Face](https://docs.voxel51.com/integrations/huggingface.html#zero-shot-classification). In the paper the authors use `CLIP-ViT-base-patch32` in their experiments, which is the checkpoint we will also use. You'll see that we instantiate the model with classes but it will not affect the embeddings.

In [54]:
import torch 

import fiftyone.zoo as foz

clip_model = foz.load_zoo_model(
    "zero-shot-classification-transformer-torch",
    name_or_path="openai/clip-vit-base-patch32", 
    classes=class_names,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
    )

In [None]:
main_images.compute_embeddings(
    model=clip_model,
    embeddings_field="clip_embeddings"
)

filtered_images.compute_embeddings(
    model=clip_model,
    embeddings_field="clip_embeddings"
)

#### SigLIP 2

SigLIP 2 is a [family of multilingual vision-language encoders](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) that improves upon the original SigLIP model. It incorporates several techniques, including captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation, into a unified training recipe.

I don't want to get too deep into the technical details of the SigLIP 2 model, feel free to [check out the paper](https://arxiv.org/abs/2502.14786) for more information about the model, it's performance, and how it was trained.

##### SigLIP 2 Data Curation

What I do want to spend some time talking about, however, is the data curation methods discussed in the paper. The data curation method in focuses on improving the quality and diversity of training data, especially for smaller models. The primary technique used is **[Active Curation as Implicit Distillation (ACID)](https://arxiv.org/pdf/2411.18674)**. Here's a breakdown of how it works:

1.  **Teacher-Student Model Setup**:
    *   A smaller model (student) is trained with the help of a more powerful, pre-trained model (teacher).

    *   In SigLIP 2, the **teacher model is a fine-tuned SigLIP 2 So400m model**, which is initially trained on a diverse dataset and then further fine-tuned on a high-quality curated dataset.

2.  **Scoring Examples for "Learnability"**:

    *   During training, both the teacher and student models evaluate the training examples.

    *   Each example is scored based on how "learnable" it is. This score is basically the difference in loss values between the current student model and the reference model, and is meant to reflect how well the model can learn from that particular example.

     *  Examples that are easy for the reference but difficult for the current student are given high scores. This means the data is "learnable" because the reference model, which has already been trained, can easily understand the patterns in the data. The student model, still in training, finds these patterns more challenging.

3.  **Active Sample Selection**:

    *   Instead of using all available data, **only the most "learnable" examples are selected** for each training batch.

    *   This selection is done jointly, with two main criteria for scoring sub-batches:

        *   Easy-reference scoring: This uses the loss values of the reference model to preferentially sample batches that are easy for the reference model.

        *   Learnability scoring: This uses the difference in loss values between the current student model and the reference model to prioritize batches that are easy for the reference model but difficult for the student model. Batches that are easy for the reference but difficult for the current student are given high scores

    *   By curating data based on the reference model, ACID implicitly distills its knowledge through a data-driven objective. This objective combines model predictions and real labels as targets, and retains targets where the reference model and labels agree allowing for mutual denoising of model predictions and data labels

4.  **Filtering Ratio**:
    *   To balance the benefits of curation with computational costs, a filtering ratio is applied.

    *   For example, a filtering ratio of 0.5 means that the super-batch is twice the size of the final batch (64,000 examples), and the best 50% are selected.

    *   The B/32 model uses a filtering ratio of 0.75.

5.  **Implicit Distillation**:

    *   By selectively training on the most informative examples, the student model (the smaller SigLIP 2 model) learns to mimic the behavior of the teacher model.

    *   This process implicitly distills the knowledge from the larger, more capable teacher model into the smaller student model, improving its performance.

    *   This method saves computational resources compared to explicit distillation methods that may use a second teacher model.

In essence, this data curation method ensures that the smaller models are trained on the most valuable data, leading to improved performance and efficiency. By fine-tuning the teacher model on a curated dataset, the method also captures the benefits of diverse knowledge and high-quality data.


In [56]:
import torch 

import fiftyone.zoo as foz

siglip_model = foz.load_zoo_model(
    "zero-shot-classification-transformer-torch",
    name_or_path="google/siglip2-base-patch32-256",
    classes=class_names,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
    )

Note, I'm using this particular checkpoint because I'm trying to be as "apples-to-apples" with the CLIP model as possible. And, of course, these are two completely different model architectures and trained on completely different datasets. And, for the purposes of this tutorial, "apples-to-apples" means picking a model that have fairly close names. In this case, they both use `ViT/B 32` as the vision encoder...that's good enough for us.

In [None]:
main_images.compute_embeddings(
    model=siglip_model,
    embeddings_field="siglip_embeddings"
)

filtered_images.compute_embeddings(
    model=siglip_model,
    embeddings_field="siglip_embeddings"
)

#### AIMv2

AIMv2 is a [family of vision encoders released in late 2024](huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c) that uses a novel multimodal autoregressive method. 

It processes image patches and text tokens as a unified sequence, using a causal multimodal decoder to predict elements sequentially. AIMv2 processes data as one continuous sequence, predicting the next step in the series. It deliberately puts image information first, followed by text, creating a specific sequence: image patches → text tokens. This differs from CLIP's parallel processing of image and text and strengthens the vision encoder. AIMv2 is trained on 12 billion image-text samples, balancing human-written alt-text and synthetically generated captions from diverse sources. I've written about the AIMv2 models in great detail in two blog posts, which you can read [here](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9) and [here](https://medium.com/voxel51/aimv2-outperforms-clip-on-synthetic-dataset-imagenet-d-4452760b624c).

In order to use AIMv2 for embeddings, we need to [install a plugin](https://github.com/harpreetsahota204/aim-embeddings-plugin).

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/aimv2_embeddings

!fiftyone plugins requirements @harpreetsahota/aimv2_embeddings --install

We'll need to set an enviornment variable as well:

In [57]:
import os

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

You'll also need to kick off a delegated service by running `fiftyone delegated launch` in the terminal.

In [58]:
import fiftyone.operators as foo

aim_embeddings = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

In [None]:
# Run the operator on your dataset
await aim_embeddings(
    main_images,
    model_name="apple/aimv2-large-patch14-224",  # Choose any supported model
    embedding_types="cls", #can be "cls", "mean"
    emb_field="aimv2_embeddings",
    delegate=True
)

In [None]:
# Run the operator on your dataset
await aim_embeddings(
    filtered_images,
    model_name="apple/aimv2-large-patch14-224",  # Choose any supported model
    embedding_types="cls",
    emb_field="aimv2_embeddings",
    delegate=True
)

#### Exploring embeddings

Now we can use UMAP to reduce the dimensionality of the embeddings and explore them in the FiftyOne app.

In [None]:
import fiftyone.brain as fob

# Define datasets and embedding fields as lists
datasets = [main_images, filtered_images]
embedding_fields = [
    "aimv2_embeddings",
    "clip_embeddings",
    "siglip_embeddings"
]

# Compute UMAP for each dataset and embedding combination
for ds in datasets:
    for field in embedding_fields:
        _fname = field.split("_embeddings")[0]
        brain_key = f"{_fname}_viz"
        
        results = fob.compute_visualization(
            ds,
            embeddings=field,
            method="umap",
            brain_key=brain_key,
            num_dims=2,
        )

In [None]:
fo.launch_app(main_images)

<img src="assets/illusion-embeddings.gif" width="70%">

Looking at the embeddings between filtered and non-filtered images, there are several observations:

1. **Embedding Separation**

- The filtered images (after Gaussian blur and grayscale conversion) tend to form tighter, more distinct clusters compared to their unfiltered counterparts. This is especially noticible in the AIMv2 and CLIP embedding spaces. Notably the SigLIP 2 model doesn't show much clustering and the embeddings look uniformly distributed across the embedding space. You'll likely see different results for a different check point, but this particular checkpoint doesn't seem to discern classes that well.

- This suggests the filtering process helps the models create more consistent representations of similar images.

2. **Noise Reduction**

- The filtered embeddings show less scatter/noise, indicating that the preprocessing helps reduce irrelevant visual variations; the notable exception being the SigLIP 2 embeddings.

- This aligns with the paper's findings that filtering improves model performance by helping models focus on the essential features that define the illusions.

3. **Class Boundary Clarity**

- In the filtered version, the boundaries between different classes appear more defined.

- This is particularly noticeable for the "no illusion" class, which forms more cohesive clusters after filtering.

4. **Distance Between Related Concepts**

- The filtered embeddings seem to better capture the relationship between the real concept and illusory concept, as evidenced by more logical spatial relationships in the embedding space.

- This suggests the filtering process helps models better understand both the actual and illusory elements in the images.

This visualization helps explain why the simple preprocessing technique (Gaussian blur + grayscale) improved model performance in the original paper - it's creating cleaner, more structured representations that are easier for the models to work with.

### Computing uniquess and representativeness values

We can use the embeddings to compute uniqueness values for the images in our dataset. 

We can use the `compute_uniqueness` method from the FiftyOne Brain, which measures how dissimilar each sample is from its neighbors in an embedding space. It finds each sample's 3 nearest neighbors, weights their distances (60%-30%-10%), and normalizes these weighted distances to produce scores between 0-1. Higher scores indicate samples that are more "isolated" or distinct in the embedding space, while lower scores indicate samples that have many close neighbors. 

I'll compute these values for the CLIP embeddings on the Illusion Animals subset, and leave computing uniqueness values for the other embeddings to the reader:

In [None]:
import fiftyone.brain as fob

fob.compute_uniqueness(
    main_images,
    embeddings="clip_embeddings",
    uniqueness_field="clip_uniqueness",
    )

You can then filter on the uniquess values to see the most and least unique images in the dataset:

<img src="assets/illusion-animals-uniqueness.gif">

#### Reproducing CLIP results from the paper

We've already instantiated the CLIP model above. To use it for zero shot classification, we can make use the of the `apply_model` method of the dataset:

In [None]:
main_images.apply_model(
    model=clip_model, 
    label_field="clip_predictions",
    text_prompt = "illusion animal ",
)

In [63]:
clip_res_illusions = main_images.evaluate_classifications(
    pred_field="clip_predictions",
    gt_field="label",
    method="simple",
    eval_key="clip_eval",
    )

In [None]:
filtered_images.apply_model(
    model=clip_model, 
    label_field="clip_predictions",
)

In [65]:
clip_res_filtered = filtered_images.evaluate_classifications(
    pred_field="clip_predictions",
    gt_field="label",
    method="simple",
    eval_key=f"clip_eval",
    )

We can use the Model Evaluation panel in the FiftyOne app to see the model peformance:

<img src="assets/illusion-eval.gif" width="70%">

And we can also print the classification reports programmatically as shown below:

In [None]:
clip_res_illusions.print_metrics(average='weighted', digits=4)

In [None]:
clip_res_filtered.print_metrics(average='weighted', digits=4)

The paper reported an accuracy of 42.64 for the illusion images and 85.45 for the non-illusion images.

We're not able to reproduce their results, and I suspect that's because of how they implement their [inference function](https://github.com/IllusoryVQA/IllusoryVQA/blob/main/Experiments/Zero-Shot/CLIP/inference_CLIP_IllusionAnimals.ipynb):

```python
def inference(img, labels, model, vis_processors, device):
    image = vis_processors["eval"](img).unsqueeze(0).to(device)
    sample = {"image": image, "text_input": labels}
    clip_features = model.extract_features(sample)
    image_features = clip_features.image_embeds_proj
    text_features = clip_features.text_embeds_proj
    sims = (image_features @ text_features.t())[0] / 0.01
    probs = torch.nn.Softmax(dim=0)(sims).tolist()
    max_index = probs.index(max(probs))
    max_label = labels[max_index]
    return max_label
```

1. **Feature Extraction vs End-to-End**

- Paper implementation explicitly extracts features using `model.extract_features()` and then computes similarities manually

- Standard implementation uses the model's built-in forward pass (`model(**inputs)`) which handles this internally

2. **Temperature Scaling**

- Paper implementation uses a custom temperature value of 0.01: `sims = (image_features @ text_features.t())[0] / 0.01`

- Standard implementation uses CLIP's default temperature scaling built into the model


3. **Feature Projection**

- Paper specifically uses projected embeddings: `image_features = clip_features.image_embeds_proj`

- Standard implementation lets the model handle the projection internally


These differences, particularly the custom temperature value of 0.01, likely explain why we couldn't exactly reproduce their results. The temperature parameter significantly affects how "sharp" or "soft" the probability distribution becomes after softmax - a lower value like 0.01 makes the model more confident in its predictions compared to CLIP's default temperature.

#### Testing SigLIP 2 and AIMv2 

We can use the AIMv2 model for zero shot classification directly with FiftyOne's integration with Hugging Face (it's just the embeddings that we needed a plugin for). Let's go ahead and instantiate the model:

In [68]:
aim_model = foz.load_zoo_model(
    "zero-shot-classification-transformer-torch",
    name_or_path="apple/aimv2-large-patch14-224-lit", 
    classes=class_names,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True #yes
    )

And apply it to our datasets:

In [None]:
main_images.apply_model(
    model=aim_model, 
    label_field="aimv2_predictions",
    text_prompt = "illusion animal ",
    )

aim_res_illusions = main_images.evaluate_classifications(
    pred_field="aimv2_predictions",
    gt_field="label",
    method="simple",
    eval_key=f"aim_eval",
    )

Now for the filtered images:

In [None]:
filtered_images.apply_model(
    model=aim_model, 
    label_field="aimv2_predictions",
    )

aim_res_filtered = filtered_images.evaluate_classifications(
    pred_field="aimv2_predictions",
    gt_field="label",
    method="simple",
    eval_key=f"aim_eval",
    )

Now, let's apply our already instantiated SigLIP 2 model to our datasets as well:

In [None]:
main_images.apply_model(
    model=siglip_model, 
    label_field="siglip2_predictions",
    text_prompt = "illusion animal ",
    )

siglip_res_illusions = main_images.evaluate_classifications(
    pred_field="siglip2_predictions",
    gt_field="label",
    method="simple",
    eval_key=f"siglip2_eval",
    )

And for the filtered images:

In [None]:
filtered_images.apply_model(
    model=siglip_model, 
    label_field="siglip2_predictions",
    )

siglip_res_filtered = filtered_images.evaluate_classifications(
    pred_field="siglip2_predictions",
    gt_field="label",
    method="simple",
    eval_key=f"siglip2_eval",
    )

We can run a comparison of model performance in the FiftyOne App as well:

#### Summary of findings

| Model | Dataset Version | Your Results | Paper Results |
|-------|----------------|--------------|---------------|
| CLIP | Illusion Images | 56.40% | 42.64% |
| CLIP | Filtered Images | 62.80% | 85.45% |
| SigLIP 2 | Illusion Images | 10.40% | N/A |
| SigLIP 2 | Filtered Images | 19.80% | N/A |
| AIMv2 | Illusion Images | 22.9%| N/A |
| AIMv2 | Filtered Images | 47% | N/A |

I encourage you to dig into the results yourself, and if you find anything interesting please comment below. 

Given this isn't a research paper, and we've already covered a lot of ground I'll just share my high level observation: CLIP is crushing it! I had high hopes for the SigLIP 2 model, but on this particular task it doesn't perform as well as CLIP or AIMv2. To be fair, [I'm a huge AIMv2 fanboy](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9)...so I was hoping it would beat both models, but it let me down here.

In [1]:
import fiftyone as fo

main_images = fo.load_dataset("illusion_animals")

filtered_images= fo.load_dataset("illusion_animals_fitered")

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/janus-vqa-fiftyone

!fiftyone plugins requirements @harpreetsahota/janus_vqa --install

!fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin

!fiftyone plugins requirements @harpreetsahota/moondream2 --install

In [None]:
NO_HINT_PROMPT = f"""Which class is in the picture: {', '.join(class_names)}. 
Your answer must be one of these exact classes, no other answers allowed. 
Respond in one word for your guess of the correct class without any extra explanation."""

In [None]:
import fiftyone.operators as foo

janus_vqa = foo.get_operator("@harpreetsahota/janus_vqa/janus_vqa")

moondream = foo.get_operator("@harpreetsahota/moondream2/moondream")

No hint prompt

In [6]:
await janus_vqa(
    dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=NO_HINT_PROMPT,
    question_field="no_hint_prompt",
    answer_field="janus_no_hint_answer",
    delegate=True
    )

In [None]:
await moondream(
    dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_no_hint_answer",
    query_text=NO_HINT_PROMPT,
    delegate=True
    )

Hint prompt

In [24]:
HINT_PROMPT = f"""There might be an image illusion of something in this image. 
These are the classes that the image illusion might belong to: {', '.join(class_names)}.
Your answer must be one of these exact classes, no other answers allowed.  
Respond in one word for your guess of the correct class without any extra explanation.
"""

In [19]:
await janus_vqa(
    dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=HINT_PROMPT,
    question_field="hint_prompt",
    answer_field="janus_hint_answer",
    delegate=True
    )

In [None]:
await moondream(
    dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_hint_answer",
    query_text=HINT_PROMPT,
    delegate=True
    )

Moondream2 also produces short captions, let's generate short captions and then compute similarity between the caption and the ground truth prompt

Then let's also see if any of the captions actually include the classes of interest

Why This Matters for AI Practitioners and Researchers

This research opens exciting new avenues for improving multimodal AI systems:

1. **Perceptual Robustness**: Training models to handle illusory images could make them more robust to adversarial attacks and unusual visual inputs. If a model can correctly process information even when presented with potentially misleading visual cues, it may be less susceptible to manipulation or confusion in real-world applications.

2. **Cognitive Alignment**: Understanding how AI models perceive illusions differently from humans can help researchers better align AI visual processing with human cognition. This alignment is crucial for applications where AI systems need to interpret visual information similarly to humans, such as in autonomous driving or medical image analysis.

3. **Preprocessing Solutions**: The simple filtering technique offers an immediate way to improve model performance on challenging visual inputs without requiring extensive retraining or architectural changes.

4. **Benchmark Advancement**: The Illusory VQA datasets provide valuable new benchmarks that push beyond conventional image recognition tasks, helping researchers identify strengths and weaknesses in current multimodal architectures.

5. **Bridging Disciplines**: This work creates interesting connections between AI research and cognitive psychology, potentially leading to cross-disciplinary insights about visual perception.

For AI practitioners working on multimodal systems, incorporating illusion testing into evaluation protocols could reveal important limitations that might otherwise go undetected. Similarly, the preprocessing techniques described in this research could be adapted for a variety of challenging visual inputs beyond just illusions.