When humans encounter optical illusions, our brains often see things that aren't physically present in the image. 

This perceptual phenomenon, known as pareidolia, has long fascinated neuroscientists and psychologists. Now, researchers are turning these visual puzzles toward Vision-Language Models (VLM) to test their perceptual capabilities.

I recently came across a paper,*[Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions
](https://arxiv.org/abs/2412.08169)*, which introduces a novel task called Illusory VQA. The core challenge presented in the [Illusory VQA](https://github.com/IllusoryVQA/IllusoryVQA) task is deceptively complex: given an image containing both a "Real Concept" (RC) and potentially an "Illusory Concept" (IC), can a VLM detect if an illusion is present and correctly answer questions about that illusory element?

This task requires perception beyond standard image recognition and assessing how well models can mimic human-like visual understanding, and is interestingly challenging because the model must simultaneously recognize what's actually in the image while also perceiving what appears to be there due to the illusion — much like our own visual system does.

### The Illusory Datasets

For this task, the authors created four benchmark datasets, each targeting different aspects of visual illusion processing:

• **IllusionMNIST:** Built using the classic MNIST handwritten digit dataset as the source material, this dataset contains 3,960 training samples and 1,219 test samples. The researchers added a "No illusion" class to make the task more challenging, requiring models to determine whether an illusion is actually present.

• **IllusionFashionMNIST:** Based on the Fashion-MNIST dataset, which contains clothing items rather than digits, this collection includes 3,300 training samples and 1,267 test samples. Like its MNIST counterpart, it includes a "No illusion" class to further test discrimination abilities.

• **IllusionAnimals:** This dataset features animal images generated using SDXL-Lightning and transformed with ControlNet to create illusory versions. It comprises 3,300 training samples and 1,100 test samples, with the additional "No illusion" class.

• **IllusionChar:** This unique dataset focuses specifically on reading characters in images, with sequences of 3 to 5 characters per image. Including 9,900 training samples and 3,300 test samples, it tests how well models can interpret text within illusory contexts.

What I found particularly interesting is how these datasets were constructed:

<img src="assets/illusoryvqa-datagen.png" width="70%">

The research team:

- Generated scene descriptions using large language models

- Combined these descriptions with raw images

- Used a variant of ControlNet to create the final illusory images

- Conducted human evaluations to validate the quality of the generated images

- Asked participants to identify what they perceived in each picture

- Filtered out inappropriate content using NSFW detectors

This approach ensures that the illusions in the datasets genuinely challenge perceptual abilities in ways that mirror human visual processing.

### Testing Leading Multimodal Models

The study evaluated several state-of-the-art models:

<img src="assets/illusoryvqa-table2.png" width="70%">

The research team focused on zero-shot performance (how well models perform without specific training on illusions) and performance after fine-tuning.

The results that all models showed a performance drop when dealing with illusions compared to standard images—mirroring the human experience of being "fooled" by optical illusions. Different models demonstrated varying levels of robustness to different types of illusions, suggesting that architectural differences influence how these systems process visual information.

### A Simple Yet Effective Solution

An interesting finding from the research is their straightforward solution for improving model performance on illusory images. The technique:

1. Apply a Gaussian and blur low-pass filter to the illusory images
2. Convert the images to grayscale

This simple preprocessing approach yielded significant performance improvements across all tested models. 

For example, in the IllusionAnimals dataset:

- CLIP initially showed the highest performance on illusory images

- After applying the filter, BLIP-2 achieved the best results—even outperforming humans

- All models saw substantial gains in accuracy after implementing the filtering technique

This finding suggests that relatively simple image processing techniques can help AI systems overcome perceptual challenges posed by illusions. The filtering process essentially helps the models differentiate between real and illusory elements in the images, similar to how certain visual processing aids might help humans see through optical illusions.

## What we're going to do in this tutorial


In this tutorial, we'll explore the IllusionAnimals dataset and evaluate how different AI models perceive visual illusions. We'll:

1. Load and explore the IllusionAnimals dataset using FiftyOne

2. Compute embeddings using multiple state-of-the-art models:
   - SigLIP (by Google)
   - AIM-v2 (by Apple)

3. Visualize these embeddings using UMAP dimensionality reduction

4. Perform zero-shot classification using various models

5. Test Visual Question-Answering (VQA) capabilities using:
   - Janus-Pro
   - Moondream2

6. Compare how models perform with and without hints about potential illusions

Let's start by installing some dependencies and downloading the dataset from the Hugging Face Hub 

In [None]:
!pip install git+https://github.com/huggingface/transformers.git#egg=transformers

In [None]:
!pip install fiftyone umap-learn

In [2]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

In [None]:
dataset = load_from_hub(
    "harpreetsahota/IllusionAnimals",
    overwrite=True,
    persistent=True
    )

Let's start with an initial visual vibe check of what is in this dataset.

Note this is a [Grouped Dataset](https://docs.voxel51.com/user_guide/groups.html#grouped-datasets). Grouped datasets allow us to represent multiples slices of the same data point. This way data for multiple perspectives of the same scene can be stored, visualized, and queried in ways that respect the relationships between the slices of data.

For the IllusionAnimals subset, the dataset includes different slices representing variations of the images with and without illusions, and with and without filters. The specific slices available in the IllusionAnimals dataset are:

*   **Raw Images**: These are the original images of animals without any illusions applied. They serve as a baseline for evaluating the models' performance on standard image recognition tasks. The models should accurately identify the animal in the image.

*   **Illusory Images**: These images have visual illusions incorporated into them. The illusions are designed to make the images appear as one animal while subtly containing elements of another. The goal is to test whether the models can detect the presence of the illusory concept, even with the presence of the real concept.

*   **Filtered Images**: These are the illusory images that have been processed with a Gaussian and blur low-pass filter. This filter is applied to enhance the models’ ability to detect the illusions. The idea is that the filter helps to reduce noise and highlight the illusory elements, making it easier for the models to identify and interpret the content. Applying the filter generally improves model performance.

*   **Illusionless Class**: In addition to the above, an extra class called "illusionless" is added to push the models’ capabilities. This class enables the models to detect instances where no illusion images are present in the picture.

In [11]:
class_names = dataset.distinct("label.label")

We'll set an enviornment variable as well:

In [7]:
import os

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/aimv2_embeddings

In [None]:
!fiftyone plugins requirements @harpreetsahota/aimv2_embeddings --install

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/hiera-embeddings-plugin

In [None]:
!fiftyone plugins requirements @harpreetsahota/hiera_embeddings --install

### Computing Embeddings

In [None]:
import torch 

import fiftyone.zoo as foz

siglip_model = foz.load_zoo_model(
    "zero-shot-classification-transformer-torch",
    name_or_path="google/siglip2-base-patch16-512", 
    classes=class_names,
    device="cuda" if torch.cuda.is_available() else "cpu"
    )

In [None]:
dataset.compute_embeddings(
    model=siglip_model,
    embeddings_field="siglip_emb"
)

In [None]:
import fiftyone.operators as foo

aim_embeddings = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

In [None]:
# Run the operator on your dataset
await aim_embeddings(
    dataset,
    model_name="apple/aimv2-large-patch14-224",  # Choose any supported model
    embedding_types="mean",
    emb_field="aimv2_mean_embeddings",
    delegate=True
)

In [None]:
# Run the operator on your dataset
await aim_embeddings(
    dataset,
    model_name="apple/aimv2-large-patch14-224",  # Choose any supported model
    embedding_types="cls",
    emb_field="aimv2_cls_embeddings",
    delegate=True
)

In [None]:
import fiftyone.brain as fob

embedding_fields = [ 
    "aimv2_mean_embeddings",
    "aimv2_cls_embeddings",
    "siglip_emb"
    ]

for fields in embedding_fields:
    _fname = fields.split("_embeddings")[0]
    results = fob.compute_visualization(
        dataset,
        embeddings=fields,
        method="umap",
        brain_key=f"{_fname}_viz",
        num_dims=2,
        )

### Zero-shot classification using Siglip and aimv2

In [None]:
dataset.apply_model(
    model=siglip_model, 
    label_field="siglip2_predictions",
    )

In [None]:
aimv2_model = foz.load_zoo_model(
    "zero-shot-classification-transformer-torch",
    name_or_path="apple/aimv2-large-patch14-224-lit", 
    classes=class_names,
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
    )

In [None]:
dataset.apply_model(
    model=aimv2_model, 
    label_field="aimv2_predictions",
    )

In [None]:
fo.launch_app(dataset)

Evaluate classifications and see the results


### Can VLMs do any better?

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/janus-vqa-fiftyone

!fiftyone plugins requirements @harpreetsahota/janus_vqa --install

!fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin

!fiftyone plugins requirements @harpreetsahota/moondream2 --install

In [None]:
NO_HINT_PROMPT = f"""Which class is in the picture: {', '.join(class_names)}. 
Your answer must be one of these exact classes, no other answers allowed. 
Respond in one word for your guess of the correct class without any extra explanation."""

In [None]:
import fiftyone.operators as foo

janus_vqa = foo.get_operator("@harpreetsahota/janus_vqa/janus_vqa")

moondream = foo.get_operator("@harpreetsahota/moondream2/moondream")

No hint prompt

In [6]:
await janus_vqa(
    dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=NO_HINT_PROMPT,
    question_field="no_hint_prompt",
    answer_field="janus_no_hint_answer",
    delegate=True
    )

In [None]:
await moondream(
    dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_no_hint_answer",
    query_text=NO_HINT_PROMPT,
    delegate=True
    )

Hint prompt

In [24]:
HINT_PROMPT = f"""There might be an image illusion of something in this image. 
These are the classes that the image illusion might belong to: {', '.join(class_names)}.
Your answer must be one of these exact classes, no other answers allowed.  
Respond in one word for your guess of the correct class without any extra explanation.
"""

In [19]:
await janus_vqa(
    dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=HINT_PROMPT,
    question_field="hint_prompt",
    answer_field="janus_hint_answer",
    delegate=True
    )

In [None]:
await moondream(
    dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_hint_answer",
    query_text=HINT_PROMPT,
    delegate=True
    )

Moondream2 also produces short captions, let's generate short captions and then compute similarity between the caption and the ground truth prompt

Then let's also see if any of the captions actually include the classes of interest

Why This Matters for AI Practitioners and Researchers

This research opens exciting new avenues for improving multimodal AI systems:

1. **Perceptual Robustness**: Training models to handle illusory images could make them more robust to adversarial attacks and unusual visual inputs. If a model can correctly process information even when presented with potentially misleading visual cues, it may be less susceptible to manipulation or confusion in real-world applications.

2. **Cognitive Alignment**: Understanding how AI models perceive illusions differently from humans can help researchers better align AI visual processing with human cognition. This alignment is crucial for applications where AI systems need to interpret visual information similarly to humans, such as in autonomous driving or medical image analysis.

3. **Preprocessing Solutions**: The simple filtering technique offers an immediate way to improve model performance on challenging visual inputs without requiring extensive retraining or architectural changes.

4. **Benchmark Advancement**: The Illusory VQA datasets provide valuable new benchmarks that push beyond conventional image recognition tasks, helping researchers identify strengths and weaknesses in current multimodal architectures.

5. **Bridging Disciplines**: This work creates interesting connections between AI research and cognitive psychology, potentially leading to cross-disciplinary insights about visual perception.

For AI practitioners working on multimodal systems, incorporating illusion testing into evaluation protocols could reveal important limitations that might otherwise go undetected. Similarly, the preprocessing techniques described in this research could be adapted for a variety of challenging visual inputs beyond just illusions.