Memes are, arguably, the best proving ground for Vision Language Models (VLMs) because they combine multiple challenging aspects of visual and linguistic understanding:

1. Memes require understanding both visual elements and text, and crucially, how they interact

2. They often rely on shared cultural knowledge or references

3. The humor often emerges from subtle interactions between the image and text

4. They appear in different formats, templates, and styles

5. Text can appear in various fonts, sizes, and positions

In this highly rigorous, academic, journal quality blog post, I'll put two VLMs (Janus-Pro and Moondream2) through their paces on three distinct tasks:


1. OCR: Can they accurately extract text from memes?

2. Meme Understanding: Can they explain what makes a meme funny and relevant?

3. Caption Generation: Can they generate contextually appropriate, humorous captions?

I'll also test their attention to detail by seeing if they can spot subtle watermarks, giving us insight into their visual processing capabilities.

Start by setting up our environment and downloading the necessary plugins:

In [None]:
!pip install fiftyone

I've created plugins which allow you to easily use [🌔Moondream2](https://github.com/harpreetsahota204/moondream2-plugin) and [🐋Janus-Pro](https://github.com/harpreetsahota204/janus-vqa-fiftyone) with your FiftyOne dataset.

Let's start by downloading the plugins and installing their dependencies.

> The plugin framework lets you extend and customize the functionality of FiftyOne to suit your needs.  If you’re interested in learning more about plugins, you might be interested in attending one of our monthly workshops. You can [see the full schedule here](https://voxel51.com/computer-vision-events/) and look for the *Advanced Computer Vision Data Curation and Model Evaluation workshop*.

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/janus-vqa-fiftyone

In [None]:
!fiftyone plugins requirements @harpreetsahota/janus_vqa --install

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin

In [None]:
!fiftyone plugins requirements @harpreetsahota/moondream2 --install

We also need to set an enviornment variable.

In [2]:
import os

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

I found [this webiste - scott.ai from Scott Penberthy](https://scott.ai/2019-08-06-memeified-ng) which had some awesome machine learning memes on it. I parsed these meme's into a FiftyOne dataset. You can download the dataset [from Hugging Face](https://huggingface.co/datasets/harpreetsahota/memes-dataset) as well.

In [None]:
import fiftyone as fo
from fiftyone.utils import huggingface as fouh

ml_memes_dataset = fouh.load_from_hub(
    "harpreetsahota/ml-memes",
    name="ml-memes",
    overwrite=True
    )

Let's quickly explore the dataset:

In [None]:
fo.launch_app(ml_memes_dataset)

Now, let's instantiate our plugions as operators via the FiftyOne SDK.

Alternatively, you can use the app and fill out the operator form. Just hit the backtick button (`) to open the operator menu. Type in “Moondream” or "Janus" and click on it. You'll be presented with a form to fill out, which takes the same information as what we will pass in via the SDK.

In [None]:
import fiftyone.operators as foo

janus_vqa = foo.get_operator("@harpreetsahota/janus_vqa/janus_vqa")

moondream = foo.get_operator("@harpreetsahota/moondream2/moondream")

Now let's kick off a delegated service by opening the terminal and running `fiftyone delegated launch`

# OCR

Optical Character Recognition (OCR) is a fundamental task in Computer Vision.

And, I think, using it to parse test from memes is a good use case! Memes typically combine both visual elements and text. While traditional OCR systems are specifically trained for text extraction, it's interesting to test how well general-purpose Vision Language Models (VLMs) can perform this task.

Testing VLMs on OCR helps us understand:

1. Their ability to perceive and accurately read text in various fonts, orientations, and styles common in memes

2. How well they can distinguish between text and visual elements

3. Their robustness in handling text that's integrated into images rather than presented as plain text

Let's test both Janus-Pro and Moondream2 on this task using the plugins we downloaded earlier.

First, let's run Janus:

In [None]:
QUESTION = "What does the text on this image say? Respond only with the text on the image and nothing else."

await janus_vqa(
    ml_memes_dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=QUESTION,
    question_field="ocr_questions",
    answer_field="janus_ocr",
    delegate=True
    )

And now, Moondream2:

In [None]:
await moondream(
    ml_memes_dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_ocr",
    query_text=QUESTION,
    delegate=True
    )

Since we don't have ground truth annotations for the text in these memes, we'll do a qualitative evaluation - a "vibe check" - of how well each model performs. 

We can visually inspect the results in the FiftyOne App by comparing the model outputs against the actual meme images to assess accuracy and completeness of text extraction.


In [None]:
fo.launch_app(ml_memes_dataset)

# Meme understanding

Understanding memes is more complex than OCR, because there are multiple levels of comprehension:

1. Recognizing the scene, characters, and their expressions

2. Understanding the reference or template being used

3. Connecting how the text relates to the visual elements

4. Grasping why the combination is meant to be humorous

This means we can test a VLMs ability to:

- Integrate multimodal information (text and visuals)

- Understand cultural references and context

- Grasp abstract concepts and humor

- Explain complex social/cultural phenomena in natural language

Let's see how our models handle this deeper level of understanding:

In [None]:
MEME_UNDERSTANDING_QUESTION = """This image is a meme. Describe the scene of the meme,
its characters, what they are saying, and what the
target audience of this meme might find funny about it.
"""

await janus_vqa(
    ml_memes_dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=MEME_UNDERSTANDING_QUESTION,
    question_field="meme_understanding_question",
    answer_field="janus_meme_understanding",
    delegate=True
    )

In [None]:
await moondream(
    ml_memes_dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_meme_understanding",
    query_text=MEME_UNDERSTANDING_QUESTION,
    delegate=True
    )

In [None]:
fo.launch_app(ml_memes_dataset)

## Can the VLMs find the attribution tag?

Each meme has a small attribution in the left corner, which reads `@scott.ai`. This presents an interesting test case for VLMs' visual capabilities because:

1. The attribution is intentionally subtle - a small watermark that could be easily missed even by human viewers

2. It tests the model's ability to detect and read fine details in images

3. It evaluates whether VLMs can distinguish between the main meme content and metadata like attributions

4. It helps us understand if VLMs can maintain attention to small details while processing the broader image

This kind of test is particularly relevant for real-world applications where models might need to:
- Detect watermarks or copyright information
- Read small print or disclaimers
- Identify subtle branding elements

Let's see if the VLMs can pick up on this subtle detail:

In [None]:
ATTR_QUESTION = """The creator of this meme has tagged themselves for self-attribution. 
Who can we attribute as the creator of this meme? Respond with just the authors name"""

await janus_vqa(
    ml_memes_dataset,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=ATTR_QUESTION,
    question_field="attr_question",
    answer_field="janus_attr",
    delegate=True
    )

In [None]:
await moondream(
    ml_memes_dataset,
    revision="2025-01-09",
    operation="query",
    output_field="moondream_attr",
    query_text=ATTR_QUESTION,
    delegate=True
    )

In [None]:
fo.launch_app(ml_memes_dataset)

## Now let's test these models on captioning

Meme captioning is a generative task that's distinct from our previous experiments:

- Unlike OCR, which extracts existing text

- Unlike meme understanding, which interprets the combined meaning

- Captioning requires the model to create novel, contextually appropriate text

This is challenging because good meme captions:

1. Match the visual template's intended use

2. Are culturally relevant to the target audience (in our case, the ML/AI community)

3. Strike a right balance of humor and relatability

4. Follow an established format of the meme template

You could use metrics like BLEU or ROUGE to evaluate captions against references, but they often miss the aspects of humor and cultural relevance. Like the previous tasks, a qualitative "vibe check" is probably the most reliable way to assess the quality of the captions.

Let's download another dataset (which is a grouped dataset, with a captioned and uncaptioned meme...but we will work with only uncaptioned) and then see what our models generate:

In [None]:
memes_dataset = fouh.load_from_hub(
    "harpreetsahota/memes-dataset",
    name="meme-captioning",
    overwrite=True
    )

uncaptioned_memes = memes_dataset.select_group_slices("template")
 
uncaptioned_memes = uncaptioned_memes.clone(name="vlm-captioned-memes")

In [None]:
fo.launch_app(uncaptioned_memes)

In [16]:
MEME_GENERATE = """This image is a meme. Write a caption for this meme that is
realted to deep learning and artificial intelligence.
Respond only with the caption and nothing else.
"""

In [None]:
await janus_vqa(
    uncaptioned_memes,
    model_path="deepseek-ai/Janus-Pro-1B",
    question=MEME_GENERATE,
    question_field="caption_prompt",
    answer_field="janus_caption",
    delegate=True
    )

In [None]:
await moondream(
    uncaptioned_memes,
    revision="2025-01-09",
    operation="query",
    query_text=MEME_GENERATE,
    output_field="moondream_caption",
    delegate=True
)

In [None]:
fo.launch_app(uncaptioned_memes)