# Project Overview

This notebook implements an advanced benchmark of four distinct Vision Language Models (VLMs) on the Hateful Memes Challenge Dataset (HMCD). This version loads the dataset from local files, uses a balanced dataset sample, and evaluates models from four different families to capture classification.

In [1]:
import ollama
import pandas as pd
import os

## Step 1: Load Dataset from Local Files & Prepare Sample

In [2]:
DATASET_FOLDER = 'data' 
ANNOTATION_FILE = os.path.join(DATASET_FOLDER, 'dev.jsonl')
IMG_DIR = os.path.join(DATASET_FOLDER, 'img')

In [3]:
if not os.path.exists(ANNOTATION_FILE):
    print(f"Error'")
else:
    df = pd.read_json(ANNOTATION_FILE, lines=True)
    # Create the full path to each image file
    df['img_path'] = df['img'].apply(lambda x: os.path.join(IMG_DIR, x))
    print("Local dataset loaded successfully.")


Local dataset loaded successfully.


### Balanced Sampling Strategy

In [4]:
df['label_text'] = df['label'].map({0: 'non-hateful', 1: 'hateful'})

In [5]:
hateful_df = df[df['label'] == 1].sample(n=20, random_state=70)

In [6]:
non_hateful_df = df[df['label'] == 0].sample(n=20, random_state=70)

In [7]:
df_sample = pd.concat([hateful_df, non_hateful_df])
df_sample = df_sample.sample(frac=1, random_state=50).reset_index(drop=True)

In [8]:
df_sample.head()

Unnamed: 0,id,img,label,text,img_path,label_text
0,56980,img/56980.png,0,coffee isn't helping get the jumper cables,data/img/img/56980.png,non-hateful
1,28905,img/28905.png,1,"hey, i just met you and this is crazy but here...",data/img/img/28905.png,hateful
2,32415,img/32415.png,0,when finishing a race is your passion,data/img/img/32415.png,non-hateful
3,97305,img/97305.png,1,when your wife just died giving birth to your ...,data/img/img/97305.png,hateful
4,63921,img/63921.png,0,taking a photo with family,data/img/img/63921.png,non-hateful


In [9]:
print(f"Created a balanced test set with {len(df_sample)} samples.")
print("Label distribution:")
print(df_sample['label_text'].value_counts())

Created a balanced test set with 40 samples.
Label distribution:
label_text
non-hateful    20
hateful        20
Name: count, dtype: int64


## Step 2: Playing with the Prompt

In [10]:
def create_zero_shot_prompt():
    """Asks for a direct classification."""
    return (
    "You are a content moderation AI. Your task is to classify a meme as either 'hateful' or 'non-hateful'. "
    "Analyze both the image and the accompanying text.\n"
    "Respond in *exactly* this format (no explanation before or after):\n"
    "Classification: hateful or non-hateful\n"
    "Justification: <your short reason>\n"
    "Example:\n"
    "Classification: non-hateful\n"
    "Justification: The meme contains no offensive content and appears neutral.\n"
)


In [11]:
def create_few_shot_prompt():
    return (
        "You are a content moderation AI. Your task is to classify a meme as either **hateful** or **non-hateful**.\n"
        "Analyze both the **image** and any **text** it contains. Think step-by-step, then give your final answer.\n\n"

        "Please follow the **exact format** below:\n"
        "Classification: <hateful|non-hateful>\n"
        "Justification: <1–2 sentence reason based on the image and/or text>\n\n"

        "### EXAMPLES ###\n\n"

        "Example 1:\n"
        "Image: A meme with a cartoon showing people of different races holding hands with the text: 'We’re stronger together.'\n"
        "Classification: non-hateful\n"
        "Justification: The image promotes unity and shows no offensive or targeted content.\n\n"

        "Example 2:\n"
        "Image: A meme with a photo of a specific religion’s attire and a caption mocking it with slurs.\n"
        "Classification: hateful\n"
        "Justification: The image contains derogatory language targeting a religious group, making it hateful.\n\n"

        "Example 3:\n"
        "Image: A meme with a man slipping on a banana peel with the caption: 'My Monday mornings.'\n"
        "Classification: non-hateful\n"
        "Justification: The image is humorous and contains no targeted or harmful content.\n\n"

        "### NEW MEME ###\n"
        "Now analyze the following meme and provide your response in the same format.\n"
        "Respond only with:\n"
        "Classification: <hateful|non-hateful>\n"
        "Justification: <reason>\n"
    )


In [12]:
prompt_template = create_zero_shot_prompt()
# prompt_template = create_few_shot_prompt()

In [13]:
# def parse_response(response_text):
#     cleaned_text = response_text.lower().strip()

#     classification = 'error'
#     justification = '' ''## will modify this later. Adding for debugging

#     if 'justification:' in cleaned_text:
#         try:
#             parts = cleaned_text.split('justification:', 1)
#             classification_part = parts[0]
#             justification = parts[1].strip()
#         except IndexError:
#             classification_part = cleaned_text
#             justification = "Could not parse justification part."
#     else:
#         classification_part = cleaned_text

#     if 'non-hateful' in classification_part:
#         classification = 'non-hateful'
#     elif 'hateful' in classification_part:
#         classification = 'hateful'
        
#     return classification, justification

In [14]:
## Parse function v3
import re

def parse_response(response_text):
    """
    A robust function to extract classification and justification using regular expressions.
    """
    cleaned_text = response_text.lower().strip()

    classification = 'error'
    justification = ''

    classification_match = re.search(r"classification:\s*(hateful|non-hateful)", cleaned_text) 
    
    if classification_match:
        classification = classification_match.group(1).strip()
    else:
        if 'non-hateful' in cleaned_text:
            classification = 'non-hateful'
        elif 'hateful' in cleaned_text:
            classification = 'hateful'

    justification_match = re.search(r"justification:\s*(.*)", cleaned_text, re.DOTALL)
    if justification_match:
        justification = justification_match.group(1).strip()
    elif not justification and classification != 'error':
        justification = "Justification not provided in response."
    else:
        justification = cleaned_text # For 'error' cases, store the raw output for debugging

    return classification, justification



In [15]:
# def parse_response(response_text): v2
#     cleaned_text = response_text.lower().strip()

#     classification = 'error'
#     justification = ''

#     # Try to extract classification line
#     if 'classification:' in cleaned_text:
#         try:
#             classification_line = cleaned_text.split('classification:')[1].split('\n')[0].strip()
#             if 'non-hateful' in classification_line:
#                 classification = 'non-hateful'
#             elif 'hateful' in classification_line:
#                 classification = 'hateful'
#         except Exception:
#             classification = 'error'

#     # Fallback: try first word
#     # if classification == 'error':
#     #     if cleaned_text.startswith('non-hateful'):
#     #         classification = 'non-hateful'
#     #     elif cleaned_text.startswith('hateful'):
#     #         classification = 'hateful'

#     # Extract justification
#     if 'justification:' in cleaned_text:
#         try:
#             justification = cleaned_text.split('justification:')[1].strip()
#         except IndexError:
#             justification = 'Could not parse justification.'

#     return classification, justification


In [16]:
import anthropic
import base64
import os

In [17]:
ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY')
if ANTHROPIC_API_KEY:
    claude_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

In [18]:
import base64

In [19]:
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

In [20]:
def classify_with_claude(image_path, prompt):
    """Classifies a meme using the Claude 3.5 Sonnet model."""
    if not ANTHROPIC_API_KEY:
        return 'error', "ANTHROPIC_API_KEY not configured."
    try:
        response = claude_client.messages.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=150,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image", 
                            "source": {
                                "type": "base64", 
                                "media_type": "image/png", # Use image/jpeg or image/png
                                "data": image_to_base64(image_path)
                            }
                        },
                        {
                            "type": "text", 
                            "text": prompt
                        }
                    ],
                }
            ],
        )
        return parse_response(response.content[0].text)
    except Exception as e:
        return 'error', f"Claude API Error: {str(e)}"


In [21]:
from tqdm import tqdm

In [22]:
models_to_test = {
    'claude-3.5-sonnet': classify_with_claude
}
results_data = {model: [] for model in models_to_test}

for model_name, classification_func in models_to_test.items():
    print(f"\n--- Benchmarking model: {model_name} ---")
    predictions = []
    justifications = []
    for index, row in tqdm(df_sample.iterrows(), total=len(df_sample), desc=f"Processing {model_name}"):
        pred, just = classification_func( row['img_path'], prompt_template)
        predictions.append(pred)
        justifications.append(just)

    df_sample[f'prediction_{model_name}'] = predictions
    df_sample[f'justification_{model_name}'] = justifications


print("\n--- Benchmark Complete! ---")
# display(df_sample[['id', 'label_text'] + [f'prediction_{model}' for model in models_to_test.keys()]].head())

display(df_sample.head())



--- Benchmarking model: claude-3.5-sonnet ---


Processing claude-3.5-sonnet: 100%|██████████| 40/40 [00:00<00:00, 53550.00it/s]


--- Benchmark Complete! ---





Unnamed: 0,id,img,label,text,img_path,label_text,prediction_claude-3.5-sonnet,justification_claude-3.5-sonnet
0,56980,img/56980.png,0,coffee isn't helping get the jumper cables,data/img/img/56980.png,non-hateful,error,ANTHROPIC_API_KEY not configured.
1,28905,img/28905.png,1,"hey, i just met you and this is crazy but here...",data/img/img/28905.png,hateful,error,ANTHROPIC_API_KEY not configured.
2,32415,img/32415.png,0,when finishing a race is your passion,data/img/img/32415.png,non-hateful,error,ANTHROPIC_API_KEY not configured.
3,97305,img/97305.png,1,when your wife just died giving birth to your ...,data/img/img/97305.png,hateful,error,ANTHROPIC_API_KEY not configured.
4,63921,img/63921.png,0,taking a photo with family,data/img/img/63921.png,non-hateful,error,ANTHROPIC_API_KEY not configured.


In [23]:
ground_truth = df_sample['label_text']


In [24]:
from sklearn.metrics import classification_report

for model_name in models_to_test.keys():
    print(f"\n--- Evaluation Report for: {model_name} ---")
    model_predictions = df_sample[f'prediction_{model_name}']
    model_predictions = model_predictions.apply(lambda x: x[0] if isinstance(x, tuple) else x)

    report = classification_report(
        ground_truth, 
        model_predictions, 
        labels=['hateful', 'non-hateful'], 
        zero_division=0)
    print(report)


--- Evaluation Report for: claude-3.5-sonnet ---
              precision    recall  f1-score   support

     hateful       0.00      0.00      0.00      20.0
 non-hateful       0.00      0.00      0.00      20.0

   micro avg       0.00      0.00      0.00      40.0
   macro avg       0.00      0.00      0.00      40.0
weighted avg       0.00      0.00      0.00      40.0



In [25]:
columns_to_keep = ['id', 'text', 'label_text']
for model_name in models_to_test.keys():
    columns_to_keep.append(f'prediction_{model_name}')
    columns_to_keep.append(f'justification_{model_name}')

In [26]:
full_results_df = df_sample[columns_to_keep]

In [27]:
full_results_df.to_csv('claude zero shot.csv', index=False)