# **MetaLogic Image-Evaluation**

**Authors:**  
- **Jason Olefson**  
- **Jooyoung Gonzalez**  
- **Shrikanth Akkaru**

---

This project implements a full evaluation pipeline for assessing **semantic robustness** and **logical consistency** in text-to-image (T2I) generative models using the *MetaLogic* methodology.

We analyze whether models produce **consistent images** when prompted with logically equivalent descriptions, and we measure:

- **Faithfulness** — How well generated images reflect the semantics of the prompt  
- **Stability** — How consistent outputs remain under logically equivalent perturbations  
- **Robustness** — Sensitivity of image features to structural changes in prompts  

Our pipeline includes:

- **Image generation** using diffusion models  
- **CLIP-based scoring** to quantify text–image alignment  
- **Custom VQA (TIFA-lite)** semantic checks using LLM-generated questions  
- **Pairwise logic-equivalence analysis** across different logical laws  

This notebook contains end-to-end code to reproduce the evaluation, generate datasets, compute metrics, and export results.


## Environment Setup



In [None]:
!nvidia-smi

Wed Dec  3 18:22:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             53W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

# Create base directory for images if none exist
base_image_dir = "/content/drive/MyDrive/Trustworthy-AI-Final/Images"
os.makedirs(base_image_dir, exist_ok=True)

print(f"Base image directory '{base_image_dir}' created or already exists.")

Base image directory '/content/drive/MyDrive/Trustworthy-AI-Final/Images' created or already exists.


In [None]:
!pip install --upgrade --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
!pip install --upgrade diffusers transformers accelerate safetensors

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m152.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.57.2
    Uninstalling transformers-4.57.2:
      Successfully uninstalled transformers-4.57.2
Successfully installed transformers-4.57.3


## Import Models

### V1-5

In [None]:
# from diffusers import StableDiffusionPipeline
# import torch

# model_id = "sd-legacy/stable-diffusion-v1-5"
# pipe = StableDiffusionPipeline.from_pretrained(model_id, dtype=torch.float16)
# pipe = pipe.to("cuda")

# prompt = "a red cat and a yellow apple on a wooden table"
# image = pipe(prompt).images[0]

# image.save("sample.png")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/737 [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

text_encoder_2/model.safetensors:   0%|          | 0.00/2.78G [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/10.3G [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

vae_1_0/diffusion_pytorch_model.safetens(…):   0%|          | 0.00/335M [00:00<?, ?B/s]

Keyword arguments {'dtype': torch.float16} are not expected by StableDiffusionPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


  0%|          | 0/50 [00:00<?, ?it/s]

TypeError: argument of type 'NoneType' is not iterable

XL BASE

In [None]:
from diffusers import StableDiffusionXLPipeline
import torch

model_id = "stabilityai/stable-diffusion-xl-base-1.0"

pipe = StableDiffusionXLPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16"
).to("cuda")

prompt = "a red cat and a yellow apple on a wooden table"
image = pipe(prompt).images[0]
image.save("sample_sdxl.png")


Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

vae/diffusion_pytorch_model.fp16.safeten(…):   0%|          | 0.00/167M [00:00<?, ?B/s]

vae_1_0/diffusion_pytorch_model.fp16.saf(…):   0%|          | 0.00/167M [00:00<?, ?B/s]

text_encoder_2/model.fp16.safetensors:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

unet/diffusion_pytorch_model.fp16.safete(…):   0%|          | 0.00/5.14G [00:00<?, ?B/s]

text_encoder/model.fp16.safetensors:   0%|          | 0.00/246M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

`torch_dtype` is deprecated! Use `dtype` instead!


  0%|          | 0/50 [00:00<?, ?it/s]

## Import prompt-pairs from CSV file

In [None]:
import pandas as pd

Prompt_Pairs = pd.read_csv('/content/drive/MyDrive/Trustworthy-AI-Final/prompts_metalogic.csv')
print(Prompt_Pairs.shape)
Prompt_Pairs.head()

(100, 6)


Unnamed: 0,pair_id,category_id,logical_law,semantic_dimension,prompt_A,prompt_B
0,comm_conj_001,1,Commutative,Conjunctive,a red cat and a yellow apple on a wooden table,a yellow apple and a red cat on a wooden table
1,comm_conj_002,1,Commutative,Conjunctive,a blue dog and a yellow cat on a dark metal floor,a yellow cat and a blue dog on a dark metal floor
2,comm_conj_003,1,Commutative,Conjunctive,a red banana and a blue cat on a wooden table,a blue cat and a red banana on a wooden table
3,comm_conj_004,1,Commutative,Conjunctive,a blue dog and a yellow banana on a wooden table,a yellow banana and a blue dog on a wooden table
4,comm_conj_005,1,Commutative,Conjunctive,a blue banana and a red dog on a dark metal floor,a red dog and a blue banana on a dark metal floor


## Generate Images and Organize Storage



### Image Generation

With the 100 prompt pairs generated from our MetaLogic inspired prompt templates, we need to generate the images with the model to be evaluated.

I want to use SD v1.5 b/c although it is older and deprecated, it is fast to use for image generation, and it is used commonly for research -> may work better for VQA or CLIP

SD v1.5
https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5

SD xl
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0


In [None]:
for index, row in Prompt_Pairs.iterrows():
    logical_law = row['logical_law'].replace(' ', '_').replace('/', '_') # Sanitize
    category_id = row['category_id']
    semantic_dimension = row['semantic_dimension'].replace(' ', '_').replace('/', '_') # Sanzamatize!
    pair_id = row['pair_id']

    # Construct directory path
    current_image_dir = os.path.join(base_image_dir, logical_law, str(category_id), semantic_dimension)
    os.makedirs(current_image_dir, exist_ok=True)

    # Generate / save image for prompt A
    prompt_A = row['prompt_A']
    image_A_filename = f"{pair_id}_A.png"
    image_A_path = os.path.join(current_image_dir, image_A_filename)
    print(f"Generating image for prompt A: {prompt_A}")
    image_A = pipe(prompt_A).images[0]
    image_A.save(image_A_path)
    print(f"Saved image A to: {image_A_path}")

    # Generate / save image for prompt B
    prompt_B = row['prompt_B']
    image_B_filename = f"{pair_id}_B.png"
    image_B_path = os.path.join(current_image_dir, image_B_filename)
    print(f"Generating image for prompt B: {prompt_B}")
    image_B = pipe(prompt_B).images[0]
    image_B.save(image_B_path)
    print(f"Saved image B to: {image_B_path}\n")

print("All images generated and saved.")

## Evaluation Metrics



### CLIPScore
For each prompt $p$ and it's generated image $I_p$, compute cosine similarity.

Aggregate the individual scores for the mean over all prompts.

Aggregate the scores again over each perturbation categories to get the mean.

For text-image pairs $(p_1, I_1), (p_2, I_2)$, compute alignment stability as:

$s_1 = CLIPScore(p_1, I_1)$, $s_2 = CLIPScore(p_2,I_2)$
$AlignmentStability(p_1,p_2) = 1 - \frac{|s_1 - s_2|}{100}$

#### 1. Import OpenAI CLIP Model

https://huggingface.co/openai/clip-vit-base-patch32

In [None]:
from PIL import Image
import requests
import numpy as np
import torch
import matplotlib.pyplot as plt
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

print("CLIP model and processor loaded successfully.")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

CLIP model and processor loaded successfully.


In [None]:
import os
import torch
from PIL import Image

def calculate_clip_score(image_path, text_prompt, model, processor):
    """Text–image CLIPScore (cosine similarity * 100)."""
    image = Image.open(image_path).convert("RGB")
    inputs = processor(text=text_prompt, images=image, return_tensors="pt", padding=True)

    # Move to GPU if available
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}
        model.to("cuda")

    with torch.no_grad():
        outputs = model(**inputs)

    image_features = outputs.image_embeds
    text_features = outputs.text_embeds

    # Normalize
    image_features_norm = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
    text_features_norm = text_features / text_features.norm(p=2, dim=-1, keepdim=True)

    # Cosine similarity
    cosine_similarity = torch.sum(image_features_norm * text_features_norm, dim=-1)
    return (cosine_similarity * 100).item()  # Scale for readability


def get_image_embedding(image_path, model, processor):
    """Return normalized CLIP image embedding for image–image similarity."""
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}
        model.to("cuda")

    with torch.no_grad():
        image_features = model.get_image_features(**inputs)

    image_features_norm = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
    return image_features_norm


clip_results = []

print("Calculating pair-wise CLIP scores and image similarity...")

for index, row in Prompt_Pairs.iterrows():
    logical_law = row['logical_law'].replace(' ', '_').replace('/', '_')
    category_id = row['category_id']
    semantic_dimension = row['semantic_dimension'].replace(' ', '_').replace('/', '_')
    pair_id = row['pair_id']

    current_image_dir = os.path.join(base_image_dir, logical_law, str(category_id), semantic_dimension)

    image_A_path = os.path.join(current_image_dir, f"{pair_id}_A.png")
    image_B_path = os.path.join(current_image_dir, f"{pair_id}_B.png")

    # Skip if either image is missing (can also handle more gracefully if you want @joo)
    if not os.path.exists(image_A_path) or not os.path.exists(image_B_path):
        print(f"Warning: Missing image for pair {pair_id}. A: {os.path.exists(image_A_path)}, B: {os.path.exists(image_B_path)}")
        continue

    prompt_A = row['prompt_A']
    prompt_B = row['prompt_B']

    # Text–image faithfulness scores
    clip_A = calculate_clip_score(image_A_path, prompt_A, model, processor)
    clip_B = calculate_clip_score(image_B_path, prompt_B, model, processor)

    # Per-pair aggregate measures
    clip_mean = 0.5 * (clip_A + clip_B)
    clip_diff = abs(clip_A - clip_B)

    # Image–image similarity (robustness to prompt perturbation)
    emb_A = get_image_embedding(image_A_path, model, processor)
    emb_B = get_image_embedding(image_B_path, model, processor)
    image_similarity = torch.sum(emb_A * emb_B, dim=-1).item() * 100  # cosine * 100

    clip_results.append({
        "pair_id": pair_id,
        "logical_law": logical_law,
        "category_id": category_id,
        "semantic_dimension": semantic_dimension,
        "image_A_path": image_A_path,
        "image_B_path": image_B_path,
        "clip_A": clip_A,
        "clip_B": clip_B,
        "clip_mean": clip_mean,          # average faithfulness over pair
        "clip_diff": clip_diff,          # sensitivity to perturbation
        "image_similarity": image_similarity  # how similar two outputs are
    })

    print(f"{pair_id} | CLIP A: {clip_A:.2f}, CLIP B: {clip_B:.2f}, "
          f"Δ: {clip_diff:.2f}, img-sim: {image_similarity:.2f}")

print("Done computing CLIP metrics for all pairs.")


Calculating pair-wise CLIP scores and image similarity...
comm_conj_001 | CLIP A: 37.01, CLIP B: 34.35, Δ: 2.66, img-sim: 84.44
comm_conj_002 | CLIP A: 31.77, CLIP B: 34.19, Δ: 2.42, img-sim: 75.35
comm_conj_003 | CLIP A: 37.04, CLIP B: 34.71, Δ: 2.33, img-sim: 81.44
comm_conj_004 | CLIP A: 29.73, CLIP B: 34.95, Δ: 5.22, img-sim: 93.73
comm_conj_005 | CLIP A: 33.25, CLIP B: 32.64, Δ: 0.61, img-sim: 75.13
comm_horiz_001 | CLIP A: 30.84, CLIP B: 30.22, Δ: 0.62, img-sim: 65.87
comm_horiz_002 | CLIP A: 28.07, CLIP B: 40.38, Δ: 12.32, img-sim: 67.79
comm_horiz_003 | CLIP A: 36.88, CLIP B: 32.58, Δ: 4.30, img-sim: 80.70
comm_horiz_004 | CLIP A: 34.40, CLIP B: 31.09, Δ: 3.31, img-sim: 92.09
comm_horiz_005 | CLIP A: 35.47, CLIP B: 31.16, Δ: 4.31, img-sim: 90.06
comm_vert_001 | CLIP A: 36.68, CLIP B: 34.60, Δ: 2.09, img-sim: 73.51
comm_vert_002 | CLIP A: 36.10, CLIP B: 40.38, Δ: 4.28, img-sim: 73.16
comm_vert_003 | CLIP A: 34.07, CLIP B: 34.98, Δ: 0.91, img-sim: 83.05
comm_vert_004 | CLIP A: 34

In [None]:
import pandas as pd

clip_df = pd.DataFrame(clip_results)
clip_df.head()


Unnamed: 0,pair_id,logical_law,category_id,semantic_dimension,image_A_path,image_B_path,clip_A,clip_B,clip_mean,clip_diff,image_similarity
0,comm_conj_001,Commutative,1,Conjunctive,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,37.00927,34.345402,35.677336,2.663868,84.444147
1,comm_conj_002,Commutative,1,Conjunctive,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,31.770533,34.190731,32.980632,2.420198,75.345421
2,comm_conj_003,Commutative,1,Conjunctive,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,37.042191,34.709461,35.875826,2.332729,81.441838
3,comm_conj_004,Commutative,1,Conjunctive,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,29.729563,34.947586,32.338574,5.218023,93.725157
4,comm_conj_005,Commutative,1,Conjunctive,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,/content/drive/MyDrive/Trustworthy-AI-Final/Im...,33.249329,32.642426,32.945877,0.606903,75.129962


In [None]:
import pandas as pd

out_path_clip = "/content/drive/MyDrive/Trustworthy-AI-Final/clip_results.csv"

clip_df.to_csv(out_path_clip, index=False)

print("Saved CLIP results to:", out_path_clip)

Saved CLIP results to: /content/drive/MyDrive/Trustworthy-AI-Final/clip_results.csv


### VQA (TIFA-lite)

For each prompt $p$ and its generated image $I_p$ , Assume a set of QAs generated from the original text prompt
	$Q_p = {(q_i, a_i)}$ for $i = 1, ..., n$  for $n$ generated QAs

* $q_i =$  yes/no or short-answer questions about the image content
* $a_i =$ expected/correct answer  

Given a generated image $I_p$ , VQA will answer $\hat{a}_i$  for each $q_i$

#### 1. Import VLM models

BLIP https://huggingface.co/Salesforce/blip-vqa-base

mPLUG https://huggingface.co/mPLUG/mPLUG-Owl3-7B-241101

BLIP may work better for our purposes since we want stable, predictable bahavior and may work better with our simplified, synthetic prompt structures -> mPLUG could work better if we want to do longer natural language prompts with perturbations.

#### 2. Call OpenAI API to generate Q/A's for VQA evaluation

In [None]:
import json
import pandas as pd
from openai import OpenAI

# API KEY (DELETE AFTER RUNNING FOR SECURITY!!!!)
client = OpenAI(api_key="Jason's key")

# Load prompts CSV from Drive
df = pd.read_csv("/content/drive/MyDrive/Trustworthy-AI-Final/prompts_metalogic.csv")

rows = []

def generate_questions(promptA, promptB):
    system_msg = (
        "You generate VQA (visual question answering) yes/no checks "
        "for image-generation prompts. You MUST respond with ONLY a JSON array "
        'of strings, like ["Q1", "Q2", "Q3"]. No backticks, no markdown, '
        "no explanations, no extra keys. Just the JSON array."
    )

    user_msg = f"""
Given two logically equivalent image-generation prompts:

PROMPT A: "{promptA}"
PROMPT B: "{promptB}"

Generate 5 semantic yes/no questions that:
- Must be true for BOTH prompts.
- Focus on OBJECTS, COLORS, ATTRIBUTES, and SCENE.
- Avoid negatives.
- Are short and direct.
- Should be answerable from the image.
- Use natural phrasing.

Return ONLY a JSON list of 5 strings, e.g.:

["Question 1?", "Question 2?", "Question 3?", "Question 4?", "Question 5?"]
"""

    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg},
        ],
        temperature=0.3,
    )

    raw = response.choices[0].message.content.strip()

    # Defensive: strip code fences if model ignores instructions
    if raw.startswith("```"):
        # Remove leading ```language\n and trailing ```
        parts = raw.split("```")
        # parts like ["", "python\n[ ... ]", ""] or ["", "json\n[...]", ""]
        if len(parts) >= 2:
            raw = parts[1]
        raw = raw.strip()
        # If language tag on the first line... (e.g. "python")
        if "\n" in raw:
            first_line, rest = raw.split("\n", 1)
            if first_line.strip().lower() in ("python", "json"):
                raw = rest.strip()

    # parse as JSON array of strings
    try:
        questions = json.loads(raw)
    except json.JSONDecodeError:
        print("Failed to parse JSON from model. Raw content was:\n", raw[:300])
        raise

    return questions


print("Generating VQA checks...")

for idx, row in df.iterrows():
    pair_id = row["pair_id"]
    logic = row["logical_law"]
    cat = row["category_id"]
    sem = row["semantic_dimension"]

    qlist = generate_questions(row["prompt_A"], row["prompt_B"])

    # Ensure exactly 5 questions (truncate or pad, JiC)
    qlist = qlist[:5]

    for variant in ["A", "B"]:
        for q in qlist:
            rows.append({
                "pair_id": pair_id,
                "logical_law": logic,
                "category_id": cat,
                "semantic_dimension": sem,
                "image_variant": variant,
                "question": q,
                "expected_answer": "yes",
            })

    if idx % 10 == 0:
        print(f"Processed {idx}/{len(df)} pairs")

vqa_df = pd.DataFrame(rows)
out = "/content/drive/MyDrive/Trustworthy-AI-Final/vqa_custom_checks.csv"
vqa_df.to_csv(out, index=False)

print("Saved:", out)
vqa_df.head()


VQA Faithfulness Score
* How many questions does the VQA answer correctly based on the generated image? This should tell us how "faithful" the image is to the text prompt
* $C_i = [0,1]$  1 if same, 0 otherwise
* $Faithfulness  (p, I_p) = \frac{1}{N_p} \sum_{i=1}{C_i}$

## VQA

### Load BLIP VQA Model


We need to use an LLM -> GPT-3 generate the original QA pairs

In [None]:
!pip install -q transformers accelerate timm sentencepiece

import torch
from transformers import BlipProcessor, BlipForQuestionAnswering

# BLIP VQA model
vqa_model_name = "Salesforce/blip-vqa-base"

vqa_processor = BlipProcessor.from_pretrained(vqa_model_name)
vqa_model = BlipForQuestionAnswering.from_pretrained(vqa_model_name)
vqa_model.eval()

if torch.cuda.is_available():
    vqa_model.to("cuda")

print("Loaded VQA model:", vqa_model_name)


### VQA Helper Functions

In [None]:
from PIL import Image
import re
import os

def answer_vqa(image_path, question, model, processor):
    """Run BLIP VQA on (image, question) and return a lowercase string answer."""
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, text=question, return_tensors="pt")

    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=10)

    answer = processor.decode(out[0], skip_special_tokens=True)
    return answer.strip().lower()


def normalize_answer(ans):
    """Light normalization so we can compare predicted vs expected answers."""
    ans = ans.lower().strip()
    ans = re.sub(r"[^\w\s]", "", ans)  # remove punctuation

    # Map common yes/no variants
    if ans in ["yes", "yeah", "yep", "yup", "of course"]:
        return "yes"
    if ans in ["no", "nope", "nah"]:
        return "no"

    return ans


### Run TIFA-lite style VQA and compute metrics

In [None]:
import pandas as pd
import os

vqa_results = []

# Load the custom checks we generated
qa_df = pd.read_csv("/content/drive/MyDrive/Trustworthy-AI-Final/vqa_custom_checks.csv")
print("Loaded custom VQA checks:", qa_df.shape)

print("Running VQA over custom semantic checks...")

for idx, row in qa_df.iterrows():
    pair_id = row["pair_id"]
    logical_law = row["logical_law"]
    category_id = row["category_id"]
    semantic_dim = row["semantic_dimension"]
    variant = row["image_variant"]   # 'A' or 'B'
    question = row["question"]
    expected = normalize_answer(str(row["expected_answer"]))

    # Rebuild image path (same structure as CLIP)
    logical_law_fs = logical_law.replace(" ", "_").replace("/", "_")
    semantic_fs = semantic_dim.replace(" ", "_").replace("/", "_")

    image_dir = os.path.join(base_image_dir, logical_law_fs, str(category_id), semantic_fs)
    image_path = os.path.join(image_dir, f"{pair_id}_{variant}.png")

    if not os.path.exists(image_path):
        print(f"Missing image for pair {pair_id} {variant} at {image_path}. Skipping.")
        continue

    pred_raw = answer_vqa(image_path, question, vqa_model, vqa_processor)
    pred_norm = normalize_answer(pred_raw)
    is_correct = int(pred_norm == expected)

    vqa_results.append({
        "pair_id": pair_id,
        "logical_law": logical_law,
        "category_id": category_id,
        "semantic_dimension": semantic_dim,
        "image_variant": variant,
        "image_path": image_path,
        "question": question,
        "expected_answer": expected,
        "predicted_answer_raw": pred_raw,
        "predicted_answer_norm": pred_norm,
        "is_correct": is_correct,
    })

    if idx % 25 == 0:
        print(f"[{idx}/{len(qa_df)}] processed pair {pair_id} {variant}")

print("Finished VQA on all custom checks.")

vqa_df = pd.DataFrame(vqa_results)
vqa_df.head()


## Compute Faithfulness and Stability Metrics

In [None]:
# Faithfulness per image (per prompt)
faith_per_image = (
    vqa_df
      .groupby(["pair_id", "image_variant"])
      .agg(
          vqa_accuracy=("is_correct", "mean"),
          num_questions=("is_correct", "size"),
      )
      .reset_index()
)

faith_per_image.head()

# Put A and B in same row for each pair
faith_pivot = faith_per_image.pivot(
    index="pair_id",
    columns="image_variant",
    values="vqa_accuracy",
).reset_index()

faith_pivot = faith_pivot.rename(columns={"A": "faith_A", "B": "faith_B"})

# Merge metadata from Prompt_Pairs
meta_cols = ["pair_id", "logical_law", "category_id", "semantic_dimension"]
meta_df = Prompt_Pairs[meta_cols].drop_duplicates()

vqa_pair_metrics = faith_pivot.merge(meta_df, on="pair_id", how="left")

# TIFA-lite style scores
vqa_pair_metrics["faith_mean"] = 0.5 * (vqa_pair_metrics["faith_A"] + vqa_pair_metrics["faith_B"])
vqa_pair_metrics["faith_diff"] = (vqa_pair_metrics["faith_A"] - vqa_pair_metrics["faith_B"]).abs()

vqa_pair_metrics.head()


## Save Metrics to Drive

In [None]:
out_path_all = "/content/drive/MyDrive/Trustworthy-AI-Final/vqa_raw_results.csv"
out_path_pairs = "/content/drive/MyDrive/Trustworthy-AI-Final/vqa_tifa_lite_pair_metrics.csv"

vqa_df.to_csv(out_path_all, index=False)
vqa_pair_metrics.to_csv(out_path_pairs, index=False)

print("Saved raw VQA results to:", out_path_all)
print("Saved pair-wise TIFA-lite metrics to:", out_path_pairs)
