# FESTA Demo - LLaVA 1.6 7B Testing

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iiscleap/mllm-uncertainty-estimation/blob/main/festa_demo/FESTA_Simple_Demo.ipynb)

Simple notebook to test LLaVA 1.6 7B model with FESTA example images using the exact same prompting as the research setup.

In [None]:
# Install required packages
!pip install torch torchvision transformers pillow accelerate bitsandbytes
!pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

In [None]:
import torch
import json
import requests
from PIL import Image
import matplotlib.pyplot as plt
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from io import BytesIO
import warnings
from transformers import logging

# Suppress warnings and transformers logging (same as research script)
warnings.filterwarnings("ignore")
logging.set_verbosity_error()

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

In [None]:
# Load LLaVA 1.6 7B model (exactly same as research script)
print("Loading Llava model...")
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf", 
    torch_dtype=torch.float16, 
    device_map="auto"
)
print("Model loaded successfully!")

In [None]:
# GitHub base URL for examples
base_url = "https://raw.githubusercontent.com/iiscleap/mllm-uncertainty-estimation/main/festa_demo/examples/"

def load_image_from_url(image_name):
    url = base_url + image_name
    response = requests.get(url)
    return Image.open(BytesIO(response.content))

def generate_response(image, question, dataset_type="blink"):
    """Generate response using exact same prompt format as research script"""
    
    # Set choices based on dataset (same as research script)
    if dataset_type == "blink":
        choices = "A. Yes\nB. No"
    elif dataset_type == "vsr":
        choices = "A. True\nB. False"
    else:
        choices = "A. Yes\nB. No"  # default to blink format
    
    # Create instruction with exact same format as research script
    instruction = f"{question}\nChoices:\n{choices}\nReturn only the option (A or B), and nothing else.\nMAKE SURE your output is A or B"
    
    # Create conversation with exact same structure as research script
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": instruction},
                {"type": "image"},
            ],
        },
    ]
    
    # Apply chat template and generate (same as research script)
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)
    
    # Generate with exact same parameters as research script
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=1)
        res = processor.decode(output[0], skip_special_tokens=True).strip()[-1]
    
    return res

def test_example(image_name, question, title, dataset_type="blink"):
    """Test example with research-grade prompting"""
    image = load_image_from_url(image_name)
    response = generate_response(image, question, dataset_type)
    
    # Determine full answer text
    if dataset_type == "blink":
        full_answer = "A (Yes)" if response == "A" else "B (No)" if response == "B" else response
    elif dataset_type == "vsr":
        full_answer = "A (True)" if response == "A" else "B (False)" if response == "B" else response
    else:
        full_answer = "A (Yes)" if response == "A" else "B (No)" if response == "B" else response
    
    plt.figure(figsize=(12, 6))
    plt.imshow(image)
    plt.axis('off')
    plt.title(f"{title}\nQuestion: {question}\nLLaVA Answer: {full_answer}", fontsize=12, pad=20)
    plt.tight_layout()
    plt.show()
    
    return response

print("✅ Functions loaded with research-grade prompting!")

## Test with 6 FESTA Examples

Using the exact same prompting setup as the FESTA research paper.

In [None]:
# Example 1: Original Spatial Relation
response = test_example(
    "val_Spatial_Relation_1.jpg",
    "Is the car beneath the cat?",
    "Example 1: Original Spatial Relation",
    "blink"
)
print(f"Raw Response: {response}\n")

In [None]:
# Example 2: Contrast Perturbation (Equivalent Sample)
response = test_example(
    "val_Spatial_Relation_1_contrast1.jpg",
    "Is the car beneath the cat?",
    "Example 2: Contrast Perturbation (Should be same as Example 1)",
    "blink"
)
print(f"Raw Response: {response}\n")

In [None]:
# Example 3: Masking Perturbation (Equivalent Sample)
response = test_example(
    "val_Spatial_Relation_1_masking1.jpg",
    "Is the car beneath the cat?",
    "Example 3: Masking Perturbation (Should be same as Example 1)",
    "blink"
)
print(f"Raw Response: {response}\n")

In [None]:
# Example 4: Negated/Complementary Version (Should toggle answer)
response = test_example(
    "val_Spatial_Relation_1_negated_contrast1.jpg",
    "Is the car beneath the cat?",
    "Example 4: Negated Scene (Should give opposite answer)",
    "blink"
)
print(f"Raw Response: {response}\n")

In [None]:
# Example 5: Different Scene Original
response = test_example(
    "val_Spatial_Relation_5.jpg",
    "Are there animals in this image?",
    "Example 5: Different Scene Original",
    "blink"
)
print(f"Raw Response: {response}\n")

In [None]:
# Example 6: Different Scene Blur (Equivalent Sample)
response = test_example(
    "val_Spatial_Relation_5_blur1.jpg",
    "Are there animals in this image?",
    "Example 6: Blur Perturbation (Should be same as Example 5)",
    "blink"
)
print(f"Raw Response: {response}\n")

## Compare Responses

Check for FESTA failure patterns:

In [None]:
# Test multiple times to see consistency (FESTA equivalent sampling)
print("🔍 FESTA Equivalent Sampling Test:")
print("Testing same question on original vs perturbed images (should be consistent)\n")

question = "Is the car beneath the cat?"
images = [
    "val_Spatial_Relation_1.jpg",
    "val_Spatial_Relation_1_contrast1.jpg", 
    "val_Spatial_Relation_1_masking1.jpg"
]

responses = []
for i, img in enumerate(images):
    image = load_image_from_url(img)
    resp = generate_response(image, question, "blink")
    responses.append(resp)
    img_type = ["Original", "Contrast", "Masking"][i]
    print(f"{img_type:>10}: {resp}")

# Check consistency
all_same = len(set(responses)) == 1
print(f"\n{'✅ CONSISTENT' if all_same else '❌ INCONSISTENT'}: {'All responses match' if all_same else 'Responses vary across equivalent samples'}")

if not all_same:
    print("⚠️  FESTA Equivalent Sampling Failure Detected!")

In [None]:
# Custom testing function - modify as needed
def test_custom():
    # Change these values to test other images/questions
    image_name = "val_Spatial_Relation_1.jpg"  # Change this
    question = "Is the car beneath the cat?"      # Change this
    dataset_type = "blink"                       # "blink" or "vsr"
    
    response = test_example(image_name, question, "Custom Test", dataset_type)
    print(f"Custom Response: {response}")

# Uncomment to use custom testing
# test_custom()

## Research Notes

This notebook uses the **exact same prompting setup** as the FESTA research paper:

- **System Prompt**: `"{question}\nChoices:\n{choices}\nReturn only the option (A or B), and nothing else.\nMAKE SURE your output is A or B"`
- **Chat Template**: Applied via `processor.apply_chat_template()`
- **Generation**: `max_new_tokens=1` for single token A/B response
- **Model**: `llava-hf/llava-v1.6-mistral-7b-hf` with `torch.float16`

**FESTA Framework Tests:**
- **Equivalent Samples**: Same question, different perturbations → Should give consistent answers
- **Complementary Samples**: Opposite scenarios → Should give different answers