# Lab 3.6: Image Captioning and VQA

**Objective**: Generate captions and answer questions about images

**Duration**: 30 minutes

## Learning Outcomes
- Use BLIP for image captioning
- Perform visual question answering
- Understand multi-modal model outputs

In [None]:
import sys
sys.path.insert(0, "../../../src")
from hf_ecosystem import __version__
print(f"hf-ecosystem version: {__version__}")

In [None]:
from transformers import pipeline
from PIL import Image
import requests
from io import BytesIO

## 1. Image Captioning with BLIP

In [None]:
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base", device="cpu")
print("BLIP captioning model loaded")

In [None]:
# Load image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/300px-PNG_transparency_demonstration_1.png"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# Generate caption
caption = captioner(image)
print(f"Caption: {caption[0]['generated_text']}")

## 2. Visual Question Answering

In [None]:
vqa = pipeline("visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa", device="cpu")
print("VQA model loaded")

In [None]:
# Ask questions about the image
questions = ["What color is prominent?", "Is this a photo?"]

for q in questions:
    answer = vqa(image, q)
    print(f"Q: {q}")
    print(f"A: {answer[0]['answer']} ({answer[0]['score']:.3f})")
    print()

## Verification

In [None]:
def verify_lab():
    assert len(caption) > 0
    assert "generated_text" in caption[0]
    print("âœ… Lab completed successfully!")

verify_lab()