# Multimodal AI with Hugging Face

This notebook demonstrates various multimodal AI tasks using Hugging Face Transformers, including:
- Image captioning
- Visual question answering (VQA)
- Text-to-image generation
- Image-text similarity
- Document understanding

In [None]:
# Install required packages
!pip install transformers torch torchvision pillow diffusers accelerate

In [None]:
import torch
from transformers import (
    BlipProcessor, BlipForConditionalGeneration,
    ViltProcessor, ViltForQuestionAnswering,
    CLIPProcessor, CLIPModel,
    LayoutLMv3Processor, LayoutLMv3ForTokenClassification
)
from diffusers import StableDiffusionPipeline
from PIL import Image
import requests
import matplotlib.pyplot as plt
import numpy as np

## 1. Image Captioning with BLIP

Generate descriptive captions for images using the BLIP model.

In [None]:
# Load BLIP model for image captioning
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load sample image
url = "https://images.unsplash.com/photo-1518717758536-85ae29035b6d?w=400"
image = Image.open(requests.get(url, stream=True).raw)

# Generate caption
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_length=50, num_beams=5)
caption = processor.decode(out[0], skip_special_tokens=True)

# Display results
plt.figure(figsize=(10, 6))
plt.imshow(image)
plt.axis('off')
plt.title(f"Generated Caption: {caption}", fontsize=14, pad=20)
plt.show()

print(f"Caption: {caption}")

## 2. Visual Question Answering with ViLT

Answer questions about images using Vision-and-Language Transformer.

In [None]:
# Load ViLT model for VQA
vqa_processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
vqa_model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

# Use the same image from before
questions = [
    "What animal is in the image?",
    "What color is the animal?",
    "Is the animal indoors or outdoors?",
    "What is the animal doing?"
]

print("Visual Question Answering Results:")
print("=" * 40)

for question in questions:
    # Prepare inputs
    encoding = vqa_processor(image, question, return_tensors="pt")
    
    # Forward pass
    outputs = vqa_model(**encoding)
    logits = outputs.logits
    idx = logits.argmax(-1).item()
    answer = vqa_model.config.id2label[idx]
    
    print(f"Q: {question}")
    print(f"A: {answer}")
    print()

# Display the image again for reference
plt.figure(figsize=(8, 6))
plt.imshow(image)
plt.axis('off')
plt.title("Image for VQA", fontsize=14)
plt.show()

## 3. Text-to-Image Generation with Stable Diffusion

Generate images from text descriptions using Stable Diffusion.

In [None]:
# Load Stable Diffusion pipeline
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

if torch.cuda.is_available():
    pipe = pipe.to("cuda")

# Generate images from different prompts
prompts = [
    "A serene mountain landscape at sunset with a crystal clear lake",
    "A futuristic city with flying cars and neon lights",
    "A cozy library with books floating magically in the air"
]

fig, axes = plt.subplots(1, len(prompts), figsize=(15, 5))

for i, prompt in enumerate(prompts):
    print(f"Generating image for: {prompt}")
    
    # Generate image
    image = pipe(prompt, num_inference_steps=20, guidance_scale=7.5).images[0]
    
    # Display
    axes[i].imshow(image)
    axes[i].axis('off')
    axes[i].set_title(prompt[:30] + "...", fontsize=10, pad=10)

plt.tight_layout()
plt.show()

## 4. Image-Text Similarity with CLIP

Measure similarity between images and text descriptions using CLIP.

In [None]:
# Load CLIP model
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load multiple images for comparison
image_urls = [
    "https://images.unsplash.com/photo-1518717758536-85ae29035b6d?w=300",  # Dog
    "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=300",  # Cat
    "https://images.unsplash.com/photo-1449824913935-59a10b8d2000?w=300",  # Car
]

images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
image_labels = ["Dog", "Cat", "Car"]

# Text descriptions to compare
text_queries = [
    "a photo of a dog",
    "a photo of a cat",
    "a photo of a car",
    "a photo of an animal",
    "a photo of a vehicle"
]

# Process inputs
inputs = clip_processor(
    text=text_queries,
    images=images,
    return_tensors="pt",
    padding=True
)

# Get features
outputs = clip_model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

# Display results
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Show images
for i, (image, label) in enumerate(zip(images, image_labels)):
    row, col = i // 2, i % 2
    if i < 3:
        axes[row, col].imshow(image)
        axes[row, col].axis('off')
        axes[row, col].set_title(f"Image {i+1}: {label}", fontsize=12)

# Show similarity heatmap
similarity_matrix = logits_per_image.detach().numpy()
im = axes[1, 1].imshow(similarity_matrix, cmap='viridis', aspect='auto')
axes[1, 1].set_xticks(range(len(text_queries)))
axes[1, 1].set_xticklabels([q[:20] + "..." if len(q) > 20 else q for q in text_queries], rotation=45)
axes[1, 1].set_yticks(range(len(image_labels)))
axes[1, 1].set_yticklabels(image_labels)
axes[1, 1].set_title("Image-Text Similarity Scores")
plt.colorbar(im, ax=axes[1, 1])

plt.tight_layout()
plt.show()

# Print top matches for each image
print("\nTop text matches for each image:")
print("=" * 40)
for i, label in enumerate(image_labels):
    top_match_idx = probs[i].argmax().item()
    confidence = probs[i][top_match_idx].item() * 100
    print(f"{label}: '{text_queries[top_match_idx]}' ({confidence:.1f}% confidence)")

## 5. Document Understanding with LayoutLMv3

Extract and classify information from document images using LayoutLMv3.

In [None]:
# Load LayoutLMv3 for document understanding
layout_processor = LayoutLMv3Processor.from_pretrained(
    "microsoft/layoutlmv3-base", 
    apply_ocr=False
)
layout_model = LayoutLMv3ForTokenClassification.from_pretrained(
    "microsoft/layoutlmv3-base"
)

# Create a simple document-like image with text
from PIL import Image, ImageDraw, ImageFont

# Create a sample invoice-like document
img = Image.new('RGB', (800, 600), color='white')
draw = ImageDraw.Draw(img)

# Try to use a default font, fallback to default if not available
try:
    font_large = ImageFont.truetype("arial.ttf", 24)
    font_medium = ImageFont.truetype("arial.ttf", 16)
    font_small = ImageFont.truetype("arial.ttf", 12)
except:
    font_large = ImageFont.load_default()
    font_medium = ImageFont.load_default()
    font_small = ImageFont.load_default()

# Add document content
draw.text((50, 50), "INVOICE", fill='black', font=font_large)
draw.text((50, 100), "Invoice #: INV-2024-001", fill='black', font=font_medium)
draw.text((50, 130), "Date: January 15, 2024", fill='black', font=font_medium)
draw.text((50, 180), "Bill To:", fill='black', font=font_medium)
draw.text((50, 210), "John Doe", fill='black', font=font_medium)
draw.text((50, 240), "123 Main Street", fill='black', font=font_medium)
draw.text((50, 270), "New York, NY 10001", fill='black', font=font_medium)
draw.text((50, 320), "Description: Web Development Services", fill='black', font=font_medium)
draw.text((50, 350), "Amount: $2,500.00", fill='black', font=font_medium)
draw.text((50, 400), "Total: $2,500.00", fill='black', font=font_large)

# For this example, we'll simulate OCR data
# In practice, you would use an OCR service to extract text and bounding boxes
words = [
    "INVOICE", "Invoice", "#:", "INV-2024-001", "Date:", "January", "15,", "2024",
    "Bill", "To:", "John", "Doe", "123", "Main", "Street", "New", "York,", "NY", "10001",
    "Description:", "Web", "Development", "Services", "Amount:", "$2,500.00",
    "Total:", "$2,500.00"
]

# Simulate bounding boxes (normalized coordinates)
boxes = [
    [50, 50, 150, 74], [50, 100, 100, 116], [101, 100, 115, 116], [116, 100, 220, 116],
    [50, 130, 85, 146], [86, 130, 140, 146], [141, 130, 165, 146], [166, 130, 210, 146],
    [50, 180, 85, 196], [86, 180, 115, 196], [50, 210, 90, 226], [91, 210, 130, 226],
    [50, 240, 75, 256], [76, 240, 115, 256], [116, 240, 160, 256], [50, 270, 85, 286],
    [86, 270, 130, 286], [131, 270, 155, 286], [156, 270, 200, 286],
    [50, 320, 120, 336], [121, 320, 155, 336], [156, 320, 250, 336], [251, 320, 310, 336],
    [50, 350, 105, 366], [106, 350, 180, 366], [50, 400, 95, 416], [96, 400, 170, 416]
]

# Normalize bounding boxes to 0-1000 scale (LayoutLM expects this)
normalized_boxes = []
for box in boxes:
    normalized_box = [
        int(box[0] * 1000 / 800),  # x1
        int(box[1] * 1000 / 600),  # y1
        int(box[2] * 1000 / 800),  # x2
        int(box[3] * 1000 / 600)   # y2
    ]
    normalized_boxes.append(normalized_box)

# Display the document
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.title("Sample Document for Layout Analysis", fontsize=14, pad=20)
plt.show()

print("Document Analysis Complete!")
print(f"Extracted {len(words)} words from the document.")
print("\nSample extracted words:")
for i, (word, box) in enumerate(zip(words[:10], normalized_boxes[:10])):
    print(f"{i+1:2d}. '{word}' at position {box}")

print("\nThis example demonstrates document understanding capabilities.")
print("In production, you would:")
print("1. Use OCR to extract text and bounding boxes")
print("2. Fine-tune LayoutLMv3 for your specific document types")
print("3. Extract structured information like dates, amounts, addresses")

## Summary

This notebook demonstrated five key multimodal AI capabilities:

1. **Image Captioning**: Automatically generating descriptive text for images
2. **Visual Question Answering**: Answering natural language questions about image content
3. **Text-to-Image Generation**: Creating images from textual descriptions
4. **Image-Text Similarity**: Measuring semantic similarity between visual and textual content
5. **Document Understanding**: Extracting structured information from document images

### Key Takeaways:

- **BLIP** excels at image captioning and conditional text generation
- **ViLT** enables sophisticated visual question answering
- **Stable Diffusion** generates high-quality images from text prompts
- **CLIP** provides powerful image-text similarity matching
- **LayoutLMv3** specializes in understanding document structure and layout

### Applications:

- **Content Moderation**: Automatically analyzing and describing user-uploaded images
- **Accessibility**: Generating alt-text for images to assist visually impaired users
- **Search & Retrieval**: Finding images based on natural language descriptions
- **Creative Tools**: Generating artwork and designs from text descriptions
- **Document Processing**: Automating invoice processing, form extraction, and document analysis

These multimodal models represent the cutting edge of AI, bridging the gap between visual and textual understanding to enable more sophisticated and intuitive AI applications.