# 03 - Multimodal Pipelines

This notebook covers pipelines for non-text modalities:

**Computer Vision:**
- Image Classification
- Object Detection
- Image Segmentation
- Depth Estimation

**Audio:**
- Automatic Speech Recognition (ASR)
- Audio Classification
- Text-to-Speech

**Multimodal:**
- Image-to-Text (Captioning)
- Visual Question Answering
- Document Question Answering

In [None]:
# Install additional dependencies for multimodal
# !pip install transformers torch torchvision torchaudio
# !pip install pillow soundfile librosa

In [None]:
from transformers import pipeline
import torch
from PIL import Image
import requests
from io import BytesIO

In [None]:
# Helper function to load images from URL
def load_image_from_url(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content))

# Sample image URLs for testing
CAT_IMAGE = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
DOG_IMAGE = "https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/1200px-YellowLabradorLooking_new.jpg"
STREET_IMAGE = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/A_modern_city_street.jpg/1280px-A_modern_city_street.jpg"

---
## Part 1: Computer Vision Pipelines

### 1.1 Image Classification

Classify entire images into categories.

In [None]:
# Image classification pipeline
image_classifier = pipeline("image-classification", model="google/vit-base-patch16-224")

# Load a sample image
image = load_image_from_url(CAT_IMAGE)

# Classify
results = image_classifier(image)

print("Image Classification Results:")
for result in results[:5]:
    print(f"  {result['label']:30} ({result['score']:.4f})")

In [None]:
# You can also pass image paths or URLs directly
results = image_classifier(CAT_IMAGE)  # URL
print("Classification from URL:", results[0]['label'])

# Or local file path
# results = image_classifier("./my_image.jpg")

In [None]:
# Batch classification
images = [CAT_IMAGE, DOG_IMAGE]
batch_results = image_classifier(images)

for img_url, result in zip(images, batch_results):
    print(f"Image: {result[0]['label']} ({result[0]['score']:.4f})")

### 1.2 Object Detection

Detect and locate objects within images.

In [None]:
# Object detection pipeline
object_detector = pipeline("object-detection", model="facebook/detr-resnet-50")

# Detect objects
image = load_image_from_url(STREET_IMAGE)
detections = object_detector(image)

print(f"Found {len(detections)} objects:")
for det in detections:
    print(f"  {det['label']:15} (score: {det['score']:.3f})")
    print(f"    Box: {det['box']}")

In [None]:
# Filter by confidence threshold
threshold = 0.9
confident_detections = [d for d in detections if d['score'] > threshold]

print(f"\nHigh-confidence detections (>{threshold}):")
for det in confident_detections:
    print(f"  {det['label']}: {det['score']:.3f}")

### 1.3 Image Segmentation

Segment images at pixel level.

In [None]:
# Semantic segmentation
segmenter = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic")

image = load_image_from_url(STREET_IMAGE)
segments = segmenter(image)

print(f"Found {len(segments)} segments:")
for seg in segments:
    print(f"  {seg['label']:20} (score: {seg.get('score', 'N/A')})")

### 1.4 Depth Estimation

Estimate depth from single images.

In [None]:
# Depth estimation pipeline
depth_estimator = pipeline("depth-estimation", model="Intel/dpt-large")

image = load_image_from_url(STREET_IMAGE)
result = depth_estimator(image)

print(f"Depth map shape: {result['depth'].size}")
print(f"Predicted depth type: {type(result['predicted_depth'])}")

# The result contains:
# - 'depth': PIL Image with depth visualization
# - 'predicted_depth': Raw depth tensor

---
## Part 2: Audio Pipelines

### 2.1 Automatic Speech Recognition (ASR)

Convert speech to text.

In [None]:
# ASR pipeline
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")

# You can pass:
# - Audio file path: asr("audio.mp3")
# - Audio URL: asr("https://example.com/audio.mp3")
# - NumPy array with sample rate

# Example with a sample audio URL
AUDIO_SAMPLE = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"

result = asr(AUDIO_SAMPLE)
print(f"Transcription: {result['text']}")

In [None]:
# ASR with timestamps
result = asr(AUDIO_SAMPLE, return_timestamps=True)

print("Transcription with timestamps:")
if 'chunks' in result:
    for chunk in result['chunks']:
        start, end = chunk['timestamp']
        print(f"  [{start:.2f}s - {end:.2f}s]: {chunk['text']}")
else:
    print(result['text'])

In [None]:
# Long-form transcription (for audio > 30 seconds)
long_asr = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
    chunk_length_s=30,  # Process in 30-second chunks
    stride_length_s=5   # Overlap between chunks
)

# This handles long audio files automatically
# result = long_asr("long_audio.mp3")

### 2.2 Audio Classification

Classify audio into categories.

In [None]:
# Audio classification pipeline
audio_classifier = pipeline(
    "audio-classification",
    model="MIT/ast-finetuned-audioset-10-10-0.4593"
)

# Classify audio
# result = audio_classifier("audio_sample.wav")

# The model can detect:
# - Speech, music, environmental sounds
# - Specific instruments
# - Animal sounds
# - etc.

print("Audio classification can detect various sound categories.")
print("Pass an audio file path to classify.")

---
## Part 3: Multimodal Pipelines

### 3.1 Image-to-Text (Image Captioning)

Generate text descriptions of images.

In [None]:
# Image captioning pipeline
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")

# Generate caption
image = load_image_from_url(CAT_IMAGE)
result = captioner(image)

print(f"Caption: {result[0]['generated_text']}")

In [None]:
# Multiple images
images = [CAT_IMAGE, DOG_IMAGE]
results = captioner(images)

for img, result in zip(['Cat', 'Dog'], results):
    print(f"{img}: {result[0]['generated_text']}")

### 3.2 Visual Question Answering (VQA)

Answer questions about images.

In [None]:
# VQA pipeline
vqa = pipeline("visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa")

image = load_image_from_url(CAT_IMAGE)

# Ask questions about the image
questions = [
    "What animal is in the image?",
    "What color is the animal?",
    "Is the animal sleeping?"
]

for question in questions:
    result = vqa(image=image, question=question)
    print(f"Q: {question}")
    print(f"A: {result[0]['answer']} (score: {result[0]['score']:.3f})")
    print()

### 3.3 Document Question Answering

Extract information from document images (invoices, forms, etc.).

In [None]:
# Document QA pipeline
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")

# This works with document images (PDFs rendered as images, scanned documents, etc.)
# Particularly useful for:
# - Invoice processing
# - Form extraction
# - Receipt parsing

# Example usage (with a document image):
# result = doc_qa(
#     image="invoice.png",
#     question="What is the total amount?"
# )

print("Document QA pipeline loaded.")
print("Pass a document image and question to extract information.")

### 3.4 Zero-Shot Image Classification

Classify images into custom categories without training.

In [None]:
# Zero-shot image classification (CLIP)
zero_shot_image = pipeline(
    "zero-shot-image-classification",
    model="openai/clip-vit-base-patch32"
)

image = load_image_from_url(CAT_IMAGE)

# Define custom categories
candidate_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird", "a photo of a fish"]

result = zero_shot_image(image, candidate_labels=candidate_labels)

print("Zero-shot classification:")
for item in result:
    print(f"  {item['label']:25} ({item['score']:.4f})")

---
## ðŸŽ¯ Multimodal Pipeline Reference

| Pipeline | Input | Output | Model Examples |
|----------|-------|--------|----------------|
| `image-classification` | Image | Labels + scores | google/vit-base-patch16-224 |
| `object-detection` | Image | Boxes + labels | facebook/detr-resnet-50 |
| `image-segmentation` | Image | Masks + labels | facebook/detr-resnet-50-panoptic |
| `depth-estimation` | Image | Depth map | Intel/dpt-large |
| `automatic-speech-recognition` | Audio | Text | openai/whisper-base |
| `audio-classification` | Audio | Labels | MIT/ast-finetuned-audioset |
| `image-to-text` | Image | Caption | Salesforce/blip-image-captioning-base |
| `visual-question-answering` | Image + Question | Answer | dandelin/vilt-b32-finetuned-vqa |
| `document-question-answering` | Doc image + Question | Answer | impira/layoutlm-document-qa |
| `zero-shot-image-classification` | Image + Labels | Scores | openai/clip-vit-base-patch32 |

## Next Steps

Continue to [04_advanced_pipelines.ipynb](04_advanced_pipelines.ipynb) for custom pipelines and optimization!