# Multimodal LLMs - Easy Tasks

Basic concepts with CLIP and BLIP-2. Loading models, making embeddings, simple image tasks.

**Topics:**
- CLIP text/image embeddings
- Computing similarity scores
- Basic image captioning
- Simple visual Q&A

## Setup

Run all cells in this section.

### [Optional] - Installing Packages on Google Colab

If you are viewing this notebook on Google Colab, uncomment and run the following code to install dependencies.

**Note**: Use a GPU for this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [None]:
# %%capture
# !pip install matplotlib transformers datasets accelerate sentence-transformers pillow

### Import Libraries

In [None]:
from urllib.request import urlopen
from PIL import Image
import numpy as np
import torch
import matplotlib.pyplot as plt
from transformers import CLIPTokenizerFast, CLIPProcessor, CLIPModel
from transformers import AutoProcessor, Blip2ForConditionalGeneration

print("Imports ready")

### Load Sample Images

We'll use a few AI-generated images for testing.

In [None]:
# Image URLs
puppy = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/puppy.png"
cat = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/cat.png"
car = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/car.png"

print("Image URLs loaded")

## Easy Tasks

Basic operations with multimodal models.

### Task 1: CLIP Text Embeddings

Create embeddings for text using CLIP. Text embeddings capture semantic meaning.

**Goal**: Embed a caption and inspect the result.

In [None]:
# Load CLIP model
model_id = "openai/clip-vit-base-patch32"
print("Loading CLIP...")

clip_tok = CLIPTokenizerFast.from_pretrained(model_id)
clip_proc = CLIPProcessor.from_pretrained(model_id)
clip_model = CLIPModel.from_pretrained(model_id)

print("Loaded")

In [None]:
# Tokenize text
caption = "a puppy playing in the snow"

inputs = clip_tok(caption, return_tensors="pt")
print(f"Caption: {caption}")

In [None]:
# Check tokens
tokens = clip_tok.convert_ids_to_tokens(inputs["input_ids"][0])
print(f"Tokens: {tokens}")

In [None]:
# Create text embedding
txt_emb = clip_model.get_text_features(**inputs)

print(f"Embedding shape: {txt_emb.shape}")
print(f"First 5 values: {txt_emb[0][:5].tolist()}")

**Questions:**

1. Try a different caption - how does the embedding change?
2. What happens with very long text?
3. Try text in another language - does it work?

### Task 2: CLIP Image Embeddings

Same idea but for images. Images get processed into patches, then embedded.

**Goal**: Load an image, embed it, check the shape.

In [None]:
# Load image
img = Image.open(urlopen(puppy)).convert("RGB")

print(f"Loaded image")
print(f"Size: {img.size}")

In [None]:
# Show image
plt.imshow(img)
plt.axis('off')
plt.title("Puppy image")
plt.show()

In [None]:
# Preprocess
proc_img = clip_proc(
    text=None, 
    images=img, 
    return_tensors='pt'
)['pixel_values']

print(f"Processed shape: {proc_img.shape}")

In [None]:
# Original was probably different size
print(f"Original: {img.size}")
print(f"After processing: 224x224 (required by CLIP)")

In [None]:
# Create image embedding
img_emb = clip_model.get_image_features(proc_img)

print(f"Image embedding shape: {img_emb.shape}")
print(f"Same as text? {img_emb.shape == txt_emb.shape}")

**Questions:**

1. Load the cat or car image - what's different?
2. Why does CLIP resize to 224x224?
3. What happens if you load a very small image?

### Task 3: Text-Image Similarity

Compare embeddings to see if image matches caption.

**Goal**: Calculate similarity score between text and image.

In [None]:
# Normalize embeddings (required for cosine similarity)
txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True)

print("Normalized embeddings")

In [None]:
# Calculate similarity
txt_np = txt_emb.detach().cpu().numpy()
img_np = img_emb.detach().cpu().numpy()

sim = txt_np @ img_np.T

print(f"Similarity: {sim[0][0]:.4f}")

In [None]:
# Try wrong caption
wrong = "a car driving at sunset"
wrong_inp = clip_tok(wrong, return_tensors="pt")

wrong_emb = clip_model.get_text_features(**wrong_inp)
wrong_emb = wrong_emb / wrong_emb.norm(dim=-1, keepdim=True)

wrong_np = wrong_emb.detach().cpu().numpy()
sim_wrong = wrong_np @ img_np.T

print(f"\nCorrect caption: {sim[0][0]:.4f}")
print(f"Wrong caption: {sim_wrong[0][0]:.4f}")

**Questions:**

1. Try other caption variations - which scores highest?
2. What's a "good" similarity score?
3. Can you find a caption that scores even lower?

### Task 4: Zero-Shot Image Classification

Use CLIP to classify images without training. Just compare image to class descriptions.

**Goal**: Given an image, find which class description matches best.

In [None]:
# Load cat image
cat_img = Image.open(urlopen(cat)).convert("RGB")

plt.imshow(cat_img)
plt.axis('off')
plt.title("What is this?")
plt.show()

In [None]:
# Possible classes
classes = [
    "a photo of a dog",
    "a photo of a cat", 
    "a photo of a car",
    "a photo of a bird"
]

print("Classes:")
for i, c in enumerate(classes, 1):
    print(f"{i}. {c}")

In [None]:
# Embed image
cat_proc = clip_proc(images=cat_img, return_tensors='pt')['pixel_values']
cat_emb = clip_model.get_image_features(cat_proc)
cat_emb = cat_emb / cat_emb.norm(dim=-1, keepdim=True)

print("Image embedded")

In [None]:
# Embed all classes
class_embs = []

for cls in classes:
    inp = clip_tok(cls, return_tensors="pt")
    emb = clip_model.get_text_features(**inp)
    emb = emb / emb.norm(dim=-1, keepdim=True)
    class_embs.append(emb)

print(f"Embedded {len(class_embs)} classes")

In [None]:
# Calculate similarities
cat_np = cat_emb.detach().cpu().numpy()
scores = []

for emb in class_embs:
    emb_np = emb.detach().cpu().numpy()
    sc = cat_np @ emb_np.T
    scores.append(sc[0][0])

print("\nScores:")
for cls, sc in zip(classes, scores):
    print(f"{cls}: {sc:.4f}")

In [None]:
# Find best match
best_idx = np.argmax(scores)
best_class = classes[best_idx]

print(f"\nPrediction: {best_class}")
print(f"Confidence: {scores[best_idx]:.4f}")

**Questions:**

1. Try the car image - does it classify correctly?
2. Add more classes - does accuracy drop?
3. What if you use very specific class names?

### Task 5: Basic Image Captioning with BLIP-2

Generate text descriptions of images. BLIP-2 bridges vision and language.

**Goal**: Load an image, generate a caption.

In [None]:
# Load BLIP-2
print("Loading BLIP-2 (this takes a minute)...")

blip_proc = AutoProcessor.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    revision="51572668da0eb669e01a189dc22abe6088589a24"
)

blip_model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    revision="51572668da0eb669e01a189dc22abe6088589a24",
    torch_dtype=torch.float16
)

dev = "cuda" if torch.cuda.is_available() else "cpu"
blip_model.to(dev)

print(f"Loaded on {dev}")

In [None]:
# Load car image
car_img = Image.open(urlopen(car)).convert("RGB")

plt.imshow(car_img)
plt.axis('off')
plt.show()

In [None]:
# Preprocess
inp = blip_proc(car_img, return_tensors="pt").to(dev, torch.float16)

print("Image preprocessed")
print(f"Shape: {inp['pixel_values'].shape}")

In [None]:
# Generate caption
gen_ids = blip_model.generate(**inp, max_new_tokens=20)

caption = blip_proc.batch_decode(gen_ids, skip_special_tokens=True)
caption = caption[0].strip()

print(f"Caption: {caption}")

In [None]:
# Try with puppy
pup_img = Image.open(urlopen(puppy)).convert("RGB")

inp = blip_proc(pup_img, return_tensors="pt").to(dev, torch.float16)
gen_ids = blip_model.generate(**inp, max_new_tokens=20)
caption = blip_proc.batch_decode(gen_ids, skip_special_tokens=True)[0].strip()

print(f"Puppy caption: {caption}")

**Questions:**

1. Try your own images - how accurate are captions?
2. What happens with abstract or artistic images?
3. Increase max_new_tokens to 50 - do captions get better?

### Task 6: Simple Visual Q&A

Ask questions about images. Model processes both image and question.

**Goal**: Give BLIP-2 an image and a question, get an answer.

In [None]:
# Load car image again
car_img = Image.open(urlopen(car)).convert("RGB")

plt.imshow(car_img)
plt.axis('off')
plt.title("Ask me about this image")
plt.show()

In [None]:
# Ask a question
q = "Question: What color is the car? Answer:"

inp = blip_proc(car_img, text=q, return_tensors="pt")
inp = inp.to(dev, torch.float16)

print(f"Question: {q}")

In [None]:
# Generate answer
gen_ids = blip_model.generate(**inp, max_new_tokens=20)
ans = blip_proc.batch_decode(gen_ids, skip_special_tokens=True)[0].strip()

print(f"Answer: {ans}")

In [None]:
# Try another question
q2 = "Question: Is this indoors or outdoors? Answer:"

inp = blip_proc(car_img, text=q2, return_tensors="pt")
inp = inp.to(dev, torch.float16)

gen_ids = blip_model.generate(**inp, max_new_tokens=20)
ans2 = blip_proc.batch_decode(gen_ids, skip_special_tokens=True)[0].strip()

print(f"Q: {q2}")
print(f"A: {ans2}")

**Questions:**

1. Try yes/no questions - does it answer correctly?
2. Ask about details not in the image - what happens?
3. What types of questions work best?

**Questions:**

1. Try yes/no questions - does it answer correctly?
2. Ask about details not in the image - what happens?
3. What types of questions work best?