
# 🚀 Open in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kamranjaved/Workshop-on-AI-and-Hand-on-training/blob/main/S6%3A%20CLIP%20inference%20tutorial%20/CLIP_MSCOCO_Benchmark.ipynb)



# 🧠 CLIP Benchmark on MSCOCO 5K Test Split

This notebook evaluates OpenAI's CLIP (ViT-L/14) model on the COCO Karpathy test split.

**Steps:**
1. Load the CLIP model and COCO dataset.
2. Encode all 5,000 validation images.
3. Perform text-to-image retrieval for 25,000 captions.
4. Compute Recall@1, Recall@5, and Recall@10 metrics.


## 1. Setup and Imports

In [None]:

import os
import torch
import clip
from PIL import Image
from tqdm import tqdm
from datasets import load_dataset


## 2. Define Paths and Load CLIP Model

In [None]:

image_folder = "/home/coop2025/Documents/COOP/Sara/coco_val2014v2"

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)
model.eval()

print(f"Using device: {device}")


## 3. Load COCO Karpathy Dataset

In [None]:

dataset = load_dataset("yerevann/coco-karpathy", split="test")
print(f"Loaded dataset with {len(dataset)} examples.")


## 4. Encode All Images

In [None]:

image_features = []
image_ids = []
caption_to_cocoid = {}

# Map each caption to its COCO image ID
for example in dataset:
    cocoid = example["cocoid"]
    for caption in example["sentences"]:
        caption_to_cocoid[caption] = cocoid

print("Encoding all 5K images...")
for example in tqdm(dataset):
    cocoid = example["cocoid"]
    image_path = os.path.join(image_folder, f"COCO_val2014_000000{cocoid:06d}.jpg")

    if not os.path.exists(image_path):
        continue

    try:
        image = preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(device)
        with torch.no_grad():
            feat = model.encode_image(image)
            feat /= feat.norm(dim=-1, keepdim=True)
            image_features.append(feat.cpu())
            image_ids.append(cocoid)
    except Exception as e:
        print(f"Error with image {cocoid}: {e}")

image_features = torch.cat(image_features)
image_features = image_features.to(device)
print(f"✅ Encoded {len(image_features)} images.")


## 5. Perform Caption-Based Image Retrieval

In [None]:

r1 = r5 = r10 = total = 0

print("Running caption-based image retrieval (25K queries)...")
for example in tqdm(dataset):
    cocoid = example["cocoid"]
    for caption in example["sentences"]:
        prompt = f"A photo of {caption}"
        text_token = clip.tokenize([prompt]).to(device)

        with torch.no_grad():
            text_feat = model.encode_text(text_token)
            text_feat /= text_feat.norm(dim=-1, keepdim=True)

        sims = (text_feat @ image_features.T).squeeze(0)
        topk = sims.topk(10)
        top_image_ids = [image_ids[i] for i in topk.indices.tolist()]

        if cocoid == top_image_ids[0]:
            r1 += 1
        if cocoid in top_image_ids[:5]:
            r5 += 1
        if cocoid in top_image_ids[:10]:
            r10 += 1
        total += 1


## 6. Report Final Results

In [None]:

print("\n===== 🧾 Final CLIP Benchmark Results on MSCOCO 5K Test Split =====")
print(f"Total Queries (5 captions × 5K images): {total}")
print(f"Recall@1  = {r1/total:.2%}  ({r1}/{total})")
print(f"Recall@5  = {r5/total:.2%}  ({r5}/{total})")
print(f"Recall@10 = {r10/total:.2%}  ({r10}/{total})")



## ✅ Summary
This notebook benchmarks the CLIP model on the MSCOCO test split.

**Results Interpretation:**
- **Recall@1**: % of times the correct image was ranked 1st for a caption.
- **Recall@5 / @10**: % of times the correct image appeared within the top 5 or 10 results.

Higher recall values indicate stronger image-text alignment.
