
# 🚀 Open in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kamranjaved/Workshop-on-AI-and-Hand-on-training/blob/main/S6%3A%20CLIP%20inference%20tutorial%20/CLIP_MSCOCO_Benchmark.ipynb)



# 🧠 CLIP Benchmark on MSCOCO 5K Test Split

This notebook evaluates OpenAI's CLIP (ViT-L/14) model on the COCO Karpathy test split.

**Steps:**
1. Load the CLIP model and COCO dataset.
2. Encode all 5,000 validation images.
3. Perform text-to-image retrieval for 25,000 captions.
4. Compute Recall@1, Recall@5, and Recall@10 metrics.


## 1. Setup and Imports

In [2]:
!pip install git+https://github.com/openai/CLIP.git



Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-e9m9dch0
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-e9m9dch0
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369490 sha256=896aec5bd8f11a59bb0f9561def471fc65bad209414ea5775d4f26f1e012e9f5
  Stored in directory: /tmp/pip-ephem-wheel-cache-8dcf3q_r/wheels/35/3e/df/3d24cbfb3b6a06f17

In [3]:
!pip install torch



In [4]:

import os
import torch
import clip
from PIL import Image
from tqdm import tqdm
from datasets import load_dataset


## 2. Define Paths and Load CLIP Model

In [9]:

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)
model.eval()

print(f"Using device: {device}")


Using device: cuda


## 3. Load COCO Karpathy Dataset

In [21]:

#dataset = load_dataset("yerevann/coco-karpathy", split="test")
full_dataset = load_dataset("yerevann/coco-karpathy", split="test")

# Create a new dataset containing only the first 1000 examples
dataset = full_dataset.select(range(250))

print(f"Loaded dataset with {len(dataset)} examples.")


Loaded dataset with 250 examples.


## 4. Encode All Images

In [22]:
image_folder = "/content/coco_200"
image_features = []
image_ids = []
caption_to_cocoid = {}

# Map each caption to its COCO image ID
for example in dataset:
    cocoid = example["cocoid"]
    for caption in example["sentences"]:
        caption_to_cocoid[caption] = cocoid

print("Encoding all 5K images...")
for example in tqdm(dataset):
    cocoid = example["cocoid"]
    image_path = os.path.join(image_folder, f"COCO_val2014_000000{cocoid:06d}.jpg")

    if not os.path.exists(image_path):
        continue

    try:
        image = preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(device)
        with torch.no_grad():
            feat = model.encode_image(image)
            feat /= feat.norm(dim=-1, keepdim=True)
            image_features.append(feat.cpu())
            image_ids.append(cocoid)
    except Exception as e:
        print(f"Error with image {cocoid}: {e}")

image_features = torch.cat(image_features)
image_features = image_features.to(device)
print(f"✅ Encoded {len(image_features)} images.")


Encoding all 5K images...


100%|██████████| 250/250 [00:00<00:00, 2946.85it/s]

✅ Encoded 2 images.





## 5. Perform Caption-Based Image Retrieval

In [23]:
print(sims.shape)

torch.Size([1])


In [24]:

r1 = 0
r5 = 0
r10 = 0
total = 0

print("Running caption-based image retrieval (25K queries)...")
for example in tqdm(dataset):
    cocoid = example["cocoid"]
    for caption in example["sentences"]:
        prompt = f"A photo of {caption}"
        text_token = clip.tokenize([prompt]).to(device)

        with torch.no_grad():
            text_feat = model.encode_text(text_token)
            text_feat /= text_feat.norm(dim=-1, keepdim=True)

        sims = (text_feat @ image_features.T).squeeze(0)
        k = min(10, sims.size(-1))

        # Now run topk with the new k
        topk = sims.topk(k)
        #topk = sims.topk(10)
        top_image_ids = [image_ids[i] for i in topk.indices.tolist()]

        if cocoid == top_image_ids[0]:
            r1 += 1
        if cocoid in top_image_ids[:5]:
            r5 += 1
        if cocoid in top_image_ids[:10]:
            r10 += 1
        total += 1


Running caption-based image retrieval (25K queries)...


100%|██████████| 250/250 [00:11<00:00, 22.66it/s]


## 6. Report Final Results

In [19]:

print("\n===== 🧾 Final CLIP Benchmark Results on MSCOCO 5K Test Split =====")
print(f"Total Queries (5 captions × 5K images): {total}")
print(f"Recall@1  = {r1/total:.2%}  ({r1}/{total})")
print(f"Recall@5  = {r5/total:.2%}  ({r5}/{total})")
print(f"Recall@10 = {r10/total:.2%}  ({r10}/{total})")



===== 🧾 Final CLIP Benchmark Results on MSCOCO 5K Test Split =====
Total Queries (5 captions × 5K images): 1001
Recall@1  = 0.50%  (5/1001)
Recall@5  = 0.50%  (5/1001)
Recall@10 = 0.50%  (5/1001)



## ✅ Summary
This notebook benchmarks the CLIP model on the MSCOCO test split.

**Results Interpretation:**
- **Recall@1**: % of times the correct image was ranked 1st for a caption.
- **Recall@5 / @10**: % of times the correct image appeared within the top 5 or 10 results.

Higher recall values indicate stronger image-text alignment.
