# CLIP performance testing

This notebook is based on the [Interacting with CLIP notebook](https://github.com/openai/CLIP/tree/main/notebooks) shared by OpenAI on CLIP's github repository. We have used the same testing setting as described in the [paper](https://arxiv.org/abs/2103.00020) for zero shot and linear probe classification. On CIFAR-10 we observe accuracy of `89.59%` against `89.83%` reported by [OpenClip](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv#L84), a third party library. For linear probe, we observe `95.02%` which is close to `95.1%` reported in the [paper](https://arxiv.org/abs/2103.00020).

We further experiment the variation in performance due to class names and sorrounding preposition text which is passed to the model. We also attempt at improving the performance by taking a mean of cosine similarity across two versions of the image (original and augmented) but couldn't notice improvement.

# Preparation for Colab

Make sure you're running a GPU runtime; if not, select "GPU" as the hardware accelerator in Runtime > Change Runtime Type in the menu. The next cells will install the `clip` package and its dependencies, and check if PyTorch 1.7.1 or later is installed.

In [1]:
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ftfy
Successfully installed ftfy-6.3.1
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-y6ngyv5y
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-y6ngyv5y
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-

In [2]:
import numpy as np
import torch
from pkg_resources import packaging

print("Torch version:", torch.__version__)

Torch version: 2.6.0+cu124


  from pkg_resources import packaging


# Loading the model

`clip.available_models()` will list the names of available CLIP models.

In [3]:
import clip

clip.available_models()

['RN50',
 'RN101',
 'RN50x4',
 'RN50x16',
 'RN50x64',
 'ViT-B/32',
 'ViT-B/16',
 'ViT-L/14',
 'ViT-L/14@336px']

In [4]:
model, preprocess = clip.load("ViT-B/32")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

100%|████████████████████████████████████████| 338M/338M [00:01<00:00, 200MiB/s]


Model parameters: 151,277,313
Input resolution: 224
Context length: 77
Vocab size: 49408


# Preprocessing

## Image
The input images are resized and center-cropped to conform with the image resolution that the model expects. Furthermore, the pixel intensity is normalised using the dataset mean and standard deviation.

The second return value from `clip.load()` contains a torchvision `Transform` as shown below that performs this preprocessing.

## Text
CLIP model uses a case-insensitive text tokenizer, which can be invoked using `clip.tokenize()`. By default, the outputs are padded to become 77 tokens long, which is what the CLIP models expects.

In [5]:
preprocess

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=True)
    CenterCrop(size=(224, 224))
    <function _convert_image_to_rgb at 0x7b07430cdbc0>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

In [6]:
clip.tokenize("Hello World!")

tensor([[49406,  3306,  1002,   256, 49407,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], dtype=torch.int32)

# Quantitative performance

## Zero-Shot image classification

The CIFAR-10 images are classified using the cosine similarity (times 100) as the logits to the softmax operation.

OpenAI Clip 's performance on CIFAR-10 as reported by [OpenClip library](https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv#L84) is `89.83%`. However, when we measure the same ourselves, we get `89.59%` which is close. We also experiment how variation in i) class name and ii) text prefixing improves/detoriates performance.

In [7]:
import torch
import numpy as np
from torchvision import datasets, transforms
from tqdm import tqdm

batch_size = 32
cifar10 = datasets.CIFAR10(root='~/.cache', train=False, download=True, transform=preprocess)
dataloader = torch.utils.data.DataLoader(cifar10, batch_size=batch_size, shuffle=False)

def get_model_accuracy(model, dataloader, text_tokens, show=True):
    top1_accuracies = []
    top5_accuracies = []

    with torch.no_grad():
        for images, targets in tqdm(dataloader, desc='Processing batch'):
            images = images.to(device)
            image_features = model.encode_image(images).float()
            text_features = model.encode_text(text_tokens).float()

            image_features /= image_features.norm(dim=-1, keepdim=True)
            text_features /= text_features.norm(dim=-1, keepdim=True)
            similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T
            sorted_preds = similarity.T.argsort()

            top1_accuracies.extend((sorted_preds[:, -1] == targets).to(torch.int).tolist())
            top5_accuracies.extend(np.any((sorted_preds[:, -5:] == targets[:, None]).to(torch.int).cpu().numpy(), axis=1).tolist())

    if show:
        print(f"\nAcc@1: {np.mean(top1_accuracies) * 100:.2f}%")
        print(f"Acc@5: {np.mean(top5_accuracies) * 100:.2f}%")

    return np.mean(top1_accuracies), np.mean(top5_accuracies)

100%|██████████| 170M/170M [00:03<00:00, 46.1MB/s]


In [None]:
# NOTE: A higher accuracy was achieved by varying the text description as shown in subsequent code
text_descriptions = [f"This is a photo of a {label}" for label in cifar10.classes]
text_tokens = clip.tokenize(text_descriptions).to(device)
top1, top5 = get_model_accuracy(model, dataloader, text_tokens)

Processing batch: 100%|██████████| 313/313 [00:28<00:00, 11.11it/s]


Acc@1: 88.96%
Acc@5: 99.37%





## Evaluating impact of class name change on performance
There exist many similar or/and synonym terms for each of the CIFAR-10 classes. We test performance by passing these class name variations. We simplify the experiment by not prefixing any text to the classes. Variation in text prefixing and impact on performance would be seen in subsequent code.
1. Airplane: aircraft, plane, jet, airliner, aeroplane
2. Automobile: car, vehicle, auto, sedan, motorcar
3. Bird: avian, fowl, songbird, feathered friend, winged creature
4. Cat: feline, kitty, kitten, tomcat, house cat
5. Deer: stag, doe, buck, fawn, antelope
6. Dog: canine, puppy, pooch, hound, mutt
7. Frog: amphibian, toad, tree frog, bullfrog, croaker
8. Horse: equine, steed, pony, stallion, mare
9. Ship: vessel, boat, yacht, schooner, liner
10. Truck: lorry, pickup, hauler, freight truck, delivery truck

We can see that there are benefits to using `variant 1` instead of the original class name as it provides a `0.39%` improvement.

In [9]:
class_variants = [
    ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"],                                 # original cifar 10
    ["aircraft", "car", "avian", "feline", "stag", "canine", "amphibian", "equine", "vessel", "lorry"],                         # variant 1
    ["plane", "vehicle", "fowl", "kitty", "doe", "puppy", "toad", "steed", "boat", "pickup"],                                   # variant 2
    ["jet", "auto", "songbird", "kitten", "buck", "pooch", "tree frog", "pony", "yacht", "hauler"],                             # variant 3
    ["airliner", "sedan", "feathered friend", "tomcat", "fawn", "hound", "bullfrog", "stallion", "schooner", "freight truck"],  # variant 4
    ["aeroplane", "motorcar", "winged creature", "house cat", "antelope", "mutt", "croaker", "mare", "liner", "delivery truck"] # variant 5
]

for class_variant in class_variants:
    text_tokens = clip.tokenize(class_variant).to(device)
    top1, top5 = get_model_accuracy(model, dataloader, text_tokens)

Processing batch: 100%|██████████| 313/313 [00:27<00:00, 11.48it/s]



Acc@1: 87.38%
Acc@5: 99.15%


Processing batch: 100%|██████████| 313/313 [00:27<00:00, 11.57it/s]



Acc@1: 87.77%
Acc@5: 99.10%


Processing batch: 100%|██████████| 313/313 [00:26<00:00, 11.81it/s]



Acc@1: 79.43%
Acc@5: 98.33%


Processing batch: 100%|██████████| 313/313 [00:26<00:00, 11.99it/s]



Acc@1: 80.64%
Acc@5: 98.00%


Processing batch: 100%|██████████| 313/313 [00:26<00:00, 11.86it/s]



Acc@1: 74.19%
Acc@5: 97.05%


Processing batch: 100%|██████████| 313/313 [00:26<00:00, 11.81it/s]


Acc@1: 71.59%
Acc@5: 96.98%





## Evaluating impact of text description change on performance

We obaserve that `variant 3` achieves the best performance at `89.59%`.

In [12]:
# Variations involve different phrasing to text description preposition
text_descriptions_v0 = [f"This is a photo of a {label}" for label in cifar10.classes]           # Tested earlier
text_descriptions_v1 = [f"A picture of a {label}" for label in cifar10.classes]
text_descriptions_v2 = [f"This is a clear photo of a {label}" for label in cifar10.classes]
text_descriptions_v3 = [f"An image showing a {label}" for label in cifar10.classes]             # Highest accuracy
text_descriptions_v4 = [f"A snapshot of a {label}" for label in cifar10.classes]
text_descriptions_v5 = [f"This is a photo of a {label} in its natural habitat" for label in cifar10.classes]

# Tokenize the text descriptions
text_tokens_v0 = clip.tokenize(text_descriptions_v0).to(device)
text_tokens_v1 = clip.tokenize(text_descriptions_v1).to(device)
text_tokens_v2 = clip.tokenize(text_descriptions_v2).to(device)
text_tokens_v3 = clip.tokenize(text_descriptions_v3).to(device)
text_tokens_v4 = clip.tokenize(text_descriptions_v4).to(device)
text_tokens_v5 = clip.tokenize(text_descriptions_v5).to(device)

# Now you can test these variations using your existing code
top1_v0, top5_v0 = get_model_accuracy(model, dataloader, text_tokens_v0)
top1_v1, top5_v1 = get_model_accuracy(model, dataloader, text_tokens_v1)
top1_v2, top5_v2 = get_model_accuracy(model, dataloader, text_tokens_v2)
top1_v3, top5_v3 = get_model_accuracy(model, dataloader, text_tokens_v3)
top1_v4, top5_v4 = get_model_accuracy(model, dataloader, text_tokens_v4)
top1_v5, top5_v5 = get_model_accuracy(model, dataloader, text_tokens_v5)

Processing batch: 100%|██████████| 313/313 [00:28<00:00, 11.14it/s]



Acc@1: 88.96%
Acc@5: 99.37%


Processing batch: 100%|██████████| 313/313 [00:29<00:00, 10.60it/s]



Acc@1: 88.61%
Acc@5: 99.37%


Processing batch: 100%|██████████| 313/313 [00:27<00:00, 11.31it/s]



Acc@1: 87.88%
Acc@5: 99.09%


Processing batch: 100%|██████████| 313/313 [00:26<00:00, 11.71it/s]



Acc@1: 89.59%
Acc@5: 99.57%


Processing batch: 100%|██████████| 313/313 [00:31<00:00,  9.85it/s]



Acc@1: 88.06%
Acc@5: 99.30%


Processing batch: 100%|██████████| 313/313 [00:33<00:00,  9.24it/s]


Acc@1: 87.39%
Acc@5: 98.74%





In [14]:
# We also attempted measuring the performance when using class name variant 1, and text description
# variant 3 together as both of them individually achieved highest performance. However, 89.51% remained the best
text_descriptions = [f"An image showing a {label}" for label in class_variants[1]]
text_tokens = clip.tokenize(text_descriptions).to(device)
top1, top5 = get_model_accuracy(model, dataloader, text_tokens)

Processing batch: 100%|██████████| 313/313 [00:26<00:00, 11.66it/s]


Acc@1: 89.31%
Acc@5: 99.54%





### Experiments with augmentation
We further attempted improving performance by passing two versions of the same image to the model. The first is the original image and second is an augmented version having horizontal flipping and random rotation transformations applied. We then find similarity with each image and take average across the two. The expectation of achieving a higher accuracy wasn't met with a minor reduction in performance from 89.59% to 89.42%.

In [29]:
import torchvision.transforms as transforms

new_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=1),
    transforms.RandomRotation(15, interpolation=transforms.InterpolationMode.BICUBIC),
    preprocess
])

batch_size = 32
new_dataset = datasets.CIFAR10(root='~/.cache', train=False, download=True, transform=new_transform)
new_dataloader = torch.utils.data.DataLoader(new_dataset, batch_size=batch_size, shuffle=False)
text_descriptions = [f"An image showing a {label}" for label in cifar10.classes]
text_tokens = clip.tokenize(text_descriptions).to(device)

def get_model_accuracy_2(model, dataloader1, dataloader2, text_tokens, show=True):
    top1_accuracies = []
    top5_accuracies = []

    with torch.no_grad():
        for (images1, targets), (images2, _) in tqdm(zip(dataloader1, dataloader2), desc='Processing batch', total=len(dataloader1)):
            images1 = images1.to(device)
            images2 = images2.to(device)
            image_features1 = model.encode_image(images1).float()
            image_features2 = model.encode_image(images2).float()
            text_features = model.encode_text(text_tokens).float()

            image_features1 /= image_features1.norm(dim=-1, keepdim=True)
            image_features2 /= image_features2.norm(dim=-1, keepdim=True)
            text_features /= text_features.norm(dim=-1, keepdim=True)
            similarity1 = text_features @ image_features1.T
            similarity2 = text_features @ image_features2.T
            similarity = (similarity1 + similarity2) / 2
            sorted_preds = similarity.T.argsort().cpu().numpy()

            top1_accuracies.extend((sorted_preds[:, -1] == targets).to(torch.int).tolist())
            top5_accuracies.extend(np.any((sorted_preds[:, -5:] == targets[:, None]).to(torch.int).cpu().numpy(), axis=1).tolist())

    if show:
        print(f"\nAcc@1: {np.mean(top1_accuracies) * 100:.2f}%")
        print(f"Acc@5: {np.mean(top5_accuracies) * 100:.2f}%")

    return np.mean(top1_accuracies), np.mean(top5_accuracies)

top1, top5 = get_model_accuracy_2(model, dataloader, new_dataloader, text_tokens)

Processing batch: 100%|██████████| 313/313 [01:01<00:00,  5.13it/s]


Acc@1: 89.44%
Acc@5: 99.66%





# Linear probing results on CIFAR-10

The results of `95.02%` are close to `95.1%` in the paper and uses the same hyperparameter values as described in the paper. Some hyper parameters such as L2 lambda were estimated by parametric sweep by the researchers over a log range with some optimisation steps to improve the binary search which wasn't repeated by us as suitable performance was achieved without the same.

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

# Load CIFAR-10 dataset
batch_size = 32
train_dataset = datasets.CIFAR10(root='~/.cache', train=True, download=True, transform=preprocess)
test_dataset = datasets.CIFAR10(root='~/.cache', train=False, download=True, transform=preprocess)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Freeze the parameters of the pretrained model
for param in model.parameters():
    param.requires_grad = False

# Utility method for extracting features using the pretrained model
def extract_features(loader, desc='Train'):
    features = []
    labels = []
    with torch.no_grad():
        for images, targets in tqdm(loader, desc):
            image_features = model.encode_image(images.to(device)).float()
            image_features /= image_features.norm(dim=-1, keepdim=True)
            features.append(image_features.cpu()) # NOTE: Only image features considered in linear probe
            labels.append(targets)
    return torch.cat(features), torch.cat(labels)

train_features, train_labels = extract_features(train_loader, desc='Train')
test_features, test_labels = extract_features(test_loader, desc='Test')

# Flatten the features to use with LogisticRegression
train_features = train_features.view(train_features.size(0), -1).numpy()
test_features = test_features.view(test_features.size(0), -1).numpy()
train_labels = train_labels.numpy()
test_labels = test_labels.numpy()

# This range was defined for L2 regularization strength (λ) in the CLIP paper
lambda_range = np.logspace(-6, 6, 96)

# Perform logistic regression with cross-validation to find the best λ
clf = LogisticRegressionCV(
    Cs=lambda_range,
    cv=5,
    max_iter=1000,
    solver='lbfgs',
    scoring='accuracy',
    n_jobs=-1
)

clf.fit(train_features, train_labels)

# Evaluate the model
predictions = clf.predict(test_features)
accuracy = accuracy_score(test_labels, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')

Train: 100%|██████████| 3125/3125 [02:19<00:00, 22.39it/s]
Test: 100%|██████████| 625/625 [00:28<00:00, 22.08it/s]


Accuracy: 95.02%
