<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024s2/blob/main/session-7/clip-zero-shot-image-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zero-shot Image Classificaation using CLIP

CLIP is a multi-modal embedding model that is trained to learn the joint embedding of image-text pair. As such, CLIP can be used to compare how similar an image is with a text and vice versa.

In this notebook, we will see how we can apply CLIP to do zero-shot image classification.

In [None]:
%%capture
!pip install datasets transformers

## Dataset

We will use the `frgfm/imagenette` dataset via Hugging Face Datasets. This is a smaller subset of 10 easily classified classes from Imagenet.

In [None]:
from datasets import load_dataset

imagenette = load_dataset(
    'frgfm/imagenette',
    '320px',
    split='validation'
)
# show dataset info
imagenette

In [None]:
set(imagenette['label'])

The dataset contains 10 labels, all stored as integer values. To perform classification with CLIP we need the text content of these labels. Most Hugging Face datasets include the mapping to text labels inside the the dataset info:

In [None]:
labels = imagenette.info.features['label'].names
labels

We format the one-word classes into sentences because we expect CLIP model to have seen more sentence-like text during pretraining, than a single 'word'. For ImageNet it was reported that a 1.3 percentage point improvement in accuracy was achieved using the same prompt template of "a photo of a {label}" [1].

Prompt templates don’t always improve performance and they should be tested for each dataset.

In [None]:
# generate sentences
clip_labels = [f"a photo of a {label}" for label in labels]
clip_labels

Before we can compare labels and photos, we need to initialize CLIP. We will use the CLIP implementation found via Hugging Face transformers.

CLIP processor wraps a CLIP image processor and a CLIP tokenizer into a single processor. CLIP image processor will do the image preprocessing, such as rescaling, normalizing, cropping etc, while CLIP tokenizer is used to tokenize the text, using Byte Pair Encoding.

In [None]:
# initialization
from transformers import CLIPProcessor, CLIPModel

model_id = "openai/clip-vit-base-patch32"

processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)

In [None]:
import torch

# if you have CUDA set it to the active device like this
device = "cuda" if torch.cuda.is_available() else "cpu"
# move the model to the device
model.to(device)

device

Text transformers cannot read text directly. Instead, they need a set of integer values known as token IDs (or input_ids), where each unique integer represents a word or sub-word (known as a token).

We create these token IDs alongside another tensor called the attention mask (used by the transformer’s attention mechanism) using the processor we just initialized.

In [None]:
clip_labels

In [None]:
# create label tokens
label_tokens = processor(
    text=clip_labels,
    padding=True,
    images=None,
    return_tensors='pt'
).to(device)

print(label_tokens['input_ids'][0])
print(processor.tokenizer.decode(label_tokens['input_ids'][0]))

In [None]:
label_tokens['attention_mask'][0]

Using these transformer-readable tensors, we create the label text embeddings like so:

In [None]:
# encode tokens to sentence embeddings
label_emb = model.get_text_features(**label_tokens)
# detach from pytorch gradient computation
label_emb = label_emb.detach().cpu().numpy()
label_emb.shape

The vectors that CLIP outputs are not normalized, meaning dot product similarity will give inaccurate results unless the vectors are normalized beforehand. We do that like so:

In [None]:
import numpy as np

# normalization
label_emb = label_emb / np.linalg.norm(label_emb, axis=0)
label_emb.min(), label_emb.max()

All we have left is to work through the same process with the images from our dataset. We will test this with a single image first.

In [None]:
imagenette[0]['image']

In [None]:
image = processor(
    text=None,
    images=imagenette[0]['image'],
    return_tensors='pt'
)['pixel_values'].to(device)
image.shape

After processing the image, we return a single (1) three-color channel (3) image width of 224 pixels and a height of 224 pixels. We must process incoming images to normalize and resize them to fit the input size requirements of the ViT model.

We can create the image embedding with:

In [None]:
img_emb = model.get_image_features(image)
img_emb.shape

In [None]:
img_emb = img_emb.detach().cpu().numpy()

From here, all we need to do is calculate the dot product similarity between our image embedding and the ten label text embeddings. The highest score is our predicted class.

In [None]:
scores = np.dot(img_emb, label_emb.T)
scores.shape

In [None]:
# get index of highest score
pred = np.argmax(scores)
pred

In [None]:
# find text label with highest score
labels[pred]

Label 2, i.e., “cassette player” is our correctly predicted winner. We can repeat this logic over the entire frgfm/imagenette dataset to get the classification accuracy of CLIP.

In [None]:
from tqdm.auto import tqdm

preds = []
batch_size = 32

for i in tqdm(range(0, len(imagenette), batch_size)):
    i_end = min(i + batch_size, len(imagenette))
    images = processor(
        text=None,
        images=imagenette[i:i_end]['image'],
        return_tensors='pt'
    )['pixel_values'].to(device)
    img_emb = model.get_image_features(images)
    img_emb = img_emb.detach().cpu().numpy()
    scores = np.dot(img_emb, label_emb.T)
    preds.extend(np.argmax(scores, axis=1))

In [None]:
true_preds = []
for i, label in enumerate(imagenette['label']):
    if label == preds[i]:
        true_preds.append(1)
    else:
        true_preds.append(0)

sum(true_preds) / len(true_preds)

That gives us an impressive zero-shot accuracy of 98.7%. CLIP proved to be able to accurately predict image classes with little more than some minor reformating of text labels to create sentences.

Zero-shot image classification with CLIP is a fascinating use case for high-performance image classification with minimal effort and zero fine-tuning required.

Before CLIP, this was not possible. Now that we have CLIP, it is almost too easy. The multi-modality and contrastive pretraining techniques have enabled a technological leap forward.

From multi-modal search, zero-shot image classification, and object detection to industry-changing tools like OpenAI’s Dall-E and Stable Diffusion, CLIP has enabled many new use-cases that were previously blocked by insufficient data or compute.