# How to use CLIP Zero-Shot on your own classificaiton dataset

This notebook provides an example of how to benchmark CLIP's zero shot classification performance on your own classification dataset.

[CLIP](https://openai.com/blog/clip/) is a new zero shot image classifier relased by OpenAI that has been trained on 400 million text/image pairs across the web. CLIP uses these learnings to make predicts based on a flexible span of possible classification categories.

CLIP is zero shot, that means **no training is required**.

Try it out on your own task here!

Be sure to experiment with various text prompts to unlock the richness of CLIP's pretraining procedure.


![Roboflow Wordmark](https://i.imgur.com/dcLNMhV.png)


# Download and Install CLIP Dependencies

In [None]:
#installing some dependencies, CLIP was release in PyTorch
import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

!pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126

import numpy as np
import torch
import os

print("Torch version:", torch.__version__)
os.kill(os.getpid(), 9)
#Your notebook process will restart after these installs

CUDA version: 12.5
Looking in indexes: https://download.pytorch.org/whl/cu126
Collecting torch==2.7.0
  Downloading https://download.pytorch.org/whl/cu126/torch-2.7.0%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision==0.22.0
  Downloading https://download.pytorch.org/whl/cu126/torchvision-0.22.0%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio==2.7.0
  Downloading https://download.pytorch.org/whl/cu126/torchaudio-2.7.0%2Bcu126-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Collecting sympy>=1.13.3 (from torch==2.7.0)
  Downloading https://download.pytorch.org/whl/sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch==2.7.0)
  Downloading https://download.pytorch.org/whl/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch==2.7.0)
  Downloading https://download.pytorc

In [1]:
#clone the CLIP repository
!git clone https://github.com/openai/CLIP.git
%cd CLIP

Cloning into 'CLIP'...
remote: Enumerating objects: 256, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 256 (delta 126), reused 110 (delta 110), pack-reused 102 (from 1)[K
Receiving objects: 100% (256/256), 8.86 MiB | 9.14 MiB/s, done.
Resolving deltas: 100% (140/140), done.
/content/CLIP


# Download Classification Data or Object Detection Data

We will download the [public flowers classificaiton dataset](https://public.roboflow.com/classification/flowers_classification) from Roboflow. The data will come out as folders broken into train/valid/test splits and seperate folders for each class label.

You can easily download your own dataset from Roboflow in this format, too.

We made a conversion from object detection to CLIP text prompts in Roboflow, too, if you want to try that out.


To get your data into Roboflow, follow the [Getting Started Guide](https://blog.roboflow.ai/getting-started-with-roboflow/).

In [2]:
#follow the link below to get your download code from from Roboflow
!pip install -q roboflow


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.9/86.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.8/66.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.9/49.9 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from roboflow import Roboflow

rf = Roboflow(api_key="YOUR-KEY", model_format="clip", notebook="roboflow-clip")


In [None]:
#we auto generate some example tokenizations in Roboflow but you should edit this file to try out your own prompts
#CLIP gets a lot better with the right prompting!
#be sure the tokenizations are in the same order as your class_names above!
%cat {dataset.location}/test/_tokenization.txt

In [None]:
#edit your prompts as you see fit here, be sure the classes are in teh same order as above
%%writefile {dataset.location}/test/_tokenization.txt
The paper sign in rock paper scissors
The rock sign in rock paper scissors
The scissors sign in rock paper scissors

## Config

In [5]:
pip install ftfyfrom google.colab import files
files.upload()

Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ftfy
Successfully installed ftfy-6.3.1


In [6]:
import torch
import clip
from PIL import Image
import glob
import os

In [18]:
# poetry install
# cd data
!kaggle datasets download --unzip frtgnn/rock-paper-scissor

Dataset URL: https://www.kaggle.com/datasets/frtgnn/rock-paper-scissor
License(s): CC-BY-NC-SA-4.0
Downloading rock-paper-scissor.zip to /content/CLIP
 74% 162M/220M [00:00<00:00, 735MB/s] 
100% 220M/220M [00:00<00:00, 371MB/s]


# Run CLIP inference on your classification dataset

In [26]:
dataset_root = "/content/CLIP/rps-test-set/rps-test-set"
"""
# Trong zero-shot learning với CLIP, bạn không huấn luyện model, mà chỉ:
- So sánh vector ảnh với vector của các mô tả văn bản (a photo of rock, a photo of paper, etc)

- Chạy inference trực tiếp

➡️ Do đó, bạn không cần tập train/ — chỉ cần test/ để đánh giá độ chính xác của CLIP trên ảnh mới"""

device = "cuda" if torch.cuda.is_available() else "cpu"

In [21]:
# ----------------------
# 🔁 LOAD MODEL
# ----------------------
model, preprocess = clip.load("ViT-B/32", device=device)

100%|████████████████████████████████████████| 338M/338M [00:02<00:00, 149MiB/s]


In [34]:
# ----------------------
# 📁 LOAD CLASS NAMES
# ----------------------
class_names = sorted([
    d for d in os.listdir(dataset_root)
    if os.path.isdir(os.path.join(dataset_root, d))
])
print("✅ Detected classes:", class_names)

✅ Detected classes: ['paper', 'rock', 'scissors']


In [40]:

# ----------------------
# 🧠 Tokenize text prompts
# ----------------------
candidate_captions = [
    "a photo of a hand shaped like paper",
    "The rock sign in rock paper scissors",
    "The scissors sign in rock paper scissors"
]

class_names = ["paper", "rock", "scissors"]  # Cố định thứ tự theo captions

text_tokens = clip.tokenize(candidate_captions).to(device)

In [41]:

# ----------------------
# 🔍 Inference Loop
# ----------------------
correct = []

for cls in class_names:
    class_correct = []
    test_imgs = glob.glob(os.path.join(dataset_root, cls, "*.png"))

    for img_path in test_imgs:
        image = preprocess(Image.open(img_path).convert("RGB")).unsqueeze(0).to(device)

        with torch.no_grad():
            # Encode image and compare with text tokens
            logits_per_image, _ = model(image, text_tokens)
            probs = logits_per_image.softmax(dim=-1).cpu().numpy()[0]
            pred_index = probs.argmax()
            pred_label = class_names[pred_index]

        # Evaluation
        if pred_label == cls:
            correct.append(1)
            class_correct.append(1)
        else:
            correct.append(0)
            class_correct.append(0)

    acc = sum(class_correct) / len(class_correct) if class_correct else 0
    print(f"🎯 Accuracy on class '{cls}': {acc:.2%}")


🎯 Accuracy on class 'paper': 100.00%
🎯 Accuracy on class 'rock': 0.00%
🎯 Accuracy on class 'scissors': 0.00%


In [None]:
#Hope you enjoyed!
#As always, happy inferencing
#Roboflow