# Interacting with CLIP

This is a self-contained notebook that shows how to download and run CLIP models, calculate the similarity between arbitrary image and text inputs, and perform zero-shot image classifications.

# Preparation for Colab

Make sure you're running a GPU runtime; if not, select "GPU" as the hardware accelerator in Runtime > Change Runtime Type in the menu. The next cells will install the `clip` package and its dependencies, and check if PyTorch 1.7.1 or later is installed.

In [None]:
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

In [2]:
!git clone https://github.com/ttengwang/PDVC.git

Cloning into 'PDVC'...
remote: Enumerating objects: 338, done.[K
remote: Counting objects: 100% (91/91), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 338 (delta 65), reused 56 (delta 56), pack-reused 247[K
Receiving objects: 100% (338/338), 37.84 MiB | 22.88 MiB/s, done.
Resolving deltas: 100% (143/143), done.


In [None]:
%%capture
%cd /content/PDVC
!bash /content/PDVC/data/yc2/features/download_yc2_tsn_features.sh

In [4]:
import numpy as np
import torch
from pkg_resources import packaging

print("Torch version:", torch.__version__)

Torch version: 1.13.1+cu116


# Loading the model

`clip.available_models()` will list the names of available CLIP models.

In [5]:
import clip
import gc
gc.collect()
torch.cuda.empty_cache()

clip.available_models()

['RN50',
 'RN101',
 'RN50x4',
 'RN50x16',
 'RN50x64',
 'ViT-B/32',
 'ViT-B/16',
 'ViT-L/14',
 'ViT-L/14@336px']

In [6]:
model, preprocess = clip.load("RN50")
device = "cuda"
model = model.to(device)
model.eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

100%|███████████████████████████████████████| 244M/244M [00:04<00:00, 51.6MiB/s]


Model parameters: 102,007,137
Input resolution: 224
Context length: 77
Vocab size: 49408


# Image Preprocessing

We resize the input images and center-crop them to conform with the image resolution that the model expects. Before doing so, we will normalize the pixel intensity using the dataset mean and standard deviation.

The second return value from `clip.load()` contains a torchvision `Transform` that performs this preprocessing.



In [7]:
preprocess

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=None)
    CenterCrop(size=(224, 224))
    <function _convert_image_to_rgb at 0x7fcc7dc28310>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

# Text Preprocessing

We use a case-insensitive tokenizer, which can be invoked using `clip.tokenize()`. By default, the outputs are padded to become 77 tokens long, which is what the CLIP models expects.

In [None]:
import pandas as pd
import gc
df=pd.read_json('/content/PDVC/data/yc2/captiondata/yc2_val.json')
for index, row in df.iterrows():
  if index=="sentences":
    token=[]
    for i in row:
        t=[]
        try:
            for j in i:
                t1 = clip.tokenize(j).to(device)
                t.append(model.encode_text(t1))
                del t1
                gc.collect()
                torch.cuda.empty_cache()
        except Exception as e: print(e)
        token.append(t)
    df.loc['Text'] = token