<a href="https://colab.research.google.com/github/itachi-452b/learning-clip/blob/main/Testing_CLIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trying Out OpenAI's CLIP

Today I am trying out OpenAI clip model.

[Reference](https://towardsdatascience.com/clip-the-most-influential-ai-model-from-openai-and-how-to-use-it-f8ee408958b1#:~:text=CLIP%20is%20an%20open%20source,and%20open%2Dsourced%20by%20OpenAI.)

[Hugging Face](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPProcessor)

In [None]:
# Install libraries via pip
!pip install transformers
!pip install datasets

In [None]:
# Install libraries
import transformers
import datasets
import numpy as np
import pandas as pd
import torch
from PIL import Image
import requests

from transformers import CLIPTokenizerFast, CLIPProcessor, CLIPModel

In [None]:
# Check Device, load clip model from hugging face
# Load model, tokeniser (creates tokens for model inputs) and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/clip-vit-base-patch32"

# we initialize a tokenizer, image processor, and the model itself
tokenizer = CLIPTokenizerFast.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id).to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/961k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

In [None]:
urls=['https://images.unsplash.com/photo-1662955676669-c5d141718bfd?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=687&q=80',
    'https://images.unsplash.com/photo-1552053831-71594a27632d?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=662&q=80',
    'https://images.unsplash.com/photo-1530281700549-e82e7bf110d6?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=688&q=80']

images=[Image.open(requests.get(i, stream=True).raw)  for i in urls]

In [None]:
text_prompts=["a girl wearing a beanie", "a boy wearing a beanie", "a dog", "a dog at the beach"]
inputs = inputs = processor(text=text_prompts, images=images, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image 
probs = logits_per_image.softmax(dim=1) 

In [None]:
pd.DataFrame(probs.detach().numpy()*100, columns=text_prompts, index=list(['image1','image2', 'image3'])).style.background_gradient(axis=None,low=0, high=0.91).format(precision=2)

Unnamed: 0,a girl wearing a beanie,a boy wearing a beanie,a dog,a dog at the beach
image1,99.26,0.74,0.0,0.0
image2,0.1,0.06,98.76,1.08
image3,0.0,0.0,0.81,99.18


Let us do another test

In [None]:
urls=['https://img.i-scmp.com/cdn-cgi/image/fit=contain,width=1098,format=auto/sites/default/files/styles/1200x800/public/d8/images/methode/2019/12/12/2fa2638e-1ca7-11ea-8971-922fdc94075f_image_hires_174609.JPG?itok=axx7y6Tu&v=1576143981',
    'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSao-iQSjVUd1Ed3Ac4kvEs1dL_cnrnFxOPNA&usqp=CAU',
    'https://www.lifeisabeachparty.com/assets/image/content/photogallery/photogallery_7.jpg']

images=[Image.open(requests.get(i, stream=True).raw)  for i in urls]

In [None]:
text_prompts=["volcano", "naruto", "pool party", "beach", "party","nightlife","happy people"]
inputs = inputs = processor(text=text_prompts, images=images, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image 
probs = logits_per_image.softmax(dim=1) 

In [None]:
pd.DataFrame(probs.detach().numpy()*100, columns=text_prompts, index=list(['image1','image2', 'image3'])).style.background_gradient(axis=None,low=0, high=0.91).format(precision=2)

Unnamed: 0,volcano,naruto,pool party,beach,party,nightlife,happy people
image1,99.96,0.01,0.0,0.01,0.02,0.0,0.0
image2,0.02,99.98,0.0,0.0,0.0,0.0,0.0
image3,0.0,0.0,93.48,0.87,2.5,0.0,3.14


Next step is to finetune. You can check it [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).

Using [finetuner](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text)