**Assignment No. 7:** Implement a basic multimodal system for image captioning using CLIP

In [None]:
!pip uninstall -y torch torchaudio fastai

Found existing installation: torch 2.5.1+cu124
Uninstalling torch-2.5.1+cu124:
  Successfully uninstalled torch-2.5.1+cu124
Found existing installation: torchaudio 2.5.1+cu124
Uninstalling torchaudio-2.5.1+cu124:
  Successfully uninstalled torchaudio-2.5.1+cu124
Found existing installation: fastai 2.7.18
Uninstalling fastai-2.7.18:
  Successfully uninstalled fastai-2.7.18


In [None]:
!pip install torch==2.6.0 torchvision==0.21.0

Collecting torch==2.6.0
  Downloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision==0.21.0
  Downloading torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.6.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.6.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.6.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metad

In [None]:
!pip install transformers



In [None]:
import transformers
print(transformers.__version__)

4.48.3


In [None]:
import torch
import torchvision

print("Torch version:", torch.__version__)
print("Torchvision version:", torchvision.__version__)


Torch version: 2.6.0+cu124
Torchvision version: 0.21.0+cu124


In [None]:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

# Load the CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load the local image directly
image_path = "/content/Image_assignment_7.webp"  # Path to your local image
image = Image.open(image_path)

# List of generic captions or descriptions (you can extend this as needed)
# For a more dynamic approach, you could generate more captions using a caption generation model.
text = [
    "a small blue bird on a branch",
    "a bird sitting on a flowering tree branch",
    "a close-up of a bird with white flowers",
    "a beautiful spring scene with a blue bird",
    "a blue bird perched on a blooming tree",
    "a nature photograph of a bird and flowers",
    "a peaceful bird resting on a branch",
    "a scenic view of a bird and sky"
]


# Preprocess the image and text to match CLIP input requirements
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)

# Check if CUDA is available, otherwise fall back to CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move tensors to the correct device
inputs = {key: value.to(device) for key, value in inputs.items()}
model.to(device)

# Get the image and text features using CLIP
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # Image-text similarity scores
    logits_per_text = outputs.logits_per_text  # Text-image similarity scores

# Softmax to normalize similarity scores
image_features = logits_per_image.softmax(dim=-1)  # For image-to-text similarity
text_features = logits_per_text.softmax(dim=-1)  # For text-to-image similarity

# Print similarity scores
print("Image to Text Similarity Scores:")
for idx, caption in enumerate(text):
    print(f"{caption}: {image_features[0][idx].item():.4f}")

# Optionally, return the best caption based on similarity score
best_caption_idx = torch.argmax(image_features)
print("\nBest caption for the image:", text[best_caption_idx])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

Image to Text Similarity Scores:
a small blue bird on a branch: 0.0057
a bird sitting on a flowering tree branch: 0.0160
a close-up of a bird with white flowers: 0.0002
a beautiful spring scene with a blue bird: 0.4551
a blue bird perched on a blooming tree: 0.4994
a nature photograph of a bird and flowers: 0.0173
a peaceful bird resting on a branch: 0.0061
a scenic view of a bird and sky: 0.0002

Best caption for the image: a blue bird perched on a blooming tree
