# AI Image Caption Recommendation System

This repository demonstrates how to use

*   Build an AI Image Caption Recommendation System using a retrieval-based approach. The system leverages the CLIP (Contrastive Language–Image Pre-training) model to understand both image and text content. The goal is to recommend the most relevant captions for a given image from a predefined list of captions. This approach is particularly useful for social media platforms where captions need to be engaging and contextually relevant.



Install Required Packages
Before running the script, ensure you have the necessary libraries installed. You can install them using pip:

torch: PyTorch library for deep learning.
transformers: Library for pre-trained models like CLIP.
Pillow: Library for image processing.
scikit-learn: Library for cosine similarity calculation.

In [None]:
pip install torch transformers pillow scikit-learn

In [None]:
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from sklearn.metrics.pairwise import cosine_similarity

This function opens the image using PIL, converts it to RGB format (important for consistency), and then uses the CLIPProcessor to transform the image into a format suitable for the CLIP model. The processor handles resizing, normalization, and other necessary transformations. The output will be a PyTorch tensor ready for CLIP.

In [None]:
def load_and_preprocess_image(image_path):
    image = Image.open(image_path).convert("RGB")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    inputs = processor(images=image, return_tensors="pt")
    return inputs, processor

Generate Image Embeddings

This function loads the pre-trained CLIP model. The crucial step is model.get_image_features(**inputs), which passes the preprocessed image tensor to the CLIP model and extracts a high-dimensional feature vector representing the image’s visual content. torch.no_grad() is used to prevent gradient calculations during inference, saving memory and speeding up the process.


In [None]:
def generate_image_embeddings(inputs):
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    with torch.no_grad():
        image_features = model.get_image_features(**inputs)
    return image_features, model

Match Captions

This function takes the image features and a list of candidate captions as input. It processes the captions using the same CLIPProcessor (now for text) to get text embeddings. It then calculates the cosine similarity between the image embedding and each text embedding. Cosine similarity measures the angle between two vectors; a value closer to 1 indicates higher similarity. The function will return the captions ranked by similarity and their corresponding similarity scores.

In [None]:
def match_captions(image_features, captions, clip_model, processor):
    text_inputs = processor(text=captions, return_tensors="pt", padding=True)
    with torch.no_grad():
        text_features = clip_model.get_text_features(**text_inputs)

    image_features = image_features.detach().cpu().numpy()
    text_features = text_features.detach().cpu().numpy()

    similarities = cosine_similarity(image_features, text_features)

    best_indices = similarities.argsort(axis=1)[0][::-1]
    best_captions = [captions[i] for i in best_indices]

    return best_captions, similarities[0][best_indices].tolist()

Main Function

This function ties together the preprocessing, feature extraction, and matching processes. It takes an image path and a list of candidate captions, processes the image, gets its features, and then matches these features against the captions. The result will be a list of best-fit captions with their similarity scores.

In [None]:
def image_captioning(image_path, candidate_captions):
    inputs, processor = load_and_preprocess_image(image_path)
    image_features, clip_model = generate_image_embeddings(inputs)

    best_captions, similarities = match_captions(image_features, candidate_captions, clip_model, processor)
    return best_captions, similarities

Example Captions

In [None]:
candidate_captions = [
    "Trees, Travel and Tea!",
    "A refreshing beverage.",
    "A moment of indulgence.",
    "The perfect thirst quencher.",
    "Your daily dose of delight.",
    "Taste the tradition.",
    "Savor the flavor.",
    "Refresh and rejuvenate.",
    "Unwind and enjoy.",
    "The taste of home.",
    "A treat for your senses.",
    "A taste of adventure.",
    "A moment of bliss.",
    "Your travel companion.",
    "Fuel for your journey.",
    "The essence of nature.",
    "The warmth of comfort.",
    "A sip of happiness.",
    "Pure indulgence.",
    "Quench your thirst, ignite your spirit.",
    "Awaken your senses, embrace the moment.",
    "The taste of faraway lands.",
    "A taste of home, wherever you are.",
    "Your daily dose of delight.",
    "Your moment of serenity.",
    "The perfect pick-me-up.",
    "The perfect way to unwind.",
    "Taste the difference.",
    "Experience the difference.",
    "A refreshing escape.",
    "A delightful escape.",
    "The taste of tradition, the spirit of adventure.",
    "The warmth of home, the joy of discovery.",
    "Your passport to flavor.",
    "Your ticket to tranquility.",
    "Sip, savor, and explore.",
    "Indulge, relax, and rejuvenate.",
    "The taste of wanderlust.",
    "The comfort of home.",
    "A journey for your taste buds.",
    "A haven for your senses.",
    "Your refreshing companion.",
    "Your delightful escape.",
    "Taste the world, one sip at a time.",
    "Embrace the moment, one cup at a time.",
    "The essence of exploration.",
    "The comfort of connection.",
    "Quench your thirst for adventure.",
    "Savor the moment of peace.",
    "The taste of discovery.",
    "The warmth of belonging.",
    "Your travel companion, your daily delight.",
    "Your moment of peace, your daily indulgence.",
    "The spirit of exploration, the comfort of home.",
    "The joy of discovery, the warmth of connection.",
    "Sip, savor, and set off on an adventure.",
    "Indulge, relax, and find your peace.",
    "A delightful beverage.",
    "A moment of relaxation.",
    "The perfect way to start your day.",
    "The perfect way to end your day.",
    "A treat for yourself.",
    "Something to savor.",
    "A moment of calm.",
    "A taste of something special.",
    "A refreshing pick-me-up.",
    "A comforting drink.",
    "A taste of adventure.",
    "A moment of peace.",
    "A small indulgence.",
    "A daily ritual.",
    "A way to connect with others.",
    "A way to connect with yourself.",
    "A taste of home.",
    "A taste of something new.",
    "A moment to enjoy.",
    "A moment to remember."
]

Test the System

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

best_captions, similarities = image_captioning("/content/aman.png", candidate_captions)

# Get the top 5 results
top_n = min(5, len(best_captions))
top_best_captions = best_captions[:top_n]
top_similarities = similarities[:top_n]

print("Top 5 Best Captions:")
for i, (caption, similarity) in enumerate(zip(top_best_captions, top_similarities)):
    print(f"{i+1}. {caption} (Similarity: {similarity:.4f})")