## Image Captioning using KNN

Although VLMs (Vision Language Models) are the go to tools for image captioning right now, there are interesting works from earlier years that used KNN for captioning and perform surprisingly well enough!

Further, Libraries like [Faiss](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) perform the nearest neighbor computation efficiently and are used in many industrial applications.

- In this question you will implement an algorithm to perform captioning using KNN based on the paper [A Distributed Representation Based Query Expansion Approach for
Image Captioning](https://aclanthology.org/P15-2018.pdf)

- Dataset: [MS COCO](https://cocodataset.org/#home) 2014 (val set only)

- Algorithm:
    1. Given: Image embeddings and correspond caption embeddings (5 Per image)
    1. For every image, findout the k nearest images and compute its query vector as the weighted sum of the captions of the nearest images (k*5 captions per image)
    1. The predicted caption would be the caption in the dataset that is closest to the query vector. (for the sake of the assignment use the same coco val set captions as the dataset)

- The image and text embeddings are extracted from the [CLIP](https://openai.com/research/clip) model. (You need not know about this right now)

- Tasks:
    1. Implement the algorithm and compute the bleu score. Use Faiss for nearest neighbor computation. Starter code is provided below.
    1. Try a few options for k. Record your observations.
    1. For a fixed k, try a few options in the Faiss index factory to speed the computation in step 2. Record your observations.
    1. Qualitative study: Visualize five images, their ground truth captions and the predicted caption.
    
Note: Run this notebook on Colab for fastest resu

In [2]:
!gdown 1RwhwntZGZ9AX8XtGIDAcQD3ByTcUiOoO #image embeddings

'gdown' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
!gdown 1b-4hU2Kp93r1nxMUGEgs1UbZov0OqFfW #caption embeddings

'gdown' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
!wget http://images.cocodataset.org/zips/val2014.zip
!unzip /content/val2014.zip
!wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
!unzip /content/annotations_trainval2014.zip
!pip install faiss-cpu

'wget' is not recognized as an internal or external command,
operable program or batch file.


'unzip' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp311-cp311-win_amd64.whl (10.8 MB)
     ---------------------------------------- 0.0/10.8 MB ? eta -:--:--
     ---------------------------------------- 0.1/10.8 MB 2.0 MB/s eta 0:00:06
     - -------------------------------------- 0.3/10.8 MB 3.5 MB/s eta 0:00:03
     - -------------------------------------- 0.4/10.8 MB 3.7 MB/s eta 0:00:03
     - -------------------------------------- 0.4/10.8 MB 3.7 MB/s eta 0:00:03
     --- ------------------------------------ 1.0/10.8 MB 4.5 MB/s eta 0:00:03
     ---- ----------------------------------- 1.2/10.8 MB 4.3 MB/s eta 0:00:03
     ----- ---------------------------------- 1.4/10.8 MB 4.6 MB/s eta 0:00:03
     ------ --------------------------------- 1.7/10.8 MB 4.7 MB/s eta 0:00:02
     ------- -------------------------------- 2.0/10.8 MB 4.9 MB/s eta 0:00:02
     -------- ------------------------------- 2.2/10.8 MB 4.9 MB/s eta 0:00:02
     --------- ------------------------------ 2.4/1

DEPRECATION: Loading egg at c:\python311\lib\site-packages\pdfstructure-0.0.1-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import torchvision.datasets as dset
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F
from nltk.translate import bleu_score
import faiss
import numpy as np

In [None]:
def get_transform():
    transform = transforms.Compose([
        transforms.Resize((224,224)),
        transforms.ToTensor(),  # convert the PIL Image to a tensor
        transforms.Normalize(
            (0.485, 0.456, 0.406),  # normalize image for pre-trained model
            (0.229, 0.224, 0.225),
        )
    ])
    return transform

coco_dset = dset.CocoCaptions(root = '/content/val2014',
                        annFile = '/content/annotations/captions_val2014.json',
                        transform=get_transform())

print('Number of samples: ', len(coco_dset))
img, target = coco_dset[3] # load 4th sample

print("Image Size: ", img.shape)
print(target)

loading annotations into memory...
Done (t=0.58s)
creating index...
index created!
Number of samples:  40504
Image Size:  torch.Size([3, 224, 224])
['A loft bed with a dresser underneath it.', 'A bed and desk in a small room.', 'Wooden bed on top of a white dresser.', 'A bed sits on top of a dresser and a desk.', 'Bunk bed with a narrow shelf sitting underneath it. ']


In [None]:
ids = list(sorted(coco_dset.coco.imgs.keys()))
captions = []
for i in range(len(ids)):
    captions.append([ele['caption'] for ele in coco_dset.coco.loadAnns(coco_dset.coco.getAnnIds(ids[i]))][:5]) #5 per image
captions_np = np.array(captions)
print('Captions:', captions_np.shape)

Captions: (40504, 5)


In [None]:
captions_flat = captions_np.flatten().tolist()
print('Total captions:', len(captions_flat))

Total captions: 202520


In [None]:
cap_path = '/content/coco_captions.npy'
caption_embeddings = np.load(cap_path)
print('Caption embeddings',caption_embeddings.shape)

Caption embeddings (40504, 5, 512)


In [None]:
img_path = '/content/coco_imgs.npy'
image_embeddings = np.load(img_path)
print('Image embeddings',image_embeddings.shape)

Image embeddings (40504, 512)


In [None]:
def accuracy(predict, real):
    '''
    use bleu score as a measurement of accuracy
    :param predict: a list of predicted captions
    :param real: a list of actual descriptions
    :return: bleu accuracy
    '''
    accuracy = 0
    for i, pre in enumerate(predict):
        references = real[i]
        score = bleu_score.sentence_bleu(references, pre)
        accuracy += score
    return accuracy/len(predict)

In [None]:
def knn_captioning(image_embeddings, caption_embeddings, k, ground_truth_captions):
    # Initialize Faiss index
    index = faiss.IndexFlatL2(image_embeddings.shape[1])
    index.add(image_embeddings)

    predicted_captions = []

    for i in range(image_embeddings.shape[0]):
        # Find k nearest neighbors
        D, I = index.search(image_embeddings[i:i+1], k)

        # Compute query vector as the weighted sum of captions of the nearest images
        query_vector = np.sum(caption_embeddings[I.flatten()], axis=0)

        # Find the caption closest to the query vector using vectorized operations
        distances = np.linalg.norm(caption_embeddings - query_vector, axis=1)
        closest_caption_index = np.argmin(distances)

        # Ensure index is within bounds
        closest_caption_index = min(closest_caption_index, len(ground_truth_captions) - 1)
        predicted_captions.append(ground_truth_captions[closest_caption_index])

    return predicted_captions


# Example usage:
k_value = 5  # You can experiment with different values of k
predicted_captions = knn_captioning(image_embeddings, caption_embeddings, k_value, captions)

# Compute BLEU score for each image
bleu_scores = [bleu_score.sentence_bleu(captions[i], predicted_captions[i]) for i in range(len(captions))]
average_bleu = np.mean(bleu_scores)
print(f"Average BLEU score: {average_bleu}")
