<div align="center"><img src="./asset/UniTn-Logo.svg" style="width: 7em;"/></div>
<h1 align="center" style="font-size: 3.2em;">University of Trento<br/>Machine Learning - Deep Learning Module</h1>
<h2 align="center" style="font-size: 1.5em;">
    Marco Garosi<br/>
    Matteo Minardi<br/>
    Eric Suardi<br/>
    A.Y. 2022-2023
</h2>
<h2 align="center" style="font-size: 1.2em;">Dr. Alessandro Conti<br/>Prof. Elisa Ricci</h2>

# Table of Contents
* Introduction
* How to run this Notebook
* Steps
* Metrics
* "Boilerplate" Code
* Baseline
    * Results
    * Issues
* Core Ideas
    1. Adding a bottleneck on top of CLIP
        * Results
    2. Adding a Fully-Connected Neural Network on top of CLIP
    3. GradCam and ClipSeg
    4. Natural Language Understanding
    5. Graphs
    6. Incorporating more information into the inputs
        * Encoding the position using colors
            * Encoding position in the background
            * Encoding position in the whole image
        * [Our Final Proposal] Deviating CLIP's "attention" to relevant parts
* Results
* Notes
* References

# Introduction

Deep Learning is becoming increasingly relevant both in the research and industrial worlds. Thanks to its exceptional ability in solving problems and tasks, Deep Learning is being actively researched and developed to discover and implement new exciting architectures.

One of the most cutting-edge techniques are the so-called ***multimodal models***, which are models that are able to perform on a mixture of data types, for instance text and visual inputs.

The task assigned to this project is based on this latter type of multimodal models. By leveraging the capabilities of the ***Contrastive Language-Image Pre-training***, or **CLIP** for short, model by OpenAI the objective was to perform *visual grounding* on the *RefCOCOg* benchmark dataset.

CLIP, which comprises two encoders - one for the textual information, one for the visual one -, comes with a variety of *backbone* models which lead to different performance. The following backbones are available:
* ResNet50
* ResNet101
* ResNet50x4
* ResNet50x16
* ResNet50x64
* ViT-B/32
* ViT-B/16
* ViT-L/14
* ViT-L/14@336px

# How to run this Notebook

This Notebook contains: the baseline implementation that we developed; **our model/architecture proposal, at the very end** ("Deviating CLIP's "attention" to relevant parts" section); all the ideas that we worked on, in between the baseline and our final proposal.

We decided to include the intermediate ideas as they were part of our process and we think some of them may be of interest.

To run this Notebook, we suggest to run the first part up to the baseline. Then, our results can be checked by jumping to the respective section, as mentioned earlier, and running that.

It is not necessary to run the code in the cells in between, since it is just implementations of some ideas we are proposing.

All the results can be found in the tables at the end of every section, therefore it is not strictly necessary to run the Notebook other than to check it.

There is quite some code. However, we did our best to keep it as simple and understandable as possible. Comments will guide the readers into the details, whereas Markdown cells will drive them across the macro-blocks by providing a more general overview of what is going on in each part of this Notebook.
Some code may be repeated for clarity and to ensure that it is consistent across all the sections of this comprehensive Notebook.

___
### Package installation

To run this Notebook, some additional packages are required. Running the next cell will install them using PIP. We are assuming that PyTorch is already installed and properly configured to work on the machine (since installation may vary depending on hardware and software configurations).

___
### Dataset download

The dataset will be downloaded from Google Drive, using the shared folder ID and the `gdown` command. However, it may happen that Google Drive does not allow to download the `.tar` folder. This can happen if several downloads of the same file occurred recently and there is nothing we can do to avoid this: it is a Google Drive's policy.

If it does not work, we kindly ask to manually load the dataset from Google Drive rather than downloading it in the environment. While this solves the issue, the implementation would depend on the path of the dataset in one's Google Drive storage, so it is not possible, for us, to handle this directly.

The dataset is then extracted in the `refcocog` folder, which is the one in which the code will look for to load all the data.

In [None]:
!pip install pandas
!pip install numpy

!pip install ftfy regex tqdm
!pip install git+https://github.com/openai/CLIP.git

Run the following two cells only if executing on Google Colab. They will download the dataset and extract it into the current environment, which will then be mounted.

In [None]:
!gdown --id 1xijq32XfEm6FPhUb7RsZYWHc2UuwVkiq

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!tar -xvzf "drive/MyDrive/Deep Learning Project/refcocog.tar.gz"

To conclude this setup section, please remember that CLIP models will be downloaded if they are not found in the cache. Therefore, this may slow down the running time.

In addition, some parts of this Notebook use different CLIP backbones, therefore multiple downloads may occur.

# Steps

In order to develop our project, we followed the following steps:
1. Reading
    * We have read the foundational papers for this project
1. Dataset exploration
    * We have thoroughly explored the dataset to understand its structure, how to link information together and what to expect from it
1. Baseline implementation
    * We have implemented the baseline to understand how it performs and what issues it brings in
1. Results examination
    * We have looked at several outputs one-by-one to better understand the problems that the baseline was unable to solve
1. Brain storming, looking for novel ideas and a lot of reading
1. Exploration
    * We have implemented and tested many different architectures and pipelines in order to understand what could work and what could not
1. Improvement
    * Based on the results of the previous step, we have decided to focus on a couple of architectures and did our best, both in terms of reading new papers and creativity, to push them forwards and improve the baseline's results
1. Results examination and conclusions

Some of these steps were, of course, iterated multiple times as we had to gather knowledge about the problems of each solution we had thought of.

# Metrics

Before presenting our work, it is crucial to explain the metrics we utilized to measure the performance of our models.

We implemented and employed the following metrics:
* Intersection over Union (IoU), average
* Accuracy, computed as the fraction of samples having an Intersection over Union $\ge 0.5$
* Cosine similarity between the embedding of the image cropped on the estimated bounding box and the embedding of the image cropped on the golden/ground truth bounding box, average

Other metrics can be computed (e.g. top k accuracy, which looks for a correct result in the top k produced - that is, the k bounding boxes with the highest confidence). However, these can be derived or are strongly correlated to the three we have decided to use for the overall assignment, therefore we chose to compute the three aforementioned metrics on the whole set of examples used for testing.

# "Boilerplate" Code

In the following sections we will present our code and our findings. In this section we will instead present and briefly explain, through comments, the "boilerplate" code that is necessary to run our architecture and assess its performance.

> Some pieces of code are repeated across different proposals. This happens especially for loading CLIP model(s). We decided to repeat that code across different proposals to ensure the correct CLIP models are loaded at each spot and to make it easier to refer to the relevant part of the architectures (and CLIP backbones are one of those).

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import torchvision.transforms as T
import torch.nn.functional as F
from torchvision.io import read_image, ImageReadMode
from PIL import Image
import os
import json
import pickle
import numpy as np

In [2]:
class RefCocoG_Dataset(Dataset):
    full_annotations = None

    def __init__(self, root_dir, annotations_f, instances_f, split='train', transform=None, target_transform=None) -> None:
        super().__init__()

        self.root_dir = root_dir
        self.annotations_f = annotations_f
        self.instances_f = instances_f

        self.split = split

        self.transform = transform
        self.target_transform = target_transform

        self.get_annotations()
        self.image_names = list([
            self.annotations[id]['image']['actual_file_name']
            for id in self.annotations
        ])

    def get_annotations(self):
        if RefCocoG_Dataset.full_annotations:
            self.annotations = dict(filter(lambda match: match[1]['image']['split'] == self.split, RefCocoG_Dataset.full_annotations.items()))
            return

        # Load pickle data
        with open(os.path.join(self.root_dir, 'annotations', self.annotations_f), 'rb') as file:
            self.data = pickle.load(file)

        # Load instances
        with open(os.path.join(self.root_dir, 'annotations', self.instances_f), 'rb') as file:
            self.instances = json.load(file)

        # Match data between the two files and build the actual dataset
        self.annotations = {}

        images_actual_file_names = {}
        for image in self.instances['images']:
            images_actual_file_names[image['id']] = image['file_name']

        for image in self.data:
            if image['ann_id'] not in self.annotations:
                self.annotations[image['ann_id']] = {}

            self.annotations[image['ann_id']]['image'] = image
            self.annotations[image['ann_id']]['image']['actual_file_name'] = images_actual_file_names[image['image_id']]

        for annotation in self.instances['annotations']:
            if annotation['id'] not in self.annotations:
                continue

            self.annotations[annotation['id']]['annotation'] = annotation

        # Keep only samples from the given split
        RefCocoG_Dataset.full_annotations = self.annotations
        self.annotations = dict(filter(lambda match: match[1]['image']['split'] == self.split, self.annotations.items()))

    def __len__(self):
        # Return the number of images
        return len(self.image_names)

    def corner_size_to_corners(self, bounding_box):
        """
        Transform (top_left_x, top_left_y, width, height) bounding box representation
        into (top_left_x, top_left_y, bottom_right_x, bottom_right_y)
        """

        return [
            bounding_box[0],
            bounding_box[1],
            bounding_box[0] + bounding_box[2],
            bounding_box[1] + bounding_box[3]
        ]

    def __getitem__(self, idx):
        # Get the image name at the given index
        image_name = self.image_names[idx]

        # Load the image file as a PIL image
        image = Image.open(os.path.join(self.root_dir, 'images', image_name)).convert('RGB')
        
        image_id = list(self.annotations)[idx]

        # Get the caption for the image
        prompts = [
            prompt['sent'] for prompt in self.annotations[image_id]['image']['sentences']
        ]

        # Get the bounding box for the prompts for the image
        bounding_box = self.corner_size_to_corners(self.annotations[image_id]['annotation']['bbox'])

        # Apply the transform if given
        if self.transform:
            image = self.transform(image)

        sample = [
            image,
            bounding_box,
            prompts,
        ]

        # Return the sample as a list
        return sample

In [3]:
# Load the dataset with the three splits

dataset_train = RefCocoG_Dataset('refcocog', 'refs(umd).p', 'instances.json', split='train')
dataset_val = RefCocoG_Dataset('refcocog', 'refs(umd).p', 'instances.json', split='val')
dataset_test = RefCocoG_Dataset('refcocog', 'refs(umd).p', 'instances.json', split='test')

dataset_splits = [
    dataset_train,
    dataset_val,
    dataset_test
]

In [4]:
# Display how many samples there are per split to check whether
# it was loaded correctly
len(RefCocoG_Dataset.full_annotations), len(dataset_train.annotations), len(dataset_val.annotations), len(dataset_test.annotations)

(49820, 42224, 2573, 5023)

In [5]:
# In order to be able to move lists of objects around (especially lists of
# PIL Images, rather than tensors), we need a custom collation function.
# This ensures we can feed the original images to the pipeline, rather
# than tensor-transformed (with scaling, cropping, etc.) versions.

def collate_differently_sized_prompts(batch):
    images = [item[0] for item in batch]
    bboxes = [item[1] for item in batch]
    prompts = [item[2] for item in batch]
    
    return list(images), list(bboxes), list(prompts)

def get_data(dataset_splits, batch_size=64, test_batch_size=256, num_workers=0):
    training_data = dataset_splits[0]
    validation_data = dataset_splits[1]
    test_data = dataset_splits[2]

    train_loader = torch.utils.data.DataLoader(training_data, batch_size, shuffle=True, drop_last=True, collate_fn=collate_differently_sized_prompts, num_workers=num_workers)
    val_loader = torch.utils.data.DataLoader(validation_data, test_batch_size, shuffle=False, collate_fn=collate_differently_sized_prompts, num_workers=num_workers)
    test_loader = torch.utils.data.DataLoader(test_data, test_batch_size, shuffle=False, collate_fn=collate_differently_sized_prompts, num_workers=num_workers)

    return train_loader, val_loader, test_loader

In [6]:
train_loader, val_loader, test_loader = get_data(dataset_splits, batch_size=64, test_batch_size=64, num_workers=0)

Get the correct device to work with.

In [7]:
if torch.cuda.is_available():
    device = torch.device("cuda:0") # First GPU
else:
    device = 'cpu'

In [8]:
import torch

def cosine_similarity(a: torch.Tensor, b: torch.Tensor, keep_on_same_device=False):
    """
    Cosine Similarity

    Normalizes both tensors a and b. Returns <b, a.T> (inner product).
    """

    a_norm = a / a.norm(dim=-1, keepdim=True)
    b_norm = b / b.norm(dim=-1, keepdim=True)

    similarity = (b_norm @ a_norm.T)

    if keep_on_same_device:
        return similarity
    
    return similarity.cpu()

And metrics-related code:

In [9]:
import torch.nn as nn

class Evaluator(nn.Module):
    def __init__(self, device=None, models=None, preprocesses=None) -> None:
        super().__init__()
        
        if device:
            self.device = device
        else:
            self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
        
        if not models or not preprocesses:
            raise ValueError('Models and preprocesses for CLIP model should be provided')

        self.models = models
        self.preprocesses = preprocesses

        self.clip_backbones = list(self.models.keys())

    def forward(self, indices, images, gt_bounding_boxes, pred_bounding_boxes):
        self.device = indices.device
        if indices.is_cuda:
            self.device_index = int(str(self.device)[-1])
        else:
            self.device_index = 0

        # -- Getting the right data and moving it to the correct device --

        # Images remain on the CPU because they are PIL Images, not Tensors
        # Converting to Tensors leads to errors with YOLO
        images = [images[i] for i in indices]

        gt_bounding_boxes = [torch.tensor(gt_bounding_boxes[i]).unsqueeze(0) for i in indices]
        pred_bounding_boxes = [torch.tensor(pred_bounding_boxes[i]) for i in indices]
        
        pred_crops = [self.get_cropped_bounding_boxes(image, bbox_list) for image, bbox_list in zip(images, pred_bounding_boxes)]
        gt_crops = [self.get_cropped_bounding_boxes(image, bbox, len(pred_bboxes)) for image, bbox, pred_bboxes in zip(images, gt_bounding_boxes, pred_bounding_boxes)]

        # Store overall results across all the provided backbones
        overall_results = []

        with torch.no_grad():
            for backbone in self.clip_backbones:
                preprocessed_gt_crops = torch.stack([self.preprocesses[backbone][self.device_index](image) for sample in gt_crops for image in sample]).to(self.device)
                preprocessed_pred_crops = torch.stack([self.preprocesses[backbone][self.device_index](image) for sample in pred_crops for image in sample]).to(self.device)

                gt_crop_features = self.models[backbone][self.device_index].encode_image(preprocessed_gt_crops)
                pred_crop_features = self.models[backbone][self.device_index].encode_image(preprocessed_pred_crops)

                result = cosine_similarity(gt_crop_features, pred_crop_features, keep_on_same_device=True)
                result = torch.diagonal(result)
                overall_results.append(result)
        
        # Computing an average for each column
        overall_results = torch.stack(overall_results).mean(dim=0)

        return overall_results

    def get_cropped_bounding_boxes(self, image, bounding_boxes, repeat=1):
        cropped_bounding_boxes = []
        
        if bounding_boxes is None:
            return [image] * repeat

        for bounding_box in bounding_boxes:
            cropped_img = image.crop((bounding_box[0].item(), bounding_box[1].item(), bounding_box[2].item(), bounding_box[3].item()))
            cropped_bounding_boxes += [cropped_img] * repeat
        
        if len(cropped_bounding_boxes) == 0:
            return [image] * repeat

        return cropped_bounding_boxes

In [10]:
from torchvision.ops import box_iou

def iou_metric(bounding_boxes, ground_truth_bounding_boxes):
    """
    Localization Accuracy Metric

    Intersection over Union (IoU) is a common metric measure for localization accuracy.
    """

    if not torch.is_tensor(ground_truth_bounding_boxes):
        ground_truth_bounding_boxes = torch.tensor(ground_truth_bounding_boxes)
    ground_truth_bounding_boxes = ground_truth_bounding_boxes.unsqueeze(0).to(device)

    return box_iou(bounding_boxes, ground_truth_bounding_boxes)

def cosine_similarity_metric(bounding_boxes, ground_truth_bounding_boxes):
    """
    Cosine Similarity Metric

    Cosine similarity is a common metric measure for semantic similarity.
    This measures the cosine similarity between the bounding boxes considered
    as sets of points.
    """

    if not torch.is_tensor(ground_truth_bounding_boxes):
        ground_truth_bounding_boxes = torch.tensor(ground_truth_bounding_boxes)
    ground_truth_bounding_boxes = ground_truth_bounding_boxes.to(device)
    
    return cosine_similarity(bounding_boxes, ground_truth_bounding_boxes)

def selected_area_cosine_similarity_metric(images, ground_truth_bounding_boxes, pred_bounding_boxes, evaluator_model):
    """
    Cosime Similarity between the embedding of the crops Metric

    Cosine similarity is a common metric measure for semantic similarity.
    This measures the cosine similarity between the embeddings of the areas
    as cropped by the bounding boxes.
    """

    pred_bounding_boxes = [bbox.cpu().numpy() for bbox in pred_bounding_boxes]
    indices = torch.tensor(list(range(len(pred_bounding_boxes)))).to(device)
    
    return evaluator_model(indices, images, ground_truth_bounding_boxes, pred_bounding_boxes)

In [11]:
accuracy_iou_threshold = 0.5

And lastly, define a function to test the model on a given split of the dataset.

In [26]:
def test_model(data_loader, model, evaluator_model, device, verbose=False):
    IoUs = []
    cosine_similarities = []
    
    for batch_idx, (images, gt_bounding_boxes, prompts) in enumerate(data_loader):
        if verbose:
            print(f'-- Batch index: {batch_idx} --')

        # Run the model
        indices = torch.tensor(list(range(len(images)))).to(device)
        outputs = model(indices, images, prompts)

        # Group the outputs by sample, since each sample may have multiple
        # prompts which were treated independently of one another, thus
        # generating multiple bounding boxes.

        outputs_grouped_by_sample = []
        outputs_idx = 0
        prompts_idx = 0
        while True:
            if not prompts_idx < len(images):
                break

            outputs_grouped_by_sample.append(
                outputs[outputs_idx : outputs_idx + len(prompts[prompts_idx])]
            )

            outputs_idx += len(prompts[prompts_idx])
            prompts_idx += 1

        cosine_similarities += selected_area_cosine_similarity_metric(images, gt_bounding_boxes, outputs_grouped_by_sample, evaluator_model)
        
        for output_bboxes, gt_bboxes in zip(outputs_grouped_by_sample, gt_bounding_boxes):
            """
            There is one output bounding box for each prompt given in input.
            Note that each prompt for a given input is actually a list of prompts,
            therefore it can contain an arbitrary number of promps. Hence, there is
            a bounding box for each one of them.
            """
            
            result_ious = iou_metric(output_bboxes, gt_bboxes)
            for iou in result_ious:
                IoUs.append(iou)

    IoUs_to_cpu = np.array([tensor.item() if torch.is_tensor(tensor) else 0 for tensor in IoUs])
    mIoU = np.nanmean(IoUs_to_cpu)

    counter = np.sum([1 if iou >= accuracy_iou_threshold else 0 for iou in IoUs_to_cpu])
    accuracy = counter / len(IoUs)

    cosine_similarities_to_cpu = np.array([tensor.item() if torch.is_tensor(tensor) else 0 for tensor in cosine_similarities])
    m_cos_sim = np.nanmean(cosine_similarities_to_cpu)

    print('--- Metrics ---')
    print(f'Mean Intersection over Union (mIoU): {mIoU}')
    print(f'Accuracy: {accuracy}')
    print(f'Mean Cosine Similarity: {m_cos_sim}')

    return mIoU, accuracy, m_cos_sim

# Baseline

In this section, we present our implementation of the baseline as proposed in the assignment.

<div align="center"><img src="./asset/baseline-architecture.png"/></div>

More precisely, the region proposals are dealt with as follows:

<div align="center"><img src="./asset/baseline-detail.png"/></div>

As the image shows, region/object proposals are extracted for every image using the YOLOv5 model. The image is then cropped for every bounding box found by YOLO and these crops are then fed to the CLIP Image Encoder. The prompt(s) given for the image are also fed to CLIP's Textual Encoder.

A dot product (cosine similarity) is then computed among all object proposals/prompts pairs, thus creating a matrix with size $n \times m$, where $n$ is the number of prompts and $m$ is the number of object proposals.

To ensure that a bounding box is computed for each prompt, rows are treated independently. The maximum value of the cosine similarity is therefore taken for each row, thus allowing for finding the best crop matching the $i^{\textit{th}}$ prompt.

The code implementation is presented below.

First of all, we have to load YOLOv5, small model.

In [13]:
if torch.cuda.is_available():
    yolo_models = [torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True).to(f'cuda:{i}') for i in range(torch.cuda.device_count())]
else:
    yolo_models = [torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True).to(device)]

Using cache found in C:\Users\operatore/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-4-28 Python-3.10.11 torch-2.0.0+cu117 CUDA:0 (Quadro P2000, 5120MiB)



[31m[1mrequirements:[0m C:\Users\operatore\.cache\torch\hub\requirements.txt not found, check failed.


Fusing layers... 
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape... 
Using cache found in C:\Users\operatore/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-4-28 Python-3.10.11 torch-2.0.0+cu117 CUDA:0 (Quadro P2000, 5120MiB)

Fusing layers... 
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape... 


[31m[1mrequirements:[0m C:\Users\operatore\.cache\torch\hub\requirements.txt not found, check failed.


Then, we have to load CLIP. For the baseline, we use ResNet50x16. Note that, as for YOLO, we load a model for each GPU available: this will allow us to exploit multiple-GPU systems to speed up calculations.

In [14]:
import clip

clip_backbones = ['RN50x16']

models, preprocesses = {}, {}

for clip_backbone in clip_backbones:
    models[clip_backbone] = []
    preprocesses[clip_backbone] = []

    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            model, preprocess = clip.load(clip_backbone, device=f'cuda:{i}')
            
            models[clip_backbone].append(model)
            preprocesses[clip_backbone].append(preprocess)
    else:
        model, preprocess = clip.load(clip_backbone, device=device)
        models[clip_backbone].append(model)
        preprocesses[clip_backbone].append(preprocess)

In [20]:
model_to_use_for_baseline = 'RN50x16'

Now we can define the actual model. It extends `torch.nn.Module` so that it can be utilized as any other models and it can be used in multi-GPU settings more easily.

In [21]:
import torch
import torch.nn as nn
import clip
import numpy as np

class BaselineModel(nn.Module):
    def __init__(self, device=None, models=None, preprocesses=None, yolo_models=None) -> None:
        """
        Initialize a BaselineModel.

        CLIP models have to be given, along with the preprocessors.
        YOLO models are expected as well.
        """
        
        super().__init__()
        
        if device:
            self.device = device
        else:
            self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
        
        if not models or not preprocesses:
            raise ValueError('Models and preprocesses for CLIP model should be provided')

        self.models = models
        self.preprocesses = preprocesses
        
        if not yolo_models:
            raise ValueError('Models for YOLO should be provided')
        self.yolo_models = yolo_models

    def forward(self, indices: torch.Tensor, images, prompts_list):
        """
        Forward call

        `indices` represents a list of indices for the images and prompts lists.
        `indices` is automatically split in a multi-GPU setting, therefore it allows
        for extracting only the inputs that should be processed by the single GPU.
        """

        # Get the list 
        self.device = indices.device
        if indices.is_cuda:
            self.device_index = int(str(self.device)[-1])
        else:
            self.device_index = 0

        # -- Getting the right data and moving it to the correct device --

        # Images remain on the CPU because they are PIL Images, not Tensors
        images = [images[i] for i in indices]

        prompts_list = [prompts_list[i] for i in indices]
        prompts_tensor = [clip.tokenize(prompt_list).to(self.device) for prompt_list in prompts_list]

        # -- Actual processing --

        bounding_boxes = self.get_bounding_boxes(images)

        # It contains the predicted bounding box for each image for each prompt
        # Then, it is a list of length len(images) and for each entry there is a
        # list with len(prompts[i]), where i is the i-th image 
        overall_outputs = []

        with torch.no_grad():
            for idx, prompts_tensor_for_sample in enumerate(prompts_tensor):
                # Image crops
                image_crops = self.get_cropped_bounding_boxes(images[idx], bounding_boxes.xyxy[idx])

                preprocessed_image_crops = torch.stack([self.preprocesses[self.device_index](image).to(self.device) for image in image_crops])

                crop_features = self.models[self.device_index].encode_image(preprocessed_image_crops)
                crop_features /= crop_features.norm(dim=-1, keepdim=True)

                text_features = self.models[self.device_index].encode_text(prompts_tensor_for_sample)

                similarity = cosine_similarity(crop_features, text_features).float()
                texts_p = (100 * similarity).softmax(dim=-1)

                # Get the best-matching region proposals for the given prompt
                _, max_indices = texts_p.max(dim=1)
                try:
                    for max_idx in max_indices:
                        overall_outputs.append(
                            torch.tensor(bounding_boxes.xyxy[idx][max_idx, 0:4]).to(self.device)
                        )
                except:
                    for max_idx in max_indices:
                        overall_outputs.append(
                            torch.tensor((0, 0, images[idx].size[0], images[idx].size[1])).to(self.device)
                        )

        return torch.stack(overall_outputs)

    def get_bounding_boxes(self, pil_images):
        """
        Extract bounding boxes (region proposals) using YOLOv5
        """

        bounding_boxes = self.yolo_models[self.device_index](pil_images)
        return bounding_boxes
    
    def get_cropped_bounding_boxes(self, image, bounding_boxes):
        """
        Crop the input image on each object found by YOLO.
        `bounding_boxes` is a list of boxes, where each box is represented
        as (top_left_x, top_left_y, bottom_right_x, bottom_right_y)
        """

        cropped_bounding_boxes = []
        
        for bounding_box in bounding_boxes:
            cropped_img = image.crop((bounding_box[0].item(), bounding_box[1].item(), bounding_box[2].item(), bounding_box[3].item()))
            cropped_bounding_boxes.append(cropped_img)

        if len(cropped_bounding_boxes) == 0:
            cropped_bounding_boxes.append(image)
                
        return cropped_bounding_boxes

# Instantiate the model
baseline_model = BaselineModel(models=models[model_to_use_for_baseline], preprocesses=preprocesses[model_to_use_for_baseline], yolo_models=yolo_models)

# And make it work with multiple GPUs if they are available
if torch.cuda.device_count() > 1:
    baseline_model = torch.nn.DataParallel(baseline_model)

Instantiate the evaluator with the same CLIP backbones:

In [22]:
evaluator_model = Evaluator(models=models, preprocesses=preprocesses)

if torch.cuda.device_count() > 1:
    evaluator_model = torch.nn.DataParallel(evaluator_model)

Run the model on the validation and the test set to measure the performance:

In [23]:
print('-- Test Set --')
test_model(test_loader, baseline_model, evaluator_model, device, verbose=True)

print('-- Validation Set --')
test_model(val_loader, baseline_model, evaluator_model, device, verbose=True)

-- Batch index: 0 --
--- Metrics ---
Mean Intersection over Union (mIoU): 0.5377648168453767
Accuracy: 0.5564516129032258
Mean Cosine Similarity: 0.879729240171371
-- Batch index: 0 --
--- Metrics ---
Mean Intersection over Union (mIoU): 0.5657518891818916
Accuracy: 0.559322033898305
Mean Cosine Similarity: 0.8941795219809322


(0.5657518891818916, 0.559322033898305, 0.8941795219809322)

## Results

In order to test the baseline thoroughly, we have run it using different CLIP backbone models. The following table summarizes the results we found for each metric.

| Backbone | Mean Intersection over Union | Accuracy | Mean Cosine Similarity |
| -------- | ---------------------------- | -------- | ----------------- |
| RN50x16  | 0.504                        | 0.517    | 0.877             |

## Issues

While the baseline performs decently, we noticed that it had severe issues in dealing with spatial relationships. We found that it performed rather well on those samples which do not have spatial information in the referring expression (e.g. "The banana" does not contain spatial information, "The banana on the right" does).

We strongly believe that this issue derives from the fact that cropping the proposed regions deletes the spatial information in the visual input, thus making it impossible (even for a human being!) to actually discern among a set of visually similar objects whose only distinctive feature is position in the image.

While the overall architecture may have other problems, we found this one to be the most "annoying". Therefore, we decided to focus on this one issue to improve the baseline on. Hence, we focused on improving the mean Intersection over Union (mIoU) metric.

# Core Ideas

We tried two main ways towards solving this issue:
* fine-tuning the CLIP model;
* leveraging CLIP's strong zero-shot and multi-modal reasoning capabilities.

The idea which led to the best results is the second one. However, we will briefly present all the paths we tried for completeness.

In addition, there are some ideas which we though about but which we only partially implemented due to lack of resources and/or because we were already focusing on our final candidate. We would like to briefly present them anyway, as we think they might be of interest.

## 1. Adding a bottleneck on top of CLIP

We tried adding a bottleneck on top of CLIP with the following architecture:

<div align="center"><img src="./asset/regressor-model.png"/></div>

It takes as input the concatenation of the text and image embeddings produced by CLIP. It produces as output a set of coordinates, which represent the top-left and bottom-right points of the bounding box.

We were expecting for it not to produce good results, but we decided to accurately measure its performance - this is why we implemented it. It was not able to improve on the baseline and we noticed that it usually produces much larger bounding boxes than the ground truth ones.

On the other hand, it does not include any region-proposal algorithm, thus being faster in estimating a bounding box.

We trained on the full training set for 30 epochs. We utilized the Adam optimizer, with a learning rate of $0.01$ (the Stochastic Gradient Descent (SGD) optimizer tended to diverge, thus being unstable). We tried both the IoU loss and the Mean Squared Error (MSE) loss. In either case, we were not able to get satisfactory results.

We chose this architecture as a test to understand whether it was possible to solve the task in this simple way. The architecture is a basic bottleneck, which reduces the dimensionality of the input to the first layer at each layer, thus producing a four-value output as expected.

The following cell shows the architecture's code.

In [16]:
import clip

backbone = 'ViT-B/32'

models, preprocesses = [], []

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        model, preprocess = clip.load(backbone, device=f'cuda:{i}')
        
        models.append(model)
        preprocesses.append(preprocess)
else:
    model, preprocess = clip.load(backbone, device=device)
    models.append(model)
    preprocesses.append(preprocess)

In [17]:
import torch
import torch.nn as nn
import clip
import torch.nn.functional as F

class MultiNetModel(nn.Module):
    def __init__(self, models, preprocesses, downstream_model):
        super().__init__()

        self.models = models
        self.preprocesses = preprocesses

        self.clip_out_features = 512

        self.regressor = downstream_model

    @torch.autocast(device_type="cpu", dtype=torch.bfloat16)
    @torch.autocast(device_type="cuda")
    def forward(self, indices: torch.Tensor, images, prompts) -> torch.Tensor:
        # Get the indices of the samples to work with in the bach for the
        # current computing device
        self.device = indices.device
        if indices.is_cuda:
            self.device_index = int(str(self.device)[-1])
        else:
            self.device_index = 0

        model, preprocess = self.models[self.device_index], self.preprocesses[self.device_index]
        
        images = [images[i] for i in indices]
        prompts = [prompts[i] for i in indices]

        preprocessed_images = torch.stack([
            preprocess(image) for image in images
        ]).to(self.device)
        preprocessed_prompts = torch.cat([
            tokenized for tokenized in
            [clip.tokenize(prompt_list) for prompt_list in prompts]
        ]).to(self.device)

        # Storing the index for each prompt so as to easily retreive its encoding
        prompts_indices_for_image = []
        start_index = 0
        for prompt_list in prompts:
            prompts_indices_for_image.append(
                torch.tensor(list(range(0, len(prompt_list)))).to(self.device) + start_index
            )
            start_index += len(prompt_list)

        with torch.no_grad():
            images_features = model.encode_image(preprocessed_images)
            texts_features = model.encode_text(preprocessed_prompts)

        # Retreive the given prompts for each image
        text_features_by_image = [
            texts_features[indices]
            for indices in prompts_indices_for_image
        ]

        # Feed each pair (image, prompt) pair to the FFNN and get
        # predictions in output
        bboxes = []
        for image_features, text_features_for_image in zip(images_features, text_features_by_image):
            for text_features in text_features_for_image:
                x = torch.cat([image_features, text_features], dim=0)#.to(torch.float16)
                bbox = self.regressor(x)
                bboxes.append(bbox)
        
        return torch.stack(bboxes)

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class RegressorModel(nn.Module):
    def __init__(self, clip_out_features):
        super().__init__()

        self.clip_out_features = clip_out_features

        self.fc1 = nn.Linear(2 * self.clip_out_features, self.clip_out_features)
        self.fc2 = nn.Linear(self.clip_out_features, self.clip_out_features // 2)
        self.fc3 = nn.Linear(self.clip_out_features // 2, 128)
        self.fc4 = nn.Linear(128, 4)

    def forward(self, x):
        """
        x is a tensor of features representing CLIP's image features and
        CLIP's text features concatenated into a tensor with length 1024 (512 + 512)
        """

        x = self.fc1(x)
        x = F.relu(x)

        x = self.fc2(x)
        x = F.relu(x)

        x = self.fc3(x)
        x = F.relu(x)

        x = self.fc4(x)

        return x

In [85]:
# Instantiate the model
regressor_model = RegressorModel(512).to(device)
multinet_model = MultiNetModel(models, preprocesses, regressor_model)

# And make it work with multiple GPUs if they are available
if torch.cuda.device_count() > 1:
    multinet_model = torch.nn.DataParallel(multinet_model)

Let's define training and testing functions:

In [20]:
# Using an Adam optimizer
def get_optimizer(model, lr, wd=None, momentum=None):
    try:
        actual_model = model.module
    except AttributeError:
        actual_model = model
  
    optimizer = torch.optim.Adam(actual_model.parameters(), lr=lr)

    return optimizer

In [170]:
def iou_loss(pred_bboxes, gt_bboxes):
    x1 = torch.max(pred_bboxes[:, 0], gt_bboxes[:, 0])
    y1 = torch.max(pred_bboxes[:, 1], gt_bboxes[:, 1])
    x2 = torch.max(pred_bboxes[:, 2], gt_bboxes[:, 2])
    y2 = torch.max(pred_bboxes[:, 3], gt_bboxes[:, 3])

    intersection_area = torch.clamp(x2 - x1, min=0) * torch.clamp(y2 - y1, min=0)

    predicted_area = (pred_bboxes[:, 2] - pred_bboxes[:, 0]) * (pred_bboxes[:, 3] - pred_bboxes[:, 1])
    target_area = (gt_bboxes[:, 2] - gt_bboxes[:, 0]) * (gt_bboxes[:, 3] - gt_bboxes[:, 1])

    union_area = predicted_area + target_area - intersection_area

    # Adding 1e-6 to the denominator for numerical stability
    # print(f'IoU is: {(intersection_area / (union_area + 1e-6))}, with intersection {intersection_area} and union {union_area}')
    iou_loss = torch.clamp(1.0 - (intersection_area / (union_area + 1e-6)), min=0.0, max=1)
    
    # print(f'iou loss: {iou_loss}, with mean {iou_loss.nanmean()}')

    mean = iou_loss.nanmean()

    # Ensure not to return NaN
    if torch.isnan(mean):
        # print('returning 1')
        return torch.tensor(1.0)
    # print(f'returning {mean}')
    return mean

def get_cost_function_multinet_model():
    cost_function = torch.nn.MSELoss()
    # cost_function = iou_loss
    return cost_function

In [171]:
# Defining both training and testing steps

@torch.autocast(device_type="cpu", dtype=torch.bfloat16)
@torch.autocast(device_type="cuda")
def training_step(net, data_loader, optimizer, cost_function, device='cuda'):
    samples = 0.0
    cumulative_loss = 0.0
    cumulative_iou = 0.0

    # Set the network to training mode
    net.train()
    optimizer.zero_grad()

    # Iterate over the training set
    for batch_idx, (images, gt_bounding_boxes, prompts) in enumerate(data_loader):
        # print(f'-- Batch index: {batch_idx} --')

        indices = torch.tensor(list(range(len(images)))).to(device)
        prompts = [prompt_list[0:1] for prompt_list in prompts]
        
        # Forward pass
        outputs = net(indices, images, prompts)

        # Ground truth
        gt_bounding_boxes = torch.tensor(gt_bounding_boxes).to(device)

        # Loss computation
        loss = cost_function(outputs, gt_bounding_boxes)

        # Backward pass
        loss.backward()
        
        # Parameters update
        optimizer.step()
        
        # Gradients reset
        optimizer.zero_grad()

        # Update statistics
        samples += float(outputs.shape[0])
        cumulative_loss += float(loss.item())

        # Compute IoU
        for output_bbox, gt_bbox in zip(outputs, gt_bounding_boxes):
            cumulative_iou += np.nansum(np.array([tensor.item() if torch.is_tensor(tensor) else 0 for tensor in iou_metric(output_bbox.unsqueeze(0), gt_bbox)]))

    return cumulative_loss / samples, cumulative_iou / samples

def test_step(net, data_loader, cost_function, device='cuda'):
    samples = 0.0
    cumulative_loss = 0.0
    cumulative_iou = 0.0

    # set the network to evaluation mode
    net.eval() 

    # Disable gradient computation (only testing!)
    with torch.no_grad():
        # Iterate over the test set
        for batch_idx, (images, gt_bounding_boxes, prompts) in enumerate(data_loader):
            # print(f'-- Batch index: {batch_idx} --')
            
            indices = torch.tensor(list(range(len(images)))).to(device)
            prompts = [prompt_list[0:1] for prompt_list in prompts]
            
            # Forward pass
            outputs = net(indices, images, prompts)

            # Ground truth
            gt_bounding_boxes = torch.tensor(gt_bounding_boxes).to(device)

            # Loss computation
            loss = cost_function(outputs, gt_bounding_boxes)

            # Update statistics
            samples += float(outputs.shape[0])
            cumulative_loss += float(loss.item())

            # Compute IoU
            for output_bbox, gt_bbox in zip(outputs, gt_bounding_boxes):
                cumulative_iou += np.nansum(np.array([tensor.item() if torch.is_tensor(tensor) else 0 for tensor in iou_metric(output_bbox.unsqueeze(0), gt_bbox)]))

    return cumulative_loss / samples, cumulative_iou / samples

Now we can define the function that actually handles and interleaves training and testing:

In [172]:
from torch.utils.tensorboard import SummaryWriter

def log_values(writer, step, loss, accuracy, prefix):
    writer.add_scalar(f'{prefix}/loss', loss, step)
    writer.add_scalar(f'{prefix}/accuracy', accuracy, step)

def final_tests(net, train_loader, val_loader, test_loader, cost_function, device, writer=None, epochs=None):
    # Compute final evaluation results
    print('After training:')
    train_loss, train_accuracy = test_step(net, train_loader, cost_function, device=device)
    val_loss, val_accuracy = test_step(net, val_loader, cost_function, device=device)
    test_loss, test_accuracy = test_step(net, test_loader, cost_function, device=device)

    if writer is not None and epochs is not None:
        # Log to TensorBoard
        log_values(writer, epochs, train_loss, train_accuracy, "train")
        log_values(writer, epochs, val_loss, val_accuracy, "validation")
        log_values(writer, epochs, test_loss, test_accuracy, "test")

    print('\tTraining loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_accuracy))
    print('\tValidation loss {:.5f}, Validation accuracy {:.2f}'.format(val_loss, val_accuracy))
    print('\tTest loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_accuracy))
    print('-----------------------------------------------------')

def train(
    model,
    train_loader, val_loader, test_loader,
    cost_function,
    device='cuda:0',
    learning_rate=0.01,
    weight_decay=0.000001,
    momentum=0.9,
    epochs=10,
    skip_initial_train_test=True,
    skip_initial_val_test=False,
    skip_initial_test_test=False,
    ):

    # Create a logger
    writer = SummaryWriter(log_dir="runs/exp1")

    # Store the network
    net = model
    
    # Instantiate the optimizer
    optimizer = get_optimizer(net, learning_rate, weight_decay, momentum)

    # Evaluate before training and log to TensorBoard
    print('Before training:')
    if not skip_initial_train_test:
        train_loss, train_accuracy = test_step(net, train_loader, cost_function, device=device)
        log_values(writer, -1, train_loss, train_accuracy, "train")
        print('\tTraining loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_accuracy))

    if not skip_initial_val_test:
        val_loss, val_accuracy = test_step(net, val_loader, cost_function, device=device)
        log_values(writer, -1, val_loss, val_accuracy, "validation")
        print('\tValidation loss {:.5f}, Validation accuracy {:.2f}'.format(val_loss, val_accuracy))
        
    if not skip_initial_test_test:
        test_loss, test_accuracy = test_step(net, test_loader, cost_function, device=device)
        log_values(writer, -1, test_loss, test_accuracy, "test")
        print('\tTest loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_accuracy))

    print('-----------------------------------------------------')

    print('\n-- Starting training --')

    # Train for `epochs` steps
    for e in range(epochs):
        train_loss, train_accuracy = training_step(net, train_loader, optimizer, cost_function, device=device)
        val_loss, val_accuracy = test_step(net, val_loader, cost_function, device=device)
        
        # Log to TensorBoard
        log_values(writer, e, train_loss, train_accuracy, "train")
        log_values(writer, e, val_loss, val_accuracy, "validation")

        print('Epoch: {:d}'.format(e+1))
        print('\tTraining loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_accuracy))
        print('\tValidation loss {:.5f}, Validation accuracy {:.2f}'.format(val_loss, val_accuracy))
        print('-----------------------------------------------------')

    final_tests(net, train_loader, val_loader, test_loader, cost_function, device, writer, epochs)

    # closes the logger
    writer.close()

Now we can train the network:

In [86]:
# Define the cost function
cost_function = get_cost_function_multinet_model()

# And train
train(multinet_model, train_loader, val_loader, test_loader, cost_function, epochs=30, device=device, skip_initial_test_test=True, skip_initial_train_test=True, skip_initial_val_test=True)

Before training:
-----------------------------------------------------

-- Starting training --
torch.Size([64, 4]) torch.Size([64, 4])
tensor([[0.0000e+00, 3.2748e+02, 4.5900e+02, 6.3169e+02],
        [3.3342e+02, 3.4340e+01, 5.7159e+02, 4.0771e+02],
        [3.1641e+02, 3.1116e+02, 4.6317e+02, 4.8000e+02],
        [4.0402e+02, 1.3467e+02, 4.7775e+02, 3.5787e+02],
        [4.3619e+02, 2.1093e+02, 5.9039e+02, 4.2631e+02],
        [3.9568e+02, 1.4919e+02, 4.8865e+02, 4.1730e+02],
        [5.2679e+02, 9.6910e+01, 6.4000e+02, 2.6676e+02],
        [8.7950e+01, 1.8550e+01, 4.2692e+02, 5.6745e+02],
        [4.1077e+02, 1.4659e+02, 5.3907e+02, 3.6777e+02],
        [3.6672e+02, 1.5425e+02, 6.1567e+02, 4.0607e+02],
        [3.2939e+02, 9.2370e+01, 5.7968e+02, 4.5210e+02],
        [1.3315e+02, 1.3172e+02, 4.6067e+02, 3.2931e+02],
        [2.1573e+02, 1.2809e+02, 3.8539e+02, 2.4494e+02],
        [3.4899e+02, 0.0000e+00, 4.9742e+02, 2.5417e+02],
        [1.1968e+02, 1.3650e+01, 3.2043e+02, 4.7800e

KeyboardInterrupt: 

As it can be seen, the model overfits very badly: the training accuracy continues to increase, up to 0.49, whereas the validation accuracy goes down after the first epochs, ending around 0.19. The same goes for the loss: on the training set, it continuously decreases, whilst on the validation fold it increases a lot.

Regularization techniques may be introduced, however we think this is not the "right" way to address this task. A more complex and larger model is presented in the next section.

### Results

| Backbone | Training Mean IoU | Validation Mean IoU | Test Mean IoU |
| -------- | ----------------- | ------------------- | ------------- |
| ViT-B/32  | 0.49             | 0.19              | 0.19          |

We used the Mean Squared Error (MSE) loss and the Adam optimizer. We run for 30 epochs on the whole shuffled training set and evaluated on the whole training set, validation set and test set.

## 2. Adding a Fully-Connected Neural Network on top of CLIP

We tried adding a fully-connected neural network on top of CLIP. Differently from the previous idea, which reduced the dimensionality at each layer, this one first enlarges it and then shrinks down to only four outputs (as the previous one).

The architecture is as follows:

<div align="center"><img src="./asset/large-regressor-model.png"/></div>

This network was larger and we trained similarly to the previous one. We had to use the Mean Squared Error loss as the IoU loss was not stable enough. We also tried to combine it with MSE as:

$\mathcal{L} = \lambda_1 \textit{MSE} + \lambda_2 \textit{IoU}$

where we set $\lambda_1 = 0.7$ and $\lambda_2 = 1.0$. In the end, the best one was MSE alone.

Mean Squared Error, in this specific context, computes the average (squared) error between each pair of coordinates (predicted, ground truth). Therefore, if a bounding box is "perfect" (perfect overlap with the ground truth), it will have a MSE of 0. MSE penalizes small errors less than large errors, so it should allow the network to predict somewhat good values even if precision with respect to the ground truth may not be optimal. Nonetheless, MSE is the standard go-to loss function for regression problems, which is how we framed this problem.

Intersection over Union focuses instead on maximizing the overlap between the two boxes. If the prediction is (x, y, x, y) (so it is just a line) and the golden truth is (x, y, ..., ...), then MSE would be quite low, since half the points perfectly match those of the ground truth. However, the goal is to make sure the output has an area and that the overlap with the ground truth is as large as possible. Thus, introducing the IoU may halp in serve this purpose.

Training took approximately one hour when using the pre-computed embeddings. In this Notebook, where they are computed on-demand, it takes a lot longer - up to some hours. Indeed, to optimize computing time, we pre-computed the embeddings of all samples in the dataset and stored it on the file system, so that we could train several networks with different parameters much more quickly.

> To speed training up, we preprocessed the whole dataset at the beginning by computing the embeddings for both images and textual prompts. Then we stored all those embeddings into a CSV file (approximate size of 1.2 GB for the training set). Eventually, we loaded the whole file containing all the embeddings of the training set into memory, thus cutting off CLIP to focus only on training the neural network. Otherwise, it would have taken much longer.

We chose this architecture as we started from the one presented in the previous section and we expanded it by introducing also dropout layers, to try and limit overfit. The architecture first projects the input into a larger dimensional space, which is then reduced progressively down to 4 values, as in the previous architecture.

The introduction of dropout layers with probability $p = 0.5$ should reduce at least partially the issues encountered with the previous model. Adding more linear layers should instead help in making it possible to learn a better, richer and more complex representation of the data.

In the following, we present the code and architecture of this model. Some elements are shared with the previous architecture, therefore they are not repeated again. To successfully run the following code, then, it is necessary to run the cells in the previous section to get CLIP models loaded, the MultiNetModel architecture defined, etc.

In [52]:
class LargeRegressorModel(nn.Module):
    def __init__(self, clip_out_features):
        super().__init__()

        self.clip_out_features = clip_out_features

        self.fc1 = nn.Linear(2 * self.clip_out_features, 4 * self.clip_out_features)
        self.fc2 = nn.Linear(4 * self.clip_out_features, 2 * self.clip_out_features)
        self.fc3 = nn.Linear(2 * self.clip_out_features, self.clip_out_features)
        self.fc4 = nn.Linear(self.clip_out_features, self.clip_out_features // 2)
        self.fc5 = nn.Linear(self.clip_out_features // 2, self.clip_out_features // 8)
        self.fc6 = nn.Linear(self.clip_out_features // 8, 4)

        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        """
        x is a tensor of features representing CLIP's image features and
        CLIP's text features concatenated into a tensor with length 1024 (512 + 512)
        """

        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.fc2(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.fc3(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.fc4(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.fc5(x)
        x = self.relu(x)
        x = self.dropout(x)

        x = self.fc6(x)

        return x

In [177]:
# Instantiate the model
large_regressor_model = LargeRegressorModel(512).to(device)
multinet_large_model = MultiNetModel(models, preprocesses, large_regressor_model)

# And make it work with multiple GPUs if they are available
if torch.cuda.device_count() > 1:
    multinet_large_model = torch.nn.DataParallel(multinet_large_model)

Let's define the cost function:

In [178]:
mse = torch.nn.MSELoss()

def loss_function_multinet_large_model(pred_bboxes, gt_bboxes):
    lambda_1 = 0.7
    lambda_2 = 1.0

    return lambda_1 * mse(pred_bboxes, gt_bboxes) + lambda_2 * iou_loss(pred_bboxes, gt_bboxes)

In [179]:
def get_cost_function_multinet_large_model():
    cost_function = loss_function_multinet_large_model
    return cost_function

And eventually train the model.

In [180]:
# Define the cost function
cost_function = get_cost_function_multinet_large_model()

# And train
train(multinet_large_model, train_loader, val_loader, test_loader, cost_function, learning_rate=0.001, epochs=30, device=device)

Before training:
	Validation loss 1203.30518, Validation accuracy 0.00
	Test loss 1204.80607, Test accuracy 0.00
-----------------------------------------------------

-- Starting training --
Epoch: 1
	Training loss 274.51980, Training accuracy 0.13
	Validation loss 200.15798, Validation accuracy 0.17
-----------------------------------------------------
After training:
	Training loss 191.87883, Training accuracy 0.17
	Validation loss 200.15798, Validation accuracy 0.17
	Test loss 196.14407, Test accuracy 0.17
-----------------------------------------------------


## 3. GradCam and CLIPSeg

We tried using GradCam on CLIP with Vision Transformer backbone to understand what CLIP is looking at. The idea was to compute bounding boxes starting from the heatmaps produced by GradCam itself. However, after manually examining the outputs to hundreds of inputs from the test set, we found it to be quite imprecise, so we decided not to further develop this idea.

We then read the paper of CLIPSeg, a way of predicting the segmentation area of an object using CLIP. The overall architecture is quite simple, as it uses a frozen version of CLIP and a transformer is trained on the values of neurons sampled from the last layers of CLIP.

We tested ClipSeg on hundreds of inputs as well, and we found it to be very good in producing a segmentation mask for the object referred to by the prompt. Computing a bounding box given the segmentation mask is straightforward.

However, it was unable to discern spatial relationships, which was the issue we wanted to address. Therefore, we had to discard the idea of introducing ClipSeg in our architecture, even though it would have allowed us to fully remove YOLO and any other region proposals algorithm.

## 4. Natural Language Understanding (not implemented, only investigated)

Another idea we would like to propose is to work on the textual input to retrieve relevant information by parsing it using tools like [spaCy](https://spacy.io).

By parsing the sentence, Noun Phrases (NP) can be extracted. Assuming each NP refers to an object in the image, it is possible to find a match between each NP and each object proposal by computing the embeddings of both of them and getting the object with the highest cosine similarity for each NP.

Then, other structures (for instance, Verb Phrases (VP)) can be extracted, thus making it theoretically possible to relate NPs and finding the actual object referred to by the prompt.

However, we did not implement this solution as it has at least two issues:
1. If there are multiple objects which could be the "right" one and these only differ by their spatial position in the image (we found this condition to be quite common in the dataset), this would not work as image crops do not contain any spatial information
2. The prompts in the dataset are rarely well-formed and complete sentences. Most of the times, they are simple descriptions which often lack verbs and sometimes present grammatical mistakes. Therefore, spaCy would struggle in dealing with them and more complex models, which introduce an additional overhead other than that of YOLO, would be needed.

We also thought about using GPT-2 ([code](https://github.com/openai/gpt-2) and [paper](https://arxiv.org/abs/1908.09203)) to augment/engineer the prompts in an automatic way. More precisely, GPT-2 could be used to add additional information to the prompts based on their actual content. This could help the downstream pipeline/architecture better recognize the dominant/main object referred to by the textual piece of information. However, this technique can add noise to the input, thus leading to a degradation of performance.

## 5. Graphs (partially implemented)

Another idea we would like to propose it to construct a graph on the object proposals representing their spatial relationships. **In other terms, building a scene graph.** In the graph, each node would represent an object and each edge the spatial relationship between the two objects (nodes).

The graph would be represented by an $m \times m$ adjacency matrix, $m$ being the amount of object proposals.

We implemented the construction of the graph in this way:
1. Compute the centroid of each object
2. Relate it with any other centroid and test whether it lies on its left, right, top, bottom. Transform this information into a numerical value to weight the edge

The second step, which we decided not to implement due to computational resources and in order to focus on other strategies, would be to create a Graph Neural Network (GNN) to learn to select a node of the graph given CLIP's embedding of the prompt and, possibly, of each object proposal. This would reduce to a classification task (telling, for each node, whether it is the correct one or not), thus a Cross-Entropy loss may work fine.

While this technique could work excellently in the task of spatial-references visual grounding, we believe that it may struggle with all the samples that do not contain any spatial information. We think it may perform as well as the baseline in these cases, since the graph information (which, as mentioned earlier, encodes the spatial relationships among objects through a scene graph) would be basically ignored thanks to the lack of spatial references in the textual prompt.

## 6. Incorporating more information into the inputs

We think that CLIP's extensive training set, consisting of approximately 400 millions (image, text) pairs, is so large that CLIP has to have learnt good enough representations for our task.

The main problem, then, is to understand CLIP's representations and how to leverage them for visual grounding. Therefore, we focused on:
* gathering knowledge on CLIP's encoders by extensive tests, and
* augmenting the features in such a way to exploit CLIP's capabilities.

We tried the following two approaches:
* incorportaing additional knowledge into the visual prompt;
* incorporating additional knowledge into the textual prompt.

It turned out that the best performance can be achieved by combining the two. The following sections explain in detail two architectures that we devised.

### 6.1 Encoding the position using colors

The first idea we present is encoding the position (spatial) information using colors. We thought of this idea as during our preliminary extensive tests we noticed that CLIP was able to distinguish among objects of the same category whose only difference is color (e.g. two teddy bears, where one is light brown and one is dark brown).

To achieve a successful incorporation of spatial information into the prompts, we tried two paths:
1. encoding it in the background;
2. encoding it on the whole image.

#### 6.1.1 Encoding position in the background

For the first path, we did the following:
1. Find region proposals
1. Create a new image with the same size of the given one for each region proposal
1. Set all the image content but the proposed region to a given color. The color codes for the position of the crop in the image. The position is computed using the centroid of the region proposal. More in detail, the centroid coordinates are normalized in the range $[0, 1] \times [0, 1]$. Then, a color is assigned based on a Look-Up Table (LUT), which is defined as:

| | Left | Center | Right |
| - | - | - | -|
| Top | Blue | Yellow | Red |
| Center | Green | Magenta | Black |
| Bottom | Brown | Cyan | White |

9 overall regions were defined, coding for different spatial locations.

4. Augment the prompt to make CLIP take into account this color information. Augmentation is performed as: `prompt + '. The image has a {color} background`, where `{color}` would take all the possible values on a given row or column based on the prompt itself. For instance, if the prompt contained the keyword "left", then it would be augmented generating three new prompts, one for each color in the left column of the table above.

<div align="center"><img src="./asset/colors-detail.png"/></div>

The following cells contain the highlights extracted from the Python code that we developed for this solution. We are not presenting the whole, runnable code as it is similar to the next section (for which instead we present the code) and we do not want to make it redundant.

The following function is used to perform textual prompt augmentation, where a prompt is checked for containing a spatial reference from the set $\{\textit{left}, \textit{right}, \textit{above}, \textit{below}\}$. If there is a match, three new prompts are created:
* for left and right, three prompts corresponding, respectively, to the colors on the left or right column of the grid shown above;
* for above and below, three prompts corresponding to the colors on the upper or lower rows of the grid shown above.

```python
for prompt in sample['image']['sentences']:
    prompts.append(
        prompt['sent']
    )

    if 'left' in prompt['tokens']:
        for color in ['blue', 'green', 'brown']:
            prompts.append(
                f"{prompt['sent']}. The image has a {color} background"
            )
            prompts.append(
                f"{color}. {prompt['sent']}"
            )

    if 'right' in prompt['tokens']:
        for color in ['red', 'black', 'white']:
            prompts.append(
                f"{prompt['sent']}. The image has a {color} background"
            )
            prompts.append(
                f"{color}. {prompt['sent']}"
            )

    if 'above' in prompt['tokens']:
        updated = prompt['sent']
        updated = updated[:updated.index('above')].strip()
        
        for color in ['blue', 'yellow', 'red']:
            prompts.append(
                f"{updated}. The image has a {color} background"
            )
            prompts.append(
                f"{color}. {updated}"
            )

    if 'below' in prompt['tokens']:
        updated = prompt['sent']
        updated = updated[:updated.index('below')].strip()
        
        for color in ['brown', 'cyan', 'white']:
            prompts.append(
                f"{updated}. The image has a {color} background"
            )
            prompts.append(
                f"{color}. {updated}"
            )

    return prompts
```

Then we have to do something similar for the proposed regions, for which we have to encode the spatial position using the color mapping presented in the LUT above.

```python
def get_background_color(normalized_x, normalized_y):
    color = [0, 0, 0]

    # Left
    if normalized_x <= 0.35:
        # Top
        if normalized_y <= 0.33:
            color = [0, 0, 255] # Blue
        # Center
        elif normalized_y <= 0.67:
            color = [0, 255, 0] # Green
        # Bottom
        else:
            color = [139, 69, 19] # Brown
        
    # Center
    elif normalized_x <= 0.65:
        # Top
        if normalized_y <= 0.33:
            color = [255, 255, 0] # Yellow
        # Center
        elif normalized_y <= 0.67:
            color = [255, 0, 255] # Magenta
        # Bottom
        else:
            color = [0, 255, 255] # Cyan

    # Right
    else:
        # Top
        if normalized_y <= 0.33:
            color = [255, 0, 0] # Red
        # Center
        elif normalized_y <= 0.67:
            color = [0, 0, 0] # Black
        # Bottom
        else:
            color = [255, 255, 255] # White
    
    # Just to check this never happens
    return color
```

As it can be seen, only one color is applied to each region proposal based on its location. Conversely, for each prompt containing spatial references three new prompts are generated, because there is the missing information of the other coordinate (_e.g._ left, but upper-left, center-left or bottom-left?).

This led to an overall increase in the performance, with an average IoU of 0.54.

Nevertheless, this method had some issues with very crowded images or with samples whose prompt was too complicated. This is the reason why we tried to develop novel and further ideas.

#### 6.1.2 Encoding position in the whole image

For this path, we did the following:
1. Find region proposals
1. Cropping each region proposal, thus generating $m$ new images
1. Add a colored overlay to each image based on the object's position in the original image. Color is assigned using a Look-Up Table (LUT).
1. Augment the prompt to make CLIP take into account this color information. Augmentation is performed as: `prompt + ' with a {color} overlay`, where `{color}` would be set according to the spatial keyword found in the prompt itself (e.g. "left", "right").

The following cells present the implementation of our idea.


In [14]:
use_class_weights_for_proposals = False

Reload CLIP model since it was overridden in the previous sections where training happened.

In [17]:
import clip

clip_backbones = ['RN50x16']

models, preprocesses = {}, {}

for clip_backbone in clip_backbones:
    models[clip_backbone] = []
    preprocesses[clip_backbone] = []

    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            model, preprocess = clip.load(clip_backbone, device=f'cuda:{i}')
            
            models[clip_backbone].append(model)
            preprocesses[clip_backbone].append(preprocess)
    else:
        model, preprocess = clip.load(clip_backbone, device=device)
        models[clip_backbone].append(model)
        preprocesses[clip_backbone].append(preprocess)

In [18]:
model_to_use_for_colors_architecture = 'RN50x16'

Create class prompts in the form `A photo of a {class}`.

In [21]:
import clip

classes = {id: class_name for id, class_name in yolo_models[0].names.items()}
class_prompts = {id: f'A photo of a {class_name}' for id, class_name in yolo_models[0].names.items()}

with torch.no_grad():
    # Tensor with one row per class
    prompts_tensor = clip.tokenize(class_prompts.values()).to(device)

    # Tensor with one row per class and 512 columns (embeddings), normalized
    class_prompts_embeddings = models[model_to_use_for_colors_architecture][0].encode_text(prompts_tensor)
    class_prompts_embeddings /= class_prompts_embeddings.norm(dim=-1, keepdim=True)
    class_prompts_embeddings = class_prompts_embeddings.to(device)

In [28]:
import torch
import torch.nn as nn
import clip

class ColorsOverlayModel(nn.Module):
    def __init__(self, device=None, models=None, preprocesses=None, yolo_models=None, classes=None, class_embeddings=None) -> None:
        """
        Initialize a ColorsOverlay model.
        """
        
        super().__init__()
        
        if device:
            self.device = device
        else:
            self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
        
        if not models or not preprocesses:
            raise ValueError('Models and preprocesses for CLIP model should be provided')

        self.models = models
        self.preprocesses = preprocesses
        
        if not yolo_models:
            raise ValueError('Models for YOLO should be provided')
        self.yolo_models = yolo_models

        if classes is None or class_embeddings is None:
            raise ValueError('Classes and class embeddings shoulld be provided')
        self.classes = classes
        self.class_embeddings = class_embeddings

    def forward(self, indices, images, prompts_list):
        """
        Forward call

        `indices` represents a list of indices for the images and prompts lists.
        `indices` is automatically split in a multi-GPU setting, therefore it allows
        for extracting only the inputs that should be processed by the single GPU.
        """
        
        self.device = indices.device
        if indices.is_cuda:
            self.device_index = int(str(self.device)[-1])
        else:
            self.device_index = 0

        # -- Getting the right data and moving it to the correct device --

        # Images remain on the CPU because they are PIL Images, not Tensors
        # Converting to Tensors leads to errors with YOLO
        images = [images[i] for i in indices]

        prompts_list = [prompts_list[i] for i in indices]
        prompts_list = self.update_prompts(prompts_list)
        prompts_tensor = [clip.tokenize(prompt_list).to(self.device) for prompt_list in prompts_list]

        # -- Actual processing --

        bounding_boxes = self.get_bounding_boxes(images)

        # It contains the predicted bounding box for each image for each prompt
        # Then, it is a list of length len(images) and for each entry there is a
        # list with len(prompts[i]), where i is the i-th image 
        overall_outputs = []

        with torch.no_grad():
            for idx, prompts_tensor_for_sample in enumerate(prompts_tensor):
                # Image crops
                image_crops = self.get_cropped_bounding_boxes(images[idx], bounding_boxes.pred[idx])

                preprocessed_image_crops = torch.stack([self.preprocesses[self.device_index](image).to(self.device) for image in image_crops])

                crop_features = self.models[self.device_index].encode_image(preprocessed_image_crops)
                crop_features /= crop_features.norm(dim=-1, keepdim=True)

                text_features = self.models[self.device_index].encode_text(prompts_tensor_for_sample)

                text_similarity = cosine_similarity(self.class_embeddings.to(self.device), text_features).float()
                prompt_categories_p = (100 * text_similarity).softmax(dim=-1)
                
                similarity = cosine_similarity(crop_features, text_features).float().to(self.device)
                
                if use_class_weights_for_proposals:
                    # Compute an a-priori score for each crop. These scores are based on the crop's category as
                    # predicted by YOLO
                    weights_for_crops = torch.zeros((prompt_categories_p.shape[0], len(image_crops))).to(self.device)
                    for prompt_idx, t_s in enumerate(text_similarity):
                        for weight_idx, crop in enumerate(bounding_boxes.pred[idx]):
                            weights_for_crops[prompt_idx, weight_idx] = t_s[int(crop[-1])]
                    
                    similarity *= weights_for_crops
                    
                texts_p = (100 * similarity).softmax(dim=-1)

                _, max_indices = texts_p.max(dim=1)
                try:
                    for max_idx in max_indices:
                        overall_outputs.append(
                            torch.tensor(bounding_boxes.xyxy[idx][max_idx, 0:4]).to(self.device)
                        )
                except:
                    for max_idx in max_indices:
                        overall_outputs.append(
                            torch.tensor((0, 0, images[idx].size[0], images[idx].size[1])).to(self.device)
                        )

        return torch.stack(overall_outputs)

    def update_prompts(self, prompts):
        """
        Update the prompts by introucing a color reference based on spatial
        references.
        """
        
        updated_prompts = []

        for sample in prompts:
            sample_prompts = []
            for prompt in sample:
                if 'left' in prompt:
                    prompt += ' with a red overlay'
                elif 'right' in prompt:
                    prompt += ' with a green overlay'
                sample_prompts.append(prompt)
            updated_prompts.append(sample_prompts)

        return updated_prompts

    def get_bounding_boxes(self, pil_images):
        bounding_boxes = self.yolo_models[self.device_index](pil_images)
        return bounding_boxes
    
    def get_cropped_bounding_boxes(self, image, bounding_boxes):
        """
        Bounding boxes in the form:
        [top left x, top left y, bottom right x, bottom right y, confidence, categoy]
        """

        cropped_bounding_boxes = []

        image_width, image_height = image.size
        
        for bbox_idx, bounding_box in enumerate(bounding_boxes):
            cropped_img = image.crop((bounding_box[0].item(), bounding_box[1].item(), bounding_box[2].item(), bounding_box[3].item()))

            # Centroid: (min + (max - min) / 2) / dimension
            crop_centroid_normalized = (
                ((bounding_box[2].item() + bounding_box[0].item()) / 2) / image_width,
                ((bounding_box[3].item() + bounding_box[1].item()) / 2 ) / image_height
            )

            # Assign a color overlay based on centroid position
            if crop_centroid_normalized[0] < 0.5:
                overlay = Image.new('RGBA', cropped_img.size, overlay_colors[0])
            elif crop_centroid_normalized[0] > 0.5:
                overlay = Image.new('RGBA', cropped_img.size, overlay_colors[1])
            else:
                overlay = Image.new('RGBA', cropped_img.size, overlay_colors[-1])
            blended = Image.alpha_composite(cropped_img.convert('RGBA'), overlay)
            cropped_bounding_boxes.append(blended)

        if len(cropped_bounding_boxes) == 0:
            cropped_bounding_boxes.append(image)
                
        return cropped_bounding_boxes

overlay_colors = [
    (255, 0, 0, 128),   # Red, alpha = 0.5
    (0, 255, 0, 128),   # Green, alpha = 0.5
    (0, 0, 255, 128),   # Blue, alpha = 0.5
    (0, 0, 0, 0),       # None
]

colors_overlay_model = ColorsOverlayModel(models=models[model_to_use_for_colors_architecture], preprocesses=preprocesses[model_to_use_for_colors_architecture], yolo_models=yolo_models, classes=classes, class_embeddings=class_prompts_embeddings)
if torch.cuda.device_count() > 1:
    colors_overlay_model = torch.nn.DataParallel(colors_overlay_model)

Instantiate the evaluator with the same CLIP backbones.

In [29]:
evaluator_model = Evaluator(models=models, preprocesses=preprocesses)

if torch.cuda.device_count() > 1:
    evaluator_model = torch.nn.DataParallel(evaluator_model)

Finally, we can test the model:

In [30]:
test_model(test_loader, colors_overlay_model, evaluator_model, device, True)

-- Batch index: 0 --
-- Batch index: 1 --


KeyboardInterrupt: 

This led to an overall increase in the performance with respect to the baseline.

| Backbone | Mean Intersection over Union | Accuracy | Cosine Similarity |
| -------- | ---------------------------- | -------- | ----------------- |

### 6.2 Deviating CLIP's "attention" to relevant parts (our final proposal)

We decided to try and make CLIP focus on some specific areas of the input image while feeding it all to it in order not to lose spatial information and context: indeed, we believe CLIP has the ability to understand language and, if properly guided, it can be more accurate.

Therefore, we pushed even further the prompt engineering task we had already done for the previously presented architectures. Some key elements factored in into this final pipeline which we present as our project:
1. Attention engineering to make CLIP focus on the relevant object in the image. By introducing masking operators and visual annotations (e.g. rectangles, ellipses) and background blurring to reduce distractions, we found that CLIP is able to shift its attention to che highlighted block. Note that "attention" does not only refer to the concept of "attention" in Transformer-based models: it refers to the general ability to focus and pay attention on a given portion of the image. Our method indeed works well even with ResNet-based models.
1. Prompt adapatation to unbalance even more CLIP's focus onto every object proposal. We have tried several textual prompt engineering techniques, for instance:
    * `prompt + ' with a {color} {shape} around it`, where: `{color}` was either fixed for all region proposals, independently of their position, or dependant on their position; `{shape}` depends on the shape actually used for highlighting the region proposal and it can be "rectangle", "ellipse" or "circle".
    * `This is + prompt`.

We discovered that this idea could work "by accident", when, due to a wrong implementation, we drew a rectangle instead of cropping the image on the region proposals. Combined with the aforementioned ideas, we though it would be interesting, so we fully developed it through thorough investigations on the outcomes it produced.

<div align="center"><img src="./asset/visual-prompting-detail.png"/></div>

The main issue is making sure that spatial structure is preserved while not deviating CLIP's attention to the wrong objects. In the beginning, we were only using visual markers to shift CLIP's focus, but it was not enough for overly crowded scenes. Therefore, we decided to smooth things out by blurring the background and keeping the rectangular window around a region proposal sharp. (We also tried other window shapes, for instance ellipses and rounded-corner rectangles, but we found straight rectangles to work the best.)

This allows, as it can be seen in the previous image, to leverage both CLIP's spatial reasoning abilities and natural language understanding capabilities.

In addition, we found that generating two distinct proposals for each region proposal helps improving the overall performance. More precisely, we tried the following modifications:
1. Adding a visual marker
2. Adding a visual marker and blurring
3. Adding a visual marker and gray-scaling the rest of the image
4. Adding a visual marker and gray-scaling and blurring the rest of the image

As mentioned earlier, we found that background blurring plays a key role in improving the performance, therefore we decided to only use modifications 2 and 4.

**This resembles the way our vision system works: we can perceive the whole image, but we can truly focus only on a small region.**

To summarize, if there are $m$ object proposals, we have to send through CLIP $2 \times m$ images, corresponding to the augmented prompts.

After extensive tests, we found the following parameters to work the best:
* using a stroke color of red for visual markers;
* using a stroke width of 3 px for visual markers;
* using a blur radius of 20 px, with a Gaussian blur being applied;
* masking using rectangles and using ellipses as visual markers.

We also tried the following hyperparameters:
* Stroke color: red, orange, purple, yello, green, blue, none (transparent, to hide it).
* Stroke width: 1 to 10 px, with all intermediate integer values.
* Blur radius: 0 px (no blur), 1 px, 1.5 px, 2 px, 5 px, 10 px, 20 px, 25 px and 50 px.

Testing all these hyperparameters combinations is challenging. Considering that there is also the backbone model playing a crucial role and that it could interfere with some settings, we had to try them in a controlled way to understand what combination works the best. Therefore, we sampled approximately 100 samples from the dataset and we tested about fifty distinct combinations on them. We manually examined all the outcomes to understand how the model was behaving. In the end, after all these thorough tests, we decided to use the configuration we proposed earlier.

---

To further improve the performance, we wanted to find a way of discouraging proposals which are unlikely to be the correct ones. We thought of the following two ideas, which we implemented and tested on the whole test test:
* Disallowing region proposals whose area is larger than $a\%$ of the area of the given image. We set $a = 80\%$, thus prohibiting region proposals spanning at least $80\%$ of the given image. This is a thoughtful idea as most of the objects referred to by prompts are actually a lot smaller than the overall image they are contained in. At the same time, region proposal algorithms/models (such as the YOLO we are using) sometimes find almos the whole image as a proposal: for instance, if the photo depicts a table with some objects on it, YOLO would also find the whole table, thus generating a proposal that contains almost all the original image and which will likely get the highest score for this reason. This approach may lead to wrong results in some edge cases, when the actual object to be found is (almost) the whole image. Nevertheless, the overall loss due to this fact is negligible considering how rare it is to find such a situation. However, we found this technique not to significantly improve the overall performance, so we decided *not* to leave it in the final proposal. Nevertheless, the lines of code that deas with it can be uncommented to reintroduce it.
* Weighting the region proposals by their content. That is, we used the label provided by YOLO (_e.g._ 'person') to create novel prompts (_i.e._, 'A picture of a {category}'), which we will refer to as "category prompts". We then encoded all these prompts (80 in total, as the number of categories recognized by YOLO) using the CLIP encoder. When a region was proposed, we weighted it by the similarity between the category prompt and the given prompt. For instance, the prompt "An orange in the bowl" will have a higher similarity with the category prompt "A picture of a orange" than to "A picture of a cat". Therefore, if the image contains both an orange and a cat, the cat region will be assigned a lower likelihood. In this way, we are basically computing a prior probability of a region to be correct one.

---

All this information has to be combined in order to predict a single bounding box for the given pair (image, prompt). We tried different "ensembling" methods, however we found the following to work best:
* computing the mean between the cosine similarities of the visual prompts referring to the same proposal. We also tried getting the max or the median (when we tried with 3+ visual prompts for each proposal);
* extracting the maximum cosine similarity value across all proposals to find the best-matching one.

In addition we found evidence that CLIP is biased towards certain categories of objects, perhaps because it was exposed more to them during pre-training time. This conjecture/speculation is also supported by some of the papers we have studied. To overcome this issue and gaining some more accuracy, a "de-biasing" technique can be applied. This speculation is also supported by some papers, which proposed some techniques to address it.

We decided to sample $l$ (in the end, $l = 5000$) random prompts from the trainig set and encoding them using CLIP's encoder. Then, during inference time, we compute the cosine similarity between all region proposals, the given prompt(s) for the image and all these "de-biasing" prompts. We therefore get a huge matrix, whose size is $(\#\textit{prompts} + \#\textit{de-biasing prompts}) \times \#\textit{visual prompts}$. We compute the average on a per-column basis, so we get $\#\textit{visual prompts}$ average values. We drop all the rows (corresponding to the prompts) but the first $\#\textit{prompts}$ one, which are those referring to the actually given prompts. Then, to each of these remaining rows, we subtract the average computed earlier. This helps de-biasing the result.

To speed up computations, we also tried to pre-compute the value of the de-biasing terms and even to learn them, but it did not to satisfactory results: the best practice, from a mere metric performance result, is achieved by doing as explained above. However, depending on the application, we suggest to take into account the idea of pre-computing these values or to train a network to predict them. In either case - and even in the one we have decided to use - we consider it to be a sort of training: we are exploiting data and information to improve performance on unseen data.

To summarize, these are the main components of our architecture:
* visual prompt editing to guide CLIP towards our goal;
* textual prompt editing to help CLIP find better matches;
* using prior knowledge to improve the performance on novel, unseen data.

The following cell contains the code of our implementation. We present a general framework which we then adapated to work both with YOLOv5 and the recently-released YOLOv8.

In [105]:
import torch
import torch.nn as nn
import clip
import numpy as np
from PIL import Image, ImageDraw, ImageFilter

class CirclesModel(nn.Module):
    def __init__(self, device=None, models=None, preprocesses=None, yolo_models=None, text_features_bias=None) -> None:
        """
        Initialize a CirclesModel. Note that it does not actually work, as it acts as an abstract class, which
        has to be subclassed in order to work properly with different YOLO versions.
        """

        super().__init__()
        
        if device:
            self.device = device
        else:
            self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
        
        if not models or not preprocesses:
            raise ValueError('Models and preprocesses for CLIP model should be provided')

        self.models = models
        self.preprocesses = preprocesses

        self.clip_backbones = list(self.models.keys())
        
        if not yolo_models:
            raise ValueError('Models for YOLO should be provided')
        self.yolo_models = yolo_models

        # De-biasing features
        if not text_features_bias:
            self.text_features_bias = {}
            for backbone in self.clip_backbones:
                text_features_bias[backbone] = None
        else:
            self.text_features_bias = text_features_bias

        # How many visual prompts are generated for each image
        self.visual_augmentation = 2

    def forward(self, indices, images, prompts_list):
        self.device = indices.device
        if indices.is_cuda:
            self.device_index = int(str(self.device)[-1])
        else:
            self.device_index = 0

        # -- Getting the right data and moving it to the correct device --

        # Images remain on the CPU because they are PIL Images, not Tensors
        # Converting to Tensors leads to errors with YOLO
        images = [images[i] for i in indices]

        prompts_list = [prompts_list[i] for i in indices]
        prompts_list = self.update_prompts_with_this_is(prompts_list)
        prompts_tensor = [clip.tokenize(prompt_list).to(self.device) for prompt_list in prompts_list]

        # -- Actual processing --

        bounding_boxes = self.get_bounding_boxes(images)

        # It contains the predicted bounding box for each image for each prompt
        # Then, it is a list of length len(images) and for each entry there is a
        # list with len(prompts[i]), where i is the i-th image 
        overall_outputs = []

        with torch.no_grad():
            for idx, prompts_tensor_for_sample in enumerate(prompts_tensor):
                # Image crops
                image_crops, bounding_boxes[idx] = self.get_visual_prompts(images[idx], bounding_boxes[idx])

                preprocessed_image_crops = {}
                for backbone in self.clip_backbones:
                    preprocessed_image_crops[backbone] = torch.stack([self.preprocesses[backbone][self.device_index](image) for image in image_crops]).to(self.device)

                similarities = {}
                for backbone in self.clip_backbones:
                    pic_batches = np.array_split(range(len(preprocessed_image_crops[backbone])), len(preprocessed_image_crops[backbone]) // self.visual_augmentation)
                    
                    visual_features = []
                    for pic_batch in pic_batches:
                        visual_features.append(self.models[backbone][self.device_index].encode_image(preprocessed_image_crops[backbone][pic_batch]))
                    visual_features = torch.cat(visual_features)

                    text_features = self.models[backbone][self.device_index].encode_text(prompts_tensor_for_sample)

                    if self.text_features_bias[backbone] is not None:
                        text_features = torch.cat([text_features, self.text_features_bias[backbone].to(self.device)])

                    similarities[backbone] = cosine_similarity(visual_features, text_features)


                similarity = torch.empty_like(similarities[self.clip_backbones[0]])
                for prompt_idx in range(similarity.shape[0]):
                    for proposal_idx in range(similarity.shape[1]):
                        similarity[prompt_idx, proposal_idx] = torch.mean(torch.stack([
                            similarities[backbone][prompt_idx, proposal_idx] for backbone in self.clip_backbones
                        ]))


                average = similarity.mean(dim=0)
                scores = (similarity - average)[range(len(prompts_tensor_for_sample))]

                final_scores = torch.empty((scores.shape[0], scores.shape[1] // self.visual_augmentation))

                for prompt_idx in range(final_scores.shape[0]):
                    for final_score_idx, proposal_idx in enumerate(range(0, scores.shape[1], self.visual_augmentation)):
                        final_scores[prompt_idx, final_score_idx] = torch.max(torch.stack([
                            scores[prompt_idx, proposal_idx + i] for i in range(self.visual_augmentation)
                        ]))

                
                _, max_indices = final_scores.max(dim=-1)
                try:
                    for max_idx in max_indices:
                        overall_outputs.append(
                            (bounding_boxes[idx][max_idx, 0:4]).to(self.device)
                        )
                except:
                    for max_idx in max_indices:
                        overall_outputs.append(
                            torch.tensor((0, 0, images[idx].size[0], images[idx].size[1])).to(self.device)
                        )

        return torch.stack(overall_outputs)

    def update_prompts_with_this_is(self, prompts):
        """
        Update the textual prompts by adding 'This is' at the beginning.
        This seems to help in guiding CLIP to focus on the correct part
        of the visual prompt.
        """

        return [['This is ' + prompt for prompt in sample] for sample in prompts]

    def get_bounding_boxes(self, pil_images):
        pass

    def get_image_with_marker(self, image, bbox, stroke_color='red', stroke_width=1):
        """
        Add a visual marker to the image at the position specified by the
        bounding box (bbox), which is expected to be in the format
        (top_left_x, top_left_y, bottom_right_x, bottom_right_y).
        """
        
        result = image.copy()
        draw = ImageDraw.Draw(result)
        draw.ellipse(bbox, outline=stroke_color, width=stroke_width)
        
        return result
    
    def get_image_with_marker_and_blur(self, image, bbox, stroke_color='red', stroke_width=1, blur_radius=1):
        """
        Add a visual marker to the image at the position specified by the
        bounding box (bbox), which is expected to be in the format
        (top_left_x, top_left_y, bottom_right_x, bottom_right_y).
        The background is then blurred.
        """
        
        result = image.filter(ImageFilter.GaussianBlur(radius=blur_radius))
        mask = Image.new('L', image.size, 0)
        draw = ImageDraw.Draw(mask)
        draw.rectangle(bbox, fill=255)
        result.paste(image, mask=mask)
        draw = ImageDraw.Draw(result)
        draw.ellipse(bbox, outline=stroke_color, width=stroke_width)
        
        return result

    def get_image_with_marker_and_grayscale(self, image, bbox, stroke_color='red', stroke_width=1):
        """
        Add a visual marker to the image at the position specified by the
        bounding box (bbox), which is expected to be in the format
        (top_left_x, top_left_y, bottom_right_x, bottom_right_y).
        The background is grayscaled.
        """
        
        result = image.convert('L').convert('RGB')
        mask = Image.new('L', image.size, 0)
        draw = ImageDraw.Draw(mask)
        draw.rectangle(bbox, fill=255)
        result.paste(image, mask=mask)
        draw = ImageDraw.Draw(result)
        draw.ellipse(bbox, outline=stroke_color, width=stroke_width)

        return result

    def get_image_with_marker_and_blur_grayscale(self, image, bbox, stroke_color='red', stroke_width=1, blur_radius=1):
        """
        Add a visual marker to the image at the position specified by the
        bounding box (bbox), which is expected to be in the format
        (top_left_x, top_left_y, bottom_right_x, bottom_right_y).
        The background is both grayscaled and blurred.
        """
        
        result = image.filter(ImageFilter.GaussianBlur(radius=blur_radius)).convert('L').convert('RGB')
        mask = Image.new('L', image.size, 0)
        draw = ImageDraw.Draw(mask)
        draw.rectangle(bbox, fill=255)
        result.paste(image, mask=mask)
        draw = ImageDraw.Draw(result)
        draw.ellipse(bbox, outline=stroke_color, width=stroke_width)

        return result

    def get_visual_prompts(self, image, bounding_boxes):
        self.visual_augmentation = 2
        visual_prompts = []
        keep_bbox = []

        if bounding_boxes is None:
            return [image] * self.visual_augmentation, bounding_boxes

        # Setting the parameters for the visual markers
        stroke_color = 'red'
        stroke_width = 3
        blur_radius = 20

        for idx, bounding_box in enumerate(bounding_boxes if bounding_boxes is not None else []):
            bounding_box = (bounding_box[0].item(), bounding_box[1].item(), bounding_box[2].item(), bounding_box[3].item())

            # For the following line to work correctly bounding boxes should actually be removed from
            # YOLO's results, as that's what is actually used in the end
            # if (bounding_box[2] - bounding_box[0]) * (bounding_box[3] - bounding_box[1]) < (image.size[0] * image.size[1]) * 0.8:
            #     continue

            # If the previous condition was uncommented, this line would not execute
            # for boxes covering more than 80% of the area of the image, thus they
            # would be removed from the results
            keep_bbox += [idx]

            # Uncomment the following lines to add or remove visual markers.
            # Remember to update `self.visual_augmentation` to match the number
            # of visual prompts that are generated for each region proposal.
            bbox_visual_prompts = [
                # self.get_image_with_marker(image, bounding_box, stroke_color=stroke_color, stroke_width=stroke_width),
                self.get_image_with_marker_and_blur(image, bounding_box, stroke_color=stroke_color, stroke_width=stroke_width, blur_radius=blur_radius),
                # self.get_image_with_marker_and_grayscale(image, bounding_box, stroke_color=stroke_color, stroke_width=stroke_width),
                self.get_image_with_marker_and_blur_grayscale(image, bounding_box, stroke_color=stroke_color, stroke_width=stroke_width, blur_radius=blur_radius),
            ]

            for el in bbox_visual_prompts:
                visual_prompts.append(el)

        # if self.device_index == 0:
        #     print('From', len(bounding_boxes))
        bounding_boxes = bounding_boxes[keep_bbox]
        # if self.device.index == 0:
        #     print('To', len(bounding_boxes))

        if len(visual_prompts) == 0:
            # If no region proposal, return the whole image.
            # It is inserted as many times as each region would
            # be augmented to ensure consistency in the algorithm
            for _ in range(self.visual_augmentation):
                visual_prompts.append(image)
                
        return visual_prompts, bounding_boxes

In the following cell, we define two concrete implementation of the CirclesModel which rely on different YOLO versions, namely YOLOv5 and YOLOv8.

We initially utilized YOLOv5, however we decided to move to YOLOv8 as we noticed improved performance. YOLOv8 is based on the latest research papers and is thus able to work much better than YOLOv5. In particular, we noticed that in some cases YOLOv5 was *not* able to find the correct region proposals nor was it always accurate. YOLOv8, on the other hand, performs much better thanks to the improvements made to its architecture and training phase.

Replacing YOLOv5 with YOLOv8 allows us to get an average $+4\%$ on our task.

To make fair comparisons and to avoid unfair unbalances, we used the "small" version of YOLO in either case (that is, YOLOv5s and YOLOv8s).

In [106]:
class CirclesModelYOLOv5(CirclesModel):
    def get_bounding_boxes(self, pil_images):
        bounding_boxes = self.yolo_models[self.device_index](pil_images)
        return bounding_boxes.pred

class CirclesModelYOLOv8(CirclesModel):
    def get_bounding_boxes(self, pil_images):
        bounding_boxes = self.yolo_models[self.device_index].predict(pil_images, verbose=False)
        bounding_boxes = [torch.cat([box.xyxy for box in res.boxes]) if res.boxes else None for res in bounding_boxes]
        return bounding_boxes

Firstly we have to load YOLOv8. We also check whether YOLOv5 was already loaded and, if necessary, we load it.

In [107]:
try:
    yolo_models
except:
    if torch.cuda.is_available():
        yolo_models = [torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True).to(f'cuda:{i}') for i in range(torch.cuda.device_count())]
    else:
        yolo_models = [torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True).to(device)]

In [108]:
from ultralytics import YOLO

if torch.cuda.is_available():
    yolo_v8_models = [YOLO('yolov8s.pt') for i in range(torch.cuda.device_count())]
else:
    yolo_v8_models = [YOLO('yolov8s.pt')]

Following, we can load CLIP model(s). We have decided to stick to the ResNet backbone (RN50x16) as the assignment asked to use it, if possible. Nevertheless, we tried our architecture with different backbones and the next section presents the results.

In [33]:
import clip

# The following list is used to choose which CLIP versions to load.
# If multiple backbones are chosen, they are all loaded and the models
# will use all of them as an ensemble.
# While this may improve the metric performance by a tiny fraction, it
# dramatically slows down the computation, as all the calculations have
# to be repeated for all backbones and eventually ensembled.
clip_backbones = [
    'RN50x16',
    # 'RN50x64',
    # 'ViT-B/16',
    # 'ViT-L/14@336px',
    # 'ViT-B/32',
]

models, preprocesses = {}, {}

for clip_backbone in clip_backbones:
    models[clip_backbone] = []
    preprocesses[clip_backbone] = []

    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            model, preprocess = clip.load(clip_backbone, device=f'cuda:{i}')
            
            models[clip_backbone].append(model)
            preprocesses[clip_backbone].append(preprocess)
    else:
        model, preprocess = clip.load(clip_backbone, device=device)
        models[clip_backbone].append(model)
        preprocesses[clip_backbone].append(preprocess)

Before instantiating the models, we have to load debiasing features. This is what the following cell does.

Debiasing features are prompts which are randomly sampled from the training set. The more, the better - in general. The more, the higher the compute time, therefore a good tradeoff has to be found. We decided to use 5000 randomly sampled prompts.

As exaplined earlier, the core idea is to compute the cosine similarity between all the region proposals and the text prompts (which are both the ones given for the image and the debiasing ones). Then, the average is computed column-wise (columns contain region proposals). All the rows but the ones corresponding to the actually given prompts are discarded, thus reducing greatly the dimensionality of the cosine similarity matrix, and the average computed earlier is subtracted to each row to get the aforementioned debias. This idea was proposed by Shtedritski et al. (see the References section).

In [18]:
import numpy as np

# -- Step 1 --
# Loading the training set
all_prompts = []
for batch_idx, (images, gt_bounding_boxes, prompts) in enumerate(train_loader):
    refined_prompts = ['This is ' + prompt for sample in prompts for prompt in sample]

    for prompt in refined_prompts:
        all_prompts.append(prompt)

# -- Step 2 --
# Choosing 5000 random prompts
# Note that even setting the random seed does not ensure to get the very
# same prompts, since the DataLoader for the training set shuffles the data,
# thus loading it in different ways every time.
np.random.seed = 42
randomly_sampled_examples = np.random.choice(all_prompts, 5000)

In [34]:
# -- Step 3 --
# Encoding the prompts using CLIP
text_features_full = {}

text = clip.tokenize(randomly_sampled_examples).to(device)

for clip_backbone in clip_backbones:
    encoder = models[clip_backbone][0]

    # All the prompts are split into 10 mini-batches, as loading all
    # them onto the GPU at the same time may cause out-of-memory errors.
    # Splitting into 10 smaller batches should prevent this problem
    # for reasonably sized debiasing sets.
    # If there are still issues due to low-capacity GPUs, the number of
    # batches shall be increased
    indices = range(text.shape[0])
    batches = np.array_split(indices, 10)

    text_features_full[clip_backbone] = []

    for batch in batches:
        with torch.no_grad():
            text_features_full[clip_backbone].append(encoder.encode_text(text[batch]))

# -- Step 4 --
# Concatenate all prompt features to form a matrix with as many rows as the
# number of randomly sampled prompts and as many columns as the embedding
# dimensionality
for clip_backbone in clip_backbones:
    text_features_full[clip_backbone] = torch.cat(text_features_full[clip_backbone])

We can now instantiate both models.

In [109]:
circles_model_yolo_v5 = CirclesModelYOLOv5(models=models, preprocesses=preprocesses, yolo_models=yolo_models, text_features_bias=text_features_full)
if torch.cuda.device_count() > 1:
    circles_model_yolo_v5 = torch.nn.DataParallel(circles_model_yolo_v5)

circles_model_yolo_v8 = CirclesModelYOLOv8(models=models, preprocesses=preprocesses, yolo_models=yolo_v8_models, text_features_bias=text_features_full)
if torch.cuda.device_count() > 1:
    circles_model_yolo_v8 = torch.nn.DataParallel(circles_model_yolo_v8)

Instantiate the evaluator with the same CLIP backbones.

In [110]:
evaluator_model = Evaluator(models=models, preprocesses=preprocesses)

if torch.cuda.device_count() > 1:
    evaluator_model = torch.nn.DataParallel(evaluator_model)

We can eventually test the model.

In [111]:
test_model(test_loader, circles_model_yolo_v8, evaluator_model, device, True)

-- Batch index: 0 --
-- Batch index: 1 --
-- Batch index: 2 --
-- Batch index: 3 --
-- Batch index: 4 --
-- Batch index: 5 --
-- Batch index: 6 --
-- Batch index: 7 --
-- Batch index: 8 --
-- Batch index: 9 --
-- Batch index: 10 --
-- Batch index: 11 --
-- Batch index: 12 --
-- Batch index: 13 --
-- Batch index: 14 --
-- Batch index: 15 --
-- Batch index: 16 --
-- Batch index: 17 --
-- Batch index: 18 --
-- Batch index: 19 --
-- Batch index: 20 --
-- Batch index: 21 --
-- Batch index: 22 --
-- Batch index: 23 --
-- Batch index: 24 --
-- Batch index: 25 --
-- Batch index: 26 --
-- Batch index: 27 --
-- Batch index: 28 --
-- Batch index: 29 --
-- Batch index: 30 --
-- Batch index: 31 --
-- Batch index: 32 --
-- Batch index: 33 --
-- Batch index: 34 --
-- Batch index: 35 --
-- Batch index: 36 --
-- Batch index: 37 --
-- Batch index: 38 --
-- Batch index: 39 --
-- Batch index: 40 --
-- Batch index: 41 --
-- Batch index: 42 --
-- Batch index: 43 --
-- Batch index: 44 --
-- Batch index: 45 -

(0.5064360232455498, 0.4830508474576271, 0.8853325278072034)

# Results

We present, as the main results of this project, the outcomes that we obtained from the latest model/architecture that we have introduced.

| Backbone | Mean Intersection over Union | Accuracy | Cosine Similarity |
| -------- | ---------------------------- | -------- | ----------------- |

# Notes

## 1

We implemented the baseline and all the models using YOLOv5, downloaded from TorcHub. In the end, when the final architecture was completed, we decided to try YOLOv8, the latest iteration of the YOLO family which should perform better. Indeed, we noticed that in some cases YOLOv5 was *not* able to propose any (meaningful) regions.

We therefore switched from YOLOv5 (small model) to YOLOv8 (small model, for comparable results). We found that it provides an overall +4% boost, both on the baseline and on the latter architecture we have proposed.

We also thought of using other region-proposal methods, such as Fast R-CNN, Faster R-CNN and the Detectron2 suite by FAIR. However, we decided to stick to YOLO for the following reasons:
* results are comparable across different architectures;
* YOLO, in the small version we are using, is fast enough for the task;
* YOLO was trained on the Coco dataset, thus being very accurate in proposing regions.

Nonetheless, other solutions as mentioned above may be employed to allow for faster inference times or not to rely on YOLO.

## 2

In the end, we mostly proposed training-free architectures. We did so for the following reasons:
1. CLIP was trained on a huge dataset, therefore it should have a good representation of concepts both in the visual and the textual domain (which are then brought to the same latent space). We wanted to discover whether this assertion is accurate and how much and we did not want to exacerbate or "ruin" CLIP itself: instead, we wanted to leverage its capabilities.
2. Training requires lots of resources. We would not have been able to train large models, for instance the Graph Neural Network one, as we intended. We could only train relatively small models (see previous sections) just to test some more basic ideas.

# References

* [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)
* [CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision](https://arxiv.org/abs/2112.07133)
* [Adapting CLIP For Phrase Localization Without Further Training](https://arxiv.org/pdf/2204.03647.pdf)
* [Zero-shot Referring Image Segmentation with Global-Local Context Features](https://arxiv.org/pdf/2303.17811.pdf)
* [Hierarchical Local-Global Transformer for Temporal Sentence Grounding](https://arxiv.org/pdf/2208.14882.pdf)
* [[CLS] Token is All You Need for Zero-Shot Semantic Segmentation](https://arxiv.org/pdf/2304.06212.pdf)
* [What does CLIP know about a red circle? Visual prompt engineering for VLMs](https://arxiv.org/pdf/2304.06712.pdf)
* [CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching](https://arxiv.org/pdf/2303.13076.pdf)
* [ActBERT: Learning Global-Local Video-Text Representations](https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_ActBERT_Learning_Global-Local_Video-Text_Representations_CVPR_2020_paper.pdf)
* [A Local-to-Global Approach to Multi-modal Movie Scene Segmentation](https://openaccess.thecvf.com/content_CVPR_2020/papers/Rao_A_Local-to-Global_Approach_to_Multi-Modal_Movie_Scene_Segmentation_CVPR_2020_paper.pdf)
* [CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings](https://www.jair.org/index.php/jair/article/view/13689/26825)
* [Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs](https://arxiv.org/pdf/2212.00785.pdf)
* [CLIP-Event: Connecting Text and Images with Event Structures](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_CLIP-Event_Connecting_Text_and_Images_With_Event_Structures_CVPR_2022_paper.pdf)
* [STAIR: Learning Sparse Text and Image Representation in Grounded Tokens](https://arxiv.org/pdf/2301.13081.pdf)
* [Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization](https://arxiv.org/pdf/2302.00275.pdf)
* [CRIS: CLIP-Driven Referring Image Segmentation](https://arxiv.org/pdf/2111.15174.pdf)
* [LAVT: Language-Aware Vision Transformer for Referring Image Segmentation](https://arxiv.org/pdf/2112.02244.pdf)
* [Weakly-supervised segmentation of referring expressions](https://arxiv.org/pdf/2205.04725.pdf)
* [Focusing on Targets for Improving Weakly Supervised Visual Grounding](https://arxiv.org/pdf/2302.11252v1.pdf)
* [Fine-tuned CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2212.03640.pdf)
* [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/pdf/2111.09734)
* [Simple but Effective: CLIP Embeddings for Embodied AI](https://openaccess.thecvf.com/content/CVPR2022/papers/Khandelwal_Simple_but_Effective_CLIP_Embeddings_for_Embodied_AI_CVPR_2022_paper.pdf)
* [What does CLIP know about a red circle? Visual prompt engineering for VLMs](https://arxiv.org/pdf/2304.06712.pdf)
* [Fast R-CNN](https://arxiv.org/abs/1504.08083)
* [Faster R-CNN](https://arxiv.org/abs/1506.01497)
* [Detectron2](https://ai.facebook.com/tools/detectron2/)