<a href="https://colab.research.google.com/github/nicolazilio0/deepRiccy/blob/main/Complete_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> VISUAL GROUNDING WITH CLIP AND STABLE DIFFUSION </h1>

## Abstract


This project focuses on addressing the challenge of visual grounding, which involves establishing a connection between language and visual information. Visual grounding is essential for bridging the gap between symbolic language representation and understanding the visual world.

This project builds on the CLIP paradigm, modifying its architecture to enhance performance.

Three distinct approaches were explored and combined to achieve optimal results. The first approach utilizes natural language processing (NLP) techniques to identify the subject of a sentence and filter out unwanted predictions from YOLO bounding boxes. This approach, in conjunction with the baseline, forms the initial strategy. In fact once we have computed the similarities between images to the caption we "filter" out the ones that do not contain the subject we extracted and then take as result the one with the highest similarity. 

The second approach starts from the CLIP paradigm and replaces the computation of similarity between texts and images, with images to images. In fact for each caption we have in our dataset, we generate synthetical images that should represent at best the desider target. Once we obtained the images,we filter out the bounding that are not relevant (as we did in the first approach), compute the similarity between the bounding boxes and the generated images,  and then take the one with higher similarity.

Conversely, the third approach can be viewed as a "reverse" process of the second approach. Here, the goal is to obtain a textual description of the YOLO bounding box, facilitating a comparison between texts. After we obtain the captions for the yolo bounding boxes, as in the previous approaches we filter out the "wrong" ones, compute the similarities and take as result the highest similarity. 

Once we have all of these models ready we deciced to try and combine them in order to combine textual and visual approach to try to improve the performances. We as first decided to combine them in a linear way giving as weight 0.75 to the images and 0.25 to text, than 0.5 to both and 0.25 to the images and 0.75 to text. From these results we decided then how to combine our scores to have the best performance.




By exploring these approaches and their combinations, this project aims to advance the field of visual grounding and enable effective language-to-visual connections for improved understanding and interpretation of visual data.


## Setup environment

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

#! tar -zxvf /content/drive/MyDrive/refcocog.tar.gz

! pip3 install ftfy regex tqdm --quiet
! pip3 install diffusers==0.11.1 --quiet
! pip3 install transformers scipy ftfy accelerate --quiet
! pip3 install stanza --quiet
! pip3 install -qr https://raw.githubusercontent.com/ultralytics/yolov5/master/requirements.txt  --quiet
! pip install ftfy regex tqdm --quiet
! pip install git+https://github.com/openai/CLIP.git --quiet
! pip install rouge-metric --quiet
! pip install torchmetrics --quiet
! pip install torchvision --quiet



Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.9/524.9 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.5/227.5 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m110.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.5/802.5 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m353.7/353.7 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Pr

In [None]:
local_path = './refcocog/images/' 
local_annotations = './refcocog/annotations/' 

In [8]:
local_path = '/content/refcocog/images/'
local_annotations = '/content/refcocog/annotations/'

In [10]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Collec

In [11]:
import json
import pickle
import torch
from PIL import Image
from torch.utils.data import Dataset
from torchvision import transforms
import matplotlib.pyplot as plt
import pandas as pd
from pkg_resources import packaging
import clip
import numpy as np

import os
import skimage
import IPython.display

from collections import OrderedDict
import torch
import torchmetrics as tm
import torchvision
from torchvision import ops

import os
import skimage
import IPython.display
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from tqdm import tqdm


from collections import OrderedDict
from diffusers import StableDiffusionPipeline
from diffusers import DPMSolverMultistepScheduler
import stanza
from torchvision import transforms


from transformers import AutoProcessor, BlipForConditionalGeneration


from transformers import ViTFeatureExtractor, VisionEncoderDecoderModel
from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizerFast

import requests
from transformers import Seq2SeqTrainer
from transformers import Seq2SeqTrainingArguments
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer

import datasets
from transformers import default_data_collator
import argparse

#ignore warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

## Utilities

In [12]:
# remove the id in the image name string
def split_string(string):
    string = string.split("_")
    string = string[:-1]
    string = "_".join(string)
    append = ".jpg"
    string = string + append

    return string

In [30]:
# Validation function if given a dataframe runs the model parameter for the whole dataframe and prints the metrics
def validate(model, dataframe):
    model.reset_metrics()
    for i in tqdm(range(0, len(dataframe))):
        input = dataframe.iloc[i]
        image_path = split_string(input["file_name"])
        sentence = input["sentences"]["raw"]
        gt = input["bbox"]
        original_img = Image.open(local_path + image_path).convert("RGB")
        # print img dimensions and box coordinates
        model.evaluate(image_path, sentence, gt, original_img)

    model.save_metrics()
    print(model.get_metrics())

In [14]:
def test_on_one_image(model, dataframe, index):
    model.reset_metrics()

    input = dataframe.iloc[index]
    image_path = split_string(input["file_name"])
    sentence = input["sentences"]["raw"]
    gt = input["bbox"]

    original_img = Image.open(local_path + image_path).convert("RGB")

    # print img dimensions and box coordinates
    bbox, _ =model.evaluate(image_path, sentence, gt, original_img, index)
    bbox = bbox.cpu().numpy()
    #show image with bbox and caption and gound truth
    %matplotlib inline
    plt.imshow(original_img)

    x1, y1, width, height = gt

    plt.gca().add_patch(plt.Rectangle((x1, y1), width, height, fill=False, edgecolor='red', linewidth=2))
    
    plt.gca().add_patch(plt.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0], bbox[3]-bbox[1], fill=False, edgecolor='blue', linewidth=2))
    print(sentence)
    plt.show()

    print(model.get_metrics())

In [15]:
# convert yolo format bbox into standard type
def convert_bbox(bbox, img):
    x1, y1, width, height = bbox
    x2, y2 = x1 + width, y1 + height

    # Verify coordinates
    if x1 < 0 or y1 < 0 or x2 > img.width or y2 > img.height:
        print("Bounding box fuori dai limiti dell'immagine!")
    else:
        return x1, y1, x2, y2
    
# yolo bbox include class and precision, drop them
def convert_yolo_bbox(bbox):
    return bbox[:4]

In [16]:
def clear_caption(caption):
    caption = caption.replace('<s>', '')
    caption = caption.replace('</s>', '')
    return caption

In [4]:
def crop_yolo(yolo_output, img, index):
    x1 = yolo_output.xyxy[0][index][0].cpu().numpy()
    x1 = np.rint(x1)
    y1 = yolo_output.xyxy[0][index][1].cpu().numpy()
    y1 = np.rint(y1)
    x2 = yolo_output.xyxy[0][index][2].cpu().numpy()
    x2 = np.rint(x2)
    y2 = yolo_output.xyxy[0][index][3].cpu().numpy()
    y2 = np.rint(y2)

    cropped_img = img.crop((x1, y1, x2, y2))

    return cropped_img

## Dataset

In [5]:
# dataset class definition
class Coco(Dataset):
    def __init__(self, path_json, path_pickle, train=True):
        self.path_json = path_json
        self.path_pickle = path_pickle
        self.train = train

        # load images and annotations
        with open(self.path_json) as json_data:
            data = json.load(json_data)
            self.ann_frame = pd.DataFrame(data['annotations'])
            self.ann_frame = self.ann_frame.reset_index(drop=False)

        with open(self.path_pickle, 'rb') as pickle_data:
            data = pickle.load(pickle_data)
            self.refs_frame = pd.DataFrame(data)

        # separate each sentence in dataframe
        self.refs_frame = self.refs_frame.explode('sentences')
        self.refs_frame = self.refs_frame.reset_index(drop=False)

        self.size = self.refs_frame.shape[0]

        # merge the dataframes
        self.dataset = pd.merge(
            self.refs_frame, self.ann_frame, left_on='ann_id', right_on='id')
        # drop useless columns for cleaner and smaller dataset
        self.dataset = self.dataset.drop(columns=['segmentation', 'id', 'category_id_y', 'ref_id', 'index_x',
                                         'iscrowd', 'image_id_y', 'image_id_x', 'category_id_x', 'ann_id', 'sent_ids', 'index_y', 'area'])

    def __len__(self):
        return self.size

    def __getitem__(self, idx):
        return self.dataset.iloc[idx]

    def get_annotation(self, idx):
        return self.ann_frame.iloc[idx]

    def get_imgframe(self, idx):
        return self.img_frame.iloc[idx]

    def get_validation(self):
        return self.dataset[self.dataset['split'] == 'val']

    def get_test(self):
        return self.dataset[self.dataset['split'] == 'test']

    def get_train(self):
        return self.dataset[self.dataset['split'] == 'train']

In [17]:
#test dataset

dataset = Coco(local_annotations + 'instances.json', local_annotations + "refs(umd).p")
print(dataset[0])

split                                                     test
sentences    {'tokens': ['the', 'man', 'in', 'yellow', 'coa...
file_name               COCO_train2014_000000380440_491042.jpg
bbox                           [374.31, 65.06, 136.04, 201.94]
Name: 0, dtype: object


## Metrics

In [18]:
class Metrics:
    def __init__(self, model, name):
        self.name = name
        self.treshold = 0.5
        self.transform = torchvision.transforms.Compose([
            torchvision.transforms.ToTensor(),
            torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                             std=[0.229, 0.224, 0.225])
        ])
        # initialize torch tensor
        self.iou = torch.tensor([]).cuda()
        self.recall = Recall()
        self.model = model
        self.cosine_similarity = torch.tensor([]).cuda()
        self.euclidean_distance = torch.tensor([]).cuda()

    def update(self, predicted_bbox, target_bbox, predicted_image, target_image):
        predicted_bbox = torch.tensor(predicted_bbox)
        target_bbox = torch.tensor(target_bbox)

        with torch.no_grad():
            # Preprocess the predicted image and compute the predicted embedding
            predicted_image = padd_image(predicted_image)
            image_tensor = self.transform(predicted_image)
            image_tensor = image_tensor.unsqueeze(
                0).cuda()  # Add batch dimension
            predicted_embedding = self.model.encode_image(image_tensor)

            # Preprocess the target image and compute the target embedding
            target_image = padd_image(target_image)
            target_image_tensor = self.transform(target_image)
            target_image_tensor = target_image_tensor.unsqueeze(
                0).cuda()  # Add batch dimension
            target_embedding = self.model.encode_image(target_image_tensor)

        similarity = torch.nn.functional.cosine_similarity(
            predicted_embedding, target_embedding)
        distance = torch.nn.functional.pairwise_distance(
            predicted_embedding, target_embedding)

        # convert bboxes into torch tensors
        predicted_bbox = torch.tensor(predicted_bbox)
        target_bbox = torch.tensor(target_bbox)
        predicted_bbox = convert_yolo_bbox(predicted_bbox)
        actual_iou = ops.box_iou(predicted_bbox.unsqueeze(
            0).cuda(), target_bbox.unsqueeze(0).cuda())
        self.iou = torch.cat((self.iou, actual_iou), 0)
        # get iou value of the predicted bbox and the target bbox
        if actual_iou > self.treshold:
            self.recall.update(True)
        else:
            self.recall.update(False)
        self.cosine_similarity = torch.cat(
            (self.cosine_similarity, similarity), 0)
        self.euclidean_distance = torch.cat(
            (self.euclidean_distance, distance), 0)

    def to_string(self):
        mean_iou = torch.mean(self.iou)
        recall_at_05_iou = self.recall.compute()
        mean_cosine_similarity = torch.mean(self.cosine_similarity)
        mean_euclidean_distance = torch.mean(self.euclidean_distance)

        return f"Mean IoU: {mean_iou:.4f}, Recall@0.5 IoU: {recall_at_05_iou:.4f}, Mean Cosine Similarity: {mean_cosine_similarity:.4f}, Mean Euclidean Distance: {mean_euclidean_distance:.4f}"

    def  save(self):
        iou = self.iou.cpu().numpy()
        cosine_similarity = self.cosine_similarity.cpu().numpy()
        euclidean_distance = self.euclidean_distance.cpu().numpy()

        np.savetxt(self.name+"_iou.csv", iou, delimiter=",")
        np.savetxt(self.name+"_cosine_similarity.csv", cosine_similarity, delimiter=",")
        np.savetxt(self.name+"_euclidean_distance.csv", euclidean_distance, delimiter=",")


    def reset(self):
        self.iou = torch.tensor([]).cuda()
        self.recall.reset()
        self.cosine_similarity = torch.tensor([]).cuda()
        self.euclidean_distance = torch.tensor([]).cuda()

In [19]:
# definition of recall metric
class Recall:
    def __init__(self):
        self.true_positives = 0
        self.false_negatives = 0

    def update(self, correct):
        if correct:
            self.true_positives += 1
        else:
            self.false_negatives += 1

    def compute(self):
        return self.true_positives / (self.true_positives + self.false_negatives)

    def reset(self):
        self.true_positives = 0
        self.false_negatives = 0

## Baseline Model

In [20]:
class VisualGrounding_baseline(torch.nn.Module):
    def __init__(self, yolo_version, clip_version, local_path, img_path):
        super(VisualGrounding_baseline, self).__init__()
        self.local_path = local_path
        self.img_path = img_path

        # initialize models
        self.yolo = torch.hub.load(
            'ultralytics/yolov5', yolo_version, pretrained=True)
        self.clip, self.preprocess = clip.load(clip_version)

        self.name = "baseline"
        # define metrics
        self.metrics = Metrics(self.clip, self.name)

        

    def forward(self, img_path, sentence):
        max_similarity = 0
        max_image = None
        max_bbox = None

        yolo_output = self.yolo(self.local_path+img_path)

        original_img = Image.open(self.local_path+img_path).convert("RGB")
        
        for i in range(len(yolo_output.xyxy[0])):
            #crop the image based on the yolo output
            img_cropped = crop_yolo(yolo_output, original_img, i)

            img = self.preprocess(img_cropped).cuda().unsqueeze(0)
            text = clip.tokenize([sentence]).cuda()

            with torch.no_grad():
                image_features = self.clip.encode_image(img).float()
                text_features = self.clip.encode_text(text).float()

            image_features /= image_features.norm(dim=-1, keepdim=True)
            text_features /= text_features.norm(dim=-1, keepdim=True)
            similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

            if similarity > max_similarity:
                max_similarity = similarity
                max_image = img_cropped
                max_bbox = yolo_output.xyxy[0][i]

        if max_image is None:
            #set bbox to the whole image
            max_bbox = [0, 0, original_img.width, original_img.height]
            max_image = original_img

        return max_bbox, max_image

    def evaluate(self, img_path, sentence, gt, original_img):
        bbox = convert_bbox(gt, original_img)
        gt_crop = original_img.crop(bbox)
        prediction_bbox, prediction_img = self.forward(img_path, sentence)
        self.metrics.update(prediction_bbox, bbox, prediction_img, gt_crop)
        return prediction_bbox, prediction_img

    def reset_metrics(self):
        self.metrics.reset()

    def get_metrics(self):
        return self.metrics.to_string()

    def save_metrics(self):
        self.metrics.save()

### Padd YOLO bounding boxes with average pixel value 

Since the bounding boxes that we obtain as output from YOLO are not in a predefined shape, and by default CLIP preprocesses them in order to have them in a size of 224x224, and feeding them into the CLIP encoder without processing them will cause them to be stretched, enlarged resulting in a loss of information. We don't want to lose information on our images, so we decided to apply a padding of the image in order to have them resized by 224x224 but mantaining also the informations. 
The colour we applied to the padding depends on the mean value of the colors of the pixels of the image. This is done in order both to reduce the loss of informaion, but also not to add noise or other informations inside the image. 
The way we apply the padding depends on the size of the image: 
- if the image has one of the two dimensions larger than 224, we create a square image with the size of the larger between width and height of the image, and with colour as the mean value of the colors. We then put our starting image at the center, and in the end it is resized to 224x224
- if both the dimensions are lower than 224, the original image is just padded

An example of the results we obtain is the following:
![picture](https://drive.google.com/uc?id=1UgdwmBkeVIV0Zz21nKdJpYZ2NEQfk229
)


In [21]:
def padd_image(img):
  print(type(img))
  avg_color_per_row = np.average(img, axis=0)
  avg_color = np.average(avg_color_per_row, axis=0)
  old_image_width,old_image_height  = img.size
  # create new image of desired size and color (blue) for padding
  if(old_image_height>224 or old_image_width>224):
    if(old_image_height>old_image_width):
          new_image_width = old_image_height
          new_image_height = old_image_height


    else:
          new_image_width = old_image_width
          new_image_height = old_image_width

    
    color=avg_color
    #color = (255,0,255)

    result = np.full((new_image_height,new_image_width, 3), color, dtype=np.uint8)

    # compute center offset
    x_center = (new_image_width - old_image_width) // 2
    y_center = (new_image_height - old_image_height) // 2

    # copy img image into center of result image
    result[y_center:y_center+old_image_height, 
          x_center:x_center+old_image_width] = img   

        
  else:
    new_image_width = 224
    new_image_height = 224
    '''if(old_image_height>old_image_width):
          new_image_width = old_image_height
          new_image_height = old_image_height


    else:
          new_image_width = old_image_width
          new_image_height = old_image_width'''
 
    #color = (255,0,255)
    color=avg_color
    result = np.full((new_image_height,new_image_width, 3), color, dtype=np.uint8)

    # compute center offset
    x_center = (new_image_width - old_image_width) // 2
    y_center = (new_image_height - old_image_height) // 2

    # copy img image into center of result image
    result[y_center:y_center+old_image_height, 
          x_center:x_center+old_image_width] = img


  img= Image.fromarray(result)
  img=img.resize((224,224))
  return img

In [22]:
original_images = []
images = []
texts = []
plt.figure(figsize=(16, 5))

for filename in [filename for filename in sorted(os.listdir(os.getcwd())) if filename.endswith(".png") or filename.endswith(".jpg")]:
    print(filename)
    name = os.path.splitext(filename)[0]

    image = Image.open(os.path.join(os.getcwd(), filename)).convert("RGB")
    image=padd_image(image)

  
    plt.subplot(4, 8, len(images) + 1)
    plt.imshow(image)
    plt.xticks([])
    plt.yticks([])

    original_images.append(image)
    images.append(preprocess(image))    
    texts.append(filename)


plt.tight_layout()

<Figure size 1600x500 with 0 Axes>

# Stanza NLP analysis 

In order to discriminate our predictions we use some Natural Language Processing techniques in order to find the subject of our caption. 
What we decided to use is the Dependency graph. This is useful because it represent the dependencies between the various tokens inside of the sentence. 
In fact in each sentence we can define the so called **root** of the sentence, which is the token on which all other tokens in the sentence depend. This can generate two possible scenarios: 
- **The root is a noun**: this is the best case for our task since we can identify that name as the subject of our caption.
- **The root is a verb**:  This create what is called a "verbal phrase", which is a sentence in which the root is the verb. Removing it from the sentence would make it lose meaning. When we have such type of sentence, there is high chance of having inside of the dependencies what is called "NSUBJ" or  [Nominal Subject](https://universaldependencies.org/en/dep/nsubj.html) (both active and passive). If it is present than this is the subject of our sentence. 
Otherwise if it is not present, once we have found the root of the sentence as a verb, we go back from the verb to the beginning of the sentence and we select as root of the sentence the first noun that we find. 






<img src='https://editor.analyticsvidhya.com/uploads/29920Screenshot%20(127).png' >
Example of a dependency graph

### Install stanza dependency graph model

In [23]:
#import stanza NLP model for english language
stanza.download('en',model_dir='/models/english',package='partut')
nlp = stanza.Pipeline(lang='en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor       | Package  |
------------------------------
| tokenize        | partut   |
| mwt             | partut   |
| pos             | partut   |
| lemma           | partut   |
| depparse        | partut   |
| pretrain        | partut   |
| forward_charlm  | 1billion |
| backward_charlm | 1billion |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/tokenize/partut.pt:   0%|      …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/mwt/partut.pt:   0%|          |…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pos/partut.pt:   0%|          |…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/lemma/partut.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/depparse/partut.pt:   0%|      …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pretrain/partut.pt:   0%|      …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/forward_charlm/1billion.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/backward_charlm/1billion.pt:   …

INFO:stanza:Finished downloading models and saved to /models/english.
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/tokenize/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pos/combined.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/lemma/combined.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/constituency/wsj.pt:   0%|     …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/depparse/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/sentiment/sstplus.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/ner/ontonotes.pt:   0%|        …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pretrain/fasttextcrawl.pt:   0%…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pretrain/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/forward_charlm/1billion.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/backward_charlm/1billion.pt:   …

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


### Preprocess input caption 

We noticed that there was a problem with the so called "not well-formed" sentences (*In natural language processing, a not well-formed sentence refers to a sentence that does not conform to the rules and conventions of the language, making it difficult or impossible to be accurately processed and understood by a computer program or a human reader*), this was creating the effect of not being able to find the correct root of the sentence. 
For example one of the cases we encountered was the sentence: "*boy holding two bears*" this after we analyzed it with [corenlp](https://corenlp.run/) gave us the following result: 
![picture](https://drive.google.com/uc?id=1KrY63kV1KwWWsjyyPnIlsOW1rUtU0KXM
)

This is an example of a "not well-formed" sentence, in fact although the root of the sentence is the verb "holding", when we apply the NLP parsing we cannot retrieve the noun "boy" as the subject of the sentence since it is recognised as an [INTJ](https://universaldependencies.org/u/pos/INTJ.html).

In order to solve this problem we noticed that adding a [determiner](https://universaldependencies.org/u/pos/DET.html) at the beginning of the sentence allows us to avoid having such problems. Between all the possible determiners we chose "*the*" as it does not change the meaning of the sentence and it is the most general one. 

After this is applied we obtain this dependecy parsing: 

![picture](https://drive.google.com/uc?id=1Rt4nU5kXKSn-pq6nRkaCAcyp__m7Z1cd
)
In which we can see that the word "boy" is being parsed as a [noun](https://universaldependencies.org/u/pos/NOUN.html)

<h3> Stanza sentence preprocess </h3> 
Before analyzing a sentence, we have a preliminary step in which we exclude certain errors in our subject prediction. One observation we made is that the phrase "that is" often introduces positional information about the dataset, which can conflict with our objective. Consequently, we have chosen to remove such phrases. Another issue arises with specific quantifiers, such as "the part of," "the side of," "the bunch of," and "the piece of." The problem here is that the word between "the" and "of" is mistakenly recognized as the subject of the sentence, resulting in an error. By eliminating these parts of the sentence, along with anything that precedes them, we can avoid this error and correctly identify the intended subject. Consider the sentence, "The end of a table, with a pink tablecloth, at which eight people are sitting." This sentence serves as a perfect example. Without applying pre-processing, the extracted subject would be "end" since it is a noun and the root of our sentence. However, our desired subject and root are "table." Therefore, if we remove everything between "end of," we can achieve our objective.

In [24]:
def remove_of(sentence):
 if "the side of" in sentence:
    index=sentence.find("the side of")
    sentence=sentence[index+11:]
    return sentence
 if "the handle of" in sentence:
    index=sentence.find("the handle of")
    sentence=sentence[index+13:]
    return sentence
 if "the bunch of" in sentence:
    index=sentence.find("the bunch of")
    sentence=sentence[index+12:]
    return sentence   
 if "the corner of" in sentence:
    index=sentence.find("the corner of")
    sentence=sentence[index+13:]
    return sentence
 if "the end of" in sentence:
    index=sentence.find("the end of")
    sentence=sentence[index+10:]
    return sentence
 if "the half of" in sentence:
    index=sentence.find("the half of")
    sentence=sentence[index+11:]
    return sentence    
 if "the edge of" in sentence:
    index=sentence.find("the edge of")
    sentence=sentence[index+11:]    
    return sentence
 if "the back of" in sentence:
    index=sentence.find("the back of")
    sentence=sentence[index+11:]
    return sentence   
 if "the smaller of" in sentence:
    index=sentence.find("the smaller of")
    sentence=sentence[index+14:]
    return sentence    
 if "the piece of" in sentence:
    index=sentence.find("the piece of")
    sentence=sentence[index+16:]
    return sentence 
 if "the wing of" in sentence:
    index=sentence.find("the wing of")
    sentence=sentence[index+11:]    
    return sentence
 if "the front of" in sentence:
    index=sentence.find("the front of")
    sentence=sentence[index+12:]   
    return sentence 
 if "the back side of" in sentence:
    index=sentence.find("the back side of")
    sentence=sentence[index+16:]   
    return sentence
 if "the front side of" in sentence:
    index=sentence.find("the front side of")
    sentence=sentence[index+17:]   
    return sentence 
 if "the left side of" in sentence:
    index=sentence.find("the left side of")
    sentence=sentence[index+16:]   
    return sentence
 if "the right side of" in sentence:
    index=sentence.find("the back side of")
    sentence=sentence[index+17:]   
    return sentence
  
 if "the pile of" in sentence:
    index=sentence.find("the pile of")
    sentence=sentence[index+11:]    
    return sentence
 if "the pair of" in sentence:
    index=sentence.find("the pair of")
    sentence=sentence[index+11:] 
    return sentence   
 if "the pieces of" in sentence:
    index=sentence.find("the pieces of")
    sentence=sentence[index+13:]   
    return sentence 
 if "the intersection of" in sentence:
    index=sentence.find("the intersection of")
    sentence=sentence[index+19:]  
    return sentence  
 if "the middle of" in sentence:
    index=sentence.find("the middle of")
    sentence=sentence[index+13:]
    return sentence    
 if "the patch of" in sentence:
    index=sentence.find("the patch of")
    sentence=sentence[index+12:]    
    return sentence
 if "the couple of" in sentence:
    index=sentence.find("the couple of")
    sentence=sentence[index+12:]    
    return sentence
 if "the slice of" in sentence:
    index=sentence.find("the slice of")
    sentence=sentence[index+12:]    
    return sentence
 if "the tallest of" in sentence:
    index=sentence.find("the tallest of")
    sentence=sentence[index+14:]    
    return sentence
 if "the kind of" in sentence:
    index=sentence.find("the kind of")
    sentence=sentence[index+11:]
    return sentence    
 if "that is" in sentence:
    index=sentence.find("that is")
    sentence=sentence[:index]
    return sentence
 if "the part of" in sentence:
    index=sentence.find("the part of")
    sentence=sentence[index+11:]
    return sentence
 if "the corner of" in sentence:
    index=sentence.find("the corner of")
    sentence=sentence[index+13:]
    return sentence
 if "the half of" in sentence:
    index=sentence.find("the half of")
    sentence=sentence[index+11:]
    return sentence
 if "the top of" in sentence:
    index=sentence.find("the top of")
    sentence=sentence[index+10:]
    return sentence
 if "the right half of" in sentence:
    index=sentence.find("the right half of")
    sentence=sentence[index+17:]
    return sentence
 if "the larger of" in sentence:
    index=sentence.find("the larger of")
    sentence=sentence[index+13:]
    return sentence
 if "the open part of" in sentence:
    index=sentence.find("the open part of")
    sentence=sentence[index+16:]
    return sentence
 if "the arm of" in sentence:
    index=sentence.find("the arm of")
    sentence=sentence[index+10:]
    return sentence
 if "the set of" in sentence:
    index=sentence.find("the set of")
    sentence=sentence[index+10:]
    return sentence
 if "the partial view of" in sentence:
    index=sentence.find("the partial view of")
    sentence=sentence[index+19:]
    return sentence
 if "the bunch of" in sentence:
    index=sentence.find("the bunch of")
    sentence=sentence[index+12:]
    return sentence
 return sentence

### Beginning of sentence processing

To ensure proper analysis of sentences, it is essential to perform additional preprocessing before removing specific quantifiers. This preprocessing aims to address two specific cases:

- The first case pertains to sentences lacking an article at the beginning. This poses a challenge for our parser to correctly identify the subject of the sentence. Thus, it is necessary to rectify this issue by adding an appropriate article.

- The second case involves sentences starting with "there is." In order to resolve this problem, we need to eliminate this phrase, as the word "there" is recognized as the subject (nsubj). However, it is considered an error in our context, as we initially select an nsubj within the sentence, if one exists.

Therefore, prior to removing specific quantifiers, undertaking these preprocessing steps allows us to address these two cases effectively.

In [None]:
'''nlp_sent=input["sentences"]["raw"]
nlp_sent=nlp_sent.lower()
x = nlp_sent.split(" ")
if(x[0]!="the" and x[0]!="a"):
    nlp_sent="the "+nlp_sent
if nlp_sent.startswith('there is '):
    nlp_sent = 'the ' + nlp_sent[9:]
nlp_sent=remove_of(nlp_sent)

print(nlp_sent)'''

<h3> Analyze caption and compute features of its root </h3>
This is the core function of our sentence processing. The process develops as follows: 
- we lowercase the sentence, this must be done in order to achieve better performances in the NLP part.
- We apply the beginning of sentence processing as seen above
- Then we actually decide how to process our sentence in order to extract the subject: 


In [25]:
def sent_stanza_processing(sentence):

    sentence = sentence.lower()
    if sentence.startswith('there is '):
        sentence = 'the ' + sentence[9:]
    if sentence.startswith('this is '):
        sentence = sentence[9:]
    sentence = remove_of(sentence)

    nlp_sent = sentence.lower()

    # put the sentence in lower case
    # if the sentence does not start with "a" or "the" insert it
    x = nlp_sent.split(" ")
    if (x[0] != "the" and x[0] != "a"):
        nlp_sent = "the " + nlp_sent

    doc = nlp(nlp_sent)
    # print nlp dependencies
    # doc.sentences[0].print_dependencies()
    # print(input["sentences"]["raw"])
    root = ''
    phrase_upos = []
    # get heads of words
    heads = [sent.words[word.head -
                        1].text for sent in doc.sentences for word in sent.words]
    for sent in doc.sentences:
        for word in sent.words:
            # if it is a verbal phrase then take the nominal subject of the phrase
            if (word.deprel == 'nsubj' or word.deprel == 'nsubj:pass'):
                root = word.text
                return word.text
                # print(word.text)
                break
            # print(word)
            phrase_upos.append(word)
            # else take the root of the phrase
            if (word.head == 0):
                # print(word.text)
                return word.text
                # root=word.text
                # if the root is a verb
                if (word.upos == 'VERB'):
                    for w in reversed(phrase_upos):
                        # go back until you get a noun
                        if (w.upos == 'NN'):
                            return word.text
                            # print(w.text)



## Analyze YOLO class prediction
\
We then encode also the classes that are present inside our YOLO predictions. These encodings will be used to compute the similarity between the root of our sentence and the YOLO classes in order to bring us back to one of the original YOLO classes. Having the subject of our sentence expressed as one of the YOLO classes we can exclude all the bounding boxes that have a label that is different from the one of our subject.

In [26]:
def get_root(yolo_output, sentence, model, yolo):
    root = sent_stanza_processing(sentence)
    # print(root)
    prompt_tokens = clip.tokenize(
        root, context_length=77, truncate=True).cuda()
    with torch.no_grad():
        prompt_features = model.encode_text(prompt_tokens).float()

    names = []
    for a in range(len(yolo_output.xyxy[0])):
        class_index = int(yolo_output.pred[0][a][5])
        label = yolo.names[class_index]
        names.append(label)
    tokens = clip.tokenize(names, context_length=77, truncate=True).cuda()
    with torch.no_grad():
        classes_features = model.encode_text(tokens).float()
    prompt_features /= prompt_features.norm(dim=-1, keepdim=True)
    classes_features /= classes_features.norm(dim=-1, keepdim=True)
    prompt_similarity = classes_features.cpu().numpy() @ prompt_features.cpu().numpy().T
    if prompt_similarity.shape[0] == 0:
        return "empty"
    rappresentation = np.argmax(prompt_similarity)

    interested_class = names[rappresentation]
    return interested_class

### Take class with maximum similarity 

We only want to have the class with the best similarity

In [None]:
# take as desired class the one with the highest similarity
rappresentation=np.argmax(prompt_similarity)
print(root)
interested_class=names[rappresentation]

print(interested_class)

We achieved an accuracy of 87% in predicting the subject of the sentence using Stanza. However, this result needs further explanation as it includes a 13% error rate, which can be categorized into three types of errors:

- The first type of error stems from issues in the sentence itself.  These errors can occur in two cases: when the sentence contains grammatical errors such as "The class of water in front of the bowl of bread" or when the sentence is not well-formed such as "Pot boiling water with green bell peppers in man's kitchen" that is missing the "of" between "Pot" and "boiling". NLP parsers are primarily designed for well-formed written languages, so errors may arise when working with other types of languages. One possible solution to address this problem is to introduce a machine learning model that can correct these errors and transform the original sentence into a well-formed one without altering its intended meaning.

- The second type of error is intrinsic to the dataset. In this case, the error arises from incorrect labeling of the subject class in the dataset, while our NLP analysis provides us with the correct subject. These errors can be considered as "false negatives," as they do not significantly impact the final outcome of the pipeline. (esempio con l'immagine di teddybear)

- The third type of error is related to our own process. This error occurs when we are unable to correctly map the predicted subject to one of the classes in YOLO. This issue arises due to the utilization of the CLIP encoder, which has a fixed context length of 77 tokens, during the process of embedding the subject of the sentence and the YOLO classes. The results we obtain are highly influenced by this context.
To illustrate this, let's consider an example where the word "lemon" is the subject in our context, and we intend to associate it with the "orange" class. However, if the "banana" class is selected as the most similar, it is because the word "banana" appears more frequently in the analyzed context near the word "lemon" than the word "orange" does.
The embedding process relies heavily on the surrounding context, and the choice of the most similar class depends on the co-occurrence patterns observed within that context. Therefore, variations in the embedding results can occur based on the specific context utilized by the CLIP encoder.

## Baseline + stanza

In [27]:
# class that defines the baseline model

class VisualGrounding_stanza(torch.nn.Module):
    def __init__(self, yolo_version, clip_version, local_path, img_path):
        super(VisualGrounding_stanza, self).__init__()
        self.local_path = local_path
        self.img_path = img_path
        # initialize models
        self.yolo = torch.hub.load(
            'ultralytics/yolov5', yolo_version, pretrained=True)
        self.clip, self.preprocess = clip.load(clip_version)
        self.name = "stanza"
        # define metrics
        self.metrics = Metrics(self.clip, self.name)

        

    def forward(self, img_path, sentence):
        max_similarity = 0
        max_image = None
        max_bbox = None

        yolo_output = self.yolo(self.local_path+img_path)

        original_img = Image.open(self.local_path+img_path).convert("RGB")

        root = get_root(yolo_output, sentence, self.clip, self.yolo)

        for i in range(len(yolo_output.xyxy[0])):
            if root != "empty" and self.yolo.names[int(yolo_output.pred[0][i][5])] != root:
                continue
            #crop the image based on the yolo output
            img_cropped = crop_yolo(yolo_output, original_img, i)

            img = self.preprocess(img_cropped).cuda().unsqueeze(0)
            text = clip.tokenize([sentence]).cuda()

            with torch.no_grad():
                image_features = self.clip.encode_image(img).float()
                text_features = self.clip.encode_text(text).float()

            image_features /= image_features.norm(dim=-1, keepdim=True)
            text_features /= text_features.norm(dim=-1, keepdim=True)
            similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

            if similarity > max_similarity:
                max_similarity = similarity
                max_image = img_cropped
                max_bbox = yolo_output.xyxy[0][i]

        if max_image is None:
            #set bbox to the whole image
            max_bbox = [0, 0, original_img.width, original_img.height]
            max_image = original_img

        return max_bbox, max_image

    def evaluate(self, img_path, sentence, gt, original_img):
        bbox = convert_bbox(gt, original_img)
        gt_crop = original_img.crop(bbox)
        prediction_bbox, prediction_img = self.forward(img_path, sentence)
        self.metrics.update(prediction_bbox, bbox, prediction_img, gt_crop)
        return prediction_bbox, prediction_img

    def reset_metrics(self):
        self.metrics.reset()

    def get_metrics(self):
        return self.metrics.to_string()
    
    def save_metrics(self):
        self.metrics.save()

# Stable Diffusion

## **What is Stable Diffusion?** <br>
It is a **text-to-image model**: give it a **text prompt** and it will return an image matching the text. <br>
Here an example of how it does work
![picture](https://drive.google.com/uc?id=1NAPD4WLEjGT3orZe3AqE6aN8jLpkMHV4)

## **Diffusion Models**
Stable Diffusion belongs to a class of deep learning models called **Diffusion Models.** They are generative models, meaning they are designed to generate new data similar to what they have seen in training. <br>
The model is based on two process: 
* Forward Diffusion
* Reverse Diffusion

**Forward Diffusion** <br>
This process adds noise to a training image, gradually turning it into an uncharacteristic noisy one. 
Below is an example of an image undergoing forward diffusion. 
![picture](https://drive.google.com/uc?id=10wKNn2adHGUb9ceXuUdAFmkMaGN8Nzwa)

**Reverse Diffusion** <br>
Starting from a noisy image, reverse diffusion recovers the original one.
![picture](https://drive.google.com/uc?id=1sxYKCVgEt8Hr3Rzb8__NT0M6iJ578OKm)

**How training is done** <br>
To reverse the diffusion, we need to know how much noise is added to an image. The answer is teaching a neural network model to predict the noise added. It is called the **noise predictor** and it is a [U-Net model](https://arxiv.org/abs/1505.04597). The training goes as follows.

* Pick a training image;
* Generate a random noise image;
* Corrupt the training image by adding this noisy image up to a certain number of steps;
* Teach the noise predictor to tell us how much noise was added. This is done by tuning its weights and showing it the correct answer.<br>

After training, the noise predictor is capable of estimating the noise added to the image.
Up to now the process involves generating a completely random image and requesting the noise predictor to identify the noise. The estimated noise is then subtracted from the original image, repeating this process multiple times. 
![picture](https://drive.google.com/uc?id=1YyCxaPd6GFaJQDVaSP-zdQC-CAElp4CN)

## **Stable Diffusion Model**
Stable Diffusion is a **latent diffusion model**. Instead of operating in the high-dimensional image space, it first compresses the image into the **latent space**. 
It is done using [Variational Autoencoder(VAEs)](https://arxiv.org/abs/1312.6114). <br>
The latent space of Stable Diffusion model is 4x64x64, 48 times smaller than the image pixel space. All the forward and reverse diffusions we talked about are actually done in the latent space.

So during training, instead of generating a noisy image, it generates a random tensor in latent space (latent noise). Instead of corrupting an image with noise, it corrupts the representation of the image in latent space with the latent noise. The reason for doing that is it is a lot faster since the latent space is smaller. <br>

**Reverse Diffusion in latent space** <br>
Here’s how latent reverse diffusion in Stable Diffusion works.
* A random latent space matrix is generated.
* The noise predictor estimates the noise of the latent matrix.
* The estimated noise is then subtracted from the latent matrix.
* Steps 2 and 3 are repeated up to specific sampling steps.
* The decoder of VAE converts the latent matrix to the final image.

**Conditioning** <br>
The purpose of conditioning is to steer the noise predictor so that the predicted noise will give us what we want after subtracting from the image. <br>
Below is an overview of how a text prompt is processed and fed into the noise predictor. **Tokenizer** first converts each word in tokens. Each token is then converted to a 768-value **embedding** vector. The embeddings are then processed by the text transformer and are ready to be consumed by the noise predictor. <br>
![picture](https://drive.google.com/uc?id=1yHryHT8IutyeG46PtPvnu0PSpQ0KqZHs)

**Tokenizer:** The text prompt is first tokenized by a CLIP tokenizer: Stable Diffusion model is limited to using 75 tokens in a prompt. <br>
**Embedding:** Stable diffusion v1 uses Open AI’s ViT-L/14 Clip model. Embedding is a 768-value vector. Embedding is fixed by the CLIP model, which is learned during training. <br>
**Text Transformer:** The embedding needs to be further processed by the text transformer before feeding into the noise predictor. The transformer not only further processes the data but also provides a mechanism to include different conditioning modalities. <br>
**Cross-Attention:** The output of the text transformer is used multiple times by the noise predictor throughout the U-Net. The U-Net consumes it by a cross-attention mechanism, that's where the prompt meets the image. 

### **Classifier-Free Guidance (CFG)**

**Classifier guidance** <br>

[Classifier guidance](https://arxiv.org/abs/1312.6114) is a way to incorporate image labels in diffusion models. The label is used to guide the diffusion process. The classifier guidance scale is a parameter for controlling how closely should the diffusion process follow the label.

For example  Suppose there are 3 groups of images with labels “cat”, “dog”, and “human”. If the diffusion is unguided, the model will draw samples from each group’s total population, but sometimes it may draw images that could fit two labels, e.g. a boy petting a dog.
With high classifier guidance, the images produced by the diffusion model would be biased toward the extreme or unambiguous examples. If you ask the model for a cat, it will return an image that is unambiguously a cat and nothing else. <br>
**The classifier guidance** scale controls how closely the guidance is followed. 
![picture](https://drive.google.com/uc?id=1xU0Y8A9kesgAeX7ustn8doEg-cQBa--N)In the figure above, the sampling on the right has a higher classifier guidance scale than the one in the middle. In practice, this scale value is simply the multiplier to the drift term toward the data with that label. <br>

**Classifier-free guidance** <br>
Although classifier guidance achieved record-breaking performance, it needs an extra model to provide that guidance. This has presented some difficulties in training. <br>
[Classifier-free guidance](https://arxiv.org/abs/2207.12598), in its authors’ terms, is a way to achieve *“classifier guidance without a classifier”*. Instead of using class labels and a separate model for guidance, they proposed to use image captions and train a conditional diffusion model. <br>
They put the classifier part as conditioning of the noise predictor U-Net, achieving the so-called *“classifier-free”* (i.e. without a separate image classifier) guidance in image generation.
The text prompt provides this guidance in text-to-image. <br>
In summary, **Classifier-free guidance (CFG) scale** is a value that controls how much the text prompt conditions the diffusion process. The image generation is unconditioned (i.e. the prompt is ignored) when it is set to 0. A higher value steers the diffusion towards the prompt.





##**Why stable diffusion?**
Before delving into the concept of Stable Diffusione, it is important to explain why we decided to introduce a generative model. Our initial focus was on analyzing the issues present in CLIP, particularly the problem of Polysemy.

Polysemy can be described as the phenomenon wherein the model struggles to distinguish between the various meanings of certain words due to a lack of context. As mentioned earlier, some images in the dataset are only labeled with a class tag, without a complete textual prompt. The authors provide an example from the Oxford-IIIT Pet dataset, where the term 'boxer' can refer to either a dog breed or a type of athlete. In this case, the issue lies with the quality of the data rather than the model itself.

Having identified this problem, our next step was to find a solution. Introducing a generative model was our attempt to address the challenges of Polysemy. We decided to utilize the captions provided in the dataset as prompts for generating synthetic images. These images serve two purposes in our predictions.

Firstly, since they are generated based on the same description as the target image, they tend to exhibit greater similarity to the desired goal compared to other bounding boxes being analyzed. Additionally, during the image generation process, we incorporate additional information regarding the subject we aim to recognize. When these images are encoded, this extra information enhances the similarity between the generated images and the desired target image.

(manca esempio polisemia e agguinta di informazioni)



### Import stable diffusion model and create pipeline

In [None]:
# Load stable diffusion mode
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)  
pipe = pipe.to("cuda")


### Generate images 

Since we know that Stable Diffusion has a component of randomness in the image generation we deciced to take into consideration creating more than one image. We decided to create more than one image in order to reduce the weight of a possible "outlier" in the images.

We then 

In [None]:
# insert the prompt the caption we are provided by the dataset
prompt= 'Use deep learning algorithms to generate a hyper-realistic portrait of a'+   input["sentences"]["raw"] +' Use advanced image processing techniques to make the image appear as if it were a photograph'

# create the images with stable diffusion
stable_input = pipe(prompt,num_inference_steps=50).images[0]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)
stable_input2 = pipe(prompt,num_inference_steps=50).images[0]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)
stable_input3 = pipe(prompt,num_inference_steps=50).images[0]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)

stable_input

Examples of stable diffusion generated images 

![picture](https://drive.google.com/uc?id=1N0pMFQD_htgEnoN3sIQEN4xhFXRALpm_ 
 )![picture](https://drive.google.com/uc?id=1WRGaSHdUlKx8oYe3Gh5SwALpudP6Si3D
)![picture](https://drive.google.com/uc?id=1dZlmX1oMfNp41HEeqqrTd7SdPrs2eJzW
)


### Create encoding for images

In [None]:

with torch.no_grad():
    # transform image into tensor
    convert_tensor = transforms.ToTensor()
    # resize image for clip
    stable_input=stable_input.resize([224,224])   
    stable_input2=stable_input2.resize([224,224])   
    stable_input3=stable_input3.resize([224,224])   

    # create tensors
    image_stable=torch.tensor(np.stack(convert_tensor(stable_input))).cuda()   
    image_stable2=torch.tensor(np.stack(convert_tensor(stable_input2))).cuda()   
    image_stable3=torch.tensor(np.stack(convert_tensor(stable_input3))).cuda()   
    
    # stack images into a tensor
    img_tens = torch.stack([image_stable,image_stable2,image_stable3])
    
    print(image_input.size())
    print(img_tens.size())
    
    # encode stable diffusion images
    text_features= model.encode_image(img_tens).float()
    
    # encode YOLO bounding boxes
    image_features = model.encode_image(image_input).float()
    
    print(text_features.size())
    print(image_features.size())

    #stable_input=stable_input.resize([224,224])   
    


    #text_features = model.encode_text(text_tokens).float()

### Compute Cosine similarity

In [None]:
# compute cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

In [None]:
# compute mean of similarities
similarity_vec=np.mean(similarity,axis=0)


![picture](https://drive.google.com/uc?id=11E-447bBm4y4B35ZzqurtT__b73a3knj)



### Take only the best bounding box

Once we have computed the similarity values, and obtained the mean vecotor of the similarities we want to filter out our results in order to make our prediction. The first thing we do is to filter out all the values in the similarity vector by checking if the corresponding bounding boxes have the class as the one we obtained before.
Than inside of the "filtered" vector we take the maximum value that will correspond to our prediction.

In [None]:
possible_values=[]
indexes=[]
for i in range(len(yolo_output.xyxy[0])):
    class_index=int(yolo_output.pred[0][i][5])
    label=class_names[class_index]
    # take only the YOLO predictions that have as class the one we computed in the NLP analysis
    if(label==interested_class):
      possible_values.append(similarity_vec[i])
      indexes.append(i)
# take only the best class      
print(max(possible_values))
index=possible_values.index(max(possible_values))
index=indexes[index]
print(nlp_sent)

original_images[index]

![picture](https://drive.google.com/uc?id=1ZSey_FP4GwBCwVhaFO1-XGX3ejMm1Hv1)


In [None]:
def process_stable_images(index,clip_model, device):
  stable_input1=Image.open("stable_diffusion/stable_diffusion_"+str(index)+"_1.jpg")
  stable_input2=Image.open("stable_diffusion/stable_diffusion_"+str(index)+"_2.jpg")
  stable_input3=Image.open("stable_diffusion/stable_diffusion_"+str(index)+"_3.jpg")
  
  
  
  with torch.no_grad():
    convert_tensor = transforms.ToTensor()

    stable_input1=stable_input1.resize([224,224])   
    stable_input2=stable_input2.resize([224,224])   
    stable_input3=stable_input3.resize([224,224]) 

    image_stable1=torch.tensor(np.stack(convert_tensor(stable_input1))).to(device)   
    image_stable2=torch.tensor(np.stack(convert_tensor(stable_input2))).to(device) 
    image_stable3=torch.tensor(np.stack(convert_tensor(stable_input3))).to(device) 
    
    img_tens = torch.stack([image_stable1, image_stable2, image_stable3])
    
    stable_features= clip_model.encode_image(img_tens).float()

  return stable_features


## Stable Diffusion model

In [None]:
# class that defines the baseline model
class VisualGrounding_stable_diffusion(torch.nn.Module):
    def __init__(self, yolo_version, clip_version, local_path, img_path):
        super(VisualGrounding_stable_diffusion, self).__init__()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        self.local_path = local_path
        self.img_path = img_path
        # initialize models
        self.yolo = torch.hub.load('ultralytics/yolov5', yolo_version, pretrained=True)
        self.clip, self.preprocess = clip.load(clip_version)

        self.name = "stable_diffusion"
        self.metrics = Metrics(self.clip, self.name)


    def forward(self, img_path, sentence,index):
        similarities = []
        bboxes = []
        max_similarity = 0
        max_image = None
        max_bbox = None

        yolo_output = self.yolo(self.local_path+img_path)

        original_img = Image.open(self.local_path+img_path).convert("RGB")

        root = get_root(yolo_output, sentence, self.clip, self.yolo)
        
        stable_features=process_stable_images(index, self.clip, self.device)

        no_bbox=True

        for i in range(len(yolo_output.xyxy[0])):
            if root != "empty" and self.yolo.names[int(yolo_output.pred[0][i][5])] != root:
                continue
            
            no_bbox=False
            img_cropped = crop_yolo(yolo_output, original_img, i)

            #plt.imshow(img_cropped)
            #plt.show()
            img_cropped = padd_image(img_cropped)
            img = self.preprocess(img_cropped).cuda().unsqueeze(0)
            #text = clip.tokenize([sentence]).cuda()
            
            with torch.no_grad():
                image_features = self.clip.encode_image(img).float()

                image_features /= image_features.norm(dim=-1, keepdim=True)
                similarity = stable_features.cpu().numpy() @ image_features.cpu().numpy().T
                similarity = similarity.reshape(3, 1)

                similarities.append(torch.tensor(similarity))
                bboxes.append((yolo_output.xyxy[0][i][0].cpu().numpy(), yolo_output.xyxy[0][i][1].cpu().numpy(), yolo_output.xyxy[0][i][2].cpu().numpy(), yolo_output.xyxy[0][i][3].cpu().numpy()))

        if no_bbox:
            similarities.append(torch.zeros(3,1))
            max_bbox = [0, 0, original_img.width, original_img.height]
            max_bbox = torch.tensor(max_bbox)
            max_image = original_img
            return max_bbox, max_image
            
        stacked_similarity = torch.cat(similarities, dim=1)
        max_indices = torch.argmax(stacked_similarity, dim=1)
        max_count = torch.bincount(max_indices)
    
        #parity case
        if torch.max(max_count) == 1:
          column_means = torch.mean(stacked_similarity, dim=0)
          best_bbox = torch.argmax(column_means)
        else:
          best_bbox = torch.argmax(max_count)

        
        max_bbox_new = yolo_output.xyxy[0][best_bbox] 
        max_image_new = img_cropped = crop_yolo(yolo_output, original_img, best_bbox)

        return max_bbox_new, max_image_new


    def evaluate(self, img_path, sentence, gt, original_img, index):
        bbox = convert_bbox(gt, original_img)
        gt_crop = original_img.crop(bbox)
        prediction_bbox, prediction_img = self.forward(img_path, sentence,index)
        self.metrics.update(prediction_bbox, bbox, prediction_img, gt_crop)
        return prediction_bbox, prediction_img

    def reset_metrics(self):
        self.metrics.reset()

    def get_metrics(self):
        return self.metrics.to_string()
    
    def save_metrics(self):
        self.metrics.save()

# Image Captioning 

In [None]:
# class that defines the baseline model

class VisualGrounding_ttt(torch.nn.Module):
    def __init__(self, yolo_version, clip_version, local_path, img_path):
        super(VisualGrounding_ttt, self).__init__()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        self.local_path = local_path
        self.img_path = img_path

        # initialize models
        self.yolo = torch.hub.load('ultralytics/yolov5', yolo_version, pretrained=True).to(self.device)
        self.clip, self.preprocess = clip.load(clip_version)

        # text to text section
        self.text_tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length=20)
        self.text_feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
        self.text_model = VisionEncoderDecoderModel.from_pretrained('/home/pappol/Scrivania/deepLearning/Image_Captioning_VIT_Roberta_final_4')
        self.text_model.to(self.device)
        self.name = "text to image"
        # define metrics
        self.metrics = Metrics(self.clip, self.name)

        

    def forward(self, img_path, sentence):
        similarity = torch.tensor([]).to(self.device)

        yolo_output = self.yolo(self.local_path+img_path)
        original_img = Image.open(self.local_path+img_path).convert("RGB")

        root = get_root(yolo_output, sentence, self.clip, self.yolo)

        sentence_tokens = clip.tokenize([sentence]).to(device=self.device)
        embedding_sent = self.clip.encode_text(sentence_tokens).to(self.device)

        no_bbox=True

        for i in range(len(yolo_output.xyxy[0])):
            if root != "empty" and self.yolo.names[int(yolo_output.pred[0][i][5])] != root:
                continue

            with torch.no_grad():
                no_bbox=False
                #crop the image based on the yolo output
                img_cropped = crop_yolo(yolo_output, original_img, i)
                #generate caption
                features = self.text_feature_extractor(img_cropped, return_tensors="pt").pixel_values.to(self.device)
                generated = self.text_model.generate(features)[0].to(self.device)
                caption = self.text_tokenizer.decode(generated)
                #caption = self.text_tokenizer.decode(self.text_model.generate(self.text_feature_extractor(img_cropped, return_tensors="pt").pixel_values.to(self.device))[0].to(self.device))
                caption = clear_caption(caption)
                caption = clip.tokenize([caption]).to(device=self.device)
                enbedding_gen = self.clip.encode_text(caption).to(self.device)

                #cosine similarity bwteen caption and sentence
                similarity = torch.cat((similarity, torch.nn.functional.cosine_similarity(enbedding_gen, embedding_sent)), 0)
            
        if no_bbox:
            #set bbox to the whole image
            max_bbox = [0, 0, original_img.width, original_img.height]
            max_image = original_img

        #argmax to get the most similar caption
        index = torch.argmax(similarity)
        max_bbox = yolo_output.xyxy[0][index]
        max_image = crop_yolo(yolo_output, original_img, index)

        return max_bbox, max_image

    def evaluate(self, img_path, sentence, gt, original_img):
        bbox = convert_bbox(gt, original_img)
        gt_crop = original_img.crop(bbox)
        prediction_bbox, prediction_img = self.forward(img_path, sentence)
        self.metrics.update(prediction_bbox, bbox, prediction_img, gt_crop)
        return prediction_bbox, prediction_img

    def reset_metrics(self):
        self.metrics.reset()

    def get_metrics(self):
        return self.metrics.to_string()
    
    def save_metrics(self):
        self.metrics.save()

## Fine tuning image captioning

In [None]:
TRAIN_BATCH_SIZE = 16  # input batch size for training (default: 64)
VALID_BATCH_SIZE = 6   # input batch size for testing (default: 1000)

TRAIN_EPOCHS = 45       # number of epochs to train (default: 10)
VAL_EPOCHS = 1 

LEARNING_RATE = 1e-4   # learning rate (default: 0.01)
SEED = 42              # random seed (default: 42)
MAX_LEN = 128          # Max length for product description
SUMMARY_LEN = 20       # Max length for product names
WEIGHT_DECAY = 0.01    # Weight decay (default: 1e-4)

In [None]:
class IAMDataset(Dataset):
    def __init__(self, df, tokenizer,feature_extractor, decoder_max_length = 20):
        self.df = df
        self.tokenizer = tokenizer
        self.feature_extractor = feature_extractor
        self.decoder_max_length = decoder_max_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get file name + text 
        img_path = self.df['images'][idx]
        caption = self.df['captions'][idx]
        # prepare image (i.e. resize + normalize)
        image = Image.open(img_path).convert("RGB")
        pixel_values = self.feature_extractor(image, return_tensors="pt").pixel_values
        
        # add labels (input_ids) by encoding the text
        labels = self.tokenizer(caption, truncation = True,
                                          padding="max_length", 
                                          max_length=self.decoder_max_length).input_ids
        
        # important: make sure that PAD tokens are ignored by the loss function
        labels = [label if label != self.tokenizer.pad_token_id else -100 for label in labels]

        encoding = {"pixel_values": pixel_values.squeeze(), "labels": torch.tensor(labels)}
        return encoding

In [None]:
def compute_metrics(pred):
    rouge = datasets.load_metric("rouge")
    tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

In [None]:
path = "Image_Captioning_VIT_Roberta_final_3"

print("Loading data")
df = pd.read_csv("RefCOCOg_cropped.csv")
df['cropped'] = df['cropped'].str.replace('refcocog/', '')
df = df.rename(columns={'cropped': 'images', 'raw': 'captions'})
df['captions'] = df['captions'].str.lower()

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=MAX_LEN)

feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")

batch_size=TRAIN_BATCH_SIZE

train_dataset = IAMDataset(df=train_df.sample(frac=1,random_state=2).iloc[:].reset_index().drop('index',axis =1),
                        tokenizer=tokenizer,
                        feature_extractor= feature_extractor)

test_dataset = IAMDataset(df=test_df.sample(frac=1,random_state=2)[:].reset_index().drop('index',axis =1),
                        tokenizer=tokenizer,feature_extractor= feature_extractor)

# set encoder decoder tying to True

model = VisionEncoderDecoderModel.from_pretrained(path)
# set special tokens used for creating the decoder_input_ids from the labels
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size

# set beam search parameters
model.config.eos_token_id = tokenizer.sep_token_id
model.config.max_length = 20
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

# load rouge for validation
rouge = datasets.load_metric("rouge")

captioning_model = 'VIT_Captioning'

In [None]:

training_args = Seq2SeqTrainingArguments(
    output_dir=captioning_model,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    #evaluate_during_training=True,
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
    logging_steps=1024,  
    save_steps=2048, 
    warmup_steps=1024,  
    num_train_epochs = TRAIN_EPOCHS, #TRAIN_EPOCHS
    overwrite_output_dir=True,
        save_strategy="epoch",
)

    # instantiate trainer
trainer = Seq2SeqTrainer(
    tokenizer=feature_extractor,
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=default_data_collator,
    #save strategy
)

In [None]:
trainer.train()

In [None]:
trainer.save_model('Image_Captioning_VIT_Roberta_final_4')

# Model evaluations and comparisons

### load data

In [None]:
# dataset load
dataset = Coco(local_annotations + 'instances.json', local_annotations + "refs(umd).p")

### baseline

In [None]:
# model load
baseline = VisualGrounding_baseline('yolov5x', 'ViT-B/32', local_path, local_annotations)

In [None]:
validate(baseline, dataset.get_test())

In [None]:
test_on_one_image(baseline, dataset.get_test(), 455)

### baseline + stanza

In [28]:
stanza_baseline = VisualGrounding_stanza('yolov5x', 'ViT-B/32', local_path, local_annotations)

Downloading: "https://github.com/ultralytics/yolov5/zipball/master" to /root/.cache/torch/hub/master.zip
YOLOv5 🚀 2023-6-8 Python-3.10.11 torch-2.0.1+cu118 CUDA:0 (Tesla T4, 15102MiB)

Downloading https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5x.pt to yolov5x.pt...
100%|██████████| 166M/166M [00:01<00:00, 150MB/s]

Fusing layers... 
YOLOv5x summary: 444 layers, 86705005 parameters, 0 gradients
Adding AutoShape... 
100%|███████████████████████████████████████| 338M/338M [00:04<00:00, 73.6MiB/s]


In [None]:
validate(stanza_baseline, dataset.get_validation())

In [None]:
test_on_one_image(stanza_baseline, dataset.get_test(), 455)

### Image captioning

In [None]:
text_to_text = VisualGrounding_ttt('yolov5x', 'ViT-B/32', local_path, local_annotations)

In [None]:
validate(text_to_text, dataset.get_test())

In [None]:
test_on_one_image(text_to_text, dataset.get_test(), 455)

### Stable diffusion

In [None]:
stable_model = VisualGrounding_stable_diffusion('yolov5x', 'ViT-B/32', local_path, local_annotations)

In [None]:
validate(stable_model, dataset.get_test())

In [None]:
test_on_one_image(stable_model, dataset.get_test(), 455)

# Bias on results 

## Considerazioni sui data ( aka come pararsi il culo)

When evaluating the results we obtained, it is crucial to consider the presence of bias, which stems from various sources. These biases can impact the accuracy of our outcomes. Let's discuss these factors in detail:

- Bias from positional information in the dataset: When captions in the dataset primarily consist of positional information, it becomes challenging to differentiate between possible targets. In such cases, the identical subject may appear in different positions, leading to a random selection by our model. Consequently, this randomness may result in inaccurate predictions.

- Bias from the intrinsic randomness of generated images: Our results heavily depend on the generated images, as they are employed to discriminate between bounding boxes. The quality of these generated images can introduce two types of "errors": low-quality images or images that are not suitable for our specific purposes. Both scenarios can affect the final predictions.

- Propagation of errors from the Stanza NLP process: As mentioned earlier, although some errors generated during the Stanza NLP process do not impact subsequent stages, there is still a possibility of propagating errors, which can influence the final results.

- Influence of YOLO in bounding box selection: Since we utilize YOLO as our method for obtaining bounding boxes, our results are constrained by the boxes generated by this algorithm. This constraint can affect the metrics, as we may correctly predict the subject but have a different bounding box compared to the ground truth due to variations in YOLO's prediction method.


- ERRORE SU IMAGE CAPTIONER  LUCA PORCODDIO FALLO  