# **Text to Text solution**

**Image captioning** is an additional tool we have considered to distinguish, among the bounding boxes extracted from YOLO, the one associated with the input sentence. <br>
More specifically, using an image captioning model, we generated a caption for each bounding box, then we compared these generated captions with the reference caption provided as input and, based on the similarity score, we selected the corresponding bounding box.

## **What is image captioning task?**
Image captioning is as an **End-to-End Sequence to Sequence** embedding task, where Image pixels are input sequences, and image describing caption is the desired output. <br>
Due to exclusive nature of both images and text sequences two different model are required to solve this task: one dedicated to **_encode_** from images, and the other one dedicated to **_decode_** a text sequence
As shown in the figure below, Encoder for image is a [Vision Transformer](https://arxiv.org/abs/2010.11929v2) while the output embeddings from ViT Encoder are connected with the Decoder transformer which is [Roberta](https://arxiv.org/abs/1907.11692) transformer architecture. 
![picture](https://drive.google.com/uc?id=1wifY-iD3FShRWAR6jFcLriQN5cgUoo6X)


Once initialized the **Vision Encoder Decoder** with the pretrained models, (Vision transformer and Roberta), it creates an **image encoder** and **language decoder** instance and ties their embeddings together using a cross attention layer. <br>
During training, both image and desired caption are passed as inputs to the model. This is an example of **"teacher forcing"** training. <br>
In above example, the image and caption `“ `<start>` Dog is running in grass with ball in its mouth `<start>`”` are the inputs to the model. <br>
Using these inputs, the model is forced to generate same caption, this leads to understanding the correlation between words in captions and objects in the input images: at every training step, the model predicts sequence of words which is compared with the actual caption and the loss is propagated back to learn.
    
After training, Image below explains how we generate text sequence.
    
* We start with image and start token `<start>` as input to the model.
* Using inputs model generates [sequence * Vocabulary] length vector again, but unlike in training (where all the tokens are used) only the second token (immediately succeeding to start token) is considered and all the other words are masked. Above it is shown, after start token, `Dog` was the first word, and all the other words were masked.
* Once we get the first word, we repeat the process in first step except this time we will use `<start> Dog` as our input and generate next word (3rd token).
*This loop continues until the model outputs an end token or reaches a maximum set length of output.
![picture](https://drive.google.com/uc?id=1LyB9Xa8fZ6m8Qb2oPnFUtPybB_TRc_yK)


Starting from the Image Captioning pretrained model based on [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) for the Encoder Vision Transformer and [RoBERTa](https://huggingface.co/roberta-base) Decoder, we have conducted a **fine-tuning process**, training it on the **test-images** of RefCOCOg dataset in order to improve its performance and adapt it to our specific image captioning needs.
The fine tuning has been done over **75 epochs**, with **train batch size =** and **test batch size =**. <br>
Using **RobertaTokenizerFast** class, the maximum number of tokens allowed for training descriptions has been set to **max len =**, while the maximum length of the generated captions has been defined as **summary len =**. The **weight decay** has been set to  an and **learning rate =** <br>
Knowing that **beams** are different candidate sequences that the model considers during generation, (an higher number of beams allows the model to explore multiple possibilities) other parameters that have been set are: <br>

* **model.config.eos_token_id** sets the end-of-sequence (EOS) token ID. During beam search, when generating sequences, the model will stop generating tokens once it encounters the EOS token.

* **model.config.early_stopping = True**, this parameter enables early stopping during beam search. It means that if all beams have generated an EOS token, the generation process will stop early instead of continuing until summary length is reached.

* **model.config.no_repeat_ngram_size = 3**  prevents the model from repeating 3-grams of a specific size in the generated sequence. It helps to avoid repetitive or redundant phrases in the generated output.

* **model.config.length_penalty = 2.0**, this parameter applies a penalty to longer sequences during beam search. A value greater than 1.0 will prioritize shorter sequences, while a value less than 1.0 will favor longer sequences. It helps to balance between generating concise or more detailed output.

* **model.config.num_beams = 4**  determines the number of beams to use during beam search.











## **SPICE**
To evaluate the quality of the image captioning model, we utilized the [*SPICE*]() metric, that stands for *Semantic Propositional Image Caption Evaluation*. <br>
It is a  metric for evaluating the quality of automatically generating image captions comparing the semantic propositional content of a model-generated caption with that of a human-generated caption, which is more in line with how humans evaluate captions.
*SPICE* is useful because it outperforms existing metrics such as *Bleu*, *METEOR*, *ROUGE-L* and *CIDEr* in terms of agreement with human evaluations of model-generated captions.  <br>
This means that *SPICE* can better simulate human judgment and provide more accurate assessments of the quality of image captions generated by different models. <br> Additionally, this evaluation metric can answer specific questions about the strengths and weaknesses of different caption-generating models, such as which model is better at understanding colors or counting objects in an image.
Spice works in this way: <br>
**1. Scene Graph Generation:** the first step in the *SPICE* metric is to generate a scene graph for each caption. A scene graph is a structured representation of the objects, attributes, and relationships mentioned in a caption. To generate a scene graph, the caption is parsed using the [Stanford Scene Graph Parser](https://nlp.stanford.edu/software/scenegraph-parser.shtml) and PCFG dependency parser.

**2. Object Overlap:** the next step is to compute object overlap between the model-generated caption and the human-generated caption. Object overlap measures how many objects are mentioned in both captions and how similar they are. Specifically, object overlap is computed as follows:

   - For each object in the human-generated caption, find the most similar object in the model-generated caption based on string similarity.
   - Compute a weighted average of the similarity scores for all matched objects.

**3. Attribute Overlap:** the third step is to compute attribute overlap between the model-generated caption and the human-generated caption. Attribute overlap measures how many attributes are mentioned for each object and how similar they are. Specifically, attribute overlap is computed as follows:

   - For each object in the human-generated caption, find all attributes mentioned for that object.
   - For each matched pair of objects from step 2, find all attributes mentioned for both objects.
   - Compute a weighted average of the similarity scores for all matched attributes.

**4. Relationship Overlap:** The fourth step is to compute relationship overlap between the model-generated caption and the human-generated caption. Relationship overlap measures how many relationships are mentioned and how similar they are. Specifically, relationship overlap is computed as follows:

   - For each relationship in the human-generated caption, find all pairs of matched objects that participate in that relationship.
   - Compute a weighted average of the similarity scores for all matched relationships.

**5. Saliency:** The final step is to compute saliency scores for each object and relationship mentioned in the captions. Saliency measures how important each object or relationship is to understanding the image. Specifically, saliency is computed as follows:

   - For each object and relationship in the scene graph, compute a saliency score based on its frequency of occurrence in a large corpus of captions.
   - Compute a weighted average of the saliency scores for all objects and relationships mentioned in the captions.

**6. Overall Score:** The final step is to combine the object overlap, attribute overlap, relationship overlap, and saliency scores into an overall score that reflects how closely the model-generated caption matches the human-generated

## **Oltre al valor medio, ora dobbiamo mettere un istogramma con tutti i punteggi che abbiamo ottenuto di SPICE per ogni frase, insieme ad un box plot**

## **ESEMPIO DELL'AEREO**

## **Text to Text pipeline implementation SBAGLIATA**
![picture](https://drive.google.com/uc?id=17hlAnBGBuvWYlk6q3Q7YRU4En05O2x8G)
The propsed pipeline works as follow:

* The input caption is provided to the model.
* The input caption is processed using the previously described STANZA model to extract the root.
* Based on the STANZA analysis, the extracted caption root is associated with the YOLO class that has the highest similarity score, calculated using the CLIP embedding. Only the bounding boxes corresponding to this class are selected.
* For each of the selected bounding boxes, a description is generated using the previously discussed Image Captioning model.
* Using CLIP, the cosine similarity is calculated between the generated caption and the reference caption provided as input.
* The final prediction is the bounding box whose caption obtained the highest similarity value.

## **QUI CI VA L'ESEMPIO DELLA LOCOMOTIVA SIMILE A QUELLO CHE HO FATTO CON GLI OBESI DI STABLE DIFFUSION**

## **Qui invece mettiamo i grafici dei CSV**

# **Image to Image solution**

In this proposed version, starting from the input captioning, we have considered using **Stable Diffusion** in the process of discriminating the bounding box. Before delving into the implementation of this solution, below is an introduction to how Stable Diffusion works.


## **What is Stable Diffusion?** <br>
It is a **text-to-image model**: give it a **text prompt** and it will return an image matching the text. <br>
Here an example of how it does work
![picture](https://drive.google.com/uc?id=1NAPD4WLEjGT3orZe3AqE6aN8jLpkMHV4)

## **Diffusion Models**
Stable Diffusion belongs to a class of deep learning models called **Diffusion Models.** They are generative models, meaning they are designed to generate new data similar to what they have seen in training. <br>
The model is based on two process: 
* Forward Diffusion
* Reverse Diffusion

**Forward Diffusion** <br>
This process adds noise to a training image, gradually turning it into an uncharacteristic noisy one. 
Below is an example of an image undergoing forward diffusion. 
![picture](https://drive.google.com/uc?id=10wKNn2adHGUb9ceXuUdAFmkMaGN8Nzwa)

**Reverse Diffusion** <br>
Starting from a noisy image, reverse diffusion recovers the original one.
![picture](https://drive.google.com/uc?id=1sxYKCVgEt8Hr3Rzb8__NT0M6iJ578OKm)

**How training is done** <br>
To reverse the diffusion, we need to know how much noise is added to an image. The answer is teaching a neural network model to predict the noise added. It is called the **noise predictor** and it is a [U-Net model](https://arxiv.org/abs/1505.04597). The training goes as follows.

* Pick a training image;
* Generate a random noise image;
* Corrupt the training image by adding this noisy image up to a certain number of steps;
* Teach the noise predictor to tell us how much noise was added. This is done by tuning its weights and showing it the correct answer.<br>

After training, the noise predictor is capable of estimating the noise added to the image.
Up to now the process involves generating a completely random image and requesting the noise predictor to identify the noise. The estimated noise is then subtracted from the original image, repeating this process multiple times. 
![picture](https://drive.google.com/uc?id=1YyCxaPd6GFaJQDVaSP-zdQC-CAElp4CN)

## **Stable Diffusion Model**
Stable Diffusion is a **latent diffusion model**. Instead of operating in the high-dimensional image space, it first compresses the image into the **latent space**. 
It is done using [Variational Autoencoder(VAEs)](https://arxiv.org/abs/1312.6114). <br>
The latent space of Stable Diffusion model is 4x64x64, 48 times smaller than the image pixel space. All the forward and reverse diffusions we talked about are actually done in the latent space.

So during training, instead of generating a noisy image, it generates a random tensor in latent space (latent noise). Instead of corrupting an image with noise, it corrupts the representation of the image in latent space with the latent noise. The reason for doing that is it is a lot faster since the latent space is smaller. <br>

**Reverse Diffusion in latent space** <br>
Here’s how latent reverse diffusion in Stable Diffusion works.
* A random latent space matrix is generated.
* The noise predictor estimates the noise of the latent matrix.
* The estimated noise is then subtracted from the latent matrix.
* Steps 2 and 3 are repeated up to specific sampling steps.
* The decoder of VAE converts the latent matrix to the final image.

**Conditioning** <br>
The purpose of conditioning is to steer the noise predictor so that the predicted noise will give us what we want after subtracting from the image. <br>
Below is an overview of how a text prompt is processed and fed into the noise predictor. **Tokenizer** first converts each word in tokens. Each token is then converted to a 768-value **embedding** vector. The embeddings are then processed by the text transformer and are ready to be consumed by the noise predictor. <br>
![picture](https://drive.google.com/uc?id=1yHryHT8IutyeG46PtPvnu0PSpQ0KqZHs)

**Tokenizer:** The text prompt is first tokenized by a CLIP tokenizer: Stable Diffusion model is limited to using 75 tokens in a prompt. <br>
**Embedding:** Stable diffusion v1 uses Open AI’s ViT-L/14 Clip model. Embedding is a 768-value vector. Embedding is fixed by the CLIP model, which is learned during training. <br>
**Text Transformer:** The embedding needs to be further processed by the text transformer before feeding into the noise predictor. The transformer not only further processes the data but also provides a mechanism to include different conditioning modalities. <br>
**Cross-Attention:** The output of the text transformer is used multiple times by the noise predictor throughout the U-Net. The U-Net consumes it by a cross-attention mechanism, that's where the prompt meets the image. 

### **Classifier-Free Guidance (CFG)**

**Classifier guidance** <br>

[Classifier guidance](https://arxiv.org/abs/1312.6114) is a way to incorporate image labels in diffusion models. The label is used to guide the diffusion process. The classifier guidance scale is a parameter for controlling how closely should the diffusion process follow the label.

For example  Suppose there are 3 groups of images with labels “cat”, “dog”, and “human”. If the diffusion is unguided, the model will draw samples from each group’s total population, but sometimes it may draw images that could fit two labels, e.g. a boy petting a dog.
With high classifier guidance, the images produced by the diffusion model would be biased toward the extreme or unambiguous examples. If you ask the model for a cat, it will return an image that is unambiguously a cat and nothing else. <br>
**The classifier guidance** scale controls how closely the guidance is followed. 
![picture](https://drive.google.com/uc?id=1xU0Y8A9kesgAeX7ustn8doEg-cQBa--N)In the figure above, the sampling on the right has a higher classifier guidance scale than the one in the middle. In practice, this scale value is simply the multiplier to the drift term toward the data with that label. <br>

**Classifier-free guidance** <br>
Although classifier guidance achieved record-breaking performance, it needs an extra model to provide that guidance. This has presented some difficulties in training. <br>
[Classifier-free guidance](https://arxiv.org/abs/2207.12598), in its authors’ terms, is a way to achieve *“classifier guidance without a classifier”*. Instead of using class labels and a separate model for guidance, they proposed to use image captions and train a conditional diffusion model. <br>
They put the classifier part as conditioning of the noise predictor U-Net, achieving the so-called *“classifier-free”* (i.e. without a separate image classifier) guidance in image generation.
The text prompt provides this guidance in text-to-image. <br>
In summary, **Classifier-free guidance (CFG) scale** is a value that controls how much the text prompt conditions the diffusion process. The image generation is unconditioned (i.e. the prompt is ignored) when it is set to 0. A higher value steers the diffusion towards the prompt. <br>


The stable diffsuion model that has been used is Stable Diffusion v1.4 which is trained with: <br>
* 237k steps at resolution 256×256 on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en) dataset. <br>
* 194k steps at resolution 512×512 on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution).
* 225k steps at 512×512 on [laion-aesthetics v2 5+](https://laion.ai/blog/laion-aesthetics/), with 10% dropping of text conditioning.<br>

Since Stable Diffusion relies on text prompts, we have investigated the best type of prompt to generate images that closely resemble the style of those in the dataset, resulting in adopting a prompt structure like this for each generation process: 

"**Use deep learning algorithms to generate a hyper-realistic portrait of a** `input caption `. **Use advanced image processing techniques to make the image appear as if it were a photograph."**

# **Image to Image pipeline implementation** <br>

#QUI CI VA L'IMMAGINE DI NICOLA DELLA PIPELINE DA SOSTITUIRE, aggiungendo il  fatto che generiamo tre immagini.
![picture](https://drive.google.com/uc?id=15UJy8ws6BbrBMlCXWPdi6zGsPQjbg45r)

The propsed pipeline works as follow:

* The input caption is provided to the model;
* Using the Stable Diffusion model, three images are generated based on this input;
* The input caption is processed using the previously described STANZA model to extract the root; 
* Based on the STANZA analysis, the extracted caption root is associated with the YOLO class that has the highest similarity score, which is calculated using the CLIP embedding. Only the bounding boxes corresponding to this class are selected.; 
* For each selected bounding box, using the CLIP embedding, the cosine similarity is computed with the three images generated by Stable Diffusion .
* Finally, the bounding box that achieves the highest similarity score among the others on at least two of the three generated images is selected as the final prediction. In more "ambiguous" situations where none of the bounding boxes surpass the cosine similarity value of the best of the three, the bounding box corresponding to the overall highest cosine similarity value is chosen.

It is noticeable that due to the low quality of the images generated by Stable Diffusion, the bounding boxes with the highest similarity refer to different portions of the image, and not to the one we are actually seeking (in green).

![picture](https://drive.google.com/uc?id=1Ve_ipDDYYtBbohb0YxDM4IjMNuwnPPCh)

After removing the bounding boxes whose class does not match the one selected with *STANZA*, it is possible to observe how the model is able to discriminate the correct one (similarity values on the left are higher), despite not achieving very high similarity values.
![picture](https://drive.google.com/uc?id=1VOkkE3NRKqxfFVr8IjSXC1YSQGELAo9W)



As shown in the previous example, the generated images from stable diffusion are not always highly faithful to the provided captions. However, in cases where the generated images are more relevant, the following example demonstrates how the attention map of the vision transformer in CLIP is very similar for both images (the first one is taken from RefCOCOg while the second one has been generated using Stable Diiffusion using the provided caption). <br> In case like this, the similarity helps the model to discriminate the correct portion of the image.
![picture](https://drive.google.com/uc?id=1bX7j37llDRfSW6w3LwrZJwq0hfG3S1GK)

## **ESEMPIO DELL'AMBIGUITY CON L'IMMAGINE DEL BOXER** (SE SOLO FUNZIONASSE)
In addition to that, another reason we introduced Stable Diffusion in our model relates to one of the issues with the CLIP model: polysemy. At times, the CLIP model struggles to differentiate the meaning of certain words due to a lack of context. Some images in particular are labeled with only a class label and not a complete textual prompt. The authors provide an example using the Oxford-IIIT Pet dataset, where the word 'boxer' can refer to a dog breed, but in other images, it can be interpreted as a reference to an athlete.

## **Qui ci vanno i grafici dei file CSV** 

## **CONCLUSIONS**