## 1. Application Scenario

In this project I will perform visual entity extraction on images. This is useful in for example fake news detection where we want to use external knowledge on the visual part of a news article to enhance the classification. The general idea is to use a pre-trained model that generates captions from images and then use the captions to extract entities. The former is a vision-language task known as [Visual Captioning (VC)](https://theaisummer.com/vision-language-models/), and some popular implementations include [CLIP](https://arxiv.org/abs/2103.00020) and [BLIP](https://arxiv.org/abs/2201.12086). The latter is a language task known as [Named-entity recognition (NER)](https://link.springer.com/chapter/10.1007/978-3-642-45358-8_7), which we will use the Python library [flair](https://github.com/flairNLP/flair) for. The goal of this project is to combine these two tasks to perform visual entity extraction.

### 1.1. The novelty

I have only found one [paper](https://arxiv.org/abs/2108.10509) that takes on the task of extracting visual entities, at least within the field of fake news detection. However, this paper extract entities from images using an advanced OCR system, which is not the same as extracting entities from generated captions. 

## 2. Implementation

### 2.1. Loading the images for testing

First we need to load all the images we are going to test the system on. We will use the PIL librabry for this.

In [34]:
import torch
import glob
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
raw_images = [{"name": file, "img": Image.open(file).convert("RGB")} for file in glob.glob("img/*.jpg")]
print(f"Loaded {len(raw_images)} images")

Loaded 6 images


### 2.2. Generating the captions

Now that we have all the images loaded into memory in a array, we will use [BLIP](https://arxiv.org/abs/2201.12086) to generate captions for each image. The captions will be stored in a dictionary with the image name as the key and the caption as the value.

In pre-training [BLIP](https://arxiv.org/abs/2201.12086) encodes text and images seperately using the BERT transformer architecture, as well as using multiple cross- and self-attention layers. BERT is a pre-trained model, which means it uses unannotated data to learn a representation of the language. This is a form of self-supervised learning since the model needs no labels in the data. 

In [42]:

from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
unprocessed_img_captions = []
for raw_image in raw_images:
    image = vis_processors["eval"](raw_image["img"]).unsqueeze(0).to(device)
    # generate caption
    caption = (model.generate({"image": image})[0])
    # ['a large fountain spewing water into the air']
    print("Result for ", raw_image["name"], ":", caption)
    unprocessed_img_captions.append({"img_name": raw_image["img"], "caption": caption})

Result for  img/GettyImages-1145422105-1024x683.jpg : the eiffel tower towering over the city of paris
Result for  img/ratio3x2_1800.jpg : a large group of people standing on the side of a bridge
Result for  img/NotreDame20190415QuaideMontebello_(cropped).jpg : a large cathedral with a massive fire coming out of it
Result for  img/181002113456-01-golden-gate-bridge-restricted.jpg : a view of the golden gate bridge in san francisco
Result for  img/GettyImages-917361830.jpg : a couple of men standing next to each other holding a bottle of wine
Result for  img/77336840.jpg : a man in a suit and tie giving a speech


### 2.3. Converting the captions to title case

The library flair expects the input to be in title case, so we will convert the captions to title case using the [SentenceCase API](https://rapidapi.com/Matt11/api/sentence-case-converter-truecaser/). An alternative approach I tried was to instead train an NER model on the train and test dataset in lower case, in an attempt to make a model that does not need entities to be capital. However, the training time was too long for me to test this approach, even when I ran it on the IDUN cluster. The code for this approach is included in the file *train.py*. In the resources folder you can find the model I was training, which never completed due to an 80 hour time limit. However, in the training.log you can see that it acheived a relatively high accuracy of above 90% after only 2 epochs. The XLM-RoBERTa transformer was used for embedding the text during training.

This converstion is known as Truecasting, which is the process of fixing a text's capitalization. Tranformers like BERT can be used to implement such systems. I am unfamilar with the details of the [SentenceCase API](https://rapidapi.com/Matt11/api/sentence-case-converter-truecaser/) architecture, but I assume it includes a transformer model that has been trained on a large corpus of text.

In [48]:
import requests
import time

url = "https://sentence-case-converter-truecaser.p.rapidapi.com/v1/SentenceCase"
headers = {
	"content-type": "application/json",
	"X-RapidAPI-Key": "f695177164msh75ec7b5ee6c683bp1abf71jsncfbca25d6159",
	"X-RapidAPI-Host": "sentence-case-converter-truecaser.p.rapidapi.com"
}
payload = {
	"text": "",
	"language": "en",
	"tagSpeciesNames": 0,
	"useStraightQuotes": 0
}

processed_img_captions = []

for caption in unprocessed_img_captions:
	payload["text"] = caption["caption"]
	response = requests.request("POST", url, json=payload, headers=headers)
	print(response.text)
	processed_img_captions.append(response.text)
	time.sleep(1.5) # to avoid rate limit

{"result":"The Eiffel Tower towering over the city of Paris"}
{"result":"A large group of people standing on the side of a bridge"}
{"result":"A large cathedral with a massive fire coming out of it"}
{"result":"A view of the Golden Gate Bridge in San Francisco"}
{"result":"A couple of men standing next to each other holding a bottle of wine"}
{"result":"A man in a suit and tie giving a speech"}


### 2.4. Extracting entities

After we have converted the captions to title case, we will use [flair](https://github.com/flairNLP/flair) to extract entities from the captions. [Flair](https://github.com/flairNLP/flair) uses the multilingual XLM-RoBERTa tranformer architecture for predicting entities, which mentioned previously is a from of self-supervised learning. The model is trained on the CoNLL-2003 dataset, which is a dataset of 14 different languages. The GloVe word embeddings are used for the english language, and the FastText word embeddings are used for the other 13 languages.

In [46]:
##### NER ON MANUALLY TRAINED MODEL #####
# from flair.data import Sentence
# from flair.models import SequenceTagger

# model = SequenceTagger.load('resources/taggers/sota-ner-flert/final-model.pt')
# sentences = [Sentence(caption) for caption in processed_img_captions]
# [model.predict(sentence) for sentence in sentences]

In [49]:
from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')
sentences = [Sentence(caption) for caption in processed_img_captions]
[tagger.predict(sentence) for sentence in sentences]



2022-11-10 22:35:24,414 loading file /Users/oysteinlondalnilsen/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
2022-11-10 22:35:26,383 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


[None, None, None, None, None, None]

### 2.5. The results

Finally, we will retrieve the results by printing the caption and the extracted entities. If any entities are found, their type and confidence will be printed. If no entities are found, the caption will be printed with a message saying that no entities were found.

In [50]:
for sentence in sentences: 
    print("---------------------------------------------")
    print(sentence)
    print('The following NER tags are found:')
    for entity in sentence.get_spans('ner'):
        print(entity)
    print("---------------------------------------------")

---------------------------------------------
Sentence: "{" result ":" The Eiffel Tower towering over the city of Paris "}" → ["The Eiffel Tower"/MISC, "Paris"/LOC]
The following NER tags are found:
Span[3:6]: "The Eiffel Tower" → MISC (0.6867)
Span[11:12]: "Paris" → LOC (0.9986)
---------------------------------------------
---------------------------------------------
Sentence: "{" result ":" A large group of people standing on the side of a bridge "}"
The following NER tags are found:
---------------------------------------------
---------------------------------------------
Sentence: "{" result ":" A large cathedral with a massive fire coming out of it "}"
The following NER tags are found:
---------------------------------------------
---------------------------------------------
Sentence: "{" result ":" A view of the Golden Gate Bridge in San Francisco "}" → ["Golden Gate Bridge"/LOC, "San Francisco"/LOC]
The following NER tags are found:
Span[7:10]: "Golden Gate Bridge" → LOC (0.

## Discussion

We see by the results above that this system is mostly useful when the image taken at known locations or when it contains famous monuments and such. The image captioning model is to blame for this, since it looks to not have been trained on for example famous people. In fake news detection this is essential, since a lot of fake news involves famous people. This could be solved by adding another component to the system that uses a face recognition model to extract the names of the people in the image, or by performing [Visual Question Answering (VQA)](https://theaisummer.com/vision-language-models/) on the image. To make the latter work one would need a way of generating relevant questions for each image seperately, for example by asking who the people in the image are, if people are mentioned in the caption.


Self-supervised learning is particularly helpful for this task since it would be extermenly hard to annotate the huge textual datasets used for training the components. Hence, it is better for the models themselves to learn the representations of the text and images. Another benefit comes in the context of transformers like BERT. Using these for embedding text makes sure that the model is able to learn the context of the text. This is due to the fact that BERT is bidirectional, which means that the model is able to learn the context of the text by looking at the words before and after the current word.