# Multimodal NLP

## Goal of the session

The goal of the session is to turn an LLM into a multimodal chatbot without any training. To do so we will draw inspiration from https://arxiv.org/pdf/2201.05299.pdf.

Concretely, you turn images into text or a set of labels and give these to the LLM.

❗❗❗ SELECT A GPU HARDWARE ❗❗❗

### Requirements

Install packages & import packages

In [None]:
!pip install transformers evaluate
!pip install sentencepiece
!pip install accelerate
!pip install bitsandbytes
!pip install xformers
!pip install bert_score

In [None]:
from PIL import Image
import time
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, TextStreamer, BlipProcessor, BlipForConditionalGeneration
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.rpn import AnchorGenerator
from torchvision.transforms import v2 as T
import numpy as np


In [None]:
from evaluate import load
bertscore = load("bertscore")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

### Inference with an LLM

In this section you will use TinyLlama Chat, a language model of 1.1 billion parameters.

In [None]:
model_name = "TinyLlama/TinyLlama-1.1b-Chat-v1.0"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Definition of the prompt function to build inputs in the right format.

In [None]:
def user_prompt(text):
  messages = f'''
  <|system|> \
  You are a friendly chatbot who always responds in the style of a pirate.</s> \
  <|user|> \
  {text}</s> \
  <|assistant|>
  '''
  return messages

To save memory you will load the model into NF4

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

Example

In [None]:
text = '''
"The 2018 FIFA World Cup was the 21st FIFA World Cup, the quadrennial world championship \
for national football teams organized by FIFA. It took place in Russia from 14 June \
to 15 July 2018, after the country was awarded the hosting rights in late 2010. \
It was the eleventh time the championships had been held in Europe, the first \
time they were held in Eastern Europe, and the first time they were held across \
two continents (Europe and Asia). At an estimated cost of over $14.2 billion, \
it was the most expensive World Cup ever held until it was surpassed by the 2022 World Cup in Qatar." \
Based on the previous text, where did the World Cup took place?
'''

prompt = user_prompt(text)

In [None]:
inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
with torch.no_grad():
  _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

No, the giraffe in the image is not eating. The giraffe is shown as a static object, and the text "GIFE" is not visible in the image.


## Downloading data

In [None]:
import gdown
gdown.download_folder("https://drive.google.com/drive/folders/1UsPWqPTFznqXBKoFFRUkud489CBR8-t1?usp=sharing")

Retrieving folder contents


Processing file 1PrL5EuW2PuTf4Hw9HQkL846M5YG7MAYq class_labels.txt
Processing file 1U0JGb0ciIGBSCEdpEcXlV89cJATaNL1m valid_annotations.json
Processing file 1_1TaGN0LPlb4dMTaOU0zqz0Asyw2opNT valid_imgs.tar.gz
Processing file 12js7dgnPu5TvGE49gcv0gEgH8l5THZ1U vqa_valid_questions.json
Building directory structure completed


Retrieving folder contents completed
Building directory structure
Downloading...
From: https://drive.google.com/uc?id=1PrL5EuW2PuTf4Hw9HQkL846M5YG7MAYq
To: /content/Lab session 9 - Multimodal NLP/class_labels.txt
100%|██████████| 703/703 [00:00<00:00, 2.46MB/s]
Downloading...
From: https://drive.google.com/uc?id=1U0JGb0ciIGBSCEdpEcXlV89cJATaNL1m
To: /content/Lab session 9 - Multimodal NLP/valid_annotations.json
100%|██████████| 4.00M/4.00M [00:00<00:00, 151MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1_1TaGN0LPlb4dMTaOU0zqz0Asyw2opNT
From (redirected): https://drive.google.com/uc?id=1_1TaGN0LPlb4dMTaOU0zqz0Asyw2opNT&confirm=t&uuid=896a1a72-136f-401b-897b-9295eee8a595
To: /content/Lab session 9 - Multimodal NLP/valid_imgs.tar.gz
100%|██████████| 788M/788M [00:03<00:00, 219MB/s]
Downloading...
From: https://drive.google.com/uc?id=12js7dgnPu5TvGE49gcv0gEgH8l5THZ1U
To: /content/Lab session 9 - Multimodal NLP/vqa_valid_questions.json
100%|██████████| 462k/462k [00:00

['/content/Lab session 9 - Multimodal NLP/class_labels.txt',
 '/content/Lab session 9 - Multimodal NLP/valid_annotations.json',
 '/content/Lab session 9 - Multimodal NLP/valid_imgs.tar.gz',
 '/content/Lab session 9 - Multimodal NLP/vqa_valid_questions.json']

In [None]:
!mv 'Lab session 9 - Multimodal NLP'/* .
!rm -r 'Lab session 9 - Multimodal NLP'
!tar -xvzf valid_imgs.tar.gz

## Evaluation

Evaluate TinyLlama (text-only) on the first 1000 examples of the valid benchmark.

In [None]:
class EvalDataset(Dataset):
  def __init__(self):

    self.questions_fname = "vqa_valid_questions.json"
    self.answers_fname = "valid_annotations.json"
    self.imgs_folder = "valid_imgs"

    ## TODO:
    # Load data

  def __len__(self):
    ... # To complete

  def __getitem__(self, item):
    ... # To complete

In [None]:
### TODO
"""

!!! Take 1,000 first examples as the test set. !!!

Eval Tiny Llama (text-only version) on the VQA dataset. You will use bertscore, below is an example:

predictions = ["The table is made of wood"]
references = ["made of wood"]
results = bertscore.compute(predictions=predictions, references=references, lang="en")

IMPORTANT: Try different prompts to make sure TinyLlama do not output long responses.
"""

## Use an image captioner and an object detector to convert images as text

Image captioner

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration
import requests

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
captioner = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base",
                                                         torch_dtype=torch.float16).to("cuda")
captioner.eval()
captioner.to(device)

# For usage, have a look here: https://huggingface.co/docs/transformers/model_doc/blip#transformers.BlipForConditionalGeneration

In [None]:
## TODO:
"""
Convert images into set of captions
"""
import os
import pdb
from PIL import Image

class ImgDataset(Dataset):
  def __init__(self, processor):
    self.imgs_folder = "valid_imgs"
    self.img_name_list = os.listdir(self.imgs_folder)
    self.processor = processor

  def img_name_to_img_idx(self, img_name):
    return str(int(img_name.replace(".jpg", "").split("_")[-1]))

  def __len__(self):
    return len(self.img_name_list)

  def __getitem__(self, item):
  ## To complete to load img
    im_name = self.img_name_list[item]
    img_idx = self.img_name_to_img_idx(im_name)

    img = Image.open(os.path.join(self.imgs_folder, im_name))
    text = "A picture of"
    inputs = self.processor(img, text, return_tensors="pt")
    inputs.update({"img_idx": img_idx})

    return inputs

img_dataset = ImgDataset(processor)

BATCH_SIZE = 16
dataloader = DataLoader(img_dataset, batch_size=BATCH_SIZE, num_workers=2, shuffle=False)

img_captions = {"captions": [], "img_idx": []} # keys are img index - values are the generated caption.


In [None]:
from tqdm import tqdm

for batch in tqdm(dataloader):
  img_captions["img_idx"] += batch.pop("img_idx")

  batch = {k: v.to(device) for k, v in batch.items()}
  batch["pixel_values"] = batch["pixel_values"][:, 0]
  batch["input_ids"] = batch["input_ids"][:, 0]
  batch["attention_mask"] = batch["attention_mask"][:, 0]

  with torch.no_grad():
    output = captioner.generate(**batch)
    captions = processor.tokenizer.batch_decode(output, skip_special_tokens=True)
    captions = [cap.replace("[SEP]", "").replace("[PAD]", "").strip(".").strip(" ") + "." for cap in captions]
    img_captions["captions"] += captions

100%|██████████| 300/300 [02:49<00:00,  1.77it/s]


Object detector

In [None]:
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.rpn import AnchorGenerator
from torchvision.transforms import v2 as T

# load a model pre-trained on COCO
detector = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
detector.eval()
detector.to(device)

transforms_fn = torch.nn.Sequential(
    T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    T.ToTensor(),
  )

Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
100%|██████████| 160M/160M [00:01<00:00, 140MB/s]


In [None]:
## TODO: read the class labels from class_labels.txt

with open("class_labels.txt", "r") as f:
  labels = f.read().strip("\n").split("\n")

id_2_classname = {idx: lab for idx, lab in enumerate(labels)}

In [None]:
## TODO:
"""
Convert images into set of detected objects. Apply a threshold of 0.5 on the detected probability, save coordinate related to the object.
"""
import os
import pdb
import numpy as np
from tqdm import tqdm

class ObjectDataset(Dataset):
  def __init__(self, transform_fn):
    self.imgs_folder = "valid_imgs"
    self.img_name_list = os.listdir(self.imgs_folder)
    self.transform_fn = transform_fn

  def img_name_to_img_idx(self, img_name):
    return str(int(img_name.replace(".jpg", "").split("_")[-1]))

  def __len__(self):
    return len(self.img_name_list)

  def __getitem__(self, item):
  ## To complete to load img
    im_name = self.img_name_list[item]
    img_idx = self.img_name_to_img_idx(im_name)

    img = Image.open(os.path.join(self.imgs_folder, im_name))
    img_inps = T.Resize((224, 224))(self.transform_fn(np.array(img)))

    if img_inps.size(0) == 1:
      img_inps = img_inps.repeat(3, 1, 1)

    return {"pixel_values": img_inps, "img_idx": img_idx}

obj_dataset = ObjectDataset(transforms_fn)

BATCH_SIZE = 16
dataloader = DataLoader(obj_dataset, batch_size=BATCH_SIZE, num_workers=1, shuffle=False)

img_obj = {} # keys are img_index - values are dict where keys are ["detected_obj", "obj_loc"] and values are class names and detected coordinates

In [None]:
for batch in tqdm(dataloader):
  pixel_values = batch["pixel_values"].to(device)
  img_index = batch["img_idx"]
  with torch.no_grad():
    output = detector(pixel_values)
  for b_idx, out in enumerate(output):
    valid_idx = torch.where(out["scores"] > 0.5)
    box_loc = out["boxes"][valid_idx].cpu().tolist()
    labels = out["labels"][valid_idx].cpu().tolist()
    lab_names = [id_2_classname[lab] for lab in labels]

    img_obj[img_index[b_idx]] = {"detected_obj": lab_names, "obj_loc": box_loc}


100%|██████████| 300/300 [06:17<00:00,  1.26s/it]


In [None]:
torch.cuda.empty_cache()
del detector

## Build & Eval the newly multimodal model

Create a prompt function based on the question, the detected objects, captions and (optionally) object locations.

In [None]:
def visual_prompt(caption: str, detected_obj: list, obj_loc: list, question: str):
  ### TODO: below is an example, change it with your own prompt. IMPORTANT: Try to force it output short answer.
  text = f'''
  {caption.strip(".")}. These objects {",".join(detected_obj)} are visible in the image. {question} Answer:
  '''
  return text

In [None]:
## TODO: Eval TinyLlama multimodal and compare to the text-only version
import json

with open("vqa_valid_questions.json", "r") as fj:
  questions = json.load(fj)

test_questions = questions[:1000]

In [None]:
with open("valid_annotations.json", "r") as fj:
  annotations = json.load(fj)

In [None]:
test_annotations = annotations[:1000]

In [None]:
all_q, all_ans, all_model_ans = [], [], []
for annot, ques in tqdm(zip(test_annotations, test_questions)):
  q = ques["question"]
  img_id = ques["image_id"]
  ans = annot["multiple_choice_answer"]

  cap_idx = img_captions["img_idx"].index(str(img_id))
  cap = img_captions["captions"][cap_idx]

  obj_dict = img_obj[str(img_id)]
  obj_labs = obj_dict["detected_obj"]

  vis_prompt = visual_prompt(cap, obj_labs, [], q)
  prompt = user_prompt(vis_prompt)
  inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
  with torch.no_grad():
    pred = model.generate(**inputs, streamer=streamer, max_new_tokens=15)

  model_answer = tokenizer.batch_decode(pred, skip_special_tokens=True)[0].split("<|assistant|>")[-1].strip("\n").strip()

  # Save results
  all_q.append(q)
  all_ans.append(ans)
  all_model_ans.append(model_answer)


In [None]:
score = bertscore.compute(predictions=all_model_ans, references=all_ans, lang="en")

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print(f"Precision: {np.mean(score['precision'])}")
print(f"Recall: {np.mean(score['recall'])}")
print(f"F1 Score: {np.mean(score['f1'])}")

Precision: 0.8116519441604614
Recall: 0.8261530102491379
F1 Score: 0.8186268402934075


## In-context learning abilities

You will add examples in the prompt and eval the results. You will use 1, 2 and 4 examples.


In [None]:
## TODO: Build a training dataset where to draw examples RANDOMLY. You will use the 4,000 remaining examples as your training set.

In [None]:
## TODO: build a new prompt function to include the in context examples
import random

train_questions, train_answers = questions[1000:], annotations[1000:]

def incontext_visual_prompt(caption: str, detected_obj: list, obj_loc: list,
                            question: str, answer: str):
  text = visual_prompt(caption, detected_obj, obj_loc, question) + f" {answer}."
  return text

def full_visual_prompt(caption: str, detected_obj: list, obj_loc: list,
                       question: str, num_in_context_exs: int):
  text = ""
  for in_context_ex in range(num_in_context_exs):
    tr_id = random.choice(range(4000))
    ques = train_questions[tr_id]
    annot = train_answers[tr_id]

    q = ques["question"]
    img_id = ques["image_id"]
    ans = annot["multiple_choice_answer"]

    cap_idx = img_captions["img_idx"].index(str(img_id))
    cap = img_captions["captions"][cap_idx]

    obj_dict = img_obj[str(img_id)]
    obj_labs = obj_dict["detected_obj"]

    text += incontext_visual_prompt(cap, obj_labs, [], q, ans)
    ## TODO: to complete

  text+= visual_prompt(caption, detected_obj, obj_loc, question)
  return text

In [None]:
all_q, all_ans, all_model_ans = [], [], []
for annot, ques in tqdm(zip(test_annotations, test_questions)):
  q = ques["question"]
  img_id = ques["image_id"]
  ans = annot["multiple_choice_answer"]

  cap_idx = img_captions["img_idx"].index(str(img_id))
  cap = img_captions["captions"][cap_idx]

  obj_dict = img_obj[str(img_id)]
  obj_labs = obj_dict["detected_obj"]

  vis_prompt = full_visual_prompt(cap, obj_labs, [], q, 1)
  print(vis_prompt)
  prompt = user_prompt(vis_prompt)
  inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
  with torch.no_grad():
    pred = model.generate(**inputs, streamer=streamer, max_new_tokens=15)

  model_answer = tokenizer.batch_decode(pred, skip_special_tokens=True)[0].split("<|assistant|>")[-1].strip("\n").strip()

  # Save results
  all_q.append(q)
  all_ans.append(ans)
  all_model_ans.append(model_answer)


In [None]:
## TODO: Eval multimodal TinyLlama on the 1,000 eval examples

## Selective in-context examples

To further improve the results, you can select examples close to the one you are currently evaluating your model on instead of choosing them randomly.

To do so, you will search in the training set for question similar to the current eval question.

In [None]:
## TODO: use bert score to rank the questions in the training set w.r.t the eval question.

In [None]:
## TODO: Select in-context examples for each eval sample. IMPORTANT: make sure that the in-context related images are not the same as the one you use for evaluation.

In [None]:
## TODO: Eval multimodal TinyLlama with this method and compare with previous results.

# Play with the model

Now you have finished the lab you can play with the model. Give it whatever image you want and ask it whatever you want about it.