<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024s2/blob/main/session-7/blip_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# %%capture
# !pip install matplotlib transformers datasets accelerate sentence-transformers

# Multimodal Large Language Model

BLIP-2, LLaVA, and other Visual LLMs project the visual features from images to language embeddings, using CLIP-like visual encoders.The language embeddings can then be used as input for an LLM that they can be used as the input for an LLM.  This capability enabled interesting use cases such as image captioning and visual Q&A.

In this exercise, we will learn how to use the BLIP-2 to perform image captioning and visual Q&A tasks.

*Due to a bug in transformer version 4.46.3, the BLIP2 model is not working for image captioning case, we will need to update the transformer the main branch*

In [None]:
!pip install --upgrade git+https://github.com/huggingface/transformers.git

In [None]:
import transformers

# make sure our transformers version is the latest 
transformers.__version__

## BLIP-2 model

Let's just load the BLIP model, and the associated processor.

In [None]:
from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

# Load processor and main model
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16
)

# Send the model to GPU to speed up inference
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

### Preprocessing Images

Let us first look at what kind of image processing is done by the Processor. We will use the following as test image.


In [None]:
from PIL import Image
from urllib.request import urlopen

# Load image of a supercar
image_url = 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png' 
image = Image.open(urlopen(image_url)).convert("RGB")
image

Let's see what the BLIP-2 processor does to the image.

In [None]:
# Preprocess the image
inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
print(inputs)

**Questions**

- What is the shape of the tensor?
- What can you tell from the shape?
- What are the max and min values of the pixel values?

<details><summary>Click here for answer</summary>

The processor will resize the image into 224 × 224-sized image. So if we have a very wide photo, it will be processed into square photo.
The image is RGB (axis=1 is 3_
Also the pixel values seem to have been normalized, maximum value is 2.0742 and the minimum value is -1.7773



</details>

To display the pixel values as PIL image, we need to do some transformation first.

1. As PIL image is represented as (width, height, channels), but torch tensor is presented as (channels, height, width), we need swap the axis around of the original tensor.  We can use np.einsum() to accompanish this easily in 2 steps:
 -swap position of channel axis to become 3rd axis (ijk->kji)
 -swap position of width and height (ijk->jik)
2. we also need to scale the pixel values to values between 0 and 255, using minmaxscaler. Since minmax scaler works with 2D data, we need to reshape it 2D, by flattening axis=0 (Height) and axis=1 (Width).


In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Convert to numpy and go from (1, C, H, W) to (W, H, C) in shape
image_inputs = inputs["pixel_values"][0].detach().cpu().numpy()
image_inputs = np.einsum('ijk->kji', image_inputs)
image_inputs = np.einsum('ijk->jik', image_inputs)

# Scale image inputs to 0-255 to represent RGB values
scaler = MinMaxScaler(feature_range=(0, 255))

image_inputs = scaler.fit_transform(image_inputs.reshape(-1, image_inputs.shape[-1])).reshape(image_inputs.shape)
image_inputs = np.array(image_inputs, dtype=np.uint8)

# Convert numpy array to Image
Image.fromarray(image_inputs)

### Preprocessing Text

Let’s continue this exploration of the processor with text instead. First, we
can access the tokenizer used to tokenize the input text

In [None]:
blip_processor.tokenizer

To explore how GPT2Tokenizer works, we can try it out with a small
sentence. We start by converting the sentence to token IDs before converting
them back to tokens:

In [None]:
# Preprocess the text
text = "Her vocalization was remarkably melodic"
tokens = blip_processor.tokenizer(text, return_tensors="pt")
input_ids = tokens['input_ids'][0]

tokens = blip_processor.tokenizer.convert_ids_to_tokens(input_ids)
tokens

When we inspect the tokens, you might notice a strange symbol at the
beginning of some tokens, namely, the Ġ symbol. This is actually supposed
to be a space. However, an internal function takes characters in certain code
points and moves them up by 256 to make them printable. As a result, the
space (code point 32) becomes Ġ (code point 288).

In [None]:
# Replace the space token with an underscore
tokens = [token.replace("Ġ", "_") for token in tokens]
tokens

### Use Case 1: Image Captioning

The most straightforward usage of a model like BLIP-2 is to create captions of images that you have in your data. An image is converted to pixel values that the model can read. These pixel values are passed to BLIP-2 to be converted into soft visual prompts that the LLM can use to decide on a proper caption.

In [None]:
# Load an test image
image = Image.open(urlopen(image_url)).convert("RGB")

# Convert an image into inputs and preprocess it
inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
image

In [None]:
# Generate image ids to be passed to the decoder (LLM)
generated_ids = model.generate(**inputs, max_new_tokens=20)

# Generate text from the image ids
generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
generated_text = generated_text[0].strip()
generated_text

### Use Case 2: Visual Question Answering

In the previous example, we showed going from one modality, vision (image), to another, text (caption). Instead of following this linear structure, we can try to present both modalities simultaneously by performing what is called visual question answering. In this particular use case, we give the model an image along with a question about that specific image for it to answer. The model needs to process both the image as well as the question at once.
To demonstrate, let’s start with the picture and ask BLIP-2 to describe the image. To do so, we first need to preprocess the image as we did a few times before:

In [None]:
# Load a test image
image = Image.open(urlopen(image_url)).convert("RGB")

To perform our visual question answering we need to give BLIP-2 more than just the image, namely the prompt. Without it, the model would generate a caption as it did before. We will ask the model to describe the image we just processed:

In [None]:
# Visual Question Answering
prompt = "Question: Write down what you see in this picture. Answer:"

# Process both the image and the prompt
inputs = blip_processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

# Generate text
generated_ids = model.generate(**inputs, max_new_tokens=30)
generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
generated_text = generated_text[0].strip()
generated_text

It correctly describes the image. However, this is a rather simple example since our question is essentially asking the model to create a caption. Instead, we can ask follow-up questions in a chat-based manner. To do so, we can give the model our previous conversation, including its answer to our question. We then ask it a follow-up question:

In [None]:
# Chat-like prompting
prompt = "Question: Write down what you see in this picture. Answer: Merlion. Question: Where is the merlion sitting? Answer:"

# Generate output
inputs = blip_processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=30)
generated_text = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)
generated_text = generated_text[0].strip()
generated_text