## Multimodal Granite 3.1 2B

IBM released Granite-Vision-3.1-2B preview, a compact Llava-like vision language model based on Granite Instruct 3.1 for text backbone and SigLIP for image backbone.

It has very impressive [scores on different benchmarks](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview#granite-vision-31-2b-preview) for it's size for vision understanding and document understanding.

It comes with transformers and vLLM integration from the start too! Let's put it to test.

In [None]:
!pip install -q pdf2image git+https://github.com/huggingface/transformers.git
!sudo apt-get install -q poppler-utils

In [None]:
!wget -q https://www.europarl.europa.eu/pdfs/news/expert/2018/7/story/20180706STO07407/20180706STO07407_en.pdf

In [None]:
from pdf2image import convert_from_path
import os

pdf_path = "20180706STO07407_en.pdf"
images = convert_from_path(pdf_path)

We'd like to ask Granite Vision to explain a chart, and perhaps ask further questions.

In [None]:
images[1]

We can use `LlavaNextForConditionalGeneration` class to load Granite Vision and infer.

In [None]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

model_path = "ibm-granite/granite-vision-3.1-2b-preview"
processor = LlavaNextProcessor.from_pretrained(model_path)
model = LlavaNextForConditionalGeneration.from_pretrained(model_path, device_map="cuda:0")

We will use chat template to format our image and text input together, then pass it to the model.

In [None]:
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": images[1]},
            {"type": "text", "text": "Explain the chart in the image in detail."},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda:0")


# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))


The authors report various benchmarks, one that caught my eye was LiveXiv which is about ArXiv paper understanding. Let's put it to test.

In [None]:
!wget -q https://arxiv.org/pdf/2405.04324

In [None]:
pdf_path = "/content/2405.04324"
images = convert_from_path(pdf_path)

We'll test understanding from a page of Granite Code paper of IBM.

In [None]:
images[5]

In [None]:
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": images[5]},
            {"type": "text", "text": "What differences does this paper contribute to model architecture and training?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda:0")


# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))


The answer is correct!

Get more info about the model from [the model repository](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview), including benchmarks, how to get started with transformers and vLLM.