# Qwen2-VL-2B-Instruct


<img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2_vl.jpg" alt="Your image title" width=30% height=30%/>

---

## Qwen2-VL vs Qwen2-VL-Instruct


The key difference is that Qwen2-VL-Instruct is an instruction-following version of the foundational Qwen2-VL model. 
Qwen2-VL is the base, multimodal large language model with enhancements like dynamic resolution and advanced video processing capabilities, 
while Qwen2-VL-Instruct is the fine-tuned version specifically trained to respond to user instructions and questions in a conversational, task-oriented manner. 

### Qwen2-VL (Base Model)
- Foundation: 
    - The core multimodal model that understands both text and visual information. 
- Advanced Vision: 
    - Features significant improvements in vision, including the ability to process images of any resolution (naive dynamic resolution) and to understand long videos (over 20 minutes). 
- Agent Functionality: 
    - Designed to serve as a versatile visual agent capable of complex reasoning, decision-making, and operating devices. 
- Language Capabilities: 
    - Maintains the strong linguistic foundation of the Qwen2 LLM series. 

### Qwen2-VL-Instruct
- Instruction Following: 
    - This is the version you would use to ask questions about images or videos and get direct answers. 
- Fine-tuned for Tasks: 
    - It has undergone instruction tuning, which is a process of training the base model to follow commands and generate helpful, relevant responses to instructions. 
- Conversational Interaction: 
    - Ideal for applications requiring dialogue and direct task execution, such as visual question answering or content creation from visual input. 
- Example Use Cases: 
    - You would use Qwen2-VL-Instruct to ask "What is happening in this picture?" or "Summarize this video". 

### In Summary
Think of Qwen2-VL as the powerful engine and Qwen2-VL-Instruct as the user-friendly dashboard designed for interaction. 

(powered by google)


----

### Reference:
- ***Paper***
    - [Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution](https://arxiv.org/abs/2409.12191)
- ***Blogs***
    - https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
- ***GitHub***
    - https://github.com/xwjim/Qwen2-VL

## Device Setup

In [None]:
# import torch

# if torch.backends.mps.is_available():
#     g_device = "mps"
# elif torch.cuda.is_available():
#     g_device = "cuda"
#     !nvidia-smi
# else:
#     g_device = "cpu"

# print(f"Available device : {g_device}")

g_device = "cpu"

## Quickstart

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map=g_device
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(g_device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)


## Without qwen_vl_utils

In [None]:
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor


# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to(g_device)

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)


## Multi image inference

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "./data/bee.jpg"},
            {"type": "image", "image": "./data/flowers.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(g_device)

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)