<a href="https://colab.research.google.com/github/happytalkman/2048/blob/master/notebooks/open-vocabulary-object-detection-with-qwen3-vl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Roboflow Notebooks](https://media.roboflow.com/notebooks/template/bannertest2-2.png?ik-sdk-version=javascript-1.4.3&updatedAt=1672932710194)](https://github.com/roboflow/notebooks)

# Open Vocabluary Object Detection with Qwen3-VL

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/open-vocabluary-object-detection-with-qwen3-vl.ipynb)
[![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/QwenLM/Qwen3-VL)

Qwen3-VL gives you a strong omni recognition layer. It is trained on broader and cleaner visual data, so it recognizes many entity types in one model: celebrities, anime characters, products, landmarks, animals, and plants. It handles dense scenes with many instances and returns structured JSON with labels and bounding boxes, which fits detection-style workflows. You can point it at an image, ask high-level questions, and still get precise, machine-friendly outputs that plug into your own tools or code.

For grounding, Qwen3-VL moves to a relative 2D coordinate system from 0 to 1000 on width and height, which simplifies integration with resizing, slicing, and postprocessing. It supports multi-target grounding in complex scenes and can output hundreds of boxes across categories such as head, hand, man, woman, glasses in one pass. It also parses spatial relations like left or right, front or behind, far or near, and handles occlusion, viewpoint changes, and motion patterns. On top of that, it supports 3D grounding with depth-aware reasoning, which gives you a starting point for robotics, navigation, and other embodied AI applications.

![Qwen2.5-VL](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/qwen3vl_arc.jpg)

## Environment setup

### Configure your API keys

To benchmark Qwen3-VL, you need to provide your HuggingFace Token. Follow these steps:

- Open your HuggingFace Settings page. Click Access Tokens then New Token to generate a new token.
- In Colab, go to the left pane and click on Secrets (üîë). Store your HuggingFace Access Token under the name HF_TOKEN.

In [None]:
import os
from google.colab import userdata

os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Check GPU availability

Let's make sure that we have access to GPU. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `L4 GPU`, and then click `Save`.

The required VRAM depends on the Qwen3-VL version you want to use. A100 works for Qwen or Qwen3-VL-8B-Instruct. If you plan to load a smaller checkpoint, switch to L4 or T4.

In [None]:
!nvidia-smi

### Install dependencies

In [None]:
!pip install -q transformers supervision==0.27.0rc6

### Download example data

In [None]:
!wget -q https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/traffic_jam.jpg
!wget -q https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/solvay_conference.jpg
!wget -q https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/basketball_game.jpg

## Load Qwen3-VL model

Loads the Qwen3-VL model (and its processor) from Hugging Face, preparing the model for inference.

In [None]:
from transformers import AutoProcessor, AutoModelForImageTextToText

# MODEL_ID = "Qwen/Qwen3-VL-2B-Instruct"
# MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID).to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/390 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

video_preprocessor_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.72G [00:00<?, ?B/s]

## Qwen3-VL inference and visualization

### Define helper functions

Creates a function to run text prompts against the Qwen3-VL model and parse the results.

In [None]:
import torch
from PIL import Image

def qwen_detect(image: Image, target: str, max_new_tokens: int = 1024):

    prompt = (
        f"Outline the position of {target} and output all the coordinates in JSON format."
    )

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": prompt},
            ],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    ).to("cuda")

    with torch.inference_mode():
        gen = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
        )

    trimmed = [g[len(i):] for i, g in zip(inputs.input_ids, gen)]
    text = processor.batch_decode(trimmed, skip_special_tokens=True)[0]
    return text

Creates a function to annotate an image with detection results.

In [None]:
import supervision as sv
from PIL import Image

COLOR = sv.ColorPalette.from_hex([
    "#ffff00", "#ff9b00", "#ff66ff", "#3399ff", "#ff66b2", "#ff8080",
    "#b266ff", "#9999ff", "#66ffff", "#33ff99", "#66ff66", "#99ff00"
])

def annotate_image(image: Image, detections: sv.Detections, smart_position = True) -> Image:
    text_scale = sv.calculate_optimal_text_scale(resolution_wh=image.size)
    thickness = sv.calculate_optimal_line_thickness(resolution_wh=image.size)

    color_by_class = detections.class_id is not None
    box_annotator = sv.BoxAnnotator(
        color=COLOR,
        color_lookup=sv.ColorLookup.CLASS if color_by_class else sv.ColorLookup.INDEX,
        thickness=thickness
    )
    label_annotator = sv.LabelAnnotator(
        color=COLOR,
        color_lookup=sv.ColorLookup.CLASS if color_by_class else sv.ColorLookup.INDEX,
        text_color=sv.Color.BLACK,
        text_scale=text_scale,
        text_thickness=thickness - 1,
        smart_position=smart_position
    )

    annotated_image = image.copy()
    annotated_image = box_annotator.annotate(annotated_image, detections)
    return label_annotator.annotate(annotated_image, detections)

### Detect "yellow taxi"

In [None]:
IMAGE = "/content/traffic_jam.jpg"

TARGET = "yellow taxi"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "taxi"

In [None]:
IMAGE = "/content/traffic_jam.jpg"

TARGET = "taxi"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "man, woman"

In [None]:
IMAGE = "/content/solvay_conference.jpg"

TARGET = "man, woman"
CLASSES = ["man", "woman"]

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size,
    classes=CLASSES
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "Albert and Marie"

In [None]:
IMAGE = "/content/solvay_conference.jpg"

TARGET = "albert and marie"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "person between Albert and Marie"

In [None]:
IMAGE = "/content/solvay_conference.jpg"

TARGET = "person between albert and marie"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "hat"

In [None]:
IMAGE = "/content/solvay_conference.jpg"

TARGET = "hat"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "cane"

In [None]:
IMAGE = "/content/solvay_conference.jpg"

TARGET = "cane"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "people in front row"

In [None]:
IMAGE = "/content/solvay_conference.jpg"

TARGET = "people in front row"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "player in red"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "player in red"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "player in white"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "player in white"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "player #7"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "player #7"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "player new york knicks"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "player new york knicks"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "player about to shoot"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "player about to shoot"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "player new yourk knicks on bench"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "player new yourk knicks on bench"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

### Detect "sponsor logo"

In [None]:
IMAGE = "/content/basketball_game.jpg"

TARGET = "sponsor logo"

image = Image.open(IMAGE).convert("RGB")
response = qwen_detect(image, TARGET)

print(response)

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_3_VL,
    result=response,
    resolution_wh=image.size
)

annotated_image = image.copy()
annotated_image = annotate_image(image=annotated_image, detections=detections, smart_position=False)
annotated_image.thumbnail((800, 800))
annotated_image

<div align="center">
  <p>
    Looking for more tutorials or have questions?
    Check out our <a href="https://github.com/roboflow/notebooks">GitHub repo</a> for more notebooks,
    or visit our <a href="https://discord.gg/GbfgXGJ8Bk">discord</a>.
  </p>
  
  <p>
    <strong>If you found this helpful, please consider giving us a ‚≠ê
    <a href="https://github.com/roboflow/notebooks">on GitHub</a>!</strong>
  </p>

</div>