<a href="https://colab.research.google.com/github/kima-rafayelyan/vllm-coco-captioner/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Setup and Library **Imports**
**bold text**This block installs and imports necessary libraries (vllm, transformers, sqlite3, etc.) and create a local data directory.



In [None]:
!pip install vllm
!pip install qwen-vl-utils
!pip install -U transformers accelerate


import os
os.makedirs("data", exist_ok=True)
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from qwen_vl_utils import process_vision_info
import sqlite3
import pandas as pd

Collecting qwen-vl-utils
  Using cached qwen_vl_utils-0.0.14-py3-none-any.whl.metadata (9.0 kB)
Collecting av (from qwen-vl-utils)
  Using cached av-16.0.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Using cached qwen_vl_utils-0.0.14-py3-none-any.whl (8.1 kB)
Using cached av-16.0.1-cp312-cp312-manylinux_2_28_x86_64.whl (40.5 MB)
Installing collected packages: av, qwen-vl-utils
Successfully installed av-16.0.1 qwen-vl-utils-0.0.14
Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m98.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.57.2
    Uninstalling transfo

# **Model Configuration and Database Initialization**
Define the model path and it initializes the SQLite database. It connects to the file qwen_t4_captions.db and executes a SQL command to create the captions table, defining its structure with fields for image id, url, and the generated caption.

In [None]:
MODEL_PATH = "Qwen/Qwen2-VL-2B-Instruct"

conn = sqlite3.connect("qwen_t4_captions.db")
cursor = conn.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS captions (id INTEGER PRIMARY KEY, url TEXT, caption TEXT)')
conn.commit()

# **VLLM and Sampling Parameter Setup**
Initialize the VLLM engine (llm) on the GPU and configure the tokenizer and generation parameters (sampling_params).

Engine setup phase. It initializes the LLM object, loading the Qwen model onto the GPU .It also loads the corresponding tokenizer and defines the SamplingParams to control generation, setting a low temperature=0.2  and limiting the length with max_tokens=128.

In [None]:

llm = LLM(
    model=MODEL_PATH,
    dtype="half",
    max_model_len=4096,
    gpu_memory_utilization=0.95,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.2, max_tokens=128, stop_token_ids=[151645])

INFO 12-03 22:57:58 [utils.py:253] non-default args: {'trust_remote_code': True, 'dtype': 'half', 'seed': None, 'max_model_len': 4096, 'gpu_memory_utilization': 0.95, 'disable_log_stats': True, 'model': 'Qwen/Qwen2-VL-2B-Instruct'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]



preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

INFO 12-03 22:58:31 [model.py:637] Resolved architecture: Qwen2VLForConditionalGeneration
INFO 12-03 22:58:31 [model.py:1750] Using max model len 4096
INFO 12-03 22:58:34 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]

INFO 12-03 23:06:31 [llm.py:343] Supported tasks: ['generate']


# **Input Data Definition (COCO Subset)**
This block explicitly defines the input data structure: a list named coco_subset. This list contains dictionaries, each linking a unique COCO image identifier (id) to its direct HTTP URL. The database connection is also re-established to ensure it's ready for insertion in the subsequent loop.

In [None]:
coco_subset = [
    {"id": 397133, "url": "http://images.cocodataset.org/val2017/000000397133.jpg"},
    {"id": 785,    "url": "http://images.cocodataset.org/val2017/000000000785.jpg"},
    {"id": 87038,  "url": "http://images.cocodataset.org/val2017/000000087038.jpg"},
    {"id": 174482, "url": "http://images.cocodataset.org/val2017/000000174482.jpg"}
]

conn = sqlite3.connect("qwen_t4_captions.db")
cursor = conn.cursor()

# **The Main Processing Loop and Database Storage**
Loop through each image, format the multimodal prompt, use VLLM to generate the caption, and store the result (ID, URL, caption) in the SQLite database.
 It iterates over each image in the coco_subset. Inside the loop, it constructs the multimodal chat prompt (image URL + text instruction) and prepares the input data for VLLM. It then calls llm.generate() to get the output caption. The generated text is extracted and stored in the SQLite database using an INSERT OR REPLACE command. A try...except block handles runtime errors. Finally, the database connection is closed.

In [None]:
for item in coco_subset:

    messages = [
        {"role": "user", "content": [
            {"type": "image", "image": item['url']},
            {"type": "text", "text": "Describe this image in one sentence."}
        ]}
    ]


    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_data, video_data = process_vision_info(messages)

    inputs = {
        "prompt": prompt,
        "multi_modal_data": {"image": image_data},
    }

    try:

        outputs = llm.generate([inputs], sampling_params=sampling_params)
        caption = outputs[0].outputs[0].text.strip()


        cursor.execute("INSERT OR REPLACE INTO captions (id, url, caption) VALUES (?, ?, ?)",
                       (item['id'], item['url'], caption))
        conn.commit()
        print(f"ID {item['id']}: {caption}")

    except Exception as e:
        print(f"Error on {item['id']}: {e}")


conn.close()

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

ID 397133: A person is standing in a kitchen, wearing an apron, and pointing at a table with dough and other kitchen items. The kitchen is equipped with various cooking utensils and appliances, including a stove, sink, and oven.


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

ID 785: A woman in a red jacket and black pants is skiing down a snowy slope, holding ski poles and wearing goggles.


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

ID 87038: A skateboarder in a purple shirt is performing a trick in a skate park with graffiti-covered walls in the background.


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

ID 174482: A blue bicycle is parked on a sidewalk next to a street, with a white van and several other vehicles visible in the background.


# **Display**
 This final block reconnects to the database and uses the Pandas library (pd.read_sql) to query and load all stored data from the captions table into a DataFrame. It configures Pandas to display the full caption text (pd.set_option) and prints the final table, confirming the captions were successfully generated and stored.

In [None]:
conn = sqlite3.connect("qwen_t4_captions.db")
df = pd.read_sql("SELECT * FROM captions", conn)
conn.close()

pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,id,url,caption
0,785,http://images.cocodataset.org/val2017/000000000785.jpg,"A woman in a red jacket and black pants is skiing down a snowy slope, holding ski poles and wearing goggles."
1,87038,http://images.cocodataset.org/val2017/000000087038.jpg,A skateboarder in a purple shirt is performing a trick in a skate park with graffiti-covered walls in the background.
2,174482,http://images.cocodataset.org/val2017/000000174482.jpg,"A blue bicycle is parked on a sidewalk next to a street, with a white van and several other vehicles visible in the background."
3,397133,http://images.cocodataset.org/val2017/000000397133.jpg,"A person is standing in a kitchen, wearing an apron, and pointing at a table with dough and other kitchen items. The kitchen is equipped with various cooking utensils and appliances, including a stove, sink, and oven."
