## Multimodal RAG Example using Qwen3-VL Models

This example demonstrates a complete multimodal RAG pipeline that can:

- Process PDF documents into images
- Create embeddings for both text queries and document images to use to search and retrieve relevant pages
- Rerank results for better accuracy
- Generate answers using Qwen3-VL

**Note:** These models are large. Make sure you have sufficient GPU memory. You need to download the model repositories that include the scripts folder. You can install flash attention if your GPU allows, and quantize the models using bitsandbytes.

In [None]:
!pip install -q pdf2image>=1.16.0 qwen-vl-utils flash-attn bitsandbytes


In [None]:
!sudo apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 41 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.12 [186 kB]
Fetched 186 kB in 0s (372 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package poppler-utils.
(Reading database ... 121689

Download scripts that required to wrap Qwen3-VL models.

In [None]:
!wget https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B/resolve/main/scripts/qwen3_vl_embedding.py
!wget https://huggingface.co/Qwen/Qwen3-VL-Reranker-8B/resolve/main/scripts/qwen3_vl_reranker.py
!mkdir scripts
!mv qwen3_vl_embedding.py scripts/
!mv qwen3_vl_reranker.py scripts/

--2026-01-08 16:18:02--  https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B/resolve/main/scripts/qwen3_vl_embedding.py
Resolving huggingface.co (huggingface.co)... 18.239.50.80, 18.239.50.49, 18.239.50.16, ...
Connecting to huggingface.co (huggingface.co)|18.239.50.80|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: /api/resolve-cache/models/Qwen/Qwen3-VL-Embedding-8B/93dc85c0aaf1836e8ed35366f06e70625138817b/scripts%2Fqwen3_vl_embedding.py?%2FQwen%2FQwen3-VL-Embedding-8B%2Fresolve%2Fmain%2Fscripts%2Fqwen3_vl_embedding.py=&etag=%2236d45865735be96a1278a21c132ff640e2ae68ca%22 [following]
--2026-01-08 16:18:02--  https://huggingface.co/api/resolve-cache/models/Qwen/Qwen3-VL-Embedding-8B/93dc85c0aaf1836e8ed35366f06e70625138817b/scripts%2Fqwen3_vl_embedding.py?%2FQwen%2FQwen3-VL-Embedding-8B%2Fresolve%2Fmain%2Fscripts%2Fqwen3_vl_embedding.py=&etag=%2236d45865735be96a1278a21c132ff640e2ae68ca%22
Reusing existing connection to huggingface.co:443.
HTTP r

Some utils to convert PDF to images.

In [None]:
from pdf2image import convert_from_path
import requests

def download_pdf(url, save_path="document.pdf"):
    response = requests.get(url)
    with open(save_path, 'wb') as f:
        f.write(response.content)
    print(f"PDF saved to {save_path}")
    return save_path

def pdf_to_images(pdf_path):
    images = convert_from_path(pdf_path)
    return images

We'll use a PDF about climate change and write queries related to this.

In [None]:
pdf_url = "https://climate.ec.europa.eu/system/files/2018-06/youth_magazine_en.pdf"
pdf_path = download_pdf(pdf_url)
document_images = pdf_to_images(pdf_path)

document_images = document_images[4:10]

queries = [
    {"text": "How much did the world temperature change so far?"},
    {"text": "What are the main causes of climate change?"},
]

PDF saved to document.pdf


We can now embed the documents and queries and get similarity scores.

In [None]:
import numpy as np
import torch
from io import BytesIO
from PIL import Image

from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

embedder = Qwen3VLEmbedder("Qwen/Qwen3-VL-Embedding-2B")

document_inputs = []
for idx, img in enumerate(document_images):
    img_path = f"temp_page_{idx}.png"
    img.save(img_path)
    document_inputs.append({"image": img_path})

document_embeddings = embedder.process(document_inputs)
query_embeddings = embedder.process(queries)
print(f"Query embeddings shape: {query_embeddings.shape}")

Query embeddings shape: torch.Size([2, 2048])


Now we will get the similarities.

In [None]:
def retrieve_top_k(query_embedding, document_embeddings, k=3):
    """Retrieve top-k most similar documents for a query"""
    if torch.is_tensor(query_embedding):
      query_embedding = query_embedding.cpu().numpy()
    if torch.is_tensor(document_embeddings):
      document_embeddings = document_embeddings.cpu().numpy()
    similarity_scores = query_embedding @ document_embeddings.T
    top_k_indices = np.argsort(similarity_scores)[-k:][::-1]
    top_k_scores = similarity_scores[top_k_indices]
    return top_k_indices, top_k_scores


for query_idx, query in enumerate(queries):
    print(f"\nQuery {query_idx + 1}: {query['text']}")
    print("-" * 60)

    top_indices, top_scores = retrieve_top_k(
        query_embeddings[query_idx],
        document_embeddings,
        k=3
    )

    print(f"Top 3 pages (by similarity):")
    for rank, (page_idx, score) in enumerate(zip(top_indices, top_scores), 1):
        print(f"  {rank}. Page {page_idx + 1} (score: {score:.4f})")


Query 1: How much did the world temperature change so far?
------------------------------------------------------------
Top 3 pages (by similarity):
  1. Page 1 (score: 0.4605)
  2. Page 5 (score: 0.4408)
  3. Page 2 (score: 0.4360)

Query 2: What are the main causes of climate change?
------------------------------------------------------------
Top 3 pages (by similarity):
  1. Page 1 (score: 0.4903)
  2. Page 2 (score: 0.4770)
  3. Page 5 (score: 0.4555)


We can remove embedder to save memory.

In [None]:
del embedder

Let's rerank results.

In [None]:
from scripts.qwen3_vl_reranker import Qwen3VLReranker

reranker = Qwen3VLReranker("Qwen/Qwen3-VL-Reranker-2B")

# Enable FA2 if your GPU allows
# reranker = Qwen3VLReranker(
#     model_name_or_path=reranker_model_name,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2"
# )

query_for_reranking = queries[0]
top_indices, _ = retrieve_top_k(query_embeddings[0], document_embeddings, k=3)

print(f"\nReranking results for: {query_for_reranking['text']}")

reranker_inputs = {
    "instruction": "Retrieve pages relevant to the user's query about climate change.",
    "query": query_for_reranking,
    "documents": [{"image": f"temp_page_{idx}.png"} for idx in top_indices],
    "fps": 1.0
}

reranker_scores = reranker.process(reranker_inputs)

for rank, (page_idx, score) in enumerate(zip(top_indices, reranker_scores), 1):
    print(f"  {rank}. Page {page_idx + 1} (reranker score: {score:.4f})")

best_page_idx = top_indices[np.argmax(reranker_scores)]
print(f"\nBest page after reranking: Page {best_page_idx + 1}")

You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Reranking results for: How much did the world temperature change so far?




  1. Page 1 (reranker score: 0.5188)
  2. Page 5 (reranker score: 0.4743)
  3. Page 2 (reranker score: 0.6626)

Best page after reranking: Page 2


We are done with reranker, we can remove it to save memory.

In [None]:
del reranker

We will now initialize the Qwen3-VL model, pass our best ranked page as well as our text prompt to let Qwen3-VL generate the answer.

In [None]:
vlm_model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    dtype="auto",
    device_map="auto"
).to("cuda")
# Enable FA2 for better performance
# vlm_model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen3-VL-2B-Instruct",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")

print(f"Generating answer for: {query_for_reranking['text']}")
print(f"Using retrieved page: {best_page_idx + 1}")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": f"temp_page_{best_page_idx}.png",
            },
            {
                "type": "text",
                "text": f"Based on this page from a climate change document, please answer the following question: {query_for_reranking['text']}"
            },
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(vlm_model.device)

generated_ids = vlm_model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(f"\nQuery: {query_for_reranking['text']}")
print(f"\nAnswer:\n{output_text[0]}")


Generating answer for: How much did the world temperature change so far?
Using retrieved page: 2

Query: How much did the world temperature change so far?

Answer:
Based on the information provided in the document, the world temperature has changed by approximately 1.1°C since the late 19th century.


The answer is correct!