# Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)

In this example, we will build a **Multimodal Retrieval-Augmented Generation (RAG)** system by combining the [**ColPali**](https://huggingface.co/blog/manu/colpali) retriever for document retrieval with the [**Qwen2-VL**](https://qwenlm.github.io/blog/qwen2-vl/) Vision Language Model (VLM). This RAG system is capable of enhancing query responses with both text-based documents and visual data.

Instead of relying on a complex document processor pipeline that extracts data through OCR, we will leverage a Document Rerieval Model to efficiently retrieve the relevant documents based on a specific user query.

## Setups

In [None]:
!pip install -U -q byaldi pdf2image qwen-vl-utils transformers
# Tested with byaldi==0.0.4, pdf2image==1.17.0, qwen-vl-utils==0.0.8, transformers==4.45.0

We also need to install `poppler-utils` to facilitate PDF manipulation. This utility provides essential tools for working with PDF files, ensuring we can efficiently hadnle any document-related tasks in our project.

In [None]:
!sudo apt-get install -y poppler-utils

## Load dataset

In this section, we will utilize IKEA assembly instructions as our dataset. These PDFs contain step-by-step guidance for assembling various furniture pieces.

In [None]:
import requests
import os

pdfs = {
    "MALM": "https://www.ikea.com/us/en/assembly_instructions/malm-4-drawer-chest-white__AA-2398381-2-100.pdf",
    "BILLY": "https://www.ikea.com/us/en/assembly_instructions/billy-bookcase-white__AA-1844854-6-2.pdf",
    "BOAXEL": "https://www.ikea.com/us/en/assembly_instructions/boaxel-wall-upright-white__AA-2341341-2-100.pdf",
    "ADILS": "https://www.ikea.com/us/en/assembly_instructions/adils-leg-white__AA-844478-6-2.pdf",
    "MICKE": "https://www.ikea.com/us/en/assembly_instructions/micke-desk-white__AA-476626-10-100.pdf",
}

output_dir = 'data'
os.makedirs(output_dir, exist_ok=True)

for name, url in pdfs.items():
    response = requests.get(url)
    pdf_path = os.path.join(output_dir, f'{name}.pdf')

    with open(pdf_path, 'wb') as f:
        f.write(response.content)

    print(f"Downloaded {name} to {pdf_path}")

print('Downloaded files:', os.listdir(output_dir))

After downloading the assembly instructions, we will convert the PDFs into images. This is required as it allows the document retrieval model (ColPali) to process and manipulate the visual content effectively.

In [None]:
import os
from pdf2image import convert_from_path


def convert_pdfs_to_images(pdf_folder):
    pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith('.pdf')]
    all_images = {}

    for doc_id, pdf_file in enumerate(pdf_files):
        pdf_path = os.path.join(pdf_folder, pdf_file)
        images = convert_from_path(pdf_path)
        all_images[doc_id] = images

    return all_images


all_images = convert_pdfs_to_images('/content/data/')

We can visualize a sample assembly guide to see how these instructions are presented. This will help us understand the format and layout of the content.

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 8, figsize=(15, 10))

for i, ax in enumerate(axes.flat):
    img = all_images[0][i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.show()

## Initialize the ColPali multimodal document retrieval model

Now that our dataset is ready, we will initialize the Document Retrieval Model, which will be responsible for extracting relevant information from the raw images and providing us with the appropriate documents based on our queries.

For this task, we will use the [`Byaldi`](https://github.com/AnswerDotAI/byaldi) model, a simple wrapper around the ColPali repository to make it easy to use late-interaction multi-modal models such as ColPali with a familiar API.

Top-performing retreivers can be found in [ViDore (Visual Document Retrieval Benchmark)](https://huggingface.co/spaces/vidore/vidore-leaderboard).

In [None]:
from byaldi import RAGMultiModalModel

docs_retrieval_model = RAGMultiModalModel.from_pretrained('vidore/colpali-v1.2')

Next, we can directly index our documents using the document retrieval model by specifying the folder where the PDFs are stored. This will allow the model to process and organize the documents for efficient retrieval based on our queries.

In [None]:
docs_retrieval_model.index(
    input_path='data/',
    index_name='image_index',
    store_collection_with_index=False,
    overwrite=True
)

## Retrieving documents with the document retrieval model

In [None]:
text_query = "How many people are needed to assemble the Malm?"

results = docs_retrieval_model.search(text_query, k=3)
results

In [None]:
def get_grouped_images(results, all_images):
    grouped_images = []

    for result in results:
        doc_id = result['doc_id']
        page_num = result['page_num']
        grouped_images.append(
            all_images[doc_id][page_num - 1]
        ) # page_num is 1-indexed, while doc_id is 0-indexed

    return grouped_images

In [None]:
grouped_images = get_grouped_images(results, all_images)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 10))

for i, ax in enumerate(axes.flat):
    img = grouped_images[i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.show()

## Initialize the vision language model for question answering

In [None]:
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from qwen_vl_utils import process_vision_info
import torch

vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
    'Qwen/Qwen2-VL-7B-Instruct',
    torch_dtype=torch.bfloat16,
)
vl_model.cuda().eval()

In [None]:
min_pixels = 224 * 224
max_pixels = 1024 * 1024

vl_model_processor = Qwen2VLProcessor.from_pretrained(
    'Qwen/Qwen2-VL-7B-Instruct',
    min_pixels=min_pixels,
    max_pixels=max_pixels,
)

## Assembling the VLM model and testing the system

We will create the chat structure by providing the system with the three retrieved images along with the user query.

In [None]:
chat_template = [
    {
        'role': 'user',
        'content': [
            {
                'type': 'image',
                'image': grouped_images[0],
            },
            {
                'type': 'image',
                'image': grouped_images[1],
            },
            {
                'type': 'image',
                'image': grouped_images[2],
            },
            {
                'type': 'text',
                'text': text_query,
            }
        ]
    }
]

Now we can apply to the processor:

In [None]:
text = vl_model_processor.apply_chat_template(
    chat_template,
    tokenize=False,
    add_generation_prompt=True
)

In [None]:
text

Next, we will process the inputs to ensure that they are properly formatted and ready to be used as input for the VLM.

In [None]:
image_inputs, _ = process_vision_info(chat_template)
inputs = vl_model_processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors='pt'
)
inputs = inputs.to('cuda')

Now we are ready to generate the answer.

In [None]:
generated_ids = vl_model.generate(**inputs, max_new_tokens=500)

# post-process
generated_ids_trimmed = [
    out_ids[len(in_ids) :]
    for in_ids, out_ids in zip(inputs.inputs_ids, generated_ids)
]
# decoding
output_text = vl_model_processor.batch_decode(
    generated_ids_trimed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

In [None]:
print(output_text[0])

## Assembling everything

We will create a method that encompasses the entire pipeline, allowing us to easily reuse it in future applications.

In [None]:
def answer_with_multimodal_rag(
        vl_model,
        vl_model_processor,
        docs_retrieval_model,
        text_query,
        all_images,
        top_k,
        max_new_tokens
):
    # Retrieve documents
    results = docs_retrieval_model.search(text_query, k=top_k)
    # Get the retrieved images
    grouped_images = get_grouped_images(results, all_images)

    # Construct chat template
    chat_template = [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': image} for image in grouped_images
            ] + [
                {'type': 'text', 'text': text_query}
            ]
        }
    ]

    # Prepare the inputs
    text = vl_model_processor.apply_chat_template(
        chat_template,
        tokenize=False,
        add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(chat_template)
    inputs = vl_model_processor(
        text=[text],
        images=images_inputs,
        padding=True,
        return_tensors='pt'
    )
    inputs = inputs.to('cuda')

    # Generate text from the VLM
    generated_ids = vl_model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(inputs.inputs_ids, generated_ids)
    ]

    # Decode the generated text
    output_text = vl_model_processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

    return output_text

Now we can apply the complete multimodal RAG system:

In [None]:
text_query = 'How do I assemble the Miche desk?'

output_text = answer_with_multimodal_rag(
    vl_model=vl_model,
    vl_model_processor=vl_model_processor,
    docs_retrieval_model=docs_retrieval_model,
    text_query=text_query,
    all_images=all_images,
    top_k=3,
    max_new_tokens=500
)
print(output_text[0])