# Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on a Consumer GPU

In this example, we will build a **Multimodal Retrieval-Augmented Generation (RAG)** system by integrating [`ColSmolVLM`](https://huggingface.co/vidore/colsmolvlm-v0.1) for document retrieval and [`SmolVLM`](https://huggingface.co/blog/smolvlm) as the vision language model (VLM).

## Setups

In [None]:
!pip install -q git+https://github.com/sergiopaniego/byaldi.git@colsmolvlm-support

## Load dataset

In this example, we will use charts and maps from the website [Our World in Data](https://ourworldindata.org/), an open-access platform offering a wealth of data and visualizations. We focus on the life expectancy data and load it from a [curated subset](https://huggingface.co/datasets/sergiopaniego/ourworldindata_example) hosted on HuggingFace.

In [None]:
from datasets import load_dataset

dataset = load_dataset("sergiopaniego/ourworldindata_example", split="train")

After downloading the visual data, we will save it locally to prepare it for the RAG (Retrieval-Augmented Generation) system. It enables the document retrieval model (ColSmolVLM) to efficiently index, process, and manipulate the visual content. Proper indexing ensures seamless integration and retrieval during system execution.

In [None]:
import os
from PIL import Image

def save_images_to_local(dataset, output_folder='data/'):
    os.makedirs(output_folder, exist_ok=True)

    for image_id, image_data in enumerate(dataset):
        image = image_data['image']

        if isinstance(image, str):
            image = Image.open(image)

        output_path = os.path.join(output_folder, f"image_{image_id}.png")
        image.save(output_path, format='PNG')
        print(f"Image saved at: {output_path}")

save_images_to_local(dataset)

Next, we will load the images to explore the dataset.

In [None]:
import os
from PIL import Image

def load_png_images(image_folder):
    png_files = [f for f in os.listdir(image_folder) if f.endswith('.png')]
    all_images = []

    for image_id, png_file in enumerate(png_files):
        image_path = os.path.join(image_folder, png_file)
        image = Image.open(image_path)
        all_images[image_id] = image

    return all_images

all_images = load_png_images('/content/data/')

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 5, figsize=(20, 15))

for i, ax in enumerate(axes.flat):
    img = all_images[i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.show()

## Initialize the ColSmolVLM multimodal document retrieval model

The **Document Retrieval Model** will extract relevant information from the raw images and return the appropriate documents based on our queries.

For this task, we will use the `Byaldi` library, designed to streamline multimodal RAG pipelines. Byaldi provides APIs that integrate multimodal retrievers and vision language models for efficient retrieval-augmented generation workflow. We will focus specifically on **ColSmolVLM**.

In [None]:
from byaldi import RAGMultiModalModel

docs_retrieval_model = RAGMultiModalModel.from_pretrained('vidore/colsmolvlm-alpha')

Next we will index our documents using the document retrieval model by specifying the oflder where the images are stored.

In [None]:
docs_retrieval_model.index(
    input_path='data/',
    index_name='image_index',
    store_collection_with_index=False,
    overwrite=True
)

## Retrieve documents with the document retrieval model

In [None]:
text_query = "What is the overall trend in life expectancy across different countries and regions?"

results = docs_retrieval_model.search(text_query, k=1)
results

We can take a look at the retrieved document and check whether the model has correctly matched our query with the best possible results.

In [None]:
result_image = all_images[results[0]['doc_id']]
result_image

## Initialize the vision language model for question answering

In [None]:
from transformers import AutoProcessor, Idefics3ForConditionalGeneration
import torch

model_id = 'HuggingFaceTB/SmolVLM-Instruct'
vl_model_processor = AutoProcessor.from_pretrained(model_id)
vl_model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    _attn_implementation='eager'
)
vl_model.eval()

## Assemble the VLM model and test the system

With all components loaded, we are ready to assemble the system for testing.

In [None]:
chat_template = [
    {
        'role': 'user',
        'content': [
            {
                'type': 'image'
            },
            {
                'type': 'text',
                'text': text_query
            }
        ]
    }
]

We will apply this chat template to set up the system for interacting with the model.

In [None]:
text = vl_model_processor.apply_chat_template(chat_template, add_generation_prompt=True)
text

Next, we will process the inputs to ensure they are properly formatted and ready for use with the VLM.

In [None]:
inputs = vl_model_processor(
    text=text,
    images=[result_image],
    return_tensors='pt'
)
inputs = inputs.to('cuda')

In [None]:
generated_ids = vl_model.generate(**inputs, max_new_tokens=500)

In [None]:
generated_ids_trimmed = [
    out_ids[len(in_ids) :]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = vl_model_processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

In [None]:
print(output_text[0])

In [None]:
print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

## Assemble everything

We will create a function to encompass the entire pipeline, allowing us to easily reuse it in future applications.

In [None]:
def answer_with_multimodal_rag(
        vl_model,
        vl_model_processor,
        docs_retrieval_model,
        all_images,
        text_query,
        retrival_top_k,
        max_new_tokens,
):
    results = docs_retrieval_model.search(text_query, k=retrival_top_k)
    result_image = all_images[results[0]['doc_id']]

    chat_template = [
        {
            'role': 'user',
            'content': [
                {
                    'type': 'image'
                },
                {
                    'type': 'text',
                    'text': text_query
                }
            ]
        }
    ]

    # Prepare the inputs
    text = vl_model_processor.apply_chat_template(chat_template, add_generation_prompt=True)
    inputs = vl_model_processor(
        text=text,
        images=[result_image],
        return_tensors='pt'
    )
    inputs = inputs.to('cuda')

    # Generate text from the vl_model
    generated_ids = vl_model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Decode the generated text
    output_text = vl_model_processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

    return output_text

This is the complete RAG system:

In [None]:
text_query = "What is the overall trend in life expectancy across different countries and regions?"

output_text = answer_with_multimodal_rag(
    vl_model=vl_model,
    vl_model_processor=vl_model_processor,
    docs_retrieval_model=docs_retrieval_model,
    all_images=all_images,
    text_query=text_query,
    retrival_top_k=1,
    max_new_tokens=500
)

print(output_text[0])

In [None]:
print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

## Go even Smoler

We could use a quantized version of the **SmolVLM** model to further reduce the system's resource requirements.

In [None]:
!pip install -q -U bitsandbytes

In [None]:
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

Next, we will load the model using the quantization configuration.

In [None]:
from transformers import AutoProcessor, Idefics3ForConditionalGeneration

model_id = 'HuggingFaceTB/SmolVLM-Instruct'
vl_model_processor = AutoProcessor.from_pretrained(model_id)
vl_model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=bnb_config,
    _attn_implementation='eager'
)

Now we can test the capabilities of our quantized model:

In [None]:
text_query = "What is the overall trend in life expectancy across different countries and regions?"

output_text = answer_with_multimodal_rag(
    vl_model=vl_model,
    vl_model_processor=vl_model_processor,
    docs_retrieval_model=docs_retrieval_model,
    all_images=all_images,
    text_query=text_query,
    retrival_top_k=1,
    max_new_tokens=500
)

print(output_text[0])

In [None]:
print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")