# arXiv Multimodal RAG

## Introduction

In this notebook we are going to create a multimodal RAG for scientific papers that leverages two models:
* Llama4 for it’s vision capability, it is hosted on [Nscale serverless](https://www.nscale.com) 
* ColPali a VLM model capable of generating accurate embeddings of image data. 

If you are not familiar with RAG here's a quick overview:
1. Indexing phase:
   1. The user first uploads a document
   2. The document's content is then split into chunks
   3. Those chunks are fed to an embedding model that will convert the text to a vector of number that captures the sementic meaning of the chunk
   4. The chunk is then stored in a vector store or database
2. Retrieval phase:
   1. The user will query the system, 
   2. The query itself will be converted to an embedding
   3. A vector similarity search between the query vector and the vectors stored in the database then happens. 
3. Generation phase:
   1. Once the vectors are retrieved, we use a large language model to generate a response based on the query and the retrieved context.

Put diagram here?

While simple RAGs can work well, they often fall short in tasks that requires parsing complex layouts. For example scientific paper, are complex to parse because of their sometimes complex structure. 
One solution was to leverage projects such as LlamaParse or unstructured.io to parse those documents using OCR among other techniques. Such approach can work well but lead to overhead time in the indexing phase. 

In this notebook we will explore building a RAG using ColPali. In high level, ColPali is based of PaliGemma-3B a vision language model that is further enhanced to generate ColBERT-style multi-vector representations of text and image data, among other optimisations. This model can therefore be used for multimodal retrieval tasks.

If the concept is not clear by now, do not worry as we will be building a quick multimodal RAG system to answer any questions related to ColPali.

Now let's get started!

## Install the required libraries

In [None]:
!pip install arxiv # arXiv API
!pip install byaldi # RAG model
!pip install pdf2image # Convert pdf to images
!pip install openai # LLM

In [None]:
# Install poppler
# !sudo apt-get install -y poppler-utils

## Export the necessary variables

In [None]:
import os
nscale_api_key = os.environ["NSCALE_API_KEY"]

## Retrieve the arXiv data

If you haven't heard of [arXiv](https://arxiv.org), in brief it's an open-access repository where researchers share preprints of scientific papers before formal peer review, primarily in fields like physics, mathematics, and computer science.

For our use case we will be using the [paper](https://arxiv.org/pdf/2407.01449) "ColPali: EFFICIENT DOCUMENT RETRIEVAL WITH VISION LANGUAGE MODELS" by Faysse et al.

And leverage [arxiv's API](https://info.arxiv.org/help/api/basics.html) to retrieve the paper.

In [68]:
# Search for the most relevant paper on ColPali and download it to our data folder.

import arxiv

search = arxiv.Search(
        query="ColPali",
        max_results=1,
        sort_by=arxiv.SortCriterion.Relevance,
    )

results = list(search.results())
paper = results[0]

download_dir = "data"

pdf_path = os.path.join(download_dir, f"{paper.get_short_id()}.pdf")
paper.download_pdf(filename=pdf_path)

  results = list(search.results())


'./data/2407.01449v6.pdf'

## Initialise the multimodal model

We are going to initialise ColPali model using the byaldi library. [ColPali](https://huggingface.co/vidore/colpali-v1.2) will be used to generate the embeddings of the document. It does so by converting the document into images that will then be cut into patches, these patches are later embedded in a 128 dimension vector space. 

![ColPali_retrieval](./images/ColPali_retrieval.png)

### Indexing phase

In [None]:
from byaldi import RAGMultiModalModel

# Initialise the multimodal model
retrieval_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 34379.54it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.56it/s]


In [None]:
# Index the data

retrieval_model.index(
    input_path="data/", index_name="image_index", store_collection_with_index=True, overwrite=True
)

## Initialise the LLM

We will use the new Llama 4 Scout as LLM, it is a 17 billion active parameter model with 16 experts that uses a mixture-of-experts (MoE) architecture. It's a very powerful multimodal model with native multimodality, strong performance and an extremely large context window.

Running such model locally is not feasiable. For this reason we will be inferencing the model through [Nscale serverless](https://www.nscale.com)! Nscale offers 5$ of free credit upon signup, way more than enough to fully understand the ColPali paper!

### Retrieval phase

In [95]:
from openai import OpenAI

# Initilise the client
nscale_base_url = "https://inference.api.nscale.com/v1"

client = OpenAI(
    api_key=nscale_api_key,
    base_url=nscale_base_url
)

In [119]:
# Query the retrieval model on the ColPali paper

query = "Can you explain the Late interaction formula in detail?"
returned_page = retrieval_model.search(query, k=2)[0].base64

In [120]:
response = client.chat.completions.create(
  model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": query},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{returned_page}", 
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)

The Late Interaction formula, as presented in the text, is used to calculate the interaction between a query and a document in a Vision-Language Model (VLM). 

**Late Interaction Formula**

The Late Interaction formula is defined as:

$\operatorname{LI}(q, d)=\sum_{i \in\left[1, N_{q}\right]} \max _{j \in\left[1, N_{d}\right]}\left\langle\mathbf{E}_{\mathbf{q}}^{(i)} \mid \mathbf{E}_{\mathbf{d}}^{(j)}\right\rangle$

where:

*   $\mathbf{E}_{\mathbf{q}} \in \mathbb{R}^{N_{q} \times D}$ and $\mathbf{E}_{\mathbf{d}} \in \mathbb{R}^{N_{d} \times D}$ are vector representations of the query and document, respectively.
*   $N_q$ and $N_d$ are the number of vectors in the query and in the document page embeddings.
*   $\langle\cdot \mid \cdot\rangle$ is the dot product.

**Definition of the Variables**

*   $q$ is the query.
*   $d$ is the document.
*   $\mathbf{E}_{\mathbf{q}}$ and $\mathbf{E}_{\mathbf{d}}$ are the embedding vectors of $q$ and $d$, respectively.
*   $N_q$ and $N_d$ are the nu