### Chat with your PDFs using byaldi + Claude 🚀

ColPali is an *image retrieval model*: this means that it takes in full documents as inputs, such as .PDF or .PNG files, and represents them as is. It doesn't perform any particular processing to extract information for the document. Rather, it's a model that has been trained to be able to "read" full-page content, complex layouts, tables, etc... and create fine-grained representations for them. The good thing is: you can use these representations to then query your documents, in plain-text!

It's radically different from how Document-based Retrieval-Augmented-Generation (RAG) is generally done: in current pipelines, your document needs to go through a lot of processing steps (extracting text from the document, re-adding layout informations, representing tables and images so they can be queried, etc...), which often carry a hefty time and complexity cost, and you can lose information in the process.

We think ColPali and similar approaches are about to unlock many usecases thanks to its straightforward-yet-very powerful pipeline.

How does it work, in practice? How do you go from having a document, to being able to retrieve it, to getting your LLM to answer with the relevant page in context? Don't LLMs need text?

Well, it's a lot more simple than it looks like at first glance! A hot topic recently has been Vision Language Models, or VLMs. These are basically LLMs that can read "image tokens" just like they can "text tokens", which means that all you need to do is give them your page as an image input, and they'll be able to use it, just like they'd be able to use textual input.

In this notebook, we'll show you how easy it is to use Byaldi in conjunction with Claude to answer queries based on a given document. To interact with Claude, we are going to use the [claudette](https://github.com/answerdotai/claudette) library, a very simple wrapper to quickly interact with Claude by simplifying all the cumbersone stuff!

The full steps are as follow:
- first download your chosen model (e.g. [ColPali](https://huggingface.co/vidore/colpali))
- then create an index for your pdf
- search the index for your chosen query
- pass the top search result to Claude along with your query

We'll show how this works with an academic paper and financial report. Let's get started!

*Note: This notebook will consume a small amount of Claude Sonnet 3.5 tokens, costing around $0.01.*

### Setup

To get started, you'll need to install byaldi and claudette:

In [None]:
!pip install byaldi claudette

But there's another catch: to work with PDFs, we need to be able to convert them to images. To do so, we use the `pdf2images` library, which internally relies on `poppler`. Thankfully, it's very easy to install, but depends on your system:

- **MacOS**: `brew install poppler`
- **Linux**: `sudo apt-get install poppler-utils`
- **Windows**: Follow the instructions from [Poppler-Windows](https://github.com/oschwartz10612/poppler-windows/) (I'm sorry, there's no one liner here).

Finally, if you want it to go even faster, we recommend following the byaldi setup instructions [here](https://github.com/AnswerDotAI/byaldi/) to get flash attention going.

And that's it, you're ready to go! The next step is importing all we need, and setting up our environment variables:

In [1]:
import base64
import os
os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN" # to download the ColPali model
# os.environ["ANTHROPIC_API_KEY"] = "YOUR_ANTHROPIC_API_KEY"
from byaldi import RAGMultiModalModel, ColPaliModel
from claudette import *

  from .autonotebook import tqdm as notebook_tqdm


ImportError: cannot import name 'ColPaliModel' from 'byaldi' (C:\Users\Kenta Sakai\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\byaldi\__init__.py)

Now that all the boilerplate is out of the way, let's actually load the model. It's very straightforward, a single call does the trick (notice we pass verbose=1, which means that the model will be quite loud when indexing):

In [2]:
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2", verbose=1)

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.67s/it]


And here you are, ColPali-v1.2 is now loaded, and really help to help you ask questions to your documents. But how does that look in practice? Let's look at how we can easily use Byaldi with Claude Sonnet 3.5 to answer a user query without pre-processing our input documents in any way.

### Is Vision All You Need to query Attention Is All You Need?

For our academic paper, it feels right to start with the epochal [Attention Is All You Need paper](https://arxiv.org/pdf/1706.03762).

Specifically we're going to ask "What's the BLEU score for the transfomer base model?" The answer to this question is found in Table 2 on page 8. Answering this question without image understanding would be quite annoying, as we'd need our data extraction model to parse the table in a way we could pass to our LLM, and table parsing can get tricky very quickly!

Specifically, this is the part that we want our retrieval model to find:

<img src="./docs/attention_table.png" alt="A table from the Attention is All You Need paper, showing the BLEU score on various translation tasks for the proposed Transformer architecture compared to previous approaches. There is a red box highlighting the line with the BLEU score we are interested in" width="768" height="512">


#### Using Byaldi to get relevant context

Let's first quickly define our query, which we'll use later:

Now, let's `wget` the paper, so we can then feed it to our `RAG` model to index it. If you've cloned this notebook from our repository, it's already in the `docs` folder, so you can skip this step.

In [5]:
!wget https://arxiv.org/pdf/1706.03762
!mkdir docs
!mv 1706.03762 docs/attention.pdf

'wget' is not recognized as an internal or external command,
operable program or batch file.


A subdirectory or file docs already exists.
'mv' is not recognized as an internal or external command,
operable program or batch file.


This is the full extent of our document preprocessing: downloading it! We're now going to pass it to our model for indexing. In this case, "indexing" means creating representations of the document that we will store in-memory, as well as persist on disk to be re-used later:

In [3]:
RAG.index(
    input_path="./docs/attention.pdf",
    index_name="attention",
    store_collection_with_index=True,
    overwrite=True
)

overwrite is on. Deleting existing index attention to build a new one.


  attn_output = torch.nn.functional.scaled_dot_product_attention(
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Added page 1 of document 0 to index.
Added page 2 of document 0 to index.
Added page 3 of document 0 to index.
Added page 4 of document 0 to index.
Added page 5 of document 0 to index.
Added page 6 of document 0 to index.
Added page 7 of document 0 to index.
Added page 8 of document 0 to index.
Added page 9 of document 0 to index.
Added page 10 of document 0 to index.
Added page 11 of document 0 to index.
Added page 12 of document 0 to index.
Added page 13 of document 0 to index.
Added page 14 of document 0 to index.
Added page 15 of document 0 to index.
Index exported to .byaldi\attention
Index exported to .byaldi\attention


{0: 'docs\\attention.pdf'}

The model's now done processing the document. It has split into pages (if you want to use a smaller unit of information, this is a processing step you will need to do yourself).

With our index being created, let's now ask it our query. This is done with the `search()` method, which will return the top `k` pages whose content match our textual query:

In [8]:
query = "How much did model quality suffer when we reduced dk"

results = RAG.search(query, k=1)
results

[{'doc_id': 0, 'page_num': 9, 'score': 13.3125, 'metadata': {}, 'base64': 'iVBORw0KGgoAAAANSUhEUgAABqQAAAiYCAIAAAA+NVHkAAEAAElEQVR4nOzdd0AUx/838Dl6sRdExS4CNkSx927svXfF3o1GoyaxxViDvaGIiorYe++9g4KgFAUVERDp9W6fPz5P5rffayxwiLm8X3/B3t7u7OzM7OzndmdkgiAwAAAAAAAAAAAA+PczyO8EAAAAAAAAAAAAgG4g2AcAAAAAAAAAAKAnEOwDAAAAAAAAAADQEwj2AQAAAAAAAAAA6AkE+wAAAAAAAAAAAPQEgn0AAAAAAAAAAAB6AsE+AAAAAAAAAAAAPYFgHwAAAAAAAAAAgJ5AsA8AAAAAAAAAAEBPINgHAAAAAAAAAACgJxDsAwAAAAAAAAAA0BMI9gEAAAAAAAAAAOgJBPsAAAAAAAAAAAD0BIJ9AAAAAAAAAAAAegLBPgAAAAAAAAAAAD2BYB8AAAAAAAAAAICeQLAPAAAAAAAAAABATyDYBwAAAAAAAAAAoCcQ7AMAAAAAAAAAANATCPYBAAAAAAAAAADoCQT7AAAAAAAAAAAA9ASCfQAAAAAAAAAAAHoCwT4AAAAAAAAAAAA9gWAfAAAAAAAAAACAnkCwDwAAAAAAAAAAQE8g2AcAAAAAAAAAAKAnEOwDAAAAAAAAAADQEwj2AQAAAAAAAAAA6AkE+wAAAAAAAAAAAPQEgn0AAAAAAAAAAAB6AsE+AAAAAAAAAAAAPYFgHwAAAAAAAAAAgJ5AsA8AAAAAAAAAAEBPINgHAAAAAAAAAACgJxDsAwAAAAAAAAAA0BMI9gEAAAAAAAAAAOgJBPsAAAAAAAAAAAD0BIJ9AAAAAAAAAAAAegLBPgAAAAAAAAAAAD2BYB8AAAAAAAAAAICeQLAPAAAAAAAAAABATyDYBwAAAAAAAAAAoCcQ7AMAAAAAAAAAA

Here, you can also see that we have a very long `base64` attribute to our results! This is because, earlier, we requested that our index also stores the image representation of the pages with the `store_collection_with_index` flag, to make it even easier to pass to Claude. If you're indexing a lot of documents, or have limited disk space or RAM, you might want to disable this flag and instead just use the page_number to retrieve the relevant page yourself!

#### Getting Claude to use this context

This is great! Page 8 is indeed where the relevant table is. But ColPali is just a retrieval model, so all Byaldi can do for you is get the relevant page. It's time for another hero to step in: Claude!

As mentioned earlier, using images as inputs to VLMs such as Sonnet 3.5 is very simple, and we mean it! If everything works as expected Claude should tell us that the BLEU score is 27.3 for EN-DE and 38.1 for EN-FR.

*The image passed to Claude is large so depending on your account settings you might hit a token limit. This is because we currently keep the images at a large resolution, so users are free to resize them to their liking (there is accuracy trade-offs with lower resolutions, and they're domain-dependent!). If you'd like to, you are free to do it! If you'd like an example with an already smaller image for testing, you can also skip to the next section about the financial reports.*

Not all VLMs use the same input format for images. Claude expects bytes, so let's decode our base64 image:

In [12]:
image_bytes = base64.b64decode(results[0].base64)

Next, we create a `claudette` `Chat` object, which will handle all interaction with Claude in a very simple way:

In [10]:
image_bytes = base64.b64decode(results[0].base64)chat = Chat(models[1])
# models is a claudette helper that contains the list of models available on your account, as of 2024-09-06, [1] is Claude Sonnet 3.5:
models

('claude-3-opus-20240229',
 'claude-3-5-sonnet-20240620',
 'claude-3-haiku-20240307')

And we're now all set! Moment of truth: will Claude tell us that the BLEU score is 27.3 for EN-DE and 38.1 for EN-FR?

In [13]:
chat([image_bytes, query])

According to the table in the image, the BLEU score for the Transformer (base model) is:

- 27.3 for EN-DE (English to German)
- 38.1 for EN-FR (English to French)

<details>

- id: `msg_01R7hnmFd4EotEbVLuE9BDYy`
- content: `[{'text': 'According to the table in the image, the BLEU score for the Transformer (base model) is:\n\n- 27.3 for EN-DE (English to German)\n- 38.1 for EN-FR (English to French)', 'type': 'text'}]`
- model: `claude-3-5-sonnet-20240620`
- role: `assistant`
- stop_reason: `end_turn`
- stop_sequence: `None`
- type: `message`
- usage: `{'input_tokens': 1520, 'output_tokens': 58, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}`

</details>

And here you are! A basic, functional chat-with-your-pdf app in 7 lines of code.

### Financial Report

Let's move to another situation: I am an executive at a ficticious company called ACME, and I've just been given a report containing the monthly revenue for our 529 products, with just one per page. Thankfully, my assistant has forwarded me the mini-version of the report, which has the data for just our creatively named top 5 products (A, B, C, D, E).

I personally proposed Product C, so I'm very interested in how it's doing. I'm going to ask "In which month did Product C generate the most revenue?". As you can see, the expected answer is **June**:

<img src="./docs/product_c.png" alt="A monthly revenue report titled Product C, with a revenue peak in June" width="768" height="512">

In [14]:
query = "In which month did Product C generate the most revenue?"

We're going to go through the same process as for the Attention is All You Need paper, and we'll index our financial report:

In [15]:
RAG.index(
    input_path="./docs/financial_report.pdf",
    index_name="financial_report",
    store_collection_with_index=True,
    overwrite=True
)

Added page 1 of document 1 to index.
Added page 2 of document 1 to index.
Added page 3 of document 1 to index.
Added page 4 of document 1 to index.
Added page 5 of document 1 to index.
Added page 6 of document 1 to index.
Index exported to .byaldi/financial_report
Index exported to .byaldi/financial_report


{0: 'docs/attention.pdf', 1: 'docs/financial_report.pdf'}

Now, let's search the index for our query. We expect the top result to be page 4.

In [16]:
results = RAG.search(query, k=1)
results[0].page_num

4

Finally, we again repeat the process we went through earlier: first, convert the top search result to bytes, then pass it to Claude with our query. If everything works as expected Claude should tell us Product C generated the most revenue in **June**.

In [17]:
chat = Chat(models[1])
image_bytes = base64.b64decode(results[0].base64)
chat([image_bytes, query])

According to the bar graph showing monthly revenue for Product C, the month with the highest revenue was June. The bar for June is visibly the tallest, reaching above 2500 on the revenue scale, indicating it generated the most revenue compared to all other months shown.

<details>

- id: `msg_01Q963zBYFRuKvJPFbTDGYCN`
- content: `[{'text': 'According to the bar graph showing monthly revenue for Product C, the month with the highest revenue was June. The bar for June is visibly the tallest, reaching above 2500 on the revenue scale, indicating it generated the most revenue compared to all other months shown.', 'type': 'text'}]`
- model: `claude-3-5-sonnet-20240620`
- role: `assistant`
- stop_reason: `end_turn`
- stop_sequence: `None`
- type: `message`
- usage: `{'input_tokens': 1573, 'output_tokens': 59, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}`

</details>

Hooray! Claude gets this one right too. It seems to be pretty decent at reading documents, so why not try it with your own :)?

### Full Attention is All You Need Example Code

Below is the full Python code for the notebook for easier copy-pasting:

In [None]:
import base64
import os
os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN" # to download the ColPali model
os.environ["ANTHROPIC_API_KEY"] = "YOUR_ANTHROPIC_API_KEY"
from byaldi import RAGMultiModalModel
from claudette import *

# Load model
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2", verbose=1)

# Index document
RAG.index(
    input_path="./docs/attention.pdf",
    index_name="attention",
    store_collection_with_index=True,
    overwrite=True
)

# Define query
query = "What's the BLEU score for the transformer base model?"

# Query model
results = RAG.search(query, k=1)

# Pass top result to Claude
image_bytes = base64.b64decode(results[0].base64)
chat = Chat(models[1])

# This will print claude's answer
print(chat([image_bytes, query]))