# ColSmolVLM model testing
* cookbook
    * colpali-engine
    * https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_smol_vlm#3-initialize-the-colsmolvlm-multimodal-document-retrieval-model-
* vidore benchmark: https://huggingface.co/spaces/vidore/vidore-leaderboard

models:
* (colSmol-500M)[https://huggingface.co/vidore/colSmol-500M]
    * base model: ColSmolVLM-Instruct-500M


In [22]:
import os
import torch
from PIL import Image

from transformers import AutoTokenizer
from colpali_engine.models import ColIdefics3, ColIdefics3Processor

In [3]:
from pydantic_settings import BaseSettings, SettingsConfigDict

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file="../.env", env_file_encoding="utf-8", extra="ignore"
    )
    model_dir: str
    
settings = Settings()




In [6]:
# Load Colpali engine
model_dir = os.path.join(
    settings.model_dir, "multimodal_retriever/colSmol-500M"
)

model = ColIdefics3.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="mps",
    # attn_implementation="flash_attention_2" # or eager
).eval()

config.json:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

In [23]:
tokenizer = AutoTokenizer.from_pretrained(model_dir)

In [9]:
processor = ColIdefics3Processor.from_pretrained(model_dir)

Some kwargs in processor config are unused and will not have any effect: image_seq_len. 


In [78]:
## 예시대로
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

In [79]:
print(image_embeddings.shape)
print(query_embeddings.shape)
print(scores)

torch.Size([2, 1135, 128])
torch.Size([2, 23, 128])
tensor([[5.8438, 5.8750],
        [7.6562, 7.8125]])


# Check ColsmolVLM templates
* uses ColIdefics3, ColIdefics3Processor
    * colpali repo: https://github.com/illuin-tech/colpali/tree/59e94a92790b67bd60507608c3115a2e48f83a07/colpali_engine/models/idefics3/colidefics3
    

## process_images
* 검색 대상 문서를 임베딩 할 때 사용
    * 이미지 '만' 받는 것을 가정
* 템플릿이 코드상 고정되어 있음
    * https://github.com/illuin-tech/colpali/blob/59e94a92790b67bd60507608c3115a2e48f83a07/colpali_engine/models/idefics3/colidefics3/processing_colidefics3.py#L37
    * "<|im_start|>User: Describe the image." + 이미지 관련 토큰 + "<end_of_utterance>"

In [43]:
# colpali 코드에 프롬프트 템플릿이 픽스되어 있음
messages_doc = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image."},
            {"type": "image"},
        ],
    },
]

text_doc = processor.apply_chat_template(messages_doc, add_generation_prompt=False)
print(text_doc)

<|im_start|>User: Describe the image.<image><end_of_utterance>



In [44]:
## 이미지 삽입 부분
batch_doc = processor(
    text=[text_doc],
    images=[Image.new("RGB", (32, 32), color="white")],
    return_tensors="pt",
    padding="longest",
)

In [59]:
print(tokenizer.decode(batch_doc['input_ids'][0]).replace("<image>", ""))

<|im_start|>User: Describe the image.<fake_token_around_image><row_1_col_1><fake_token_around_image><row_1_col_2><fake_token_around_image><row_1_col_3><fake_token_around_image><row_1_col_4>
<fake_token_around_image><row_2_col_1><fake_token_around_image><row_2_col_2><fake_token_around_image><row_2_col_3><fake_token_around_image><row_2_col_4>
<fake_token_around_image><row_3_col_1><fake_token_around_image><row_3_col_2><fake_token_around_image><row_3_col_3><fake_token_around_image><row_3_col_4>
<fake_token_around_image><row_4_col_1><fake_token_around_image><row_4_col_2><fake_token_around_image><row_4_col_3><fake_token_around_image><row_4_col_4>

<fake_token_around_image><global-img><fake_token_around_image><end_of_utterance>



## process_queries
* 쿼리 텍스트 임베딩 할 때 사용
* `self.query_prefix + query + suffix + "\n"`로 포매팅 처리
    * query_prefix: "Query: "
    * suffix (default): self.query_augmentation_token * 10

In [56]:
print(processor.query_augmentation_token)
print(processor.query_prefix)

<end_of_utterance>
Query: 


In [58]:
query = "sample_query"
input_ids = processor.process_queries([query])['input_ids'][0]
print(tokenizer.decode(input_ids))

Query: sample_query<end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance><end_of_utterance>



## score_multi_vector
* https://github.com/illuin-tech/colpali/blob/59e94a92790b67bd60507608c3115a2e48f83a07/colpali_engine/utils/processing_utils.py#L68
* each query and passage is represented as a set of multiple embedding vectors, rather than a single vector.

Input
* qs (`Union[torch.Tensor, List[torch.Tensor]`): Query embeddings.
* ps (`Union[torch.Tensor, List[torch.Tensor]`): Passage embeddings.

ex.
```
image: torch.Size([2, 1135, 128])
text: torch.Size([2, 23, 128])
```

### score 계산: (colbert-style late interaction)
1. `torch.einsum("bnd,csd->bcns", qs_batch, ps_batch).max(dim=3)[0].sum(dim=2)`
* qs_batch shape bnd
    * b: batch, n: number of query embeddings (seq length), d: dim
* ps_batch shape csd
    * c: batch, n: number of passage embeddings, d: dim
* Multiplication (bnd * csd):
    * Computes dot products between every query vector (N vectors) and every passage vector (S vectors)
* output shape is (B, C, N, S): similarity matrix
    * B: Number of query batches
    * C: Number of passage batches
    * N: Number of query tokens
    * S: Number of passage tokens
    *  dot product similarity between the n-th query token and the s-th passage token.

```
for b in range(B):  # Loop over queries batch
    for c in range(C):  # Loop over passages batch
        for n in range(N):  # Loop over query tokens
            for s in range(S):  # Loop over passage tokens
                # dot product beween query token embed, passage token embed
                output[b, c, n, s] = torch.dot(qs_batch[b, n], ps_batch[c, s])
```

2. max(dim=3)[0]
* finds maximum similarity for each query token (n) across all passage tokens (s).
* for each query token, we take the **best-matching passage token.**
* output shape: (b, c, n)
    * best match score for the n-th query token

3. aggregation sum(dim=2)
* single score per query-passage pair.
* Each query token contributes to the total score.
* If a passage has multiple highly similar tokens, their scores are accumulated.
* output shape: (b, c)
    * overall similarity score between query b and passage c.

# 직접 이미지 + 텍스트 임베딩 시도

In [36]:
# images = [
#     [
#         Image.new("RGB", (32, 32), color="white"),
#         Image.new("RGB", (32, 32), color="black"),
#     ],
#     # Image.new("RGB", (16, 16), color="black"),
# ]
# queries = [
#     "Is attention really all you need? <image> \n\nimage2 <image>",
#     # "What is the amount of bananas farmed in Salvador?",
# ]

In [65]:
text1 = "Is attention really all you need?"
text2 = "maybe? maybe not?"
messages_doc = [
    {
        "role": "user",
        "content": [
            {
                "type": "text", "text": "Describe the text and image. text1: {}:".format(text1) # text1
            },
            {
                "type": "text", "text": "\nimage1: "
            },
            {"type": "image"}, # image1
            {
                "type": "text", "text": "\nimage2: "
            },
            {"type": "image"}, # image2
            {
                "type": "text", "text": "\ntext2: {}".format(text2)
            }
        ],
    },
]

text_doc = processor.apply_chat_template(messages_doc, add_generation_prompt=False)
print(text_doc)

<|im_start|>User: Describe the text and image. text1: Is attention really all you need?:
image1: <image>
image2: <image>
text2: maybe? maybe not?<end_of_utterance>



In [75]:
## 이미지 삽입 부분
# processor 코드
# https://github.com/huggingface/transformers/blob/e6f4a4ebbf970c12fe475be79a039f943c28f975/src/transformers/models/idefics3/processing_idefics3.py#L111
'''
>>> images = [[image1], [image2]]

>>> text = [
...     "<image>In this image, we see",
...     "bla bla bla<image>",
... ]
>>> outputs = processor(images=images, text=text, return_tensors="pt", padding=True)
'''

images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (32, 32), color="black"),
]

batch_doc = processor(
    text=[text_doc],
    images=[images],
    return_tensors="pt",
    padding="longest",
)
input_ids = batch_doc['input_ids'][0]
print(tokenizer.decode(input_ids).replace("<image>", ""))

<|im_start|>User: Describe the text and image. text1: Is attention really all you need?:
image1: <fake_token_around_image><row_1_col_1><fake_token_around_image><row_1_col_2><fake_token_around_image><row_1_col_3><fake_token_around_image><row_1_col_4>
<fake_token_around_image><row_2_col_1><fake_token_around_image><row_2_col_2><fake_token_around_image><row_2_col_3><fake_token_around_image><row_2_col_4>
<fake_token_around_image><row_3_col_1><fake_token_around_image><row_3_col_2><fake_token_around_image><row_3_col_3><fake_token_around_image><row_3_col_4>
<fake_token_around_image><row_4_col_1><fake_token_around_image><row_4_col_2><fake_token_around_image><row_4_col_3><fake_token_around_image><row_4_col_4>

<fake_token_around_image><global-img><fake_token_around_image>
image2: <fake_token_around_image><row_1_col_1><fake_token_around_image><row_1_col_2><fake_token_around_image><row_1_col_3><fake_token_around_image><row_1_col_4>
<fake_token_around_image><row_2_col_1><fake_token_around_image><ro

In [76]:
with torch.no_grad():
    embeddings = model(**batch_doc.to("mps"))

In [77]:
embeddings.shape

torch.Size([1, 2294, 128])