# pylate package ColBERT implementation
* https://github.com/lightonai/pylate

In [1]:
import json
import os
from config import settings

## Notable ColBERT Args

```
prompts
    A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text. The prompt text will be prepended before any text to encode. For example:
    `{"query": "query: ", "passage": "passage: "}` or `{"clustering": "Identify the main category based on the
    titles in "}`.
embedding_size
    The output size of the projection layer. Default to 128.
query_prefix
    Prefix to add to the queries.
document_prefix
    Prefix to add to the documents.
add_special_tokens
    Add the prefix to the inputs.
truncation
    Truncate the inputs to the encoder max lengths or use sliding window encoding.
query_length
    The length of the query to truncate/pad to with mask tokens. If set, will override the config value. Default to 32.
document_length
    The max length of the document to truncate. If set, will override the config value. Default to 180.
attend_to_expansion_tokens
    Whether to attend to the expansion tokens in the attention layers model. If False, the original tokens will
    not only at tend to the expansion tokens, only the expansion tokens will attend to the original tokens. Default
    is False (as in the original ColBERT codebase).
skiplist_words
    A list of words to skip from the documents scoring (note that these tokens are used for encoding and are only skipped during the scoring). Default is the list of string.punctuation.
model_kwargs : dict, optional
    Additional model configuration parameters to be passed to the Huggingface Transformers model
```

In [None]:
from pylate.models import ColBERT

model_dir = os.path.join(settings.model_weight_dir, "late_interaction/ModernBERT-Korean-ColBERT-preview-v1")

# https://github.com/lightonai/pylate/blob/fe115ff8bd93351670d516859952804ced1198f7/pylate/models/colbert.py#L35
model = ColBERT(
    model_name_or_path=model_dir,
    embedding_size=1024, # defaults to 128 if not set
    document_length=None, # don't set
    device="mps",
    prompts={"query": "query: ", "passage": "passage: "} # input prefix text
)

PyLate model loaded successfully.


In [3]:
texts = [
    "예시 문서 1번 입니다.",
    "안녕하세요 제 이름은 송영록 입니다"
]

passage_embeddings = model.encode(
    sentences=texts,
    batch_size=32,
    is_query=False,
    show_progress_bar=True,
)

Encoding documents (bs=32):   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
len(passage_embeddings), passage_embeddings[0].shape, type(passage_embeddings[0])

(2, (19, 128), numpy.ndarray)

In [9]:
# Different Shape for each passage (depends on passage length)
passage_embeddings[0].shape, passage_embeddings[1].shape

((19, 128), (29, 128))