# How ColBERTv2 Works

ColBERTv2 is a **late-interaction** retrieval model that achieves a sweet spot between effectiveness and efficiency. This notebook explains the core concepts and demonstrates how it works under the hood.

## The Retrieval Spectrum

There are three main approaches to neural retrieval:

| Approach | Interaction | Efficiency | Effectiveness |
|----------|-------------|------------|---------------|
| **Bi-encoder** (Dense Retrieval) | None - single vector per doc | ⭐⭐⭐ Fast | ⭐ Good |
| **Cross-encoder** | Full attention between Q & D | ⭐ Slow | ⭐⭐⭐ Best |
| **ColBERT** (Late Interaction) | Token-level MaxSim | ⭐⭐ Medium | ⭐⭐ Better |

ColBERT's key insight: **delay the interaction between query and document until after encoding**, but still allow fine-grained token-level matching.

## Setup

In [None]:
import numpy as np
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

In [None]:
# Load ColBERTv2 model
model_name = "colbert-ir/colbertv2.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

print(f"Model hidden size: {model.config.hidden_size}")
print("ColBERTv2 projects to: 128 dimensions")

## Core Concept 1: Token-Level Embeddings

Unlike dense retrieval which produces **one vector per document**, ColBERT produces **one vector per token**.

```
Dense Retrieval:  "The cat sat" → [0.1, 0.2, ..., 0.8]  (single 768-dim vector)
ColBERT:          "The cat sat" → [[0.1, ...], [0.3, ...], [0.2, ...]]  (3 × 128-dim vectors)
```

In [None]:
def encode_colbert(texts, is_query=False):
    """Encode texts using ColBERT-style token embeddings.

    ColBERTv2 uses special tokens:
    - Queries: [Q] token prepended
    - Documents: [D] token prepended
    """
    # ColBERT uses [Q] and [D] markers, but we'll use [CLS] for simplicity
    # The actual colbertv2.0 checkpoint handles this internally

    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        # Get token embeddings from last hidden state
        token_embeddings = outputs.last_hidden_state

    # ColBERTv2 uses a linear projection to 128 dimensions
    # The model checkpoint includes this projection layer
    # Here we'll L2 normalize as ColBERT does
    token_embeddings = F.normalize(token_embeddings, p=2, dim=-1)

    return token_embeddings, inputs["attention_mask"]

In [None]:
# Encode a query and see the token-level embeddings
query = "What is machine learning?"
query_embeds, query_mask = encode_colbert([query], is_query=True)

tokens = tokenizer.tokenize(query)
print(f"Query: '{query}'")
print(f"Tokens: {['[CLS]'] + tokens + ['[SEP]']}")
print(f"Embedding shape: {query_embeds.shape}")
print(f"  -> {query_embeds.shape[1]} tokens x {query_embeds.shape[2]} dimensions each")

## Core Concept 2: MaxSim (Maximum Similarity)

The magic of ColBERT is in how it scores query-document pairs using **MaxSim**:

1. For each query token, find its **maximum similarity** with any document token
2. Sum these maximum similarities across all query tokens

$$\text{Score}(Q, D) = \sum_{q \in Q} \max_{d \in D} (q \cdot d)$$

This allows **soft matching**: a query token can match its best semantic partner in the document.

In [None]:
def maxsim_score(query_embeds, doc_embeds, query_mask, doc_mask):
    """Compute ColBERT MaxSim score between query and document.

    Args:
        query_embeds: (1, num_query_tokens, dim)
        doc_embeds: (1, num_doc_tokens, dim)
        query_mask: (1, num_query_tokens)
        doc_mask: (1, num_doc_tokens)

    Returns:
        Scalar score
    """
    # Compute all pairwise similarities: (num_query_tokens, num_doc_tokens)
    # Since embeddings are L2-normalized, dot product = cosine similarity
    similarity_matrix = torch.matmul(query_embeds[0], doc_embeds[0].T)

    # Mask out padding tokens in document
    similarity_matrix = similarity_matrix.masked_fill(doc_mask[0].unsqueeze(0) == 0, float("-inf"))

    # MaxSim: for each query token, take max similarity across doc tokens
    max_similarities = similarity_matrix.max(dim=-1).values  # (num_query_tokens,)

    # Mask out padding tokens in query and sum
    max_similarities = max_similarities * query_mask[0].float()
    score = max_similarities.sum()

    return score, similarity_matrix, max_similarities

In [None]:
# Example: Score a query against two documents
query = "How do neural networks learn?"
doc1 = "Neural networks learn by adjusting weights through backpropagation."
doc2 = "The weather today is sunny and warm."

# Encode
query_embeds, query_mask = encode_colbert([query])
doc1_embeds, doc1_mask = encode_colbert([doc1])
doc2_embeds, doc2_mask = encode_colbert([doc2])

# Score
score1, sim_matrix1, max_sims1 = maxsim_score(query_embeds, doc1_embeds, query_mask, doc1_mask)
score2, sim_matrix2, max_sims2 = maxsim_score(query_embeds, doc2_embeds, query_mask, doc2_mask)

print(f"Query: '{query}'\n")
print(f"Doc 1: '{doc1}'")
print(f"Score: {score1.item():.3f}\n")
print(f"Doc 2: '{doc2}'")
print(f"Score: {score2.item():.3f}\n")
print("→ Doc 1 is ranked higher (more relevant)")

## Visualizing MaxSim

Let's visualize how MaxSim works by showing which document tokens each query token matches with.

In [None]:
def visualize_maxsim(query, doc, query_embeds, doc_embeds, query_mask, doc_mask):
    """Visualize which document tokens each query token matches."""
    score, sim_matrix, max_sims = maxsim_score(query_embeds, doc_embeds, query_mask, doc_mask)

    query_tokens = ["[CLS]"] + tokenizer.tokenize(query) + ["[SEP]"]
    doc_tokens = ["[CLS]"] + tokenizer.tokenize(doc) + ["[SEP]"]

    # Pad tokens to match embedding length
    while len(query_tokens) < sim_matrix.shape[0]:
        query_tokens.append("[PAD]")
    while len(doc_tokens) < sim_matrix.shape[1]:
        doc_tokens.append("[PAD]")

    print(f"Query: '{query}'")
    print(f"Doc:   '{doc}'")
    print("\nMaxSim breakdown (query token → best matching doc token):")
    print("-" * 60)

    for i, (q_tok, max_sim) in enumerate(zip(query_tokens, max_sims, strict=False)):
        if query_mask[0, i] == 0:  # Skip padding
            continue
        best_doc_idx = sim_matrix[i].argmax().item()
        best_doc_tok = doc_tokens[best_doc_idx]
        print(f"  {q_tok:15} → {best_doc_tok:15} (sim: {max_sim.item():.3f})")

    print("-" * 60)
    print(f"Total MaxSim Score: {score.item():.3f}")

In [None]:
# Visualize for the relevant document
visualize_maxsim(
    "How do neural networks learn?",
    "Neural networks learn by adjusting weights through backpropagation.",
    query_embeds,
    doc1_embeds,
    query_mask,
    doc1_mask,
)

In [None]:
# Visualize for the irrelevant document
visualize_maxsim(
    "How do neural networks learn?",
    "The weather today is sunny and warm.",
    query_embeds,
    doc2_embeds,
    query_mask,
    doc2_mask,
)

## Core Concept 3: Efficient Retrieval with ColBERT

The power of ColBERT is that document embeddings can be **precomputed and indexed**:

1. **Offline**: Encode all documents, store token embeddings
2. **Online**: Encode query, compute MaxSim against stored embeddings

ColBERTv2 introduces optimizations:
- **Residual compression**: Compress token embeddings using centroids
- **Denoised supervision**: Better training with distillation
- **Dimension reduction**: 128-dim embeddings (vs 768 for BERT)

In [None]:
# Simulate a mini retrieval system
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Python is a popular programming language for data science.",
    "Transformers revolutionized natural language processing.",
    "The stock market closed higher today.",
    "Gradient descent optimizes neural network parameters.",
]

# "Index" the documents (in practice, this would be stored on disk)
print("Indexing documents...")
doc_embeddings = []
doc_masks = []
for doc in documents:
    embeds, mask = encode_colbert([doc])
    doc_embeddings.append(embeds)
    doc_masks.append(mask)
print(f"Indexed {len(documents)} documents")

In [None]:
def search(query, top_k=3):
    """Search documents using ColBERT MaxSim."""
    query_embeds, query_mask = encode_colbert([query])

    scores = []
    for doc_emb, doc_mask in zip(doc_embeddings, doc_masks, strict=False):
        score, _, _ = maxsim_score(query_embeds, doc_emb, query_mask, doc_mask)
        scores.append(score.item())

    # Rank by score
    ranked_indices = np.argsort(scores)[::-1]

    print(f"Query: '{query}'\n")
    print(f"Top {top_k} results:")
    for rank, idx in enumerate(ranked_indices[:top_k], 1):
        print(f"  {rank}. [{scores[idx]:.2f}] {documents[idx]}")
    return ranked_indices[:top_k]

In [None]:
# Try some queries
search("What is deep learning?")

In [None]:
search("How are neural networks trained?")

In [None]:
search("programming for AI")

## Why ColBERT Works Well

1. **Fine-grained matching**: Token-level embeddings capture nuanced semantics
   - "neural networks" in query can match "neural" AND "networks" separately in doc
   
2. **Soft matching via MaxSim**: Each query term finds its best match
   - Synonyms work: "learn" can match "train", "optimize", etc.
   
3. **Precomputable**: Documents encoded offline, only query encoding at search time
   - Much faster than cross-encoders

4. **Better than single-vector**: Multiple vectors capture more information
   - Dense retrieval loses information by compressing to one vector

## ColBERTv2 Improvements

ColBERTv2 (2022) improved on the original ColBERT (2020):

| Feature | ColBERT v1 | ColBERTv2 |
|---------|------------|----------|
| Embedding dim | 128 | 128 |
| Compression | None | Residual compression |
| Training | In-batch negatives | Denoised supervision + distillation |
| Index size | Large | ~6-10x smaller |
| Effectiveness | Good | State-of-the-art |

The residual compression works by:
1. Learning centroids that capture common token embedding patterns
2. Storing only the residual (difference from nearest centroid)
3. Quantizing residuals to reduce storage

## Summary

ColBERTv2's key innovations:

1. **Late interaction**: Encode Q and D independently, interact via MaxSim
2. **Token embeddings**: Preserve fine-grained information (not single vector)
3. **MaxSim scoring**: Sum of max similarities enables soft matching
4. **Efficient indexing**: Residual compression for practical deployment

This makes ColBERTv2 an excellent choice when you need:
- Better effectiveness than dense retrieval
- Better efficiency than cross-encoders
- Fine-grained semantic matching capabilities