# ColModernVBERT for Visual Document Retrieval

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/colmodernvbert/blob/main/colmodernvbert_for_Visual_Document_Retrieval.ipynb)

This notebook demonstrates how to use **ColModernVBERT** - a multi-vector vision-language model for fine-grained document retrieval and zero-shot classification - integrated as a FiftyOne Zoo Model.

## What is ColModernVBERT?

ColModernVBERT is a state-of-the-art multi-vector vision-language model that:
- Generates **~884 vectors per image** (128-dim each) for fine-grained representation
- Uses **ColBERT-style late interaction** (MaxSim scoring) for accurate matching
- Supports both **similarity search** (pooled embeddings) and **zero-shot classification** (full multi-vectors)
- Excels at **visual document understanding** tasks

## What You'll Learn

1. ✅ Load document datasets from Hugging Face
2. ✅ Register and use ColModernVBERT as a FiftyOne Zoo Model
3. ✅ Compute multi-vector embeddings with different pooling strategies
4. ✅ Visualize document embeddings with UMAP
5. ✅ Build text-to-image similarity search
6. ✅ Perform zero-shot document classification

---

**Note**: This notebook requires a GPU runtime. In Colab, go to **Runtime > Change runtime type > GPU (T4, A100, etc.)**


## 📦 Installation

First, let's install the required dependencies:
- **FiftyOne**: For dataset management and visualization
- **UMAP**: For embedding visualization
- **ColPali Engine**: The backend library that provides ColModernVBERT implementation


In [None]:
!pip install fiftyone umap-learn

Install the ColPali Engine from the vbert branch (provides ColModernVBERT implementation):


In [None]:
!pip install git+https://github.com/illuin-tech/colpali.git@vbert#egg=colpali-engine

## 📚 Load Dataset

We'll use the **document-haystack-10pages** dataset from Hugging Face, which contains document images designed for retrieval tasks.

This dataset simulates a "needle in a haystack" scenario where you need to find specific documents containing particular text or information.


In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub(
    "Voxel51/document-haystack-10pages",
    overwrite=True
    )

## 🔧 Register ColModernVBERT Zoo Model

FiftyOne's Zoo Model system allows you to register remote model sources. Let's register the ColModernVBERT repository:


In [None]:
import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/colmodernvbert",
    overwrite=True)

Download the model weights and configuration:


In [None]:
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/colmodernvbert",
    model_name="ModernVBERT/colmodernvbert"
)

## 🧠 Load the Model

Now let's load ColModernVBERT with a specific pooling strategy.

### Pooling Strategies

- **`mean`**: Averages all vectors → best for holistic semantic matching
- **`max`**: Takes maximum values → best for keyword-based search

For document retrieval, both work well. We'll use **`max`** pooling to focus on finding specific content matches.


In [None]:
import fiftyone.zoo as foz

model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="max" #could also choose mean
)


## 📊 Compute Embeddings

Now we'll compute embeddings for all documents in the dataset.

### What's Happening Under the Hood?
1. Each image is processed by ColModernVBERT → generates **~884 vectors** (128-dim each)
2. These multi-vectors are **pooled** (using max pooling) → single **128-dim embedding**
3. The pooled embeddings are stored in FiftyOne for efficient similarity search

This gives us the best of both worlds: fine-grained multi-vector representation compressed into efficient single vectors for retrieval.


In [None]:
dataset.compute_embeddings(
    model=model,
    embeddings_field="colmodernbert_embeddings"
)

Let's verify the embedding shape - it should be a 128-dimensional vector:


In [None]:
dataset.first()['colmodernbert_embeddings'].shape

## 🗺️ Visualize Embeddings with UMAP

Let's create a 2D visualization of our document embeddings using UMAP (Uniform Manifold Approximation and Projection).

This will help us:
- See how documents cluster in the embedding space
- Identify similar documents visually
- Understand the semantic structure of our dataset


In [None]:
import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    embeddings="colmodernbert_embeddings",
    method="umap",
    brain_key="colmodernbert_viz",
    num_dims=2,
)

## 🔍 Build Text-to-Image Similarity Index

Now let's build a similarity index that allows us to search for documents using text queries.

This index enables:
- **Text-to-image search**: Find documents matching text descriptions
- **Image-to-image search**: Find similar documents
- **Efficient k-NN lookups**: Fast retrieval at scale


In [None]:
import fiftyone.brain as fob

text_img_index = fob.compute_similarity(
    dataset,
    model= "ModernVBERT/colmodernvbert",
    embeddings_field="colmodernbert_embeddings",
    brain_key="colmodernbert_sim",
    model_kwargs={"pooling_strategy": "max"}
)

### Search with Text Queries

Let's extract the "needle texts" (target content) from the dataset and use them as queries to find matching documents.

For each query, we'll retrieve the **top 3** most similar documents:


In [None]:
queries = dataset.distinct("needle_texts")

sims = text_img_index.sort_by_similarity(
    queries,
    k=3
)

## 🎯 Zero-Shot Classification

Now let's use ColModernVBERT for zero-shot classification!

Instead of pooled embeddings, this mode uses the **full multi-vectors** with **MaxSim scoring** for the highest accuracy.

### How MaxSim Works
1. Image → ~884 vectors (128-dim each)
2. Each text class → ~13 vectors (128-dim each)
3. For each text vector, find its max similarity with any image vector
4. Sum these max similarities → classification score

This allows the model to match specific text tokens to relevant image regions!


Set the model's classes to our query texts (these become the classification labels):


In [None]:
model.classes = queries

Apply the model to classify all documents based on which query text they best match:


In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

dataset.apply_model(
    model,
    label_field="predictions"
    )

Let's examine the predictions for a sample document:


In [None]:
dataset.first()['predictions']

## 🎨 Launch FiftyOne App

Finally, let's launch the FiftyOne App to interactively explore our results!

The app provides:
- Visual browsing of documents
- Interactive similarity search
- Classification predictions overlay
- UMAP embedding visualization
- Filtering and sorting capabilities


In [None]:
session=fo.launch_app(sims, auto=False)

session.url

##  Summary

✅ Loaded a document dataset from Hugging Face  
✅ Registered ColModernVBERT as a FiftyOne Zoo Model  
✅ Computed 128-dimensional pooled embeddings from multi-vector representations  
✅ Created UMAP visualizations of document embeddings  
✅ Built a text-to-image similarity search index  
✅ Performed zero-shot classification using MaxSim scoring  

### Key Takeaways

**Multi-Vector Architecture**: ColModernVBERT generates ~884 vectors per image, enabling fine-grained region-to-text matching.

**Dual-Mode Operation**:
- **Retrieval mode**: Pooled 128-dim embeddings for efficient similarity search
- **Classification mode**: Full multi-vectors with MaxSim for highest accuracy

**Pooling Strategies**:
- `mean`: Holistic semantic matching
- `max`: Keyword-based search

### Next Steps

- Try different classification tasks by changing `model.classes`
- Experiment with custom text prompts
- Compare `mean` vs `max` pooling strategies
- Use the similarity index for custom text queries
- Explore the FiftyOne App's interactive features

### Resources

- **Model**: [ModernVBERT/colmodernvbert](https://huggingface.co/ModernVBERT/colmodernvbert)
- **Repository**: [github.com/harpreetsahota204/colmodernvbert](https://github.com/harpreetsahota204/colmodernvbert)
- **Paper**: [ModernVBERT: Towards Smaller Visual Document Retrievers](https://arxiv.org/abs/2510.01149)
- **FiftyOne Docs**: [docs.voxel51.com](https://docs.voxel51.com)
