# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/visual_document_retrieval_in_fiftyone_talk/blob/main/vdr_talk_notebook.ipynb)

### The Challenge:

- 1,134 vision papers at NeurIPS
- 3 days to explore
- Which 30-40 papers should you prioritize?


### The Workflow:

#### Step 1: Visualize the Landscape

- Load dataset ‚Üí Compute embeddings ‚Üí Generate UMAP

- See what research clusters emerge: diffusion, transformers, 3D, video

- Understand: What's hot? What's emerging? Where do areas overlap?

#### Step 2: Find Core Interests

- Semantic seach based on your interests

- Lasso entire clusters: Tag interesting papers as 'core_interest'

- Filter by presentation type: Oral vs Poster

#### Step 3: Discover Through Semantic Similarity

- Find papers with similar research niches

- Find papers similar to ones you already like

- Discover cross-domain connections

#### Step 4: Identify Novel Work

- Sort by representativeness (low scores = outliers)

- Papers that don't fit existing categories

- Potential breakthroughs or ambitious cross-domain work

#### Step 5: Build Your Schedule

- Core papers + Adjacent + Outliers

- Filter to oral presentations ‚Üí 15 must-attend

- Export personalized conference guide

#### Setup

Let's install our dependencies:

In [None]:
!pip install fiftyone torch transformers pillow umap-learn

In [None]:
!pip install git+https://github.com/illuin-tech/colpali.git@vbert#egg=colpali-engine

Let's install some plugins to help us along the way. Run the following in your terminal:

1. `fiftyone plugins download https://github.com/jacobmarks/keyword-search-plugin`

2. `fiftyone plugins download https://github.com/harpreetsahota204/caption-viewer`

3. `fiftyone plugins download https://github.com/voxel51/fiftyone-plugins --plugin-names @voxel51/dashboard`

Begin by loading the [Visual AI at NeurIPS 2025 dataset](https://huggingface.co/datasets/Voxel51/visual_ai_at_neurips2025), which is hosted on Hugging Face. 

This dataset contains NeurIPS 2025 accepted papers focused on computer vision and related fields, enriched with arXiv metadata and first-page images. 

It includes papers from multiple vision-related categories including Computer Vision (cs.CV), Multimedia (cs.MM), Image and Video Processing (eess.IV), Graphics (cs.GR), and Robotics (cs.RO). 

Each entry includes paper metadata, abstracts, author information, and a high-resolution (500 DPI) PNG image of the paper's first page.

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/visual_ai_at_neurips2025")

Let's call the Dataset.

When you "call the dataset" in FiftyOne‚Äîsuch as by printing it with `print(dataset)`, you get a summary of the dataset's structure and contents. 

This includes information like the number of samples, available fields, and possibly a preview of the first or last sample. 

This is a useful way to inspect your dataset after loading or creating it.

In [None]:
print(dataset)

We've got 1,134 papers here. Let's understand what we're working with by looking at the first sample:


In [None]:
dataset.first()

We can get a sense of the distribution of `arxiv_category` as follows:

In [8]:
dataset.count_values("arxiv_category.label")

{'cs.RO': 94, 'cs.MM': 2, 'cs.CV': 996, 'cs.GR': 22, 'eess.IV': 20}

Now, let's [map these category labels](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.map_labels) to something more human readable. We're doing this because, towards the end of this notebook, we'll use visual document retrieval model to perform zero shot classification of the document images.

Begin by [cloning the sample field](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.clone_sample_field):

In [None]:
dataset.clone_sample_field("arxiv_category", "arxiv_category_mapped")

In [None]:
mapping = {
    "cs.CV": "Computer Vision",
    "cs.MM": "Multimedia",
    "eess.IV": "Image and Video Processing",
    "cs.GR": "Graphics",
    "cs.RO": "Robotics",
}

view = dataset.map_labels("arxiv_category_mapped", mapping)
view.save()

And we can verify this worked:

In [9]:
dataset.count_values("arxiv_category_mapped.label")

{'Graphics': 22,
 'Robotics': 94,
 'Multimedia': 2,
 'Image and Video Processing': 20,
 'Computer Vision': 996}

Now, launch the app and do some initial exploration of the datase

In [None]:
session = fo.launch_app(dataset)

### Setup the model

For this demo, we're going to use [ColModernVBert](https://docs.voxel51.com/plugins/plugins_ecosystem/colmodernvbert.html).

ColModernVBert is a multi-vector vision-language model built on the ModernVBert architecture that generates ColBERT-style embeddings for both images and text. 

Unlike single-vector models that compress entire images into a single representation, ColModernVBert produces multiple 128-dimensional vectors per input, enabling fine-grained matching between specific image regions and text tokens.

I'm using it here because it's lightweight (250M parameters), and even without a GPU you can run the model and explore later on.

##### üìå Some other models you may want to check out later:

| Model | Parameters | Output | Key Features | Good For |
|:---|:---|:---|:---|:---|
| **[`nomic-embed-multimodal`]((https://docs.voxel51.com/plugins/plugins_ecosystem/nomic_embed_multimodal.html))** | 3B and 7B | Multi-dimensional vectors | Available in two sizes | Multimodal embedding tasks|
| **[`bimodernvbert`](https://docs.voxel51.com/plugins/plugins_ecosystem/bimodernvbert.html)** | 250M | 768-dim single vectors | Runs fast on CPU - about 7x faster than comparable models | When you need speed and don't have a GPU |
| **[`colmodernvbert`](https://docs.voxel51.com/plugins/plugins_ecosystem/colmodernvbert.html)** | 250M | Multi-vectors (ColBERT-style) | Same base as bimodernvbert, matches models 10x its size on vidore benchmarks | Fine-grained document matching with maxsim scoring|
| **[`jina-embeddings-v4`](https://docs.voxel51.com/plugins/plugins_ecosystem/jina_embeddings_v4.html)** | 3.8B | 2048-dim single-vector or multi-vector | Supports 30+ languages, task-specific LoRA adapters for retrieval, text-matching, and code | Multilingual document retrieval across different tasks|
| **[`colqwen2-5-v0-2`](https://docs.voxel51.com/plugins/plugins_ecosystem/colqwen2_5_v0_2.html)** | qwen2.5-vl-3B | Multi-vectors | Preserves aspect ratios, dynamic resolution up to 768 patches, token pooling keeps ~97.8% accuracy | Document layouts where aspect ratio matters |
| **[`colpali-v1-3`](https://docs.voxel51.com/plugins/plugins_ecosystem/colpali_v1_3.html)** | paligemma-3B | Multi-vector late interaction | Original model that showed visual doc retrieval could beat OCR pipelines | Baseline multi-vector retrieval, well-tested |



### Register the Zoo Model


In [None]:
import fiftyone.zoo as foz

# Register this repository as a remote zoo model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/colmodernvbert",
    overwrite=True
)

### Instantiate the model

In [28]:
# Load ColModernVBert model
model = foz.load_zoo_model(
    "ModernVBERT/colmodernvbert",
    pooling_strategy="mean"  # or "max"
)

### Compute embeddings

Now, we can use the [`compute_embeddings`](https://docs.voxel51.com/api/fiftyone.core.models.html#fiftyone.core.models.compute_embeddings) method on our entire document collection. 

This is a one-time operation that turns each document into a vector representation that captures its visual and semantic meaning.

##### What's Happening Under the Hood?

- Each image is processed by ColModernVBERT ‚Üí generates ~884 vectors (128-dim each)

- These multi-vectors are pooled (using max/mean pooling) ‚Üí single 128-dim embedding

- The pooled embeddings are stored as fields of the FiftyOne dataset

- This gives us the best of both worlds: fine-grained multi-vector representation compressed into efficient single vectors for retrieval.

**Note:** This took ~1.5 hours on my Mac M3.

In [None]:
dataset.compute_embeddings(
    model=model,
    embeddings_field="colmodernvbert_embeddings"
)

In [10]:
# Check embedding dimensions
print(dataset.first()['colmodernvbert_embeddings'].shape)

(128,)


#### ‚ÑπÔ∏è Let me save you sometime

If you want to skip waiting for the model run, you can download a dataset with these embeddings (and the zero-shot classifications we do later) and follow along with the rest of the notebook.

This is how you can download it:

```python
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub("harpreetsahota/visual_ai_at_neurips2025_colmodernvbert")
```

### Visualization

Once we have embeddings, we can visualize them. This is where magic happens.

The [`compute_visualization`](https://docs.voxel51.com/api/fiftyone.brain.visualization.html#fiftyone.brain.visualization.visualize) method in FiftyOne will create a 2D visualization of our document embeddings using UMAP (Uniform Manifold Approximation and Projection).

This will help us:

- See how documents cluster in the embedding space
- Identify similar documents visually
- Understand the semantic structure of our dataset

In [13]:
import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    embeddings="colmodernvbert_embeddings",
    method="umap",
    brain_key="colmodernvbert_viz",
    num_dims=2,
)

Generating visualization...
UMAP( verbose=True)
Thu Nov  6 10:32:30 2025 Construct fuzzy simplicial set




Thu Nov  6 10:32:31 2025 Finding Nearest Neighbors
Thu Nov  6 10:32:31 2025 Finished Nearest Neighbor Search
Thu Nov  6 10:32:31 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Thu Nov  6 10:32:31 2025 Finished embedding


When you open the [embeddings panel](https://docs.voxel51.com/user_guide/app.html#embeddings-panel) in the FiftyOne App, you'll see a bunch of dots.

Each dot is a document. Documents that are visually and semantically similar are placed close together. 

And without us telling it anything about document types or categories, natural clusters emerge.

### Build Similarity Index

Now let's use the [`compute_similarity`](https://docs.voxel51.com/api/fiftyone.brain.similarity.html#fiftyone-brain-similarity) method to build a similarity index. This is where visual document retrieval becomes incredibly powerful for research discovery.

This index enables three types of search that transform how you explore 1,134 papers:

1. Text-to-image search

    Natural language queries like "diffusion models for medical imaging" or "papers with architecture diagrams" find relevant content in abstracts and visuals.

2. Image-to-image search

    Click any paper to find others with similar diagrams, notation, or presentation styles.

3. Cross-domain discovery
    Find connections keywords miss‚Äîlike papers sharing architectural approaches across different fields or citing similar foundational work.

Search by semantic meaning, visual structure, and notation style simultaneously. This could help in discovering papers traditional keyword search wouldn't find.    

In [None]:
import fiftyone.brain as fob

text_img_index = fob.compute_similarity(
    dataset,
    model= "ModernVBERT/colmodernvbert",
    embeddings_field="colmodernvbert_embeddings",
    brain_key="colmodernvbert_sim",
    model_kwargs={"pooling_strategy": "mean"}
)

You'll see how to do all this in the App as well, but you can perform semantic similarity search with text queries

For this query, we'll retrieve the top 3 most similar documents.

[`sort_by_similarity`](https://docs.voxel51.com/api/fiftyone.brain.similarity.html) method returns a `fiftyone.core.view.DatasetView` containing the 3 most similar samples to your text query. 

You can use this view directly in various ways:

- Display it in the FiftyOne App: `session.view = sims`
- Iterate over the samples: `for sample in sims: ...`
- Apply additional view operations: [`sims.match(...)`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.match)
- Access the samples: `sims.first()`, [`sims.take(n)`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.take), etc.

If you want to persist this view for later use, you can [save it to your dataset](https://docs.voxel51.com/user_guide/using_views.html#similarity-views) by tagging the samples or storing the similarity scores in a field using the `dist_field` parameter:

This will store the similarity distance for each sample in a field called "similarity_score" on the samples themselves.

In [15]:
sims = text_img_index.sort_by_similarity(
    ["visual document retrieval"],
    k=3,
    dist_field="similarity_score"
)

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

### Compute Uniqueness

With the embeddings we can [compute a uniqueness score](https://docs.voxel51.com/brain.html#brain-image-uniqueness) for every paper - how different is it from all the others?

**`compute_uniqueness`** assigns each paper a uniqueness score (0-1) based on how different it is from the rest of the conference.

**Low scores (0.1-0.3)**: Papers in heavily researched areas with incremental variations. Read one representative, skip the rest.

**High scores (0.7-0.9)**: Novel approaches that don't fit existing categories. These are your potential breakthrough papers.

**Use this to** prioritize unique contributions over the 10th variation of the same idea, and discover papers that don't fit the mainstream.

In [16]:
results = fob.compute_uniqueness(
    dataset,
    embeddings="colmodernvbert_embeddings"
)

Computing uniqueness...


INFO:fiftyone.brain.internal.core.uniqueness:Computing uniqueness...


Uniqueness computation complete


INFO:fiftyone.brain.internal.core.uniqueness:Uniqueness computation complete


### Near Duplicates

**[`compute_near_duplicates`](https://docs.voxel51.com/brain.html#near-duplicates)** finds groups of very similar papers by comparing embeddings against a threshold. At a large conference like NeurIPS, this helps you:

- **Avoid redundancy**: Don't read multiple papers that are essentially the same approach with minor variations

- **Identify research trends**: Find groups of papers from different teams converging on similar solutions

- **Efficient scheduling**: If 3 papers in your queue are near-duplicates, attend one talk and skim the others



In [17]:
import fiftyone.brain as fob

dup_index = fob.compute_near_duplicates(
    dataset,
    embeddings="colmodernvbert_embeddings",
    threshold=0.02,  # Adjust as needed for your data/model
)

Computing duplicate samples...


INFO:fiftyone.brain.similarity:Computing duplicate samples...


Duplicates computation complete


INFO:fiftyone.brain.similarity:Duplicates computation complete


This creates two saved views on your dataset:

- **`near duplicates`**: All papers that are very similar to one or more other papers. These are your "related work clusters" - papers you should compare side-by-side to understand subtle differences in approach.

- **`representatives of near duplicates`**: One representative from each cluster of similar papers. Read these first to understand each approach, then decide if the variations are worth diving into.

**Example use case**: You find 5 papers about diffusion models for medical imaging that cluster tightly together. Read the representative paper to understand the core approach, then skim the others to see what each team did differently - architecture tweaks, different datasets, alternative loss functions.

##### ü§î What's the difference between computing uniqueness and near duplicates?

| Method | `compute_near_duplicates` | `compute_uniqueness` |
|:---|:---|:---|
| **Purpose** | Detects potential near-duplicate samples | Scores how unique each sample is |
| **Goal** | Find groups of very similar samples | Rank all samples by uniqueness |
| **How it works** | Measures distance between embeddings; samples below threshold are duplicates | Analyzes similarity distribution across entire dataset |
| **Output** | `SimilarityIndex` object with duplicate IDs and neighbor mappings | Adds scalar `uniqueness` field (0-1) to each sample |
| **Score meaning** | Binary: duplicate or not | Higher = more unique, Lower = more similar to others 
| **Primary use case** | Dataset cleaning (remove redundant data) | Sample selection (choose diverse samples for annotation/training) |
| **Requires threshold** | Yes | No |


**Key difference:** One finds duplicates to remove; the other ranks samples to find the most diverse ones to keep.



### Compute Representativeness

This finds [the most prototypical](https://docs.voxel51.com/brain.html#image-representativeness) papers in your dataset.

##### One way to interpret these scores

**High representativeness scores** identify mainstream papers - the ones that best represent each research cluster. These are your "survey the field" papers that show what's typical in diffusion models, vision transformers, or 3D reconstruction. If you want to understand the current state of a research area, start here.

**Low representativeness scores** identify outliers and boundary papers - the ones that don't fit neatly into existing clusters. These are often the most interesting: novel approaches combining multiple areas, cross-domain applications, or genuinely new methods. These are your "potential breakthrough" papers.

For conference planning: read the high-representativeness papers to get oriented in each area, then explore the low-representativeness papers to find cutting-edge work that might define future directions.

In [18]:
# Compute representativeness scores
fob.compute_representativeness(
    dataset,
    representativeness_field="colmodernvbert_represent",
    method="cluster-center",
    embeddings="colmodernvbert_embeddings"
)

Computing representativeness...


INFO:fiftyone.brain.internal.core.representativeness:Computing representativeness...


Computing clusters for 1134 embeddings; this may take awhile...


INFO:fiftyone.brain.internal.core.representativeness:Computing clusters for 1134 embeddings; this may take awhile...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Representativeness computation complete


INFO:fiftyone.brain.internal.core.representativeness:Representativeness computation complete


### Zero-shot Classification

We can even use this model to perform zero-shot classification. In this example, we will see how well this model can classify the arXiv category of the paper.

Let's get a list of the categories:

In [22]:
arxiv_categories = dataset.distinct("arxiv_category_mapped.label")

In [23]:
arxiv_categories

['Computer Vision',
 'Graphics',
 'Image and Video Processing',
 'Multimedia',
 'Robotics']

Then we can use the [apply_model]() method of the dataset. 

Notice the `‚Å†text_prompt` argument. This customizes how class names are embedded for comparison with images. It's a template (e.g., "A research paper from the arXiv category of ") that's combined with each class label to form text inputs like "A research paper from the arXiv category of Robotics" or "A research paper from the arXiv category of Graphics".

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

model.text_prompt="A research paper from the arXiv category of "
model.classes=arxiv_categories

dataset.apply_model(
    model,
    label_field="arxiv_category_predictions"
    )
     

We can also see how well it does with unmapped categories:

In [24]:
unmapped_arxiv_categories = dataset.distinct("arxiv_category.label")

In [25]:
unmapped_arxiv_categories

['cs.CV', 'cs.GR', 'cs.MM', 'cs.RO', 'eess.IV']

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

model.text_prompt="A research paper from the arXiv category of "
model.classes=unmapped_arxiv_categories

dataset.apply_model(
    model,
    label_field="unmapped_arxiv_category_predictions",
    )

ValueError: Unknown key 'model_kwargs'. The supported keys are ['filepath']

### Evaluate Classifications

FiftyOne has a nice [evaluation API](https://docs.voxel51.com/user_guide/evaluation.html) that you can use to assess how well a model performs.

By default, `evaluate_classifications` will treat your classifications as generic multiclass predictions, and it will evaluate each prediction by directly comparing its label to the associated ground truth prediction.

In [19]:
results = dataset.evaluate_classifications(
    "arxiv_category_predictions",
    gt_field="arxiv_category_mapped",
    eval_key="eval_simple",
)

In [20]:
dataset.save()

### Now let's go to the App and explore in more detail

In [21]:
session = fo.launch_app(dataset, auto=False)
session.url

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.


'http://localhost:5151/'

When you started this talk, you had documents. 

Maybe you had metadata: filenames, dates, categories. 

But you didn't really know your data.

Now? You can see it.

You can see how documents cluster. 

You can find the duplicates inflating your dataset. 

You can discover connections between documents that keywords would miss. 

You can identify the prototypical examples and the edge cases. 

You can search for documents with similar diagrams, similar table structures, similar visual patterns.

You transformed from 'I have documents' to 'I understand my dataset.'

You've seen what's possible. How do you actually start using this on your own documents?

The workflow is simple. 

Four steps:

One: Embed. Load your documents, pick a model, compute embeddings. BiModernVBERT is a great starting point because it runs on CPU and is fast enough for most use cases.

Two: Visualize. Generate a UMAP plot and look at your data. What clusters form? Where are the outliers? This 30-second view tells you more than hours of manual sampling.

Three: Explore. Use similarity search, uniqueness, representativeness - whatever insights you need. Find duplicates. Discover similar documents. Identify prototypes.

Four: Understand. You now know what you have, what you're missing, and what's unusual. You can make informed decisions about what to annotate, what to use for training, what to use for testing.

Take 100 documents from your current project. Run this code. Look at the visualization. I guarantee you'll see something you didn't know about your dataset:

- Clusters you didn't expect
- Outliers that surprise you
- Duplicates you didn't know existed
- Connections keywords can't find
