# Document Search using ModernBERT and Milvus Vector DB

![modernbert](images/modernbert.png)

## Document Search using ModernBERT and Milvus Vector DB  

**ModernBERT** builds upon the foundational success of BERT while incorporating advancements that address the evolving needs of real-world NLP tasks. As highlighted in [Hugging Face's blog on ModernBERT](https://huggingface.co/blog/modernbert), this model introduces several key improvements that make it particularly suitable for high-performance semantic search:  

1. **Optimized Architecture:** ModernBERT employs techniques like disentangled attention and parallel layer computation, enabling it to generate embeddings faster and with lower computational costs compared to its predecessors.  
2. **Enhanced Contextual Understanding:** With a deeper understanding of language semantics, ModernBERT can generate embeddings that better capture the meaning and relationships within and between documents, making it ideal for complex queries.  
3. **Real-World Benchmarking:** ModernBERT is fine-tuned on diverse datasets and evaluated on real-world benchmarks, ensuring that it performs robustly across various applications, including search, classification, and clustering.  

The motivation for choosing ModernBERT lies in its ability to bridge the gap between theoretical advancements in NLP and practical applications. Its embeddings are lightweight yet powerful, making it the ideal choice for scenarios where both speed and accuracy are critical, such as large-scale document retrieval.  

Pairing ModernBERT with **Milvus**, a high-performance vector database, further amplifies its capabilities. Milvus enables efficient storage and retrieval of high-dimensional embeddings, ensuring that searches are not only semantically accurate but also fast and scalable. For datasets comprising thousands or millions of documents, this combination offers a transformative approach to semantic search.  

### Basic System Architecture

![system-design](images/system_design.png)


### Step 1: Setting Up the Environment

Start by installing the necessary dependencies. We will need the following Python libraries:

- `sentence-transformers` for generating embeddings.
- `datasets` for loading the ML paper dataset.
- `pymilvus` for interacting with the Milvus vector database.
- `dotenv` for loading environment variables (e.g., API keys).

You can install the required libraries via pip:

```bash
pip install sentence-transformers datasets pymilvus python-dotenv
```

#### Loading Environment Variables

```python
from dotenv import load_dotenv
load_dotenv()
```

In [3]:
from pathlib import Path
from sentence_transformers import SentenceTransformer

# load dot_env
from dotenv import load_dotenv
load_dotenv()

False

### Step 2: Load the ModernBERT Model

We will be using the `nomic-ai/modernbert-embed-base` model from the Sentence Transformers library. This model generates high-quality embeddings suitable for semantic search.

In [7]:
# Load the SentenceTransformer model
model = SentenceTransformer("nomic-ai/modernbert-embed-base")

# Function to generate embeddings for a single text
def generate_embeddings(text: str):
    return model.encode(text)

### Step 3: Prepare the Dataset

We will use the "CShorten/ML-ArXiv-Papers" dataset from Hugging Face, which contains machine learning research papers, to demonstrate the document search process.

In [10]:
from datasets import load_dataset

ds = load_dataset("CShorten/ML-ArXiv-Papers")
     

# Keep only "title" and "abstract" columns in train set
train_ds = ds["train"].select_columns(["title", "abstract"])


To work with a smaller subset for demo purposes, we will shuffle the dataset and select the first 1000 rows.

In [12]:
# Shuffle the dataset and select the first 100 rows
small_dataset = train_ds.shuffle(seed=57).select(range(1000))

query_prefix = "search_query:"
document_prefix = "search_document:"

# Concatenate abstract and titles
def combine_text(row):
    row["text"] = document_prefix + " " + row["title"] + " " + row["abstract"]
    return row

# Apply function to entire dataset
small_dataset = small_dataset.map(combine_text)

# Print number of rows
print(f"Number of rows: {len(small_dataset)}")

Number of rows: 1000


### Step 4: Generate Embeddings for the Dataset

Next, we define a function to generate embeddings for each document and apply it to the dataset.


In [18]:
# Function to generate embeddings for a single text
def generate_embeddings(example):
    example["embeddings"] = model.encode(example["text"])
    return example

# Apply the function to the dataset using map
embeddings_ds = small_dataset.map(generate_embeddings)

We can convert the dataset to a Pandas DataFrame for easier inspection:

In [23]:
import pandas as pd

# Convert HF dataset to Pandas DF
df = embeddings_ds.to_pandas()

# Take a peek at the data
df.head()

Unnamed: 0,title,abstract,text,embeddings
0,An Active Learning Method for Diabetic Retinop...,"In recent years, deep learning (DL) techniqu...",search_document: An Active Learning Method for...,"[-0.029155258, -0.008791113, 0.009420881, -0.0..."
1,A general approximation lower bound in $L^p$ n...,We study the fundamental limits to the expre...,search_document: A general approximation lower...,"[0.0038754004, -0.05021903, 0.0012840546, 0.02..."
2,TripleSpin - a generic compact paradigm for fa...,We present a generic compact computational f...,search_document: TripleSpin - a generic compac...,"[0.046790063, -0.016866842, -0.016364306, -0.0..."
3,Self-Supervised Contrastive Learning for Unsup...,We propose a self-supervised representation ...,search_document: Self-Supervised Contrastive L...,"[0.02431918, -0.035683442, -0.07060653, -0.071..."
4,Comparing learning algorithms in neural networ...,Today data mining techniques are exploited i...,search_document: Comparing learning algorithms...,"[-0.03197413, -0.019541739, 0.043272045, -0.00..."


In [26]:
# get the max length of the text. column from pd dataframe
# df["text_length"] = df["text"].apply(lambda x: len(x.split()))
df['text_length'] = df['text'].str.len()

max_text_length = int(df["text_length"].max())
print(f"Max text length: {max_text_length}")

Max text length: 2127


# Milvus Vector Database

![milvus](images/milvus.png)

### Step 5: Set Up the Milvus Vector Database

Now, we will set up the Milvus vector database to store and search the embeddings. Milvus is an open-source vector database optimized for fast similarity search.

In [34]:
from pymilvus import MilvusClient, DataType

# client = MilvusClient("papers.db")

client = MilvusClient(
    uri="http://localhost:19530"
)

# Create schema
schema = MilvusClient.create_schema(
    auto_id=True,
    enable_dynamic_field=True,
)

collection_name = "modernbert_search"

# Add fields to schema
schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=max_text_length)
schema.add_field(field_name="dense_vector", datatype=DataType.FLOAT_VECTOR, dim=768)

# Prepare index parameters 
index_params = client.prepare_index_params()

# Add index
index_params.add_index(
    field_name="dense_vector", 
    index_type="AUTOINDEX",
    metric_type="COSINE"
)

if client.has_collection(collection_name):
    client.drop_collection(collection_name)

# Create collection with index loaded
client.create_collection(
    collection_name=collection_name,
    schema=schema,
    index_params=index_params
)

In [37]:
client.has_collection(collection_name)

True

### Step 6: Insert Data into Milvus

Now that we've set up the Milvus collection, we can insert the embeddings and their corresponding text into the database.

In [42]:
from tqdm import tqdm


for i in tqdm(range(0, len(embeddings_ds), 50)):
    batch_data = [
        {
            "text": title,
            "dense_vector": embedding
        }
        for title, embedding in zip(
            embeddings_ds[i : i + 50]["title"],
            embeddings_ds[i : i + 50]["embeddings"]
        )
    ]
    
    client.insert(
        collection_name=collection_name,
        data=batch_data
    )
# print("Number of entities inserted:", col.num_entities)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 31.62it/s]


In [45]:
# Enter your search query
query = "Active learning methods for diabetic retinopathy classification using Bayesian CNN"
print(query)

Active learning methods for diabetic retinopathy classification using Bayesian CNN


### Step 7: Perform a Search Query

Once the data is inserted into Milvus, we can perform a similarity search to find documents that match a given query.

In [50]:
# from pymilvus import (AnnSearchRequest, WeightedRanker)
from IPython.display import Markdown, display

def dense_search(query, limit=10):
    query_embeddings = model.encode(query_prefix + " " + query)
    res = client.search(
        collection_name="modernbert_search",  # Add collection name
        data=[query_embeddings],
        anns_field="dense_vector",
        limit=limit,
        output_fields=["text"],
        search_params={"metric_type": "COSINE"}
    )
    return res

In [53]:
result = dense_search(query)
for hit in result[0]:
    display(Markdown(f"**Title:** {hit['entity']['text']},\n**Score:** {hit['distance']}\n"))

**Title:** An Active Learning Method for Diabetic Retinopathy Classification with
  Uncertainty Quantification,
**Score:** 0.8416206240653992


**Title:** Learn to Segment Retinal Lesions and Beyond,
**Score:** 0.6053228378295898


**Title:** Learning Discriminative Bayesian Networks from High-dimensional
  Continuous Neuroimaging Data,
**Score:** 0.572968065738678


**Title:** Blood Vessel Detection using Modified Multiscale MF-FDOG Filters for
  Diabetic Retinopathy,
**Score:** 0.5701384544372559


**Title:** Deep Active Learning for Axon-Myelin Segmentation on Histology Data,
**Score:** 0.5581481456756592


**Title:** DeepBrain: Functional Representation of Neural In-Situ Hybridization
  Images for Gene Ontology Classification Using Deep Convolutional Autoencoders,
**Score:** 0.5342372059822083


**Title:** A Black-box Adversarial Attack Strategy with Adjustable Sparsity and
  Generalizability for Deep Image Classifiers,
**Score:** 0.5283135175704956


**Title:** AdvFilter: Predictive Perturbation-aware Filtering against Adversarial
  Attack via Multi-domain Learning,
**Score:** 0.5278029441833496


**Title:** Evaluation of Big Data based CNN Models in Classification of Skin
  Lesions with Melanoma,
**Score:** 0.5260496139526367


**Title:** Minimax Active Learning,
**Score:** 0.5215473175048828


### Step 8: Generating Synthetic Queries

We can generate synthetic queries and evaluation questions to aid in the search process. These queries can be used for type-ahead suggestions or to evaluate search results.

In [163]:
query_generation_prompt = """
You are an AI expert skilled in creating semantic search queries and evaluation questions for scientific content. 
Based on the given title and abstract of a machine learning research paper, generate a set of synthetic queries and questions. 
These should help users search for or evaluate the paper in a semantic search system.

**Requirements:**
1. Generate **1-3 synthetic queries** that a user might type into a search bar to find this paper. These queries should:
   - Be varied in phrasing and focus on different aspects of the paper (e.g., problem addressed, methods used, results, applications, etc.).
   - Use natural language and keywords relevant to the paper's topic.

2. Generate **1-3 evaluation questions** that can help assess the relevance of search results to this paper. These questions should:
   - Focus on the key contributions, concepts, or applications discussed in the paper.
   - Be clear and relevant to researchers interested in this topic.

3. Provide the output in JSON format with the following structure:
```json
{
    "synthetic_queries": [
    "Query 1",
    "Query 2",
    "... (up to 3 queries)"
  ],
  "evaluation_questions": [
    "Question 1",
    "Question 2",
    "... (up to 3 questions)"
  ]
}
"""

In [169]:
from litellm import completion
import json

items = df['text'].tolist()

all_queries = []
for item in tqdm(items[:100]):
    response = completion(
        model="mistral/mistral-large-2407",
        api_key="API_KEY", # OR Read via env variables
        messages=[
            {
                "role": "user",
                "content": f"""{query_generation_prompt.strip()}\n\nPaper Title and Abstract:\n {item}"""
            }
        ],
        response_format={"type": "json_object"}
    )
    queries = json.loads(response.choices[0].message.content)
    all_queries.extend(queries['synthetic_queries'] + queries['evaluation_questions'])

100%|██████████████████████████████████████████████████████████████| 100/100 [08:16<00:00,  4.96s/it]


In [396]:
response
json.loads(response.choices[0].message.content)

{'synthetic_queries': ['How does the proposed method for diabetic retinopathy classification utilize active learning and uncertainty quantification?',
  'What are the benefits of using a Bayesian convolutional neural network in medical imaging tasks?',
  'Can you explain the challenges faced in annotating medical data and how this research addresses them?'],
 'evaluation_questions': ['What key techniques are employed in the proposed hybrid model for improving diabetic retinopathy classification?',
  'How does the paper measure the performance of its proposed framework compared to existing methods?',
  'What insights does the study provide regarding the transparency and interpretability of deep learning models in medical applications?']}

In [173]:
with open('all_queries.txt', 'w') as f:
    for query in all_queries:
        f.write(f"{query}\n")

### Conclusion

In this tutorial, we demonstrated how to set up a document search system using ModernBERT for embedding generation and Milvus for storing and querying those embeddings. We also explored the use of synthetic queries to enhance the search experience and evaluation. This setup can be extended for larger datasets and used in various real-world applications such as academic paper search or document retrieval systems.

## Note: the full application (backend and frontend) is available here: https://github.com/mallahyari/modernbert-semantic-search