[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/milvus-io/bootcamp/blob/master/bootcamp/tutorials/quickstart/hybrid_serach_with_milvus.ipynb)

# Hybrid Semantic Search with Milvus

In this tutorial, we will demonstrate the use of Milvus Hybrid Search with the BGE-M3 model to enhance search result relevance.

Milvus Hybrid Search integrates Dense, Sparse, and Hybrid retrieval methods:

- Dense Retrieval: Utilizes semantic context to understand the meaning behind queries.
- Sparse Retrieval: Emphasizes keyword matching to find results based on specific terms.
- Hybrid Retrieval: Combines both Dense and Sparse approaches, capturing the full context and specific keywords for comprehensive search results.

By integrating these methods, the Milvus Hybrid Search balances semantic and lexical similarities, improving the overall relevance of search outcomes. This notebook will walk through the process of setting up and using these retrieval strategies, highlighting their effectiveness in various search scenarios.

### Dependencies and Environment

In [None]:
!pip install --upgrade pymilvus "pymilvus[model]" milvus-lite pandas numpy

### Download Dataset

Download the Quora Duplicate Questions dataset and place it in the same directory.

Credit for the dataset: [First Quora Dataset Release: Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)

In [None]:
# Run this cell to download the dataset
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

### Load and Prepare Data

We will load the dataset and prepare a small corpus for search.

In [2]:
import pandas as pd

file_path = "quora_duplicate_questions.tsv"
df = pd.read_csv(file_path, sep="\t")
questions = set()
for _, row in df.iterrows():
    obj = row.to_dict()
    questions.add(obj["question1"][:512])
    questions.add(obj["question2"][:512])
    if len(questions) > 10000:
        break

docs = list(questions)

# example question
print(docs[0])

Whose questions do you follow the most on a regular basis?


### Generate Random Embeddings (Optional)

If you do not have the BGE-M3 model, you can generate random embeddings for demonstration purposes.

In [3]:
import random
import numpy as np

def random_embedding(texts):
    rng = np.random.default_rng()
    return {
        "dense": np.random.rand(len(texts), 768),
        "sparse": [
            {
                d: rng.random()
                for d in random.sample(range(1000), random.randint(20, 30))
            }
            for _ in texts
        ],
    }

dense_dim = 768
ef = random_embedding

### Use BGE-M3 Model for Embeddings

The BGE-M3 model can embed texts as dense and sparse vectors. Ensure you have installed the `model` module in pymilvus.

To install it, simply run

In [None]:
!pip install "pymilvus[model]"

In [5]:
from milvus_model.hybrid import BGEM3EmbeddingFunction

ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
dense_dim = ef.dim["dense"]

# Generate embeddings using BGE-M3 model
docs_embeddings = ef(docs)

  from .autonotebook import tqdm as notebook_tqdm
Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 302473.85it/s]
Inference Embeddings: 100%|██████████| 626/626 [04:43<00:00,  2.21it/s]


### Setup Milvus Collection and Index

We will now set up the Milvus collection and create indices for the vector fields.

In [6]:
from pymilvus import (
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    connections,
)

# Connect to Milvus
connections.connect("default", uri="milvus.db")

# Specify the data schema for the new Collection
fields = [
    # Use auto generated id as primary key
    FieldSchema(
        name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
    ),
    # Store the original text to retrieve based on semantically distance
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512),
    # Milvus now supports both sparse and dense vectors,
    # we can store each in a separate field to conduct hybrid search on both vectors
    FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
]
schema = CollectionSchema(fields, "")
col_name = "hybrid_demo"
col = Collection(col_name, schema, consistency_level="Strong")

# Create indices for the vector fields
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
col.create_index("sparse_vector", sparse_index)
dense_index = {"index_type": "FLAT", "metric_type": "IP"}
col.create_index("dense_vector", dense_index)
col.load()

### Insert Data into Milvus Collection
Insert the text and sparse/dense vector representations into the collection.

In [7]:
# Insert documents and their embeddings into the collection
entities = [docs, docs_embeddings["sparse"], docs_embeddings["dense"]]
for i in range(0, len(docs), 50):
    batched_entities = [
        docs[i : i + 50],
        docs_embeddings["sparse"][i : i + 50],
        docs_embeddings["dense"][i : i + 50],
    ]
    col.insert(batched_entities)
col.flush()

### Model and Collection Initialization
Initialize the BGE-M3 model and Milvus collection.

In [8]:
# Initialize the model
def get_model():
    ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
    return ef

# Initialize the collection
def get_collection():
    col_name = "hybrid_demo"
    connections.connect("default", uri="milvus.db")
    col = Collection(col_name)
    return col

# Fetch the model and collection
ef = get_model()
col = get_collection()

Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 319363.25it/s]


### Hybrid Search

Define helper functions for hybrid search.


In [9]:
def search_from_source(source, query):
    return [f"{source} Result {i+1} for {query}" for i in range(5)]

def get_tokenizer():
    tokenizer = ef.model.tokenizer
    return tokenizer

def doc_text_formatting(query, docs):
    tokenizer = get_tokenizer()
    query_tokens_ids = tokenizer.encode(query, return_offsets_mapping=True)
    query_tokens = tokenizer.convert_ids_to_tokens(query_tokens_ids)
    formatted_texts = []

    for doc in docs:
        ldx = 0
        landmarks = []
        encoding = tokenizer.encode_plus(doc, return_offsets_mapping=True)
        tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])[1:-1]
        offsets = encoding["offset_mapping"][1:-1]
        for token, (start, end) in zip(tokens, offsets):
            if token in query_tokens:
                if len(landmarks) != 0 and start == landmarks[-1]:
                    landmarks[-1] = end
                else:
                    landmarks.append(start)
                    landmarks.append(end)
        close = False
        formatted_text = ""
        for i, c in enumerate(doc):
            if ldx == len(landmarks):
                pass
            elif i == landmarks[ldx]:
                if close is True:
                    formatted_text += "]"
                else:
                    formatted_text += "["
                close = not close
                ldx = ldx + 1
            formatted_text += c
        if close is True:
            formatted_text += "]"
        formatted_texts.append(formatted_text)
    return formatted_texts

In [10]:
from pymilvus import (
    AnnSearchRequest,
    WeightedRanker,
)

def hybrid_search(query_embeddings, sparse_weight=1.0, dense_weight=1.0):
    col = get_collection()
    sparse_search_params = {"metric_type": "IP"}
    sparse_req = AnnSearchRequest(
        query_embeddings["sparse"], "sparse_vector", sparse_search_params, limit=10
    )
    dense_search_params = {"metric_type": "IP"}
    dense_req = AnnSearchRequest(
        query_embeddings["dense"], "dense_vector", dense_search_params, limit=10
    )
    rerank = WeightedRanker(sparse_weight, dense_weight)
    res = col.hybrid_search(
        [sparse_req, dense_req], rerank=rerank, limit=10, output_fields=["text"]
    )
    if len(res):
        return [hit.fields["text"] for hit in res[0]]
    else:
        return []

### Enter Your Search Query

Enter your search query and run the search.

In [11]:
# Enter your search query
query = input("Enter your search query: ")
print(query)

# Generate embeddings for the query
query_embeddings = ef([query])
print(query_embeddings)

Who started AI research?
{'dense': [array([-0.03658685, -0.01750261, -0.01536112, ..., -0.02266536,
        0.01365146,  0.00908284], dtype=float32)], 'sparse': <1x250002 sparse array of type '<class 'numpy.float32'>'
	with 5 stored elements in Compressed Sparse Row format>}


### Display Search Results

Perform the search and display the results for Dense, Sparse, and Hybrid methods.

In [12]:
# Dense search results
print("Dense Search Results:")
results = hybrid_search(query_embeddings, sparse_weight=0.0, dense_weight=1.0)
formatted_results = doc_text_formatting(query, results)
for result in results:
    print(result)

# Sparse search results
print("\nSparse Search Results:")
results = hybrid_search(query_embeddings, sparse_weight=1.0, dense_weight=0.0)
formatted_results = doc_text_formatting(query, results)
for result in formatted_results:
    print(result)

# Hybrid search results
print("\nHybrid Search Results:")
results = hybrid_search(query_embeddings, sparse_weight=0.7, dense_weight=1.0)
formatted_results = doc_text_formatting(query, results)
for result in formatted_results:
    print(result)

Dense Search Results:
What's the best way to start learning robotics?
When in history did we start giving people names?
Why did humans come into existence?
Who invented thermometer?
Should a machine learning beginner go straight for deep learning?
How do I start learning or strengthen my knowledge of data structures and algorithms?
How can undergraduate help with machine learning research?
Why do so many people believe that the IQ test determines your intelligence?
Why is the term "research" used instead of scientific investigation?
Do humans fear artificial intelligence because it has no soul?
What is research objective?
What topic should I research for my EPQ project?
What is the best way to do an MUN research?
What advice will you give to an IIT graduate in Mechanical/Civil Engineering, who strongly wants to pursue a research career in computer science/mathematics, and has completed the basic courses of CS in coursera?
How do I research for MUN?
How could neutrinos be used for scien

### Quick Deploy

To learn about how to start an online demo with this tutorial, please refer to [the example application](https://github.com/milvus-io/bootcamp/tree/master/bootcamp/tutorials/quickstart/apps/hybrid_search_with_milvus).

<img src="https://raw.githubusercontent.com/milvus-io/bootcamp/master/bootcamp/tutorials/quickstart/apps/hybrid_demo_with_milvus/pics/demo.jpg"/>